662. Revisiting Matrix Sketching in Linear Bandits: Achieving Sublinear Regret via Dyadic Block Sketching
๐ฌ 13
โญ 6.50
๐ reinforcement learning
๐ท๏ธ Linear Bandits Matrix Sketching Multi-scale Sketching
Linear bandits have become a cornerstone of online learning and sequential decision-making, providing solid theoretical foundations for balancing exploration and exploitation.
Within this domain, matrix sketching serves as a critical component for achieving computational efficiency, especially when confronting high-dimensional problem instances.
The sketch-based approaches reduce per-round complexity from $\Omega(d^2)$ to $O(dl)$, where $d$ is the dimension and $l
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
663. ROC-n-reroll: How verifier imperfection affects test-time scaling
๐ฌ 9
โญ 6.50
๐ learning theory
๐ท๏ธ test-time scaling inference-time scaling best of n theory data quality
Test-time scaling aims to improve language model performance by leveraging additional compute during inference.
Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sampling (RS) that make use of a verifier to enable test-time scaling. However, to date there is little theoretical understanding of how verifier *imperfection* affects performance โ a gap we address in this work. Specifically, we prove that the instance-level accuracy of these methods is precisely characterized by the geometry of the verifierโs ROC curve. Our theory has two important takeaways, confirmed by experiments with Qwen and LLama models on GSM8K and MATH500. First, RS outperforms BoN for fixed compute, while both methods converge to the same accuracy in the infinite-compute limit. Second, it is generally impossible to predict the high-compute performance of either method based on observations in the low-compute regime.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
664. SatDreamer360: Multiview-Consistent Generation of Ground-Level Scenes from Satellite Imagery
๐ฌ 17
โญ 6.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Satellite-to-Ground View Synthesis Cross-View Image Translation Diffusion-based Scene Generation
Generating multiview-consistent $360^\circ$ ground-level scenes from satellite imagery is a challenging task with broad applications in simulation, autonomous navigation, and digital twin cities. Existing approaches primarily focus on synthesizing individual ground-view panoramas, often relying on auxiliary inputs like height maps or handcrafted projections, and struggle to produce multiview consistent sequences. In this paper, we propose SatDreamer360, a framework that generates geometrically consistent multi-view ground-level panoramas from a single satellite image, given a predefined pose trajectory. To address the large viewpoint discrepancy between ground and satellite images, we adopt a triplane representation to encode scene features and design a ray-based pixel attention mechanism that retrieves view-specific features from the triplane. To maintain multi-frame consistency, we introduce a panoramic epipolar-constrained attention module that aligns features across frames based on known relative poses.
To support the evaluation, we introduce VIGOR++, a large-scale dataset for generating multi-view ground panoramas from a satellite image, by augmenting the original VIGOR dataset with more ground-view images and their pose annotations. Experiments show that SatDreamer360 outperforms existing methods in both satellite-to-ground alignment and multiview consistency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
665. Neural Collapse in Multi-Task Learning
๐ฌ 4
โญ 6.50
๐ optimization
๐ท๏ธ Neural Collapse
Neural collapse (NC) plays a key role in understanding deep neural networks. However, existing empirical and theoretical studies of NC focus on one single task. This paper studies neural collapse in multi-task learning. We consider two standard feature-based multi-task learning scenarios: Single-Source Multi-Task Classification (SSMTC) and Multi-Source Multi-Task Classification (MSMTC). Interestingly, we find that the task-specific linear classifier and features converge to the Simplex Equiangular Tight Frame (ETF) in the setting of MSMTC. In the setting of SSMTC, task-specific linear classifier converges to the task-specific ETF and these task-specific ETFs are mutually orthogonal. Moreover, the shared features across tasks converge to the scaled sum of the weight vectors associated with the task-specific labels in each task's classifier. We also provide the theoretical guarantee for our empirical findings. Through detailed analysis, we uncover the mechanism of MTL where each task learns task-specific latent features that together form the shared features. Moreover, we reveal an inductive bias in MTL that task correlation reconfigures the geometry of task-specific classifiers and promotes alignment among the features learned by each task.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
666. MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
๐ฌ 15
โญ 6.50
๐ datasets and benchmarks
๐ท๏ธ Spatial Intelligence MLLM VLM VQA Benchmark 3D Understanding
Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a stepwise reasoning process. We conduct extensive experiments and evaluate 37 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30\% accuracy and OpenAI's GPT-5 reasoning model reaches 40\%, while humans score 97\%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering insights for advancing spatial intelligence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
667. Scalable Oversight via Partitioned Human Supervision
๐ฌ 16
โญ 6.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM scalable oversight weak supervision agentic systems
As artificial intelligence (AI) systems approach and surpass expert human performance across a broad range of tasks, obtaining high-quality human supervision for evaluation and training becomes increasingly challenging.
Our focus is on tasks that require deep knowledge and skills of multiple domains. Unfortunately, even the best human experts are knowledgeable only in a single narrow area, and will not be able to evaluate the correctness of advanced AI systems on such superhuman tasks.
However, based on their narrow expertise, humans may provide a weak signal, i.e., a *complementary label* indicating an option that is incorrect. For example, a cardiologist could state that ``this is not related to cardiology,'' even if they cannot identify the true disease.
Based on this weak signal, we propose a scalable oversight framework that enables us to evaluate frontier AI systems without the need to prepare the ground truth.
We derive an *unbiased* estimator of top-1 accuracy from complementary labels and quantify how many complementary labels are needed to match the variance of ordinary labels. We further introduce two estimators to combine scarce ordinary labels with abundant complementary labels.
We provide finite-sample deviation guarantees for both complementary-only and the mixed estimators.
Empirically, we show that we can evaluate the output of large language models without the ground truth, if we have complementary labels.
We further show that we can train an AI system with such weak signals: we show how we can design an agentic AI system automatically that can perform better by these partitioned human supervision.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
668. Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models
๐ฌ 11
โญ 6.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ anthropomorphism human-AI interaction social AI multi-turn evaluation
The tendency of users to anthropomorphise large language models (LLMs) is of growing societal interest. Here, we present AnthroBench: a novel empirical method and tool for evaluating anthropomorphic LLM behaviours in realistic settings. Our work introduces three key advances; first, we develop a multi-turn evaluation of 14 distinct anthropomorphic behaviours, moving beyond single-turn assessment. Second, we present a scalable, automated approach by leveraging simulations of user interactions, enabling efficient and reproducible assessment. Third, we conduct an interactive, large-scale human subject study (N=1101) to empirically validate that the model behaviours we measure predict real usersโ anthropomorphic perceptions. We find that all evaluated LLMs exhibit similar behaviours, primarily characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use. Crucially, we observe that the majority of these anthropomorphic behaviors only first occur after multiple turns, underscoring the necessity of multi-turn evaluations for understanding complex social phenomena in human-AI interaction. Our work provides a robust empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
669. FakeXplain: AI-Generated Images Detection via Human-Aligned Grounded Reasoning
๐ฌ 4
โญ 6.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Vision Language Models Image Forensics AIGC Detection
The rapid rise of image generation calls for detection methods that are both interpretable and reliable. Existing approaches, though accurate, act as black boxes and fail to generalize to out-of-distribution data, while multi-modal large language models (MLLMs) provide reasoning ability but often hallucinate. To address these issues, we construct FakeXplained dataset of AI-generated images annotated with bounding boxes and descriptive captions that highlight synthesis artifacts, forming the basis for human-aligned, visually grounded reasoning. Leveraging FakeXplained, we develop FakeXplainer which fine-tunes MLLMs with a progressive training pipeline, enabling accurate detection, artifact localization, and coherent textual explanations. Extensive experiments show that FakeXplainer not only sets a new state-of-the-art in detection and localization accuracy (98.2% accuracy, 36.0% IoU), but also demonstrates strong robustness and out-of-distribution generalization, uniquely delivering spatially grounded, human-aligned rationales.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
670. RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization
๐ฌ 18
โญ 6.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Image Manipulation Localization; Video Manipulation Localization
Visual manipulation localization (VML) aims to identify tampered regions in images and videos, a task that has become increasingly challenging with the rise of advanced editing tools. Existing methods face two main issues: resolution diversity, where resizing or padding distorts forensic traces and reduces efficiency, and the modality gap, as images and videos often require separate models. To address these challenges, we propose RelayFormer, a unified framework that adapts to varying resolutions and modalities. RelayFormer partitions inputs into fixed-size sub-images and introduces Global-Local Relay (GLR) tokens, which propagate structured context through a global-local relay attention (GLRA) mechanism. This enables efficient exchange of global cues, such as semantic or temporal consistency, while preserving fine-grained manipulation artifacts. Unlike prior methods that rely on uniform resizing or sparse attention, RelayFormer naturally scales to arbitrary resolutions and video sequences without excessive overhead. Experiments across diverse benchmarks demonstrate that RelayFormer achieves state-of-the-art performance with notable efficiency, combining resolution adaptivity without interpolation or excessive padding, unified modeling for both images and videos, and a strong balance between accuracy and computational cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
671. Frozen Priors, Fluid Forecasts: Prequential Uncertainty for Low-Data Deployment with Pretrained Generative Models
๐ฌ 13
โญ 6.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ predictive uncertainty quantification prequential inference measure-valued martingales frozen generative models
Deploying ML systems with only a few real samples makes operational metrics (such as alert rates or mean scores) highly unstable. Existing uncertainty quantification (UQ) methods fail here: frequentist intervals ignore the deployed predictive rule, Bayesian posteriors assume continual refitting, and conformal methods offer per-example rather than long-run guarantees. We introduce a forecast-first UQ framework that blends the empirical distribution with a frozen pretrained generator using a unique Dirichlet schedule, ensuring time-consistent forecasts. Uncertainty is quantified via martingale posteriors: a lightweight, likelihood-free resampling method that simulates future forecasts under the deployed rule, yielding sharp, well-calibrated intervals for both current and long-run metrics without retraining or density evaluation. A single hyperparameter, set by a small-$n$ minimax criterion, balances sampling variance and model--data mismatch; for bounded scores, we provide finite-time drift guarantees. We also show how this framework informs optimal retraining decisions. Applicable off-the-shelf to frozen generators (flows, diffusion, autoregressive models, GANs) and linear metrics (means, tails, NLL), it outperforms bootstrap baselines across vision and language benchmarks (WikiText-2, CIFAR-10, and SVHN datasets); e.g., it achieves $\sim$90\% coverage on GPT-2 with 20 samples vs.\ 37\% for bootstrap. Importantly, our uncertainty estimates are operational under the deployed forecasting rule agnostic of the population parameters, affording practicable estimators for deployment in real world settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
672. Convergence of Muon with Newton-Schulz
๐ฌ 11
โญ 6.50
๐ optimization
๐ท๏ธ Muon NewtonโSchulz Orthogonalization Nonconvex Optimization
We analyze Muon as originally proposed and used in practice---using the momentum orthogonalization with a few Newton-Schulz steps. The prior theoretical results replace this key step in Muon with an exact SVD-based polar factor. We prove that Muon with Newton-Schulz converges to a stationary point with the same rate as the SVD-polar idealization, up to a constant factor for given the number of Newton-Schulz steps $q$. We further analyze this constant factor, and prove that it converges to 1 doubly exponentially in $q$ and improves with $\kappa$, which is the degree of a polynomial used in Newton-Schulz required when approximating the orthogonalization direction.
We also prove that Muon removes the typical square-root-of-rank loss compared to its vector-based counterpart, SGD with momentum. Our results explain why Muon with a few low-degree Newton-Schulz steps matches exact-polar (SVD) behavior at much faster wall-clock time, and explain how much momentum matrix orthogonalization via Newton-Schulz benefits over the vector-based optimizer. Overall, our theory justifies the practical Newton-Schulz design of Muon, narrowing its practiceโtheory gap.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
673. Reliable Weak-to-Strong Monitoring of LLM Agents
๐ฌ 16
โญ 6.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Agent Safety Chain-of-Thought Monitoring Large Language Model
We stress test monitoring systems for detecting covert misbehavior in LLM agents (e.g., secretly exfiltrating data). We propose a monitor red teaming (MRT) workflow that varies agent and monitor awareness, adversarial evasion strategies, and evaluation across tool-calling (SHADE-Arena) and computer-use (CUA-SHADE-Arena) environments. We benchmark standard monitor scaffoldings and introduce a hybrid hierarchical--sequential design. Our experiments yield three findings. First, agent awareness dominates monitor awareness: agents that know they are monitored substantially degrade detection, while increasing monitor awareness helps less than expected. Second, monitor scaffolding matters: our hybrid design consistently outperforms baselines and enables weaker monitors to oversee stronger agents (a weak-to-strong effect). Third, targeted human oversight is key: escalating only pre-flagged cases improves TPR by 15% at FPR=0.01. Our work positions MRT as a standard workflow for stress-testing oversight, revealing robustness gaps in both LLM- and human-based monitoring. We release code, data, and logs to support further research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
674. Diffusion Blend: Inference-Time Multi-Preference Alignment for Diffusion Models
๐ฌ 13
โญ 6.50
๐ generative models
๐ท๏ธ Diffusion Model Reinforcement Learning Multi-Objective Finetuning
Reinforcement learning (RL) algorithms have been used recently to align diffusion models with downstream objectives such as aesthetic quality and text-image consistency by fine-tuning them to maximize a single reward function under a fixed KL regularization. However, this approach is inherently restrictive in practice, where alignment must balance multiple, often conflicting objectives. Moreover, user preferences vary across prompts, individuals, and deployment contexts, with varying tolerances for deviation from a pre-trained base model. We address the problem of inference-time multi-preference alignment: given a set of basis reward functions and a reference KL regularization strength, can we design a fine-tuning procedure so that, at inference time, it can generate images aligned with any user-specified linear combination of rewards and regularization, without requiring additional fine-tuning? We propose Diffusion Blend, a novel approach to solve inference-time multi-preference alignment by blending backward diffusion processes associated with fine-tuned models, and we instantiate this approach with three algorithms: DB-MPA for multi-reward alignment, DB-KLA for KL regularization control, and DB-MPA-LS for approximating DB-MPA without additional inference cost. Extensive experiments show that Diffusion Blend algorithms consistently outperform relevant baselines and closely match or exceed the performance of individually fine-tuned models, enabling efficient, user-driven alignment at inference-time.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
675. Optimistic Task Inference for Behavior Foundation Models
๐ฌ 13
โญ 6.50
๐ reinforcement learning
๐ท๏ธ Behavior Foundation Models Zero-Shot Reinforcement Learning Deep Reinforcement Learning Fast Adaptation
Behavior Foundation Models (BFMs) are capable of retrieving high-performing policy for any reward function specified directly at test-time, commonly referred to as zero-shot reinforcement learning (RL). While this is a very efficient process in terms of compute, it can be less so in terms of data: as a standard assumption, BFMs require computing rewards over a non-negligible inference dataset, assuming either access to a functional form of rewards, or significant labeling efforts. To alleviate these limitations, we tackle the problem of task inference purely through interaction with the environment at test-time. We propose OpTI-BFM, an optimistic decision criterion that directly models uncertainty over reward functions and guides BFMs in data collection for task inference. Formally, we provide a regret bound for well- trained BFMs through a direct connection to upper-confidence algorithms for linear bandits. Empirically, we evaluate OpTI-BFM on established zero-shot benchmarks, and observe that it enables successor-features-based BFMs to identify and optimize an unseen reward function in a handful of episodes with minimal compute overhead.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
676. SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
๐ฌ 19
โญ 6.50
๐ generative models
๐ท๏ธ Video Diffusion Model
We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720ร1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1\% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x} speedup). In summary, SANA-Video enables low-cost, high-quality video generation. Code and model will be publicly released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
677. DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
๐ฌ 12
โญ 6.50
๐ datasets and benchmarks
๐ท๏ธ LLM based Agent Evaluation Deep Research
Deep Research Agents (DRAs) are emerging as one of the most practical classes of LLM-based agents. Given an open-ended research task, they find, analyze, and synthesize large numbers of online sources to produce a comprehensive report at the level of a research analyst. This can compress hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabilities of these agents remains absent. To bridge this gap, we introduce DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields. To evaluate DRAs comprehensively, we propose two complementary and fully automated methodologies. The first is a reference-based method with adaptive criteria to assess the quality of generated research reports. The second evaluates a DRAโs informationโretrieval and collection capabilities by assessing its effective citation count and overall citation accuracy. By conducting extensive human consistency experiments, we demonstrate that our evaluation methods are highly aligned with expert judges and faithfully reflect human judgments of quality differences among DRA-generated content. We are open-sourcing DeepResearch Bench and key components of these frameworks to accelerate the development of practical LLM-based agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
678. WorldGym: World Model as An Environment for Policy Evaluation
๐ฌ 4
โญ 6.50
๐ reinforcement learning
๐ท๏ธ World model video generation policy evaluation generative simulators
Evaluating robot control policies is difficult: real-world testing is costly, and handcrafted simulators require manual effort to improve in realism and generality. We propose a world-model-based policy evaluation environment (WorldGym), an autoregressive, action-conditioned video generation model which serves as a proxy to real world environments. Policies are evaluated via Monte Carlo rollouts in the world model, with a vision-language model providing rewards. We evaluate a set of VLA-based real-robot policies in the world model using only initial frames from real robots, and show that policy success rates within the world model highly correlate with real-world success rates. Moreoever, we show that WorldGym is able to preserve relative policy rankings across different policy versions, sizes, and training checkpoints. Due to requiring only a single start frame as input, the world model further enables efficient evaluation of robot policies' generalization ability on novel tasks and environments. We find that modern VLA-based robot policies still struggle to distinguish object shapes and can become distracted by adversarial facades of objects. While generating highly realistic object interaction remains challenging, WorldGym faithfully emulates robot motions and offers a practical starting point for safe and reproducible policy evaluation before deployment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
679. Unified Vision-Language-Action Model
๐ฌ 4
โญ 6.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ world model robotics vision-language-action model
Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation.
However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This tokenized formulation naturally supports flexible multimodal task learning, particularly from large-scale video data, and further demonstrates that generative vision supervision can significantly enhance visual understanding. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learningโespecially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, substantially outperforming prior methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing ฯโ-FAST's 85.5%. We further demonstrate its broad applicability through experiments on real-world ALOHA manipulation tasks and autonomous driving scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
680. Token-Importance Guided Direct Preference Optimization
๐ฌ 20
โญ 6.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLMs RLHF DPO Human Preference Alignment Token-lmportance Triplet Loss
Aligning Large Language Models (LLMs) with human preferences is crucial for safe and effective AI interactions. While popular methods like Direct Preference Optimization (DPO) have simplified alignment, they remain sensitive to data noise and overlook the differential importance of individual tokens. Existing token-level approaches often rely on probability prediction or simplistic weighting schemes to obtain token importance, which still cannot fully address these issues. To solve this problem, we propose the Token-Importance Guided Direct Preference Optimization (TI-DPO), a framework that achieves fine-grained semantic control through two synergistic innovations.
First, we propose a novel hybrid weighting mechanism that combines gradient attribution with a Gaussian prior, ensuring both the accuracy and robustness of token importance scores. Second, we employ a triplet loss to provide structured guidance for the optimization, explicitly guiding model outputs to approach preferred responses and diverge from non-preferred ones. Experimental results show that TI-DPO achieves higher accuracy and stronger generative diversity, providing more stable and computationally efficient solutions compared with DPO and other RLHF methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
681. ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
๐ฌ 17
โญ 6.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ LLM Agents Memory Mechanism Reasoning Test-Time Scaling
With the growing adoption of large language model (LLM) agents in persistent, real-world roles, they naturally encounter continuous streams of tasks and interactions. A key limitation, however, is their failure to learn from this accumulated experience, forcing them to discard valuable insights and repeat past errors. Unlike prior works that primarily store raw experience or successful routines, we propose ReasoningBank, a novel memory framework that allows an agent to self-curate generalizable reasoning strategies from both its successful and failed experiences for future leverage. This mechanism enables agents to generalize across tasks and become more capable over time. To accelerate and diversify this test-time learning process, we further propose memory-aware test-time scaling (MaTTS), which leverages a powerful synergy between memory and test-time scaling. On one hand, relevant memory from ReasoningBank guides the scaling process toward more effective exploration and improved reliability. On the other, scaling, in both parallel and sequential settings, generates abundant, diverse experiences that provide rich contrastive signals for synthesizing higher-quality memory. Experiments on web browsing and software engineering tasks show that ReasoningBank consistently outperforms existing memory mechanisms in both effectiveness and efficiency, with MaTTS further amplifying the gains. These findings position memory-driven experience as a new dimension of test-time scaling, where emergent behaviors naturally arise and agents acquire self-evolving capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
682. CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers
๐ฌ 9
โญ 6.50
๐ datasets and benchmarks
๐ท๏ธ large language model statistical mechanics benchmark evaluation numerical methods scientific problem solving condensed matter physics quantum physics
Large language models (LLMs) have demonstrated remarkable progress in coding and mathematical problem-solving; however, evaluation on advanced research-level problems in the hard sciences remains scarce.
To fill this gap, we present \cmt, a dataset of 50 original problems covering condensed matter theory (CMT) at the level of an expert researcher. The topics cover analytical and computational approaches commonly used in quantum many-body physics as well as classical statistical mechanics. This dataset was designed and verified by a panel of expert researchers from around the world.
We built the dataset through a collaborative environment that challenges the panel to write and refine difficult problems that the panel would like their research assistants to be able to solve, with topics including Hartree-Fock mean-field theory, exact diagonalization, quantum Monte Carlo, density matrix renormalization group, quantum statistical mechanics, classical statistical mechanics, and model building. We evaluate different LLMs by programmatically checking LLM-generated solutions against expert-supplied ground truth.
For this, we developed machine-grading mechanisms that are suitable for advanced physics research problems.
For example, we handle non-commuting operators that are essential for quantum many-body problems by symbolic manipulation and normal ordering.
Our evaluations show that frontier models struggle with all of the problems in the dataset, highlighting a gap in the physical reasoning skills of current LLMs. Notably, experts identified strategies for creating increasingly difficult problems by interacting with the LLMs and exploiting common failure modes.
While the highest-performing model, GPT5, correctly solves 30\% of the problems, average performance across 17 models (GPT, Gemini, Claude, DeepSeek, and Llama classes) is only 11.4$\pm$2.1\%. Moreover, our benchmark contains 18 problems that {\it not a single one} of the 17 models can correctly solve, and 26 problems that are solved by {\it at most} one model.
These currently unsolvable problems span the fields of Quantum Monte Carlo, Variational Monte Carlo, and Density Matrix Renormalization Group.
The answers sometimes violate fundamental symmetries or have unphysical scaling dimensions. We believe that this benchmark set provides valuable guidance for the future development of language models, aiming to achieve the goal of AI research assistants and tutors.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
683. MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science
๐ฌ 21
โญ 6.50
๐ datasets and benchmarks
๐ท๏ธ Medical Reasoning LLM Agent Code Generation
We introduce MedAgentGym, a scalable and interactive training environment designed to enhance coding-based biomedical reasoning capabilities in large language model (LLM) agents. MedAgentGym comprises 72,413 task instances across 129 categories derived from 12 authentic real-world biomedical scenarios. Tasks are encapsulated within executable sandbox environments, each featuring detailed task specifications, interactive feedback mechanisms, verifiable ground truth annotations, and scalable training trajectory generation. Extensive benchmarking of 29 LLMs reveals substantial performance disparities in biomedical data science between commercial and open-source LLMs. Leveraging efficient multi-threaded and multi-turn trajectory sampling in MedAgentGym, Med-Copilot achieves performance gains of +43.02% and +45.28% from offline and online reinforcement learning, respectively, demonstrating MedAgentGym as an effective training ground while establishing itself as a cost-effective, privacy-preserving alternative competitive with proprietary LLMs (gpt-4o). By offering a unified execution environment with a comprehensive benchmark and accessible, extensible training resources, MedAgentGym delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical data science.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
684. ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization
๐ฌ 21
โญ 6.50
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ Pruning Model Compression Block Coordinate Descent
Large language models (LLMs) present significant deployment challenges due to their immense computational and memory requirements. While semi-structured pruning, particularly 2:4 sparsity, offers a path to practical hardware acceleration, existing methods often incur substantial performance degradation. To bridge this gap, we introduce ARMOR: (Adaptive Representation with Matrix-factORization), a novel one-shot post-training pruning algorithm. Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse
core wrapped by two low-overhead, block diagonal matrices. These wrappers act as efficient pre- and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen through a block coordinate descent algorithm that minimizes a layer-wise proxy loss. We theoretically prove this optimization is guaranteed to converge to a solution with a proxy loss less than or equal to state-of-the-art pruning algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations. ARMOR achieves this superior performance while retaining the inference speedups and substantial memory usage reductions of 2:4 pruning, establishing a more effective trade-off between model compression and task accuracy
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
685. Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences
๐ฌ 15
โญ 6.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Omni-Modal Models Reward Models Alignment
Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused on text and image modalities, offering limited support for video, audio, and other modalities; and (2) Preference Rigidity, where training on fixed binary preference pairs fails to capture the complexity and diversity of personalized preferences. To address the above challenges, we propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of: (1) Evaluation: We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
686. Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
๐ฌ 10
โญ 6.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Scaling Laws Model Architecture Inference-Efficient
Scaling the number of parameters and the size of training data has proven to be an effective strategy for improving large language model (LLM) performance. Yet, as these models grow increasingly powerful and widely deployed, the cost of inference has become a pressing concern. Despite its importance, the trade-off between model accuracy and inference efficiency remains underexplored. In this work, we examine how key architectural factors, hidden size, the allocation of parameters between MLP and attention (mlp-to-attention ratio), and grouped-query attention (GQA), influence both inference cost and accuracy. We introduce a conditional scaling law that augments the Chinchilla framework with architectural information, along with a search framework for identifying architectures that are simultaneously inference-efficient and accurate. To validate our approach, we train more than 200 models spanning 80M to 3B parameters and 8B to 100B training tokens, and fit the proposed conditional scaling law. Our results show that the conditional scaling law reliably predicts optimal architectural choices and that the resulting models outperform existing open-source baselines. Under the same training budget, optimized architectures achieve up to 2.1\% higher accuracy and 42\% greater inference throughput compared to LLaMA-3.2.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
687. Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression
๐ฌ 16
โญ 6.50
๐ learning theory
๐ท๏ธ Deep learning theory Multi-epoch training Data-reuse Optimization Scaling law Large language model
Large Language Model (LLM) training often processes vast text corpora in a single pass, leaving much available data underutilized. This paper presents a theoretical analysis of how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws. Concretely, given a $K$-epoch training on $N$ samples, how many fresh samples would one-pass training require to match the same performance? We quantify this using the \textit{effective reuse rate} of the data, $E(K, N)$, which we define as the factor by which the dataset must grow under one-pass training to match the test loss of multi-epoch training. Our analysis precisely characterizes the scaling behavior of $E(K, N)$ for SGD in linear regression under either strong convexity or Zipf-distributed data: (1) When $K$ is small, we prove that $E(K, N) \approx K$, indicating that every new epoch yields a linear gain; (2) As $K$ increases, $E(K, N)$ plateaus at a problem-dependent value that grows with $N$ ($\Theta(\log N)$ for the strongly-convex case), implying that larger datasets can be repeated more times before the marginal benefit vanishes. These theoretical findings complement a recent empirical study by [Muennighoff et al. (2023)](https://arxiv.org/abs/2305.16264), which found that training LLMs for up to $4$ epochs results in negligible loss differences compared to using fresh data at each step, \textit{i.e.}, $E(K, N) \approx K$ for $K \le 4$ in our notation.
Supported by further empirical validation with LLMs, our results reveal how this behavior depends on the underlying data size and distribution, and underscore the need to explicitly model both factors in future studies of scaling laws with data reuse.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
688. WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality
๐ฌ 11
โญ 6.50
๐ datasets and benchmarks
๐ท๏ธ large language models evaluation LLM-as-a-judge benchmark
The paradigm of LLM-as-a-judge is emerging as a scalable and efficient alternative to human evaluation, demonstrating strong performance on well-defined tasks. However, its reliability in open-ended tasks with dynamic environments and complex interactions remains unexplored. To bridge the gap, we introduce WebDevJudge, a systematic benchmark for assessing LLM-as-a-judge performance in web development, with support for both non-interactive evaluation based on static observations and continuous interactive evaluation with a dynamic web environment. WebDevJudge comprises human preference labels over paired web implementations, annotated with structured and query-grounded rubrics to ensure high-quality ground truth. Using this benchmark, we comprehensively evaluate various evaluators, including LLMs, MLLMs, and agentic workflows. We systematically investigate the impact of different paradigms and guidance mechanisms. Our experiments reveal a significant gap between LLM judges and human experts. In-depth analysis indicates this gap stems from fundamental model limitations, including failures in recognizing functional equivalence, verifying task feasibility, and mitigating bias. Overall, WebDevJudge presents a significant challenge to LLM-as-a-judge, offering insights to guide future research toward developing more reliable and capable automated evaluators for complicated scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
689. Contextual Similarity Distillation: Ensemble Uncertainties with a Single Model
๐ฌ 12
โญ 6.50
๐ reinforcement learning
๐ท๏ธ Uncertainty Quantification Epistemic Uncertainty Reinforcement Learning Deep Ensembles Exploration Neural Tangent Kernel
Uncertainty quantification is a critical aspect of reinforcement learning and deep learning, with numerous applications ranging from efficient exploration and stable offline reinforcement learning to outlier detection in medical diagnostics. The scale of modern neural networks, however, complicates the use of many theoretically well-motivated approaches such as full Bayesian inference. Approximate methods like deep ensembles can provide reliable uncertainty estimates but still remain computationally expensive. In this work, we propose contextual similarity distillation, a novel approach that explicitly estimates the variance of an ensemble of deep neural networks with a single model, without ever learning or evaluating such an ensemble in the first place. Our method builds on the predictable learning dynamics of wide neural networks, governed by the neural tangent kernel, to derive an efficient approximation of the predictive variance of an infinite ensemble. Specifically, we reinterpret the computation of ensemble variance as a supervised regression problem with kernel similarities as regression targets. The resulting model can estimate predictive variance at inference time with a single forward pass, and can make use of unlabeled target-domain data or data augmentations to refine its uncertainty estimates. We empirically validate our method across a variety of out-of-distribution detection benchmarks and sparse-reward reinforcement learning environments. We find that our single-model method performs competitively and sometimes superior to ensemble-based baselines and serves as a reliable signal for efficient exploration. These results, we believe, position contextual similarity distillation as a principled and scalable alternative for uncertainty quantification in reinforcement learning and general deep learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
690. Attend to the Active: Structure-Aware Dynamic Attention in LLMs for Compositional Instruction Following
๐ฌ 15
โญ 6.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Instruction Following; Dynamic Attention; Large Language Models
Large language models (LLMs) have exhibited strong instruction-following capabilities; however, they often struggle with compositional instructions involving multiple interleaved yet logically independent sub-tasks. These sub-tasks are typically organized in mutually exclusive structures, such as branching, chaining, or paralleling, where only one sub-task should be active at each generation step, while the others remain dormant. Despite their inactivity, dormant sub-tasks can inadvertently attract the model's attention due to structural entanglement within the input context or intermediate representations, leading to interference that compromises output fidelity. To address this challenge, we propose ATA, a structure-aware dynamic attention mechanism grounded in compositional structures, which dynamically identifies the active sub-task during generation while suppressing attention to inactive ones. By precisely steering the modelโs focus, ATA mitigates interference and explicitly enhances model adherence to the active sub-task. Importantly, ATA operates within a single forward pass without requiring parameter updates. Extensive experiments show that ATA consistently enhances LLMs' instruction-following ability across various compositional structures, effectively mitigating attention distraction and demonstrating a strong generalization ability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
691. Mean Estimation from Coarse Data: Characterizations and Efficient Algorithms
๐ฌ 15
โญ 6.50
๐ learning theory
๐ท๏ธ high dimensional statistics algorithmic statistics computational learning theory coarse observations mean estimation linear regression friction
Coarse data arise when learners observe only partial information about samples; namely, a set containing the sample rather than its exact value. This occurs naturally through measurement rounding, sensor limitations, and lag in economic systems. We study Gaussian mean estimation from coarse data, where each true sample $x$ is drawn from a $d$-dimensional Gaussian distribution with identity covariance, but is revealed only through the set of a partition containing $x$. When the coarse samples, roughly speaking, have ``low'' information, the mean cannot be uniquely recovered from observed samples (i.e., the problem is not *identifiable*). Recent work by Fotakis et al. (2021) established that *sample*-efficient mean estimation is possible when the unknown mean is *identifiable* and the partition consists of only *convex* sets. Moreover, they showed that without convexity, mean estimation becomes NP-hard. However, two fundamental questions remained open:
1. When is the mean identifiable under convex partitions?
2. Is *computationally* efficient estimation possible under identifiability and convex partitions?
This work resolves both questions. We provide a geometric characterization of when a convex partition is identifiable, showing it depends on whether the convex sets form ``slabs'' in a direction. Second, we give the first polynomial-time algorithm for finding $\varepsilon$-accurate estimates of the Gaussian mean given coarse samples from an unknown convex partition, matching the optimal $\widetilde{O}(d/\varepsilon^2)$ sample complexity. Our results have direct applications to robust machine learning, particularly robustness to observation rounding. As a concrete example, we derive a sample- and computationally- efficient algorithm for linear regression with market friction, a canonical problem in using ML in economics, where exact prices are unobserved and one only sees a range containing the price (Rosett, 1959).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
692. On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs
๐ฌ 20
โญ 6.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Configuration-aware optimization Pareto-base configuration search Quantization Fine-tuning
As increasingly large pre-trained models are released, deploying them on edge devices for privacy-preserving applications requires effective compression. Recent works combine quantization with the fine-tuning of high-precision LoRA adapters, which can substantially reduce model size while mitigating the accuracy loss from quantization. However, edge devices have inherently heterogeneous capabilities, while performing configuration-wise fine-tuning for every quantization setting is computationally prohibitive. In this paper, we propose CoA-LoRA, a method that dynamically adjusts the LoRA adapter to arbitrary quantization configurations (i.e., the per-layer bit-width choices of a pre-trained model) without requiring repeated fine-tuning. This is accomplished via a configuration-aware model that maps each configuration to its low-rank adjustments. The effectiveness of this model critically depends on the training configuration set, a collection of configurations chosen to cover different total bit-width budgets. However, constructing a high-quality configuration set is non-trivial. We therefore design a Pareto-based configuration search that iteratively optimizes the training configuration set, yielding more precise low-rank adjustments. Our experiments demonstrate that, unlike the state-of-the-art methods that require fine-tuning a separate LoRA adapter for each configuration, CoA-LoRA incurs no additional time cost while achieving comparable or even superior performance to those methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
693. HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models
๐ฌ 15
โญ 6.50
๐ datasets and benchmarks
๐ท๏ธ MLLMs Benchmark Dataset Humanities and Social Sciences
Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for HSS scenarios, in which multiple domain experts and automated agents collaborate to generate and iteratively refine each sample. HSSBench contains over 13,000 meticulously designed samples, covering six key categories. We benchmark more than 20 mainstream MLLMs on HSSBench and demonstrate that it poses significant challenges even for state-of-the-art models. We hope that this benchmark will inspire further research into enhancing the cross-disciplinary reasoning abilities of MLLMs, especially their capacity to internalize and connect knowledge across fields.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
694. Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension
๐ฌ 5
โญ 6.50
๐ optimization
๐ท๏ธ Optimization LLM training memory efficiency
Designing efficient optimizers for large language models (LLMs) with low-memory requirements and fast convergence is an important and challenging problem. This paper makes a step towards the systematic design of such optimizers through the lens of structured Fisher information matrix (FIM) approximation. We show that many state-of-the-art efficient optimizers can be viewed as solutions to FIM approximation (under the Frobenius norm) with specific structural assumptions. Building on these insights, we propose two design recommendations of practical efficient optimizers for LLMs, involving the careful selection of structural assumptions to balance generality and efficiency, and enhancing memory efficiency of optimizers with general structures through a novel low-rank extension framework. We demonstrate how to use each design approach by deriving new memory-efficient optimizers: Row and Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation (Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the effectiveness, showing faster and better convergence than existing memory-efficient baselines and Adam with little memory overhead. Notably, Alice achieves better than 2x faster convergence over Adam, while RACS delivers strong performance on the 1B model with SGD-like memory.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
695. Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking
๐ฌ 4
โญ 6.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ unified understanding and generation multimodal reasoning multimodal generation
Unified VisionโLanguage Models (UVLMs) aim to advance multimodal learning by supporting both understanding and generation within a single framework. However, existing approaches largely focus on architectural unification while overlooking the need for explicit interaction between the two capabilities during task solving. As a result, current models treat understanding and generation as parallel skills rather than synergistic processes. To achieve real synergy, we introduce the interleaved AnalyzingโDrafting problem-solving loop (AD-Loop), a new think paradigm that dynamically alternates between analytic and drafting operations. By interleaving textual thoughts with visual thoughts, AD-Loop enables models to iteratively refine both comprehension and outputs, fostering genuine synergy. To train this mechanism, we design a two-stage strategy: supervised learning on interleaved thought data to initialize alternation, followed by reinforcement learning to promote adaptive and autonomous control. Extensive experiments demonstrate that AD-Loop consistently improves performance across standard benchmarks for both understanding and generation, with strong transferability to various UVLMs architectures. Visual analyses further validate the effectiveness of implicit visual thoughts. These results highlight AD-Loop as a principled and broadly applicable strategy for synergizing comprehension and creation. Code and model will be available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
696. CL-DPS: A Contrastive Learning Approach to Blind Nonlinear Inverse Problem Solving via Diffusion Posterior Sampling
๐ฌ 12
โญ 6.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Diffusion Models Blind Inverse Problems Contrastive Learning
Diffusion models (DMs) have recently become powerful priors for solving inverse problems. However, most work focuses on non-blind settings with known measurement operators, and existing DM-based blind solvers largely assume linear measurements, which limits practical applicability where operators are frequently nonlinear. We introduce CL-DPS, a contrastively trained likelihood for diffusion posterior sampling that requires no knowledge of the operator parameters at inference. To the best of our knowledge, CL-DPS is the first DM-based framework capable of solving blind nonlinear inverse problems. Our key idea is to train an auxiliary encoder offline, using a MoCo-style contrastive objective over randomized measurement operators, to learn a surrogate for the conditional likelihood \$p(\boldsymbol{y} | \boldsymbol{x}\_t)\$. During sampling, we inject the surrogate's gradient as a guidance term along the reverse diffusion trajectory, which enables posterior sampling without estimating or inverting the forward operator. We further employ overlapping patch-wise inference to preserve fine structure and a lightweight color-consistency head to stabilize color statistics. The guidance is sampler-agnostic and pairs well with modern solvers (e.g., DPM-Solver++ (2M)). Extensive experiments show that CL-DPS effectively handles challenging nonlinear cases, such as rotational and zoom deblurring, where prior DM-based methods fail, while remaining competitive on standard linear benchmarks. Code: \url{https://anonymous.4open.science/r/CL-DPS-4F5D}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
697. DrugTrail: Explainable Drug Discovery via Structured Reasoning and DruggabilityโTailored Preference Optimization
๐ฌ 15
โญ 6.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ LLM-based drug discovery Explainability Structured reasoning Druggabilityโtailored preference optimization
Machine learning promises to revolutionize drug discovery, but its "black-box" nature and narrow focus limit adoption by experts. While Large Language Models (LLMs) offer a path forward with their broad knowledge and interactivity, existing methods remain data-intensive and lack transparent reasoning. To address these issues, we present DrugTrail, an LLM-based framework for explainable drug discovery that integrates structured reasoning trajectories with a DruggabilityโTailored Preference Optimization (DTPO) strategy. It not only introduces structured reasoning traces to articulate the "how" and "why" behind its conclusions but also serve to guide task-specific reasoning pathways within the LLM's vast knowledge space, thereby enhancing its interpretability and reliability of its final outputs. Furthermore, based on the fact that optimizing for binding affinity alone does not equate to optimizing for druggability, DTPO explicitly moves beyond single-metric optimization and opens up a broader search space that balances affinity with other essential factors. Extensive experiments demonstrate the effectiveness of our approach and its generalizability to a wider range of biomolecular optimization domains, bridging the gap between LLM reasoning capabilities and trustworthy AI-assisted drug discovery.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
698. CardioComposer: Leveraging Differentiable Geometry for Compositional Control of Anatomical Diffusion Models
๐ฌ 9
โญ 6.50
๐ generative models
๐ท๏ธ Diffusion Models Computational Geometry Anatomy Digital Twins Diffusion Guidance
Generative models of 3D cardiovascular anatomy can synthesize informative structures for clinical research and medical device evaluation, but face a trade-off between geometric controllability and realism. We propose CardioComposer: a programmable, inference-time framework for generating multi-class anatomical label maps based on interpretable ellipsoidal primitives. These primitives represent geometric attributes such as the size, shape, and position of discrete substructures. We specifically develop differentiable measurement functions based on voxel-wise geometric moments, enabling loss-based gradient guidance during diffusion model sampling. We demonstrate that these losses can constrain individual geometric attributes in a disentangled manner and provide compositional control over multiple substructures. Finally, we show that our method is compatible with a wide array of anatomical systems containing non-convex substructures, spanning cardiac, vascular, and skeletal organs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
699. Any-Order Flexible Length Masked Diffusion
๐ฌ 16
โญ 6.50
๐ generative models
๐ท๏ธ Diffusion Model Generative Model Discrete Diffusion Stochastic Interpolant
Masked diffusion models (MDMs) have recently emerged as a promising alternative to autoregressive models over discrete domains. MDMs generate sequences in an any-order, parallel fashion, enabling fast inference and strong performance on non-causal tasks. However, a crucial limitation is that they do not support token insertions and are thus limited to *fixed-length* generations. To this end, we introduce **Flex**ible **M**asked **D**iffusion **M**odels (FlexMDMs), a discrete diffusion paradigm that simultaneously can model sequences of flexible length while provably retaining MDMs' flexibility of any-order inference. Grounded in an extension of the stochastic interpolant framework, FlexMDMs generate sequences by inserting mask tokens and unmasking them. Empirically, we show that FlexMDMs match MDMs in perplexity while modeling length statistics with much higher fidelity. On a synthetic maze planning task, they achieve $\approx$ 60\% higher success rate than MDM baselines. Finally, we show pretrained MDMs can easily be *retrofitted* into FlexMDMs: on 16 H100s, it takes only three days to fine-tune LLaDA-8B into a FlexMDM, achieving superior performance on math (GSM8K, 58\%$\to$67\%) and code infilling performance (52\%$\to$65\%).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
700. SurfSplat: Conquering Feedforward 2D Gaussian Splatting with Surface Continuity Priors
๐ฌ 10
โญ 6.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Gaussian splatting 2DGS feed forward reconstruction
Reconstructing 3D scenes from sparse images remains a challenging task due to the difficulty of recovering accurate geometry and texture without optimization.
Recent approaches leverage generalizable models to generate 3D scenes using 3D Gaussian Splatting (3DGS) primitive.
However, they often fail to produce continuous surfaces and instead yield discrete, color-biased point clouds that appear plausible at normal resolution but reveal severe artifacts under close-up views.
To address this issue, we present SurfSplat, a feedforward framework based on 2D Gaussian Splatting (2DGS) primitive, which provides stronger anisotropy and higher geometric precision. By incorporating a surface continuity prior and a forced alpha blending strategy, SurfSplat reconstructs coherent geometry together with faithful textures.
Furthermore, we introduce High-Resolution Rendering Consistency (HRRC), a new evaluation metric designed to evaluate high-resolution reconstruction quality. Extensive experiments on RealEstate10K, DL3DV, and ScanNet demonstrate that SurfSplat consistently outperforms prior methods on both standard metrics and HRRC, establishing a robust solution for high-fidelity 3D reconstruction from sparse inputs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
701. XQC: Well-conditioned Optimization Accelerates Deep Reinforcement Learning
๐ฌ 13
โญ 6.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning
Sample efficiency is a central property of effective deep reinforcement learning algorithms. Recent work has improved this through added complexity, such as larger models, exotic network architectures, and more complex algorithms, which are typically motivated purely by empirical performance. We take a more principled approach by focusing on the optimization landscape of the critic network. Using the eigenspectrum and condition number of the criticโs Hessian, we systematically investigate the impact of common architectural design decisions on training dynamics. Our analysis reveals that a novel combination of batch normalization (BN), weight normalization (WN), and a distributional cross-entropy (CE) loss produces condition numbers orders of magnitude smaller than baselines. This combination also naturally bounds gradient norms, a property critical for maintaining a stable effective learning rate under non-stationary targets and bootstrapping. Based on these insights, we introduce XQC: a well-motivated, sample-efficient deep actor-critic algorithm built upon soft actor-critic that embodies these optimization-aware principles. We achieve state-of-the-art sample efficiency across 55 proprioception and 15 vision-based continuous control tasks, all while using significantly fewer parameters than competing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
702. ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations
๐ฌ 10
โญ 6.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ ai for math proof simplification
Neural theorem proving has advanced rapidly in the past year, reaching IMO gold-medalist capabilities and producing formal proofs that span thousands of lines. Although such proofs are mechanically verified by formal systems like Lean, their excessive length renders them difficult for humans to comprehend and limits their usefulness for mathematical insight. Proof simplification is therefore a critical bottleneck. Yet, training data for this task is scarce, and existing methodsโmainly agentic scaffolding with off-the-shelf LLMsโstruggle with the extremely long proofs generated by RL-trained provers. We introduce ProofOptimizer, the first language model trained to simplify Lean proofs without requiring additional human supervision. ProofOptimizer is trained via expert iteration and reinforcement learning, using Lean to verify simplifications and provide training signal. At inference time, it operates within an iterative proof-shortening workflow, progressively reducing proof length. Experiments show that ProofOptimizer substantially compresses proofs generated by state-of-the-art RL-trained provers on standard benchmarks, reducing proof length by 87% on miniF2F, 57% on PutnamBench, and 50% on Seed-Prover's IMO 2025 proofs. Beyond conciseness, the simplified proofs check faster in Lean and further improve downstream prover performance when reused as training data for supervised finetuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
703. Real-Time Reasoning Agents in Evolving Environments
๐ฌ 17
โญ 6.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Real-time reasoning Language model agents Parallel reasoning architecture
Agents in the real world must make not only logical but also *timely* judgments. This requires continuous awareness of the dynamic environment: hazards emerge, opportunities arise, and other agents act, while the agent's reasoning is still unfolding. Despite advances in language model reasoning, existing approaches fail to account for this dynamic nature. We introduce *real-time reasoning* as a new problem formulation for agents in evolving environments and build **Real-Time Reasoning Gym** to demonstrate it. We study two paradigms for deploying language models in agents: (1) **Reactive agents**, which employ language models with *bounded reasoning computation for rapid responses*. (2) **Planning agents**, which allow *extended reasoning computation for complex problems*. Our experiments show that even state-of-the-art models struggle with making logical and timely judgments in either paradigm. To address this limitation, we propose **AgileThinker**, which simultaneously engages *both reasoning paradigms*. **AgileThinker** consistently outperforms agents engaging only one reasoning paradigm as task difficulty and time pressure rise, effectively balancing reasoning depth and response latency. Our work establishes real-time reasoning as a critical testbed for developing practical agents and provides a foundation for research in temporally constrained AI systems, highlighting a path toward real-time capable agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
704. EXPO: Stable Reinforcement Learning with Expressive Policies
๐ฌ 9
โญ 6.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Imitation Learning
We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable value maximization. Unlike simpler Gaussian policies commonly used in online RL, expressive policies like diffusion and flow-matching policies are parameterized by a long denoising chain, which hinders stable gradient propagation from actions to policy parameters when optimizing against some value function. Our key insight is that we can address stable value maximization by avoiding direct optimization over value with the expressive policy and instead construct an on-the-fly RL policy to maximize Q-value. We propose Expressive Policy Optimization (EXPO), a sample-efficient online RL algorithm that utilizes an on-the-fly policy to maximize value with two parameterized policies -- a larger expressive base policy trained with a stable imitation learning objective and a light-weight Gaussian edit policy that edits the actions sampled from the base policy toward a higher value distribution. The on-the-fly policy optimizes the actions from the base policy with the learned edit policy and chooses the value maximizing action from the base and edited actions for both sampling and temporal-difference (TD) backup. Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods both in the setting of fine-tuning a pretrained policy given offline data and in leveraging offline data to train online.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
705. CTRL&SHIFT: High-quality Geometry-Aware Object Manipulation in Visual Generation
๐ฌ 17
โญ 6.50
๐ generative models
๐ท๏ธ Diffusion Model Image Editing Video Editing
Object-level manipulationโrelocating or reorienting objects in images or videos while preserving scene realismโis central to film post-production, AR, and creative editing. Yet existing methods struggle to jointly achieve three core goals: background preservation, geometric consistency under viewpoint shifts, and user-controllable transformations. Geometry-based approaches offer precise control but require explicit 3D reconstruction and generalize poorly; diffusion-based methods generalize better but lack fine-grained geometric control. We present **Ctrl&Shift**, an end-to-end diffusion framework to achieve geometry-consistent object manipulation without explicit 3D representations. Our key insight is to decompose manipulation into two stagesโobject removal and reference-guided inpainting under explicit camera pose controlโand encode both within a unified diffusion process. To enable precise, disentangled control, we design a multi-task, multi-stage training strategy that separates background, identity, and pose signals across tasks. To improve generalization, we introduce a scalable real-world dataset construction pipeline that generates paired image and video samples with estimated relative camera poses. Extensive experiments demonstrate that **Ctrl&Shift** achieves state-of-the-art results in fidelity, viewpoint consistency, and controllability. *To our knowledge, this is the first framework to unify fine-grained geometric control and real-world generalization for object manipulationโwithout relying on any explicit 3D modeling.*
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
706. J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
๐ฌ 12
โญ 6.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM-as-a-Judge Reasoning Reinforcement Learning
The progress of AI is bottlenecked by the quality of evaluation, making powerful LLM-as-a-Judge models a core solution. The efficacy of these judges depends on their chain-of-thought reasoning, creating a critical need for methods that can effectively optimize this reasoning process. In this work, we introduce J1, a reinforcement learning framework for teaching LLM judges to think before making decisions. Our core contribution lies in converting all judgment tasks for nonverifiable and verifiable prompts into a unified format with verifiable rewards, enabling direct optimization of evaluation quality while mitigating positional bias. We then use RL to train thinking-judges at scales of 8B, 32B, and 70B and show that they obtain state-of-the-art performance across multiple benchmarks. In particular, J1-Qwen-32B, our multitasked pointwise and pairwise judge also outperforms o1-mini, o3, and a much larger 671B DeepSeek-R1 on some benchmarks, while only training on synthetic data. Through comprehensive ablations of pairwise, pointwise, and multitask J1 variants, we demonstrate the effectiveness of our approach across seed prompts, reward strategies, and training recipes. Qualitative analysis reveals that J1 develops systematic evaluation strategies, including dynamic criteria generation, reference answer creation, iterative self-correction of initial assessments, and feedback generation for low-quality responses.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
707. Regularized Latent Dynamics Prediction is a Strong Baseline For Behavioral Foundation Models
๐ฌ 9
โญ 6.50
๐ reinforcement learning
๐ท๏ธ Behavioral Foundation Models (BFMs) Zero-shot Reinforcement Learning Zero-shot RL Representation Learning Unsupervised RL
Behavioral Foundation Models (BFMs) have been recently successful in producing agents with the capabilities to adapt to any unknown reward or task. In reality, these methods are only able to produce near-optimal policies for the reward functions that are in the span of some pre-existing _state features_. Naturally, their efficiency relies heavily on the choice of state features that they use. As a result, these BFMs have used a wide variety of complex objectives, often sensitive to environment coverage, to train task spanning features with different inductive properties. With this work, our aim is to examine the question: are these complex representation learning objectives necessary for zero-shot RL? Specifically, we revisit the objective of self-supervised next-state prediction in latent space for state feature learning, but observe that such an objective alone is prone to increasing state-feature similarity, and subsequently reducing span of reward functions that we can represent optimal policies for. We propose an approach, RLDP, that adds a simple regularization to maintain feature diversity and can match or surpass state-of-the-art complex representation learning methods for zero-shot RL. Furthermore, we demonstrate the prior approaches diverge in low-coverage scenarios where RLDP still succeeds.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
708. FutureFill: Fast Generation from Convolutional Sequence Models
๐ฌ 11
โญ 6.50
๐ generative models
๐ท๏ธ convolutional models fast inference
We address the challenge of efficient auto-regressive generation in sequence prediction models by introducing FutureFillโa general-purpose fast generation method for any sequence prediction algorithm based on convolutional operators. FutureFill reduces generation time from quadratic to quasilinear in the context length. Moreover, when generating from a prompt, it requires a prefill cache whose size grows only with the number of tokens to be generatedโoften much smaller than the caches required by standard convolutional or attentionโbased models. We validate our theoretical claims with language modeling experiments and demonstrate substantial efficiency gains when generating from a deep convolutional sequence prediction model.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
709. Masked Generative Policy for Robotic Control
๐ฌ 18
โญ 6.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Imitation Learning Masked Generative Transformer Generative Model
We present Masked Generative Policy (MGP), a novel framework for visuomotor imitation learning. We represent actions as discrete tokens, and train a conditional masked transformer that generates tokens in parallel and then rapidly refines only low-confidence tokens. We further propose two new sampling paradigms: MGP-Short, which performs parallel masked generation with score-based refinement for Markovian tasks, and MGP-Long, which predicts full trajectories in a single pass and dynamically refines low-confidence action tokens based on new observations. With globally coherent prediction and robust adaptive execution capabilities, MGP-Long enables reliable control on complex and non-Markovian tasks that prior methods struggle with. Extensive evaluations on 150 robotic manipulation tasks spanning the Meta-World and LIBERO benchmarks show that MGP achieves both rapid inference and superior success rates compared to state-of-the-art diffusion and autoregressive policies. Specifically, MGP increases the average success rate by 9\% across 150 tasks while cutting per-sequence inference time by up to 35ร. It further improves the average success rate by 60\% in dynamic and missing-observation environments, and solves two non-Markovian scenarios where other state-of-the-art methods fail. Further results and videos are available at: https://anonymous.4open.science/r/masked_generative_policy-8BC6.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
710. Transformers Learn Latent Mixture Models In-Context via Mirror Descent
๐ฌ 13
โญ 6.40
๐ foundation or frontier models, including LLMs
๐ท๏ธ in-context learning markov chain transformers mirror descent mixture models latent variables
Sequence modelling requires determining which past tokens are causally relevant from the context and their importance: a process inherent to the attention layers in transformers, yet whose underlying learned mechanisms remain poorly understood. In this work, we formalize the task of estimating token importance as an in-context learning problem by introducing a novel framework based on Mixture of Transition Distributions, whereby a latent variable, whose distribution is parameterized by a set of unobserved mixture weights, determines the influence of past tokens on the next. To correctly predict the next token, transformers need to learn the mixture weights in-context. We demonstrate that transformers can implement Mirror Descent to learn the mixture weights from the context. To this end, we give an explicit construction of a three-layer transformer that exactly implements one step of Mirror Descent and prove that the resulting estimator is a first-order approximation of the Bayes-optimal predictor. Corroborating our construction and its learnability via gradient descent, we empirically show that transformers trained from scratch converge to this solution: attention maps match our construction, and deeper modelsโ performance aligns with multi-step Mirror Descent.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
711. A Recovery Guarantee for Sparse Neural Networks
๐ฌ 12
โญ 6.40
๐ learning theory
๐ท๏ธ compressed sensing neural networks model pruning sparse weight recovery
We prove the first guarantees of sparse recovery for ReLU neural networks, where the sparse network weights constitute the signal to be recovered. Specifically, we study structural properties of the sparse network weights for two-layer, scalar-output networks under which a simple iterative hard thresholding algorithm recovers these weights exactly, using memory that grows linearly in the number of nonzero weights. We validate this theoretical result with simple experiments on recovery of sparse planted MLPs, MNIST classification, and implicit neural representations. Experimentally, we find performance that is competitive with, and often exceeds, a high-performing but memory-inefficient baseline based on iterative magnitude pruning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
712. To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models
๐ฌ 22
โญ 6.40
๐ foundation or frontier models, including LLMs
๐ท๏ธ Vision Transformer Visual Attention Sink Attention Sink Multimodal LLM Large Vision Langauge Model
Large Vision Language Models (LVLMs) have recently emerged as powerful architectures capable of understanding and reasoning over both visual and textual information. These models typically rely on two key components: a Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual content into a sequence of image tokens and serves as the perceptual front-end -- the eyes of the model. In contrast, the LLM interprets these tokens to perform high-level reasoning, generates responses, and functions as the cognitive core -- the brain of the model. However, it remains unclear which visual tokens contribute most significantly to understanding and reasoning, and how effectively these signals are propagated from ViT to the LLM. While most existing works have focused on identifying attention sinks, low-semantic tokens receiving disproportionately high attention, within the LLM, we shift the focus to the vision encoder by identifying a class of high-norm visual tokens from ViT, referred to as ViT attention sinks -- a problem that has been rarely studied but is indeed very important for LVLMs. Our findings show that these ViT sinks encapsulate high-level semantic concepts from images, allowing the LLM to perform more effective understanding and reasoning. Despite their importance, these sink tokens are often overlooked in existing LVLM architectures. To explore their contribution, we present both qualitative and quantitative analyses of the information embedded in these sink tokens. We also propose both training-free and training-based approaches to better leverage how this information is interpreted by the LLM, and to what extent. By explicitly utilizing these tokens, we demonstrate substantial improvements across a range of LVLMs and visual reasoning tasks, including but not limited to mathematical problem solving, logical inference, and geometric understanding, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
713. Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought
๐ฌ 14
โญ 6.40
๐ learning theory
๐ท๏ธ chain of continuous thought training dynamics reasoning superposition
Previous work shows that the chain of continuous thought (continuous CoT) improves the reasoning capability of large language models (LLMs) by enabling implicit parallel thinking, and a subsequent work provided theoretical insight by showing that a two-layer transformer equipped with continuous CoT can efficiently solve directed graph reachability by maintaining a superposition of multiple reasoning traces in the continuous thought. However, it remains unclear how the superposition mechanism is naturally learned from gradient-based training methods. To fill this gap, we theoretically analyze the training dynamics of a simplified two-layer transformer on the directed graph reachability problem to unveil how the superposition mechanism emerges during training in two training stages -- (i) a *thought-generation* stage that autoregressively expands the continuous thought, and (ii) a *prediction* stage that converts the thought into the final answer. Our analysis reveals that during training using continuous thought, the index-matching logit, an important quantity which reflects the strength of the model's local search ability, will first increase and then remain bounded under mild assumptions. The bounded index-matching logit effectively balances exploration and exploitation during the reasoning process: the model will exploit local problem structures to identify plausible search traces, and assign comparable weights to multiple such traces to explore when it is uncertain about which solution is correct, which results in superposition. Our experimental results tracking the growth of logits further validate our theory.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
714. Learning on a Razorโs Edge: Identifiability and Singularity of Polynomial Neural Networks
๐ฌ 22
โญ 6.40
๐ learning theory
๐ท๏ธ identifiability singularities critical points neuromanifolds polynomial activation algebraic geometry
We study function spaces parametrized by neural networks, referred to as neuromanifolds. Specifically, we focus on deep Multi-Layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs) with an activation function that is a sufficiently generic polynomial. First, we address the identifiability problem, showing that, for almost all functions in the neuromanifold of an MLP, there exist only finitely many parameter choices yielding that function. For CNNs, the parametrization is generically one-to-one. As a consequence, we compute the dimension of the neuromanifold. Second, we describe singular points of neuromanifolds. We characterize singularities completely for CNNs, and partially for MLPs. In both cases, they arise from sparse subnetworks. For MLPs, we prove that these singularities often correspond to critical points of the mean-squared error loss, which does not hold for CNNs. This provides a geometric explanation of the sparsity bias of MLPs. All of our results leverage tools from algebraic geometry.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
715. DA$^2$: Depth Anything in Any Direction
๐ฌ 21
โญ 6.40
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Panoramas Depth (Distance) Estimation
Panorama has a full FoV (360$^\circ\times$180$^\circ$), offering a more complete visual description than perspective images.
Thanks to this characteristic, panoramic depth estimation is gaining increasing traction in 3D vision.
However, due to the scarcity of panoramic data, previous methods are often restricted to in-domain settings, leading to poor zero-shot generalization.
Furthermore, due to the spherical distortions inherent in panoramas, many approaches rely on perspective splitting (\textit{e.g.}, cubemaps),
which leads to suboptimal efficiency.
To address these challenges, we propose $\textbf{DA}$$^{\textbf{2}}$: $\textbf{D}$epth $\textbf{A}$nything in $\textbf{A}$ny $\textbf{D}$irection, an accurate, zero-shot generalizable, and fully end-to-end panoramic depth estimator.
Specifically, for scaling up panoramic data, we introduce a data curation engine for generating high-quality panoramic depth data from perspective, and create $\sim$543K panoramic RGB-depth pairs, bringing the total to $\sim$607K.
To further mitigate the spherical distortions, we present SphereViT, which explicitly leverages spherical coordinates to enforce the spherical geometric consistency in panoramic image features, yielding improved performance.
A comprehensive benchmark on multiple datasets clearly demonstrates DA$^{2}$'s SoTA performance, with an average 38\% improvement on AbsRel over the strongest zero-shot baseline.
Surprisingly, DA$^{2}$ even outperforms prior in-domain methods, highlighting its superior zero-shot generalization.
Moreover, as an end-to-end solution, DA$^{2}$ exhibits much higher efficiency over fusion-based approaches.
Both the code and the curated panoramic data will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
716. The Counting Power of Transformers
๐ฌ 24
โญ 6.40
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ FLaNN expressiveness attention formal languages
Counting properties (e.g. determining whether certain tokens occur more
than other tokens in a given input text) have played a significant role in
the study of expressiveness of transformers.
In this paper, we provide a formal
framework for investigating the counting power of transformers. We argue
that all existing results demonstrate transformers' expressivity only for
(semi-)linear counting properties, i.e., which are expressible as a
boolean combination of linear inequalities.
Our main result is that transformers can express counting properties that
are highly nonlinear. More precisely, we prove that transformers can
capture all semialgebraic counting properties, i.e., expressible as
a boolean combination of arbitrary multivariate polynomials (of any degree).
Among others, these generalize the counting properties that
can be captured by support vector machines via polynomial kernel in the
vector space model.
To complement this result, we exhibit a natural subclass of (softmax)
transformers that completely characterizes semialgebraic counting
properties.
Through connections with the
Hilbert's tenth problem, this expressivity of transformers also
yields a new undecidability result for analyzing an extremely simple
transformer model --- surprisingly with neither positional encodings
(i.e. NoPE-transformers) nor masking.
We also experimentally validate trainability of such counting
properties.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
717. Monocular Normal Estimation via Shading Sequence Estimation
๐ฌ 18
โญ 6.40
๐ generative models
๐ท๏ธ Video Diffusion Model Shading Estimation Single-view Normal Estimation
Monocular normal estimation aims to estimate normal map from a single RGB image of an object under arbitrary lighting. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have an overall correct color distribution, the reconstructed surfaces frequently fail to align with the geometry details. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and reconstruct spatially-various geometric, as they are represented in normal maps only by relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometry information. Building on this paradigm, we present RoSE, a method that leverages image-to-video generative models to predict shading sequences. The predicted shading sequences are then converted into normal maps by solving a simple ordinary least-squares problem. To enhance robustness and better handle complex objects, RoSE is trained on a synthetic dataset, dataset, with diverse shapes, materials, and light conditions. Experiments demonstrate that RoSE achieves state-of-the-art performance on real-world benchmark datasets for object-based monocular normal estimation. Codes and dataset will be released to facilitate reproducible research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
718. Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs
๐ฌ 21
โญ 6.40
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reasoning Vision-Language Models Contrasting
Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent fine-tuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge: visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely compared with when given a single VQA sample. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-$55$K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. The code, dataset and trained models will be released upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
719. $\ell_1$ Latent Distance based Continuous-time Graph Representation
๐ฌ 22
โญ 6.40
๐ learning on graphs and other geometries & topologies
๐ท๏ธ $\ell_1$ distance graph representation sequential survival process ultra-low-dimensional embedding
Continuous-time graph representation (CTGR) is a widely-used methodology in machine learning, physics, bioinformatics, and social networks. The sequential survival process in a latent space with the squared $\ell_2$ distance is an important ultra-low-dimensional embedding for CTGR. However, the squared $\ell_2$ distance violates the triangle inequality, which may cause distortion of the relative node positions in the latent space and thus deteriorates in social, contact, and collaboration networks. Reverting to the $\ell_2$ distance is infeasible because the corresponding integral computation is intractable. To solve these problems, we propose a theoretically-sound $\ell_1$ latent distance based continuous-time graph representation ($\ell_1$LD-CTGR). It facilitates a true latent metric space for the sequential survival process. Moreover, the integral of the hazard function is found to be a closed-form piece-wise exponential integral, which well fits the ultra-low-dimensional embedding. To handle the non-differentiable $\ell_1$ norm, we successfully find a descent direction of the hazard function to replace the gradient, enabling mainstream learning architectures to learn the parameters. Extensive experiments using both synthetic and real-world data show the competitive performance of $\ell_1$LD-CTGR.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
720. Compositional Visual Planning via Inference-Time Diffusion Scaling
๐ฌ 18
โญ 6.40
๐ applications to robotics, autonomy, planning
๐ท๏ธ Planning Compositionality Diffusion Models Robotics
Diffusion models excel at short-horizon robot planning, yet scaling them to long-horizon tasks remains challenging due to computational constraints and limited training data.
Existing compositional approaches stitch together short segments by separately denoising each component and averaging overlapping regions. However, this suffers from instability as the factorization assumption breaks down in noisy data space, leading to inconsistent global plans. We propose that the key to stable compositional generation lies in enforcing boundary agreement on the estimated clean data (Tweedie estimates) rather than on noisy intermediate states. Our method formulates long-horizon planning as inference over a chain-structured factor graph of overlapping video chunks, where pretrained short-horizon video diffusion models provide local priors. At inference time, we enforce boundary agreement through a novel combination of synchronous and asynchronous message passing that operates on Tweedie estimates, producing globally consistent guidance without requiring additional training. Our training-free framework demonstrates significant improvements over existing baselines across 100 simulation tasks spanning 4 diverse scenes, effectively generalizing to unseen start-goal combinations that were not present in the original training data. Project website: https://comp-visual-planning.github.io/
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
721. Tokenisation over Bounded Alphabets is Hard
๐ฌ 12
โญ 6.40
๐ foundation or frontier models, including LLMs
๐ท๏ธ Tokenisation tokenization language modelling compression LLM NLP
Recent works have proven tokenisation to be NP-complete.
However, their proofs' constructions rely on tokenisation being applied to inputs with alphabets of unbounded cardinality, which does not accurately reflect the real world.
Indeed, since practical applications of tokenisers involve fixed-size alphabets (e.g., Unicode or bytes), the implications of such a statement may be challenged.
In this work, we examine the computational complexity of tokenisation over bounded alphabets, considering two variants of this problem: bottom-up tokenisation and direct tokenisation, where we must, respectively, select a sequence of merge operations (in bottom-up tokenisation) or a vocabulary (in direct tokenisation) whose application compresses a dataset to at most $\delta$ symbols.
When alphabets are bounded to have only 2 characters, we do not only prove that bottom-up and direct tokenisation are NP-complete, but also that there is no polynomial-time approximation scheme for either of these problems (unless P = NP).
Furthermore, even when alphabets are bounded to contain a single character, we can still prove the NP-completeness of direct tokenisation.
Although the single-character case is not practical on its own, proving hardness results for an $n$-ary alphabet allows us to prove the same results for alphabets of any larger size.
We thus conclude that direct tokenisation over any alphabet is NP-complete, and that both bottom-up and direct tokenisation do not admit polynomial-time approximation schemes for any alphabet of size 2 or larger.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
722. Adaptive gradient descent on Riemannian manifolds and its applications to Gaussian variational inference
๐ฌ 24
โญ 6.40
๐ optimization
๐ท๏ธ Adaptive method Riemannian optimization Variational Inference
We propose RAdaGD, a novel family of adaptive gradient descent methods on general Riemannian manifolds. RAdaGD adapts the step size parameter without line search, and includes instances that achieve a non-ergodic convergence guarantee, $f(x_k) - f(x_\star) \le \mathcal{O}(1/k)$, under local geodesic smoothness and generalized geodesic convexity. A core application of RAdaGD is Gaussian Variational Inference, where our method provides the first convergence guarantee in the absence of $L$-smoothness of the target log-density, under additional technical assumptions. We also investigate the empirical performance of RAdaGD in numerical simulations and demonstrate its competitiveness in comparison to existing algorithms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
723. Language and Experience: A Computational Model of Social Learning in Complex Tasks
๐ฌ 15
โญ 6.40
๐ applications to neuroscience & cognitive science
๐ท๏ธ cognitive science; social learning; cultural learning; causal learning; bayesian models of cognition
The ability to combine linguistic guidance from others with direct experience is central to human development, enabling safe and rapid learning in new environments. How do people integrate these two sources of knowledge, and how might AI systems? We present a computational framework that models human social learning as joint probabilistic inference over structured, executable world models given sensorimotor and linguistic data. We make this possible by turning a pretrained language model into a probabilistic model of how humans share advice conditioned on their beliefs, allowing our agents both to generate advice for others and to interpret linguistic input as evidence during Bayesian inference.
Using behavioral experiments and simulations across 10 video games, we show how linguistic guidance can shape exploration and accelerate learning by reducing risky interactions and speeding up key discoveries in both humans and models. We further explore how knowledge can accumulate across generations through iterated learning experiments and demonstrate successful knowledge transfer between humans and modelsโrevealing how structured, language-compatible representations might enable human-machine collaborative learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
724. Predicting Kernel Regression Learning Curves from Only Raw Data Statistics
๐ฌ 14
โญ 6.40
๐ learning theory
๐ท๏ธ kernels kernel regression neural tangent kernel eigenstructure learning curves natural data MLPs
We study kernel regression with common rotation-invariant kernels on real datasets including CIFAR-5m, SVHN, and ImageNet.
We give a theoretical framework that predicts learning curves (test risk vs. sample size) from only two measurements: the empirical data covariance matrix and an empirical polynomial decomposition of the target function $f_*$.
The key new idea is an analytical approximation of a kernelโs eigenvalues and eigenfunctions with respect to an anisotropic data distribution.
The eigenfunctions resemble Hermite polynomials of the data, so we call this approximation the \textit{Hermite eigenstructure ansatz} (HEA).
We prove the HEA for Gaussian data, but we find that real image data is often ``Gaussian enoughโโ for the HEA to hold well in practice, enabling us to predict learning curves by applying prior results relating kernel eigenstructure to test risk.
Extending beyond kernel regression, we empirically find that MLPs in the feature-learning regime learn Hermite polynomials in the order predicted by the HEA.
Our HEA framework is a proof of concept that an end-to-end theory of learning which maps dataset structure all the way to model performance is possible for nontrivial learning algorithms on real datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
725. CheckMate! Watermarking Graph Diffusion Models in Polynomial Time
๐ฌ 22
โญ 6.40
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graphs Watermarking Diffusion Models Networks
Watermarking provides an effective means for data governance.
However, conventional post-editing graph watermarking approaches degrade the graph quality and involve NP-hard subroutines.
Alternatively, recent approaches advocate for embedding watermarking patterns in the noisy latent during data generation from diffusion models, but remain uncharted for graph models due to the hardness of inverting the graph diffusion process.
In this work, we propose CheckWate: the first watermarking framework for graph diffusion models embedding checkerboard watermark and providing polynomial time verification.
To address NP-completeness due to graph isomorphism, CheckWate embeds the watermark into the latent eigenvalues, which are isomorphism-invariant.
To detect the watermark through reversing the graph diffusion process,
CheckWate leverages the graph eigenvectors to approximately dequantizes the discrete graph back to the continuous latent, with theoretical guarantees on the detectability and dequantization error.
We further introduce a latent sparsification mechanism to enhance the robustness of CheckWate against graph modifications.
We evaluate CheckWate on four datasets and four graph modification attacks, against three generation time watermark schemes.
CheckWate achieves remarkable generation quality while being detectable under strong attacks such as isomorphism, whereas the baselines are unable to detect the watermark.
Code available at: https://anonymous.4open.science/r/checkwate.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
726. A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space
๐ฌ 32
โญ 6.40
๐ foundation or frontier models, including LLMs
๐ท๏ธ safety alignment safety preservation low-rank adaptation parameter-efficient fine-tuning
Large language models (LLMs) have achieved remarkable success in diverse tasks, yet their safety alignment remains fragile during adaptation.
Even when fine-tuning on benign data or with low-rank adaptation, pre-trained safety behaviors are easily degraded, leading to harmful responses in the fine-tuned models.
To address this challenge, we propose GuardSpace, a guardrail framework for preserving safety alignment throughout fine-tuning, composed of two key components: a safety-sensitive subspace and a harmful-resistant null space.
First, we explicitly decompose pre-trained weights into safety-relevant and safety-irrelevant components using covariance-preconditioned singular value decomposition, and initialize low-rank adapters from the safety-irrelevant ones, while freezing safety-relevant components to preserve their associated safety mechanism.
Second, we construct a null space projector that restricts adapter updates from altering safe outputs on harmful prompts, thereby maintaining the original refusal behavior.
Experiments with various pre-trained models on multiple downstream tasks demonstrate that GuardSpace achieves superior performance over existing methods.
Notably, for Llama-2-7B-Chat fine-tuned on GSM8K, GuardSpace outperforms the state-of-the-art method AsFT, reducing the average harmful score from 14.4\% to 3.6\%, while improving the accuracy from from 26.0\% to 28.0\%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
727. On the Design of One-step Diffusion via Shortcutting Flow Paths
๐ฌ 21
โญ 6.40
๐ generative models
๐ท๏ธ Diffusion Model Flow Matching Few-step Diffusion Shortcut Model
Recent advances in few-step diffusion models have demonstrated their efficiency and effectiveness by shortcutting the probabilistic paths of diffusion models, especially in training one-step diffusion models from scratch (a.k.a. shortcut models). However, their theoretical derivation and practical implementation are often closely coupled, which obscures the design space.
To address this, we propose a common design framework for representative shortcut models. This framework provides theoretical justification for their validity and disentangles concrete component-level choices, thereby enabling systematic identification of improvements. With our proposed improvements, the resulting one-step model achieves a new state-of-the-art FID50k of 2.85 on ImageNet-256ร256 under the classifier-free guidance setting. Remarkably, the model requires no pre-training, distillation, or curriculum learning.
We believe our work lowers the barrier to component-level innovation in shortcut models and facilitates principled exploration of their design space.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
728. FlyPrompt: Brain-Inspired Random-Expanded Routing with Temporal-Ensemble Experts for General Continual Learning
๐ฌ 32
โญ 6.40
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Continual Learning Life-long Learning Brain-inspired AI Catastrophic Forgetting Prompt Tuning
General continual learning (GCL) challenges intelligent systems to learn from single-pass, non-stationary data streams without clear task boundaries. While recent advances in continual parameter-efficient tuning (PET) of pretrained models show promise, they typically rely on multiple training epochs and explicit task cues, limiting their effectiveness in GCL scenarios. Moreover, existing methods often lack targeted design and fail to address two fundamental challenges in continual PET: how to allocate expert parameters to evolving data distributions, and how to improve their representational capacity under limited supervision. Inspired by the fruit fly's hierarchical memory system characterized by sparse expansion and modular ensembles, we propose FlyPrompt, a brain-inspired framework that decomposes GCL into two subproblems: expert routing and expert competence improvement. FlyPrompt introduces a randomly expanded analytic router for instance-level expert activation and a temporal ensemble of output heads to dynamically adapt decision boundaries over time. Extensive theoretical and empirical evaluations demonstrate FlyPrompt's superior performance, achieving up to 11.23%, 12.43%, and 7.62% gains over state-of-the-art baselines on CIFAR-100, ImageNet-R, and CUB-200, respectively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
729. cadrille: Multi-modal CAD Reconstruction with Reinforcement Learning
๐ฌ 20
โญ 6.40
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ CAD 3D reconstruction LLM VLM point cloud DPO GRPO
Computer-Aided Design (CAD) plays a central role in engineering and manufacturing, making it possible to create precise and editable 3D models. Using a variety of sensor or user-provided data as inputs for CAD reconstruction can democratize access to design applications. However, most existing methods focus on a single input modality: point clouds, images, or texts, which limits their generalizability and robustness, while few multimodal approaches struggle to deliver competitive quality. Leveraging advances in vision-language models (VLM), we propose $\texttt{cadrille}$, a multimodal CAD reconstruction model that takes inputs of three modalities and outputs executable Python code for CAD reconstruction. Inspired by large language model (LLM) training paradigm, we adopt a two-stage pipeline: supervised fine-tuning (SFT) on large-scale procedurally generated data, followed by reinforcement learning (RL) fine-tuning using online feedback, obtained programatically. In the DeepCAD benchmark, our SFT model outperforms existing single-modal approaches in all three input modalities simultaneously. More importantly, after RL fine-tuning, $\texttt{cadrille}$ sets new state-of-the-art in as many as 10 benchmarks across three modalities and four datasets, including a real-world one.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
730. ATLAS: Alibaba Dataset and Benchmark for Learning-Augmented Scheduling
๐ฌ 21
โญ 6.40
๐ datasets and benchmarks
๐ท๏ธ Scheduling with predictions Dataset and benchmark Machine learning Learning augmented scheduling Non-clairvoyant scheduling
Learning-augmented scheduling uses ML predictions to improve decision-making under uncertainty. Many algorithms in this class have been proposed with better theoretical guarantees than the classic methods. Translating these theoretical results into practice, however, requires an understanding of real workloads. Such an understanding is hard to develop because existing production traces either lack the ground-truth processing times or are not publicly available, while synthetic benchmarks fail to represent real-world complexity. We fill this gap by introducing *Alibaba Trace for Learning-Augmented Scheduling (ATLAS)*, a research-ready dataset derived from Alibaba's Platform of Artificial Intelligence (PAI) cluster traceโa production system that processes hundreds of thousands of ML jobs per day. The ATLAS dataset has been cleaned and features engineered to represent the inputs and constraints of non-clairvoyant scheduling, including user tags, resource requests (CPU/GPU/memory), and job structures with ground-truth processing times. We develop a prediction benchmark reporting prediction error metrics, along with feature importance analysis, and introduce a novel multiple-stage ML model. We also provide a scheduling benchmark for minimizing the total completion time, max-stretch, and makespan. ATLAS is a reproducible foundation for researchers to study learning-augmented scheduling on real workloads, available at https://anonymous.4open.science/r/non-clairvoyant-with-predictions-7BF8/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
731. Closing the Gap Between Text and Speech Understanding in LLMs
๐ฌ 13
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Speech language models large language models multimodal language models modality alignment cross-modal alignment cross-modal transfer cross-modal distillation modality gap speech processing
Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterpartsโand even cascaded pipelinesโon language understanding tasks. We term this shortfall the textโspeech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALADโSample-efficient Alignment with Learning through Active selection and cross-modal Distillationโwhich combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from publicly available corpora.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
732. Huxley-G\"odel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine
๐ฌ 10
โญ 6.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Self-Improvement Coding Agents G\"odel Machine
Recent studies operationalize self-improvement through coding agents that edit their own codebases, grow a tree of self-modifications through expansion strategies that favor higher software engineering benchmark performance,
considering that this implies more promising subsequent self-modifications. However, we identify a mismatch between the agentโs self-improvement potential (metaproductivity) and its coding benchmark performance, namely the \emph{Metaproductivity-Performance~Mismatch}. Inspired by Huxleyโs concept of clade, we propose a metric ($\mathrm{CMP}$) that aggregates the benchmark performances of the \emph{descendants} of an agent as an indicator of its potential for self-improvement. We show that the G\"odel Machine, the optimal self-improving machine, is achieved with access to true $\mathrm{CMP}$. We introduce the Huxley-G\"odel Machine (HGM), which, by estimating $\mathrm{CMP}$ and using it as guidance, searches the tree of self-modifications. On SWE-bench Verified and Polyglot, HGM outperforms prior self-improving coding agent search methods while using less wall-clock time. Moreover, the agent optimized by HGM on SWE-bench Verified outperforms SWE-agent, a leading human-engineered open source coding agent on SWE-bench Lite, where SWE-agent ranks the best on the official leaderboard, when both use the GPT-5-mini backbone, demonstrating that HGM self-improvement indeed enhances genuine coding capability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
733. ASMIL: Attention-Stabilized Multiple Instance Learning for Whole-Slide Imaging
๐ฌ 13
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Whole slide image Multiple instance learning
Attention-based multiple instance learning (MIL) has emerged as a powerful framework for whole slide image (WSI) diagnosis, leveraging attention to aggregate instance-level features into bag-level predictions. Despite this success, we find that such methods exhibit a new failure mode: unstable attention dynamics. Across four representative attention-based MIL methods and two public WSI datasets, we observe that attention distributions oscillate across epochs rather than converging to a consistent pattern, degrading performance. This instability adds to two previously reported challenges: overfitting and over-concentrated attention distribution. To simultaneously overcome these three limitations, we introduce attention-stabilized multiple instance learning (ASMIL), a novel unified framework. ASMIL uses an anchor model to stabilize attention, replaces softmax with a normalized sigmoid function in the anchor to prevent over-concentration, and applies token random dropping to mitigate overfitting. Extensive experiments demonstrate that ASMIL achieves up to a 6.49% F1 score improvement over state-of-the-art methods. Moreover, integrating the anchor model and normalized sigmoid into existing attention-based MIL methods consistently boosts their performance, with F1 score gains up to 10.73%. All code and data are publicly available at https://anonymous.4open.science/r/ASMIL-5018/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
734. Structured Flow Autoencoders: Learning Structured Probabilistic Representations with Flow Matching
๐ฌ 10
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Flow Matching Probabilistic Model Representation Learning Probabilistic Graphical Model Autoencoder
Flow matching has proven to be a powerful density estimator, yet it often fails to explicitly capture the rich inherent latent structure of complex data. To address this limitation, we introduce Structured Flow Autoencoders (SFA), a family of probabilistic models that augments Continuous Normalizing Flows (CNFs) with graphical models. At the core of SFA is a novel flow matching based objective, which explicitly accounts for latent variables, enabling simultaneous learning of likelihood and posterior. We demonstrate the versatility of SFA across settings, including models with continuous and mixture latent variables, as well as latent dynamical systems. Empirical studies show that SFA outperforms Variational Autoencoders (VAE) and their graphical model extensions, achieving better data fit while simultaneously retaining meaningful latent variables as structured representations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
735. Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners
๐ฌ 16
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Reasoning Reinforcement Learning with Verifiable Rewards Long Chain-of-Thought
Reinforcement Learning with Verifiable Reward (RLVR) effectively solves complex tasks but demands extremely long context lengths during training, leading to substantial computational costs. While multi-stage training can partially mitigate this, starting with overly short contexts often causes irreversible performance degradation, ultimately failing to reduce overall training compute significantly. In this paper, we introduce **T**hinking-**F**ree **P**olicy **I**nitialization (**TFPI**), a simple yet effective adaptation to RLVR that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple *ThinkFree* operation, explicitly discarding the thinking content via a direct *</think>* append, to reduce token usage during inference. Training with *ThinkFree*-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that {\method} accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench using less than 4K H20 hours.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
736. GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation
๐ฌ 19
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Category-Agnostic Pose Estimation Graph Learning Variational Autoencoder
Category-agnostic pose estimation (CAPE) aims to localize keypoints on query images from arbitrary categories, using only a few annotated support examples for guidance. Recent approaches either treat keypoints as isolated entities or rely on manually defined skeleton priors, which are costly to annotate and inherently inflexible across diverse categories. Such oversimplification limits the modelโs capacity to capture instance-wise structural cues critical for accurate pixel-level localization. To overcome these limitations, we propose \textbf{GenCape}, a \textbf{Gen}erative-based framework for \textbf{CAPE} that infers keypoint relationships solely from image-based support inputs, without additional textual descriptions or predefined skeletons. Our framework consists of two principal components: an iterative Structure-aware Variational Autoencoder (i-SVAE) and a Compositional Graph Transfer (CGT) module. The former infers soft, instance-specific adjacency matrices from support features through variational inference, embedded layer-wise into the Graph Transformer Decoder for progressive structural priors refinement. The latter adaptively aggregates multiple latent graphs into a query-aware structure via Bayesian fusion and attention-based reweighting, enhancing resilience to visual uncertainty and support-induced bias. This structure-aware design facilitates effective message propagation among keypoints and promotes semantic alignment across object categories with diverse keypoint topologies. Experimental results on the MP-100 dataset show that our method achieves substantial gains over graph-support baselines under both 1- and 5-shot settings, while maintaining competitive performance against text-support counterparts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
737. CylinderSplat: 3D Gaussian Splatting with Cylindrical Triplanes for Panoramic Novel View Synthesis
๐ฌ 15
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Gaussian Splatting;Panoramic Novel View Synthesis;Cylindrical Triplane;Feed-forward;Multi-view Reconstruction
Feed-forward 3D Gaussian Splatting (3DGS) has shown great promise for real-time novel view synthesis, but its application to panoramic imagery remains challenging. Existing methods often rely on multi-view cost volumes for geometric refinement, which struggle to resolve occlusions in sparse-view scenarios. Furthermore, standard volumetric representations like Cartesian Triplanes are poor in capturing the inherent geometry of $360^\circ$ scenes, leading to distortion and aliasing.
In this work, we introduce CylinderSplat, a feed-forward framework for panoramic 3DGS that addresses these limitations. The core of our method is a new {cylindrical Triplane} representation, which is better aligned with panoramic data and real-world structures adhering to the Manhattan-world assumption. We use a dual-branch architecture: a pixel-based branch reconstructs well-observed regions, while a volume-based branch leverages the cylindrical Triplane to complete occluded or sparsely-viewed areas. Our framework is designed to flexibly handle a variable number of input views, from single to multiple panoramas. Extensive experiments demonstrate that CylinderSplat achieves state-of-the-art results in both single-view and multi-view panoramic novel view synthesis, outperforming previous methods in both reconstruction quality and geometric accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
738. RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems
๐ฌ 22
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reasoning abstractions; LLM; RL; Structured exploration; Reasoning
Reasoning requires going beyond pattern matching or memorization of solutions to identify and implement algorithmic procedures that can be used to deduce answers to hard problems. Doing so requires reusing primitives, intermediate results, or procedures across multiple problems. While RL post-training on long chains of thought ultimately aims to uncover this kind of algorithmic behavior, the depth-first and brute-force nature of reasoning traces learned by these models suggests that this is far from a fulfilled promise. To address more effective reasoning, we introduce reasoning abstractions: concise natural language descriptions of procedural and factual knowledge that guide the model toward learning successful reasoning. We train models to be capable of proposing several useful abstractions given a problem, followed by RL training that incentivizes building a solution while using the information provided by these abstractions. This results in a two-player RL training paradigm, abbreviated as RLAD, that jointly trains an abstraction generator and an abstraction-conditioned solution generator. This setup effectively enables structured exploration, decouples learning signals of abstraction proposal and solution generation, and improves generalization to harder problems. We also show that spending more test-time compute into generating abstractions is more beneficial for performance than generating more solutions at large inference-time budgets, illustrating the role of abstractions in guiding global exploration.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
739. GNN Explanations that do not Explain and How to find Them
๐ฌ 13
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ graph neural networks explainability self-explainable auditing faithfulness
Explanations provided by Self-explainable Graph Neural Networks (SE-GNNs) are fundamental for understanding the model's inner workings and for identifying potential misuse of sensitive attributes.
Although recent works have highlighted that these explanations can be suboptimal and potentially misleading, a characterization of their failure cases is unavailable.
In this work, we identify a critical failure of SE-GNN explanations: *explanations can be unambiguously unrelated to how the SE-GNNs infer labels.*
We show that, on the one hand, many SE-GNNs can achieve optimal true risk while producing these degenerate explanations, and on the other, most faithfulness metrics can fail to identify these failure modes.
Our empirical analysis reveals that degenerate explanations can be maliciously planted (allowing an attacker to hide the use of sensitive attributes) and can also emerge naturally, highlighting the need for reliable auditing.
To address this, we introduce a novel faithfulness metric that reliably marks degenerate explanations as unfaithful, in both malicious and natural settings.
Our code is available in the supplemental.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
740. Sample Efficient Offline RL via T-Symmetry Enforced Latent State-Stitching
๐ฌ 19
โญ 6.00
๐ reinforcement learning
๐ท๏ธ sample efficiency representation learning fundamental symmetry for dynamic modeling
Offline reinforcement learning (RL) has achieved notable progress in recent years. However, most existing offline RL methods require a large amount of training data to achieve reasonable performance and offer limited out-of-distribution (OOD) generalization capability due to conservative data-related regularizations. This seriously hinders the usability of offline RL in solving many real-world applications, where the available data are often limited. In this study, we introduce TELS, a highly sample-efficient offline RL algorithm that enables state-stitching in a compact latent space regulated by the fundamental time-reversal symmetry (T-symmetry) of dynamical systems. Specifically, we introduce a T-symmetry enforced inverse dynamics model (TS-IDM) to derive well-regulated latent state representations that greatly facilitate OOD generalization. A guide-policy can then be learned entirely in the latent space to optimize for the reward-maximizing next state, bypassing the conservative action-level behavioral regularization adopted in most offline RL methods. Finally, the optimized action can be extracted using the learned TS-IDM, together with the optimized latent next state from the guide-policy. We conducted comprehensive experiments on both the D4RL benchmark tasks and a real-world industrial control test environment, TELS achieves superior sample efficiency and OOD generalization performance, significantly outperforming existing offline RL methods in a wide range of challenging small-sample tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
741. Hierarchical Concept-based Interpretable Models
๐ฌ 11
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Explainable Artificial Intelligence Concept-based Explainability Concept Discovery Concept Hierarchy Concept Bottleneck Models Concept Embedding Models Clustering Sparse Autoencoders
Modern deep neural networks remain challenging to interpret due to the opacity of their latent representations, impeding model understanding, debugging, and debiasing. Concept Embedding Models (CEMs) address this by mapping inputs to human-interpretable concept representations from which tasks can be predicted. Yet, CEMs fail to represent inter-concept relationships and require concept annotations at different granularities during training, limiting their applicability.
In this paper, we introduce *Hierarchical Concept Embedding Models* (HiCEMs), a new family of CEMs that explicitly model concept relationships through hierarchical structures. To enable HiCEMs in real-world settings, we propose *Concept Splitting*, a method for automatically discovering finer-grained sub-concepts from a pretrained CEMโs embedding space without requiring additional annotations. This allows HiCEMs to generate fine-grained explanations from limited concept labels, reducing annotation burdens.
Our evaluation across multiple datasets, including a user study and experiments on *PseudoKitchens*, a newly proposed concept-based dataset of 3D kitchen renders, demonstrates that (1) Concept Splitting discovers human-interpretable sub-concepts absent during training that can be used to train highly accurate HiCEMs, and (2) HiCEMs enable powerful test-time concept interventions at different granularities, leading to improved task accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
742. Why Adversarially Train Diffusion Models?
๐ฌ 19
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Robustness Adversarial training
Adversarial Training (AT) is a known, powerful, well-established technique for improving classifier robustness to input perturbations, yet its applicability beyond discriminative settings remains limited. Motivated by the widespread use of score-based generative models and their need to operate robustly under substantial noisy or corrupted input data, we propose an adaptation of AT for these models, providing a thorough empirical assessment.
We introduce a principled formulation of AT for Diffusion Models (DMs) that replaces the conventional *invariance* objective with an *equivariance* constraint aligned to the denoising dynamics of score matching. Our method integrates seamlessly into diffusion training by adding either random perturbations--similar to randomized smoothing--or adversarial ones--akin to AT.
Our approach offers several advantages: **(a)** tolerance to heavy noise and corruption, **(b)** reduced memorization, **(c)** robustness to outliers and extreme data variability and **(d)** resilience to iterative adversarial attacks.
We validate these claims on proof-of-concept low- and high-dimensional datasets with *known* ground-truth distributions, enabling precise error analysis. We further evaluate on standard benchmarks (CIFAR-10, CelebA, and LSUN Bedroom), where our approach shows improved robustness and preserved sample fidelity under severe noise, data corruption, and adversarial evaluation. Code available upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
743. Probing Rotary Position Embeddings through Frequency Entropy
๐ฌ 17
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Rotary Position Embedding Frequency Entropy Large Language Model
Rotary Position Embeddings (RoPE) are widely used in Transformers to encode positional information in token representations, yet the internal frequency structure of RoPE remains poorly understood. Previous studies have reported conflicting findings on the roles of high- and low-frequency dimensions, offering empirical observations but no unifying explanation. In this paper, we present a systematic framework that bridges these disparate results. We introduce Frequency Entropy (FE), a metric that quantifies the effective utilization of each RoPE frequency dimension, and we provide an analysis of how RoPEโs sinusoidal components contribute to model representations on a per-dimension basis. Based on an analysis of the Llama-4 model, which incorporates both RoPE and NoPE layers, we find that the periodicity captured by FE appears in RoPE layers but not in NoPE layers. Furthermore, FE identifies dimensions in which energy concentrates under RoPE. These characteristics are observed across the spectrum rather than being confined to specific dimensions. Moreover, attenuating extreme-entropy dimensions at inference yields downstream accuracy that is statistically indistinguishable from the baseline, with modest perplexity improvements on average, suggesting that such dimensions are often redundant. Overall, FE provides a simple, general diagnostic for RoPE with implications for analysis and design.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
744. AnyBCQ: Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs
๐ฌ 22
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ LLM Quantization Anyprecision
The deployment of large language models (LLMs) is increasingly constrained by memory and latency bottlenecks, motivating the need for quantization techniques that flexibly balance accuracy and efficiency.
Recent work has introduced multi-precision models, which enable inference at multiple precisions within a single model depending on runtime constraints.
To support such flexibility, quantized weights are often stored as bit-planes, where hardware efficiency improves when the compute operates directly at the bit-plane level and activates only the precision required by each request.
In this work, we present AnyBCQ, a hardware-friendly multi-precision extension of Binary-Coded Quantization (BCQ) that supports direct bit-plane operations.
By representing weights as binary bit-planes with corresponding scale factors, AnyBCQ enables bit-planeโlevel computation and maps naturally to accelerator-friendly, bit-parallel arithmetic.
Our progressive precision expansion mechanism incrementally refines scaling factors while reusing previously assigned binary codes, yielding monotonic improvements in accuracy as additional bits are enabled.
We further co-design a specialized kernel that exploits the BCQ structure to support dynamic per-request precision selection with negligible overhead.
Experiments on recent LLMs demonstrate that AnyBCQ significantly narrows the accuracy drop in the low-bit regime (e.g. 2-bit), remains competitive at higher precision, and achieves throughput gains of up to $3.0\times$ over half precision and $1.2\times$ over state-of-the-art multi-precision methods.
By aligning algorithmic flexibility with hardware efficiency, AnyBCQ provides a practical foundation for multi-precision LLM deployment across diverse service-level objectives.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
745. AWM: Accurate Weight-Matrix Fingerprint for Large Language Models
๐ฌ 15
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ fingerprint large language models intellectual property
Protecting the intellectual property of large language models (LLMs) is crucial, given the substantial resources required for their training. Consequently, there is an urgent need for both model owners and third parties to determine whether a suspect LLM is trained from scratch or derived from an existing base model. However, the intensive post-training processes that models typically undergoโsuch as supervised fine-tuning, extensive continued pretraining, reinforcement learning, multi-modal extension, pruning, and upcyclingโpose significant challenges to reliable identification. In this work, we propose a training-free fingerprinting method based on weight matrices. We leverage the Linear Assignment Problem (LAP) and an unbiased Centered Kernel Alignment (CKA) similarity to neutralize the effects of parameter manipulations, yielding a highly robust and high-fidelity similarity metric. On a comprehensive testbed of 60 positive and 90 negative model pairs, our method demonstrates exceptional robustness against all six aforementioned post-training categories while exhibiting a near-zero risk of false positives. By achieving perfect scores on all classification metrics, our approach establishes a strong basis for reliable model lineage verification. Moreover, the entire computation completes within 30s on an NVIDIA 3090 GPU.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
746. Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models
๐ฌ 13
โญ 6.00
๐ learning theory
๐ท๏ธ Attention mechanism Interacting particle systems Minimax rates Nonparametric estimation
We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is $M^{-\frac{2\beta}{2\beta+1}}$ with $M$ being the sample size, depending only on the smoothness $\beta$ of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
747. Learning an Image Editing Model without Image Editing Pairs
๐ฌ 16
โญ 6.00
๐ generative models
๐ท๏ธ generative models image editing unsupervised learning personalization customization
Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
748. LMask: Learn to Solve Constrained Routing Problems with Lazy Masking
๐ฌ 12
โญ 6.00
๐ optimization
๐ท๏ธ Neural Combinatorial Optimization Routing Problem Deep Learning
Routing problems are canonical combinatorial optimization tasks with wide-ranging applications in logistics, transportation, and supply chain management. However, solving these problems becomes significantly more challenging when complex constraints are involved. In this paper, we propose LMask, a novel learning framework that utilizes dynamic masking to generate high-quality feasible solutions for constrained routing problems. LMask introduces the LazyMask decoding method, which lazily refines feasibility masks with the backtracking mechanism. In addition, it employs the refinement intensity embedding to encode the search trace into the model, mitigating representation ambiguities induced by backtracking. To further reduce sampling cost, LMask sets a backtracking budget during decoding, while constraint violations are penalized in the loss function during training to counteract infeasibility caused by this budget. We provide theoretical guarantees for the validity and probabilistic optimality of our approach. Extensive experiments on the traveling salesman problem with time windows (TSPTW) and TSP with draft limits (TSPDL) demonstrate that LMask achieves state-of-the-art feasibility rates and solution quality, outperforming existing neural methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
749. SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization
๐ฌ 20
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Long Context Alignment Large Language Models Preference Optimization
Despite advances in pretraining with extended context sizes, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named **S**h**o**rt-to-**Lo**ng **P**reference **O**ptimization (**SoLoPO**), decoupling long-context preference optimization (PO) into two components: short-context PO and short-to-long reward alignment (SoLo-RA), supported by both theoretical and empirical evidence. Specifically, short-context PO leverages preference pairs sampled from short contexts to enhance the model's contextual knowledge utilization ability. Meanwhile, SoLo-RA explicitly encourages reward score consistency for the responses when conditioned on both short and long contexts that contain identical task-relevant information. This facilitates transferring the model's ability to handle short contexts into long-context scenarios. SoLoPO is compatible with mainstream preference optimization algorithms, while substantially improving the efficiency of data construction and training processes. Experimental results show that SoLoPO enhances all these algorithms with respect to stronger length and domain generalization abilities across various long-context benchmarks, while achieving notable improvements in both computational and memory efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
750. Plan and Budget: Effective and Efficient Test-Time Scaling on Reasoning Large Language Models
๐ฌ 20
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Test-Time Compute Reasoning Effectiveness Efficiency
Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks, but their inference remains computationally inefficient. We observe a common failure mode in many prevalent LLMs, overthinking, where models generate verbose and tangential reasoning traces even for simple queries. Recent works have tried to mitigate this by enforcing fixed token budgets, however, this can lead to underthinking, especially on harder problems. Through empirical analysis, we identify that this inefficiency often stems from unclear problem-solving strategies. To formalize this, we develop a theoretical model, BAM (Budget Allocation Model), which models reasoning as a sequence of sub-questions with varying uncertainty, and introduce the E3 metric to capture the trade-off between correctness and computation efficiency. Building on theoretical results from BAM, we propose Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling. Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to 70% accuracy gains, 39% token reduction, and 193.8% improvement in E3. Notably, it elevates a smaller model (DS-Qwen-32B) to match the efficiency of a larger model (DS-LLaMA-70B), demonstrating Plan-and-Budgetโs ability to close performance gaps without retraining. Our code is available at anonymous.4open.science/r/P-and-B-6513/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
751. Fast and Interpretable Protein Substructure Alignment via Optimal Transport
๐ฌ 22
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Protein substructure alignment Residue-level representation Optimal transport Deep learning Structural bioinformatics
Proteins are essential biological macromolecules that execute life functions. Local motifs within protein structures, such as active sites, are the most critical components for linking structure to function and are key to understanding protein evolution and enabling protein engineering. Existing computational methods struggle to identify and compare these local structures, which leaves a significant gap in understanding protein structures and harnessing their functions. This study presents PLASMA, the first deep learning framework for efficient and interpretable residue-level protein substructure alignment. We reformulate the problem as a regularized optimal transport task and leverage differentiable Sinkhorn iterations. For a pair of input protein structures, PLASMA outputs a clear alignment matrix with an interpretable overall similarity score. Through extensive quantitative evaluations and three biological case studies, we demonstrate that PLASMA achieves accurate, lightweight, and interpretable residue-level alignment. Additionally, we introduce PLASMA-PF, a training-free variant that provides a practical alternative when training data are unavailable. Our method addresses a critical gap in protein structure analysis tools and offers new opportunities for functional annotation, evolutionary studies, and structure-based drug design. Reproducibility is ensured via our official implementation at https://anonymous.4open.science/r/plasma-5A5B/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
752. Does โDo Differentiable Simulators Give Better Policy Gradients?โ Give Better Policy Gradients?
๐ฌ 14
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Differentiable simulation Reinforcement learning Policy gradient Model-based reinforcement learning Monte Carlo gradient estimation Reparameterization gradient Likelihood ratio gradient Score function gradient estimator Inverse variance weighting Randomized smoothing
In policy gradient reinforcement learning, access to a differentiable model enables 1st-order gradient estimation that accelerates learning compared to relying solely on derivative-free 0th-order estimators. However, discontinuous dynamics cause bias and undermine the effectiveness of 1st-order estimators. Prior work addressed this bias by constructing a confidence interval around the REINFORCE 0th-order gradient estimator and using these bounds to detect discontinuities. However, the REINFORCE estimator is notoriously noisy, and we find that this method requires task-specific hyperparameter tuning and has low sample efficiency. This paper asks whether such bias is the primary obstacle and what minimal fixes suffice. First, we re-examine standard discontinuous settings from prior work and introduce DDCG, a lightweight test that switches estimators in nonsmooth regions; with a single hyperparameter, DDCG achieves robust performance and remains reliable with small samples. Second, on differentiable robotics control tasks, we present IVW-H, a per-step inverse-variance implementation that stabilizes variance without explicit discontinuity detection and yields strong results. Together, these findings indicate that while estimator switching improves robustness in controlled studies, careful variance control often dominates in practical deployments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
753. Social Agents: Collective Intelligence Improves LLM Predictions
๐ฌ 12
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ wisdom of crowds LLM multiagent systems
In human society, collective decision making has often outperformed the judgment of individuals. Classic examples range from estimating livestock weights to predicting elections and financial markets, where averaging many independent guesses often yields results more accurate than experts. These successes arise because groups bring together diverse perspectives, independent voices, and distributed knowledge, combining them in ways that cancel individual biases. This principle, known as the Wisdom of Crowds, underpins practices in forecasting, marketing, and preference modeling. Large Language Models (LLMs), however, typically produce a single definitive answer. While effective in many settings, this uniformity overlooks the diversity of human judgments shaping responses to ads, videos, and webpages. Inspired by how societies benefit from diverse opinions, we ask whether LLM predictions can be improved by simulating not one answer but many. We introduce Social Agents, a multi-agent framework that instantiates a synthetic society of human-like personas with diverse demographic (e.g., age, gender) and psychographic (e.g., values, interests) attributes. Each persona independently appraises a stimulus such as an advertisement, video, or webpage, offering both a quantitative score (e.g., click-through likelihood, recall score, likability) and a qualitative rationale. Aggregating these opinions produces a distribution of preferences that more closely mirrors real human crowds. Across eleven behavioral prediction tasks, Social Agents outperforms single-LLM baselines by up to 67.45% on simple judgments (e.g. webpage likability) and 9.88% on complex interpretive reasoning (e.g. video memorability). Social Agentsโ individual persona predictions also align with human judgments, reaching Pearson correlations up to 0.71. These results position computational crowd simulation as a scalable, interpretable tool for improving behavioral prediction and supporting societal decision making.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
754. Directional Textual Inversion for Personalized Text-to-Image Generation
๐ฌ 16
โญ 6.00
๐ generative models
๐ท๏ธ personalized generation text-to-image models textual inversion
Textual Inversion (TI) is an efficient approach to textโtoโimage personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to outโofโdistribution magnitudes, degrading prompt conditioning in preโnorm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in preโnorm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an inโdistribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von MisesโFisher prior, yielding a constantโdirection prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TIโvariants while maintaining subject similarity. Crucially, DTIโs hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that directionโonly optimization is a robust and scalable path for promptโfaithful personalization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
755. Multiple Streams of Knowledge Retrieval: Enriching and Recalling in Transformers
๐ฌ 9
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ mechanistic interpretability
When an LLM learns a new fact during finetuning (e.g., new movie releases, updated celebrity gossip, etc.), where does this information go? Are entities enriched with relation information, or do models recall information just-in-time before a prediction? Are ``all of the above'' true with LLMs implementing multiple redundant heuristics? Existing localization approaches (e.g., activation patching) are ill-suited for this analysis because they usually replace parts of the residual stream, thus overriding previous information.
To fill this gap, we propose dynamic weight grafting, a technique that selectively grafts weights from a finetuned model onto a pretrained model. Using this technique, we show two separate pathways for retrieving finetuned relation information: 1) "enriching" the residual stream with relation information while processing the tokens that correspond to an entity (e.g., "Zendaya" in "Zendaya co-starred with John David Washington") and 2) "recalling" this information at the final token position before generating a target fact. In some cases, models need information from both of these pathways to correctly generate finetuned facts while, in other cases, either the "enrichment" or "recall" pathway alone is sufficient. We localize the "recall'' pathway to model components---finding that "recall" occurs via both task-specific attention mechanisms and an entity-specific extraction step in the feedforward networks of the final layers before the target prediction. By targeting model components and parameters, as opposed to just activations, we are able to understand the mechanisms by which finetuned knowledge is retrieved during generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
756. e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs
๐ฌ 23
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM reasoning test-time compute RL exploration
Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., improvement in performance on hard problems as LLMs keep "thinking" for longer, beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate well. We show that one way to enable extrapolation is by training the LLM to perform in-context exploration: training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), or testing multiple hypotheses before it commits to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining skills that the base LLM has asymmetric competence in, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leveraging "negative" gradients from incorrect traces to amplify exploration during RL, resulting in longer search traces that chains additional asymmetries; and (3) coupling task difficulty with training token budget during training via a specifically-designed curriculum to structure in-context exploration. Our recipe e3 produces the best known 1.7B model according to AIME'25 and HMMT'25 scores, and extrapolates to 2x the training token budget. Our e3-1.7B model not only attains high pass@1 scores, but also improves pass@k over the base model.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
757. Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning
๐ฌ 20
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning for LLMs LLM Reasoning Efficient Reasoning Policy Optimization
Large language models trained with reinforcement learning on verifiable rewards often inflate response lengthโtrading brevity for accuracy. While longer reasoning can help on hard problems, many extra tokens are filler: verbose text making little progress. We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem and only training on responses filtered by (1) length and (2) token efficiency (reward per token). By sampling more during training time, GFPO teaches models to think less at inference time. On Phi-4-reasoning, GFPO cuts GRPOโs length inflation by up to 85\% across STEM and coding benchmarks (AIME 24/25, GPQA, Omni-MATH, LiveCodeBench) while preserving accuracy. We find that GFPO also outperforms Dr. GRPO in both accuracy and length reduction and generalizes across model sizes and families. We further propose Adaptive Difficulty GFPO, which allocates more training exploration to harder problems, yielding better efficiency-accuracy trade-offs on challenging questions. With only a 7\% increase in training time, GFPO reduces end-to-end latency by $\sim$30\%, cutting response time on hard queries by 90 seconds. GFPO trades modest training-time increases for lasting gains in inferenceโan effective recipe for efficient reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
758. FLOWER: A Flow-Matching Solver for Inverse Problems
๐ฌ 10
โญ 6.00
๐ generative models
๐ท๏ธ Inverse Problems Image Reconstruction Generative Modeling Flow Matching Ancestral Sampling
We introduce Flower, a solver for linear inverse problems. It leverages a pre-trained flow model to produce reconstructions that are consistent with the observed measurements. Flower operates through an iterative procedure over three steps: (i) a flow-consistent destination estimation, where the velocity network predicts a denoised target; (ii) a refinement step that projects the estimated destination onto a feasible set defined by the forward operator; and (iii) a time-progression step that re-projects the refined destination along the flow trajectory. We provide a theoretical analysis that demonstrates how Flower approximates Bayesian posterior sampling, thereby unifying perspectives from plug-and-play methods and generative inverse solvers. On the practical side, Flower achieves state-of-the-art reconstruction quality while using nearly identical hyperparameters across various linear inverse problems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
759. Beyond Pairwise: Empowering LLM Alignment With (Ranked) Choice Modeling
๐ฌ 12
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Language Models Fine-tuning Discrete Choice Model Ranked Choice Model Alignment Preference Optimization Learning From Human Feedback
Alignment of large language models (LLMs) has predominantly relied on pairwise preference optimization, where annotators select the better of two responses to a prompt. While simple, this approach overlooks the opportunity to learn from richer forms of human feedback, such as multiwise comparisons and top-$k$ rankings. We introduce $\textit{Ranked Choice Preference Optimization}$ (RCPO), a unified framework that bridges preference optimization with (ranked) choice modeling via maximum likelihood estimation. RCPO supports both utility based and rank based models, subsumes several pairwise methods (such as DPO and SimPO) as special cases, and provide principled training objectives for richer feedback formats. We instantiate this framework with two representative models (Multinomial Logit and Mallows-RMJ). Experiments on Llama-3-8B-Instruct, Gemma-2-9B-it, and Mistral-7B-Instruct across in-distribution and out-of-distribution settings show that RCPO consistently outperforms competitive baselines. RCPO shows that directly leveraging ranked preference data, combined with the right choice models, yields more effective alignment. It offers an extensible foundation for incorporating (ranked) choice modeling into LLM training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
760. Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs
๐ฌ 10
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ training re-evaluation curve data curriculum / data placement large language model (LLM) pre-training AdamW EMA timescale learning-rate schedules tokens-per-parameter ratio
Data curriculums have become central to successful LLM training, yet principles governing optimal data placement remain unclear. We introduce the *training re-evaluation curve (TREC)*, a diagnostic that retrospectively evaluates training batches *using the final model weights*. The TREC characterizes how well a trained model retains training data as a function of *when* the data was encountered during training. Analyzing TRECs for models from 111M to 3.9B parameters, we show that placing high-quality data at low points on the TREC significantly improves performance. Importantly, while a TREC is initially observable only after training, we demonstrate it can be *predicted in advance* from AdamWโs implicit EMA coefficients, enabling proactive curriculum design. By predicting TRECs for published training recipes, we explain prior ablations and reveal suboptimal data placements. We also align high-quality data with TREC minima in order to improve continual pre-training of a 3.9B-parameter LLM trained on 900B tokens.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
761. VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
๐ฌ 20
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Multi-modal Large Language Models Benchmark Visual Reasoning
Visual reasoning is a core component of human intelligence and a critical capability for advanced multimodal models. Yet current reasoning evaluations of multimodal large language models (MLLMs) often rely on text descriptions and allow language-based reasoning shortcuts, failing to measure genuine vision-centric reasoning. To address this, we introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories (e.g., quantitative shifts, spatial relations, attribute comparisons). These various types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives. We evaluate leading MLLMs on this benchmark and analyze their results to identify common failure modes. Most models score below 30\% accuracyโonly slightly above the 25\% random baseline and far below the 51.4\% achieved by humansโrevealing significant gaps in visual reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
762. GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching
๐ฌ 17
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language models model pruning
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in deployment and inference. While structured pruning of model parameters offers a promising way to reduce computational costs at deployment time, current methods primarily focus on single model pruning. In this work, we develop a novel strategy to compress models by strategically combining or merging layers from finetuned model variants, which preserves the original model's abilities by aggregating capabilities accentuated in different finetunes. We pose the optimal tailoring of these LLMs as a zero-order optimization problem, adopting a search space that supports three different operations: (1) Layer removal, (2) Layer selection from different candidate models, and (3) Layer merging. Our experiments demonstrate that this approach leads to competitive model pruning, for example, for the Llama2-13B model families, our compressed models maintain approximately 97.3\% of the original performance while removing ๏ฝ25\% of parameters, significantly outperforming previous state-of-the-art methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
763. Sublinear Spectral Clustering Oracle with Little Memory
๐ฌ 14
โญ 6.00
๐ learning theory
๐ท๏ธ Graph Clustering Spectral Clustering Memory-Efficient Algorithms Sublinear Algorithms Space-Time Trade-offs
We study the problem of designing *sublinear spectral clustering oracles* for well-clusterable graphs. Such an oracle is an algorithm that, given query access to the adjacency list of a graph $G$, first constructs a compact data structure $\mathcal{D}$ that captures the clustering structure of $G$. Once built, $\mathcal{D}$ enables sublinear time responses to \textsc{WhichCluster}$(G,x)$ queries for any vertex $x$. A major limitation of existing oracles is that constructing $\mathcal{D}$ requires $\Omega(\sqrt{n})$ memory, which becomes a bottleneck for massive graphs and memory-limited settings. In this paper, we break this barrier and establish a memory-time trade-off for sublinear spectral clustering oracles. Specifically, for well-clusterable graphs, we present oracles that construct $\mathcal{D}$ using much smaller than $O(\sqrt{n})$ memory (e.g., $O(n^{0.01})$) while still answering membership queries in sublinear time. We also characterize the trade-off frontier between memory usage $S$ and query time $T$, showing, for example, that $S\cdot T=\widetilde{O}(n)$ for clusterable graphs with a logarithmic conductance gap, and we show that this trade-off is nearly optimal (up to logarithmic factors) for a natural class of approaches. Finally, to complement our theory, we validate the performance of our oracles through experiments on synthetic networks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
764. Cost-of-Pass: An Economic Framework for Evaluating Language Models
๐ฌ 16
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ economic evaluation framework language-model evaluation costโperformance tradeโoff inference time techniques
The widespread adoption of AI systems in the economy hinges on their ability to generate economic value that outweighs their inference costs. Evaluating this tradeoff requires metrics that account for both performance and costs. Building on production theory, we develop an economically grounded framework for evaluating language models by combining accuracy and inference cost. We formalize cost-of-pass, the expected monetary cost of generating a correct solution. We then define the frontier cost-of-pass as the minimum cost-of-pass achievable across available models or the human-expert, using the approximate cost of hiring an expert. Our analysis reveals distinct economic insights. First, lightweight models are most cost-effective for basic quantitative tasks, large models for knowledge-intensive ones, and reasoning models for complex quantitative problems, despite higher per-token costs. Second, tracking this frontier cost-of-pass over the past year reveals significant progress, particularly for complex quantitative tasks where the cost has roughly halved every few months. Third, to trace key innovations driving this progress, we examine counterfactual frontiersโestimates of cost-efficiency without specific model classes. We find that innovations in lightweight, large, and reasoning models have been essential for pushing the frontier in basic quantitative, knowledge-intensive, and complex quantitative tasks, respectively. Finally, we assess the cost-reductions from common inference-time techniques (majority voting and self-refinement), and a budget-aware technique (TALE-EP). We find that performance-oriented methods with marginal performance gains rarely justify the costs, while TALE-EP shows some promise. Overall, our findings underscore that complementary model-level innovations are the primary drivers of cost-efficiency, and our economic framework provides a principled tool for measuring this progress and guiding deployment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
765. Lifelong Embodied Navigation Learning
๐ฌ 17
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Embodied Navigation Lifelong Learning Robotics Learning
Embodied navigation agents powered by large language models have shown strong performance on individual tasks but struggle to continually acquire new navigation skills, which suffer from catastrophic forgetting. We formalize this challenge as lifelong embodied navigation learning (LENL), where an agent is required to adapt to a sequence of navigation tasks spanning multiple scenes and diverse user instruction styles, while retaining previously learned knowledge. To tackle this problem, we propose Uni-Walker, a lifelong embodied navigation framework that decouples navigation knowledge into task-shared and task-specific components with Decoder Extension LoRA (DE-LoRA). To learn the shared knowledge, we design a knowledge inheritance strategy and an experts co-activation strategy to facilitate shared knowledge transfer and refinement across multiple navigation tasks. To learn the specific knowledge, we propose an expert subspace orthogonality constraint together and a navigation-specific chain-of-thought reasoning mechanism to capture specific knowledge and enhance instruction-style understanding. Extensive experiments demonstrate the superiority of Uni-Walker for building universal embodied navigation agents with lifelong learning. We also provide the code of this work in the Supplementary Materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
766. How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
๐ฌ 16
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM pretraining Curriculum Learning Model Weight Average
Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
767. When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations
๐ฌ 22
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Evolving Knowledge Injection; Large multimodal model; Benchmark and Dataset
Large Multimodal Models (LMMs) store vast amounts of pretrained knowledge but struggle to remain aligned with real-world updates, making it difficult to avoid capability degradation when acquiring evolving knowledge. Furthermore, most current work focuses on exploring static textual knowledge injection, neglecting dynamic multimodal evolving knowledge injection, leaving the potential of LMMs for multimodal knowledge injection as an open question. To address this, we first propose a pipeline to construct MMEVOKE, a benchmark for evaluating LMMs' ability in multimodal evolving knowledge injection. MMEVOKE contains 9,422 samples spanning 159 subtypes. Then, based on extensive experiments with MMEVOKE, we reveal challenges such as poor injection performance and capability degradation in existing knowledge injection methods through knowledge injection tests and general capability tests. Finally, to tackle these challenges, we introduce knowledge augmentation and knowledge retention methods, finding that knowledge-aware augmentation strengthens knowledge injection performance, and that Data Replay and MoE methods effectively mitigate capability degradation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
768. PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery
๐ฌ 25
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Model Pruning Large Language Model Data Selection Efficient Recovery
Model pruning is an effective approach for compressing large language models (LLMs). However, this process often leads to significant degradation of model capabilities. While post-training techniques such as instruction tuning are commonly employed to recover model performance, existing methods often overlook the uneven deterioration of model capabilities and incur high computational costs. Moreover, some irrelevant instructions may also introduce negative effects to model capacity recovery. To address these challenges, we propose the **P**ost-training d**A**ta **S**election method for **E**fficient pruned large language model **R**ecovery (**PASER**). PASER aims to identify instructions to recover the most compromised model capacities with a certain data budget. Our approach first applies manifold learning and spectral clustering to group recovery instructions in the semantic space, revealing capability-specific instruction sets. Then, the data budget is adaptively allocated across clusters by the degree of corresponding model capability degradation. In each cluster, we prioritize data samples that lead to the most decline of model performance. To mitigate potential negative tuning effects, we also detect and filter out conflicting or irrelevant recovery data. Extensive experiments demonstrate that PASER significantly outperforms conventional baselines, effectively recovering the general capabilities of pruned LLMs while utilizing merely 4\%-20\% of the original post-training data. We provide the anonymous code repository in [Link](https://anonymous.4open.science/r/PASER-E606).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
769. Deep Latent Variable Model based Vertical Federated Learning with Flexible Alignment and Labeling Scenarios
๐ฌ 21
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Vertical Federated Learning Deep Latent Variable Model missing mechanism MCAR MAR MNAR
Federated learning (FL) has attracted significant attention for enabling collaborative learning without exposing private data.
Among the primary variants of FL, vertical federated learning (VFL) addresses feature-partitioned data held by multiple institutions, each holding complementary information for the same set of users.
However, existing VFL methods often impose restrictive assumptions such as a small number of participating parties, fully aligned data, or only using labeled data.
In this work, we reinterpret alignment gaps in VFL as missing data problems and propose a unified framework that accommodates both training and inference under arbitrary alignment and labeling scenarios, while supporting diverse missingness mechanisms.
In the experiments on 168 configurations spanning four benchmark datasets, six training-time missingness patterns, and seven testing-time missingness patterns, our method outperforms all baselines in 160 cases with an average gap of 9.6 percentage points over the next-best competitors.
To the best of our knowledge, this is the first VFL framework to jointly handle arbitrary data alignment, unlabeled data, and multi-party collaboration all at once.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
770. InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement
๐ฌ 11
โญ 6.00
๐ generative models
๐ท๏ธ Interaction Generation Consistency Model Human Motion
Humanโobjectโscene interactions (HOSI) generation has broad applications in embodied AI, simulation, and animation. Unlike humanโobject interaction (HOI) and humanโscene interaction (HSI), HOSI generation requires reasoning over dynamic objectโscene changes, yet suffers from limited annotated data. To address these issues, we propose a coarseโtoโfine instructionโconditioned interaction generation framework that is explicitly aligned with the iterative denoising process of a consistency model. In particular, we adopt a dynamic perception strategy that leverages trajectories from the preceding refinement to update scene context and condition subsequent refinement at each denoising step of consistency model, yielding consistent interactions.
To further reduce physical artifacts, we introduce a bumpโaware guidance that mitigates collisions and penetrations during sampling without requiring fineโgrained scene geometry, enabling realโtime generation. To overcome data scarcity, we design a hybrid training startegy that synthesizes pseudoโHOSI samples by injecting voxelized scene occupancy into HOI datasets and jointly trains with highโfidelity HSI data, allowing interaction learning while preserving realistic scene awareness. Extensive experiments demonstrate that our method achieves stateโofโtheโart performance in both HOSI and HOI generation, and strong generalization to unseen scenes. Code and datasets will be released upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
771. TrimR: Verifier-based Training-Free Thinking Trimming for Efficient Test-Time Scaling
๐ฌ 5
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM; Reasoning; Thinking compression; Test-time scaling; Overthinking; Underthinking
Large Reasoning Models (LRMs) demonstrate exceptional capability in tackling complex mathematical, logical, and coding tasks by leveraging extended Chain-of-Thought (CoT) reasoning. Test-time scaling methodsโsuch as prolonging CoT with explicit token-level explorationโcan push LRMsโ accuracy boundaries, but they incur significant decoding overhead. A key inefficiency source is LRMs often generate redundant thinking CoTs, which demonstrate clear structured overthinking and underthinking patterns. Inspired by human cognitive reasoning processes and numerical optimization theories, we propose TrimR, a verifier-based, training-free, efficient framework to trim reasoning and enhance test-time scaling, explicitly tailored for production-level deployment. Our method employs a lightweight, pretrained, instruction-tuned verifier to detect and truncate redundant intermediate thoughts of LRMs without any LRM or verifier fine-tuning. We present both the core algorithm and asynchronous online system engineered for high-throughput industrial applications. Empirical evaluations on Ascend NPUs and vLLM show that our framework delivers substantial gains in inference efficiency under large-batch workloads. In particular, on the four MATH500, AIME24/25, and GPQA benchmarks, the reasoning runtime of QwQ-32B, DeepSeek-R1-Distill-Qwen-32B, and Pangu-R-38B is improved by up to 70% with negligible impact on accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
772. Identifiability and recoverability in self-supervised models
๐ฌ 18
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ identifiability self-supervised learning disentanglement
Self-supervised models exhibit a surprising stability in their internal representations. Whereas most prior work treats this stability as a single property, we formalize it as two distinct concepts: **statistical identifiability** (consistency of representations across runs) and
**structural identifiability** (alignment of representations with some unobserved ground truth). Recognizing that perfect pointwise identifiability is generally unrealistic for modern representation learning models, we propose new model-agnostic definitions of statistical and structural near-identifiability of representations up to some error tolerance $\epsilon$. Leveraging these definitions, we prove a statistical $\epsilon$-**near-identifiability** result for the representations of models with
nonlinear decoders, generalizing existing identifiability theory beyond last-layer representations in e.g. generative pre-trained transformers (GPTs) to near-identifiability of the intermediate representations of a broad class of models including (masked) autoencoders (MAEs) and supervised learners.
Although these weaker assumptions confer weaker identifiability, we show that independent components
analysis (ICA) can resolve much of the remaining linear ambiguity for this class of models, and validate and measure our near-identifiability claims empirically. With additional assumptions on the
data-generating process, statistical identifiability extends to structural identifiability, yielding a simple and
practical recipe for disentanglement: ICA post-processing of latent representations. On synthetic
benchmarks, this approach achieves state-of-the-art disentanglement using a vanilla autoencoder.
With a foundation model-scale MAE for cell microscopy, it disentangles biological variation from technical batch
effects, substantially improving downstream generalization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
773. Generating Directed Graphs with Dual Attention and Asymmetric Encoding
๐ฌ 15
โญ 6.00
๐ generative models
๐ท๏ธ graph generation directed graphs flow matching discrete diffusion
Directed graphs naturally model systems with asymmetric, ordered relationships, essential to applications in biology, transportation, social networks, or visual understanding. Generating such graphs enables simulation, data augmentation and novel instance discovery; however, this task remains underexplored. We identify two key reasons: first, modeling edge directionality introduces a substantially larger dependency space, making the underlying distribution harder to learn; second, the absence of standardized benchmarks hinders rigorous evaluation. Addressing the former limitation requires more expressive models that are sensitive to directional topologies. Thus, we propose Directo, the first generative model for directed graphs built upon the discrete flow matching framework. Our approach combines: (i) a dual-attention mechanism distinctly capturing incoming and outgoing dependencies, (ii) a robust, discrete generative framework, and (iii) principled positional encodings tailored to asymmetric pairwise relations. To address the second limitation and support evaluation, we introduce a novel and extensive benchmark suite covering synthetic and real-world datasets. Experiments show that our method outperforms existing directed graph generation approaches across diverse settings and competes with specialized models for particular classes, such as directed acyclic graphs. These results highlight the effectiveness and generality of our approach, establishing a solid foundation for future research in directed graph generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
774. Readout Representation: Redefining Neural Codes by Input Recovery
๐ฌ 28
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ neural representation readout representation representation size misrepresentation neural variability information recovery feature inversion hierarchical models robust representations artificial neural networks biological neural systems
Sensory representation is typically understood through a hierarchical-causal framework where progressively abstract features are extracted sequentially. However, this causal view fails to explain misrepresentation, a phenomenon better handled by an informational view based on decodable content. This creates a tension: how does a system that abstracts away details preserve the fine-grained information needed for downstream functions? We propose readout representation to resolve this, defining representation by the information recoverable from features, rather than their causal origin.
Empirically, we show that inputs can be accurately reconstructed even from heavily perturbed mid-level features, demonstrating that a single input corresponds to a broad, redundant region of feature space, challenging the causal mapping perspective.
To quantify this property, we introduce representation size, a metric linked to model robustness and representational redundancy. Our framework offers a new lens for analyzing how both biological and artificial neural systems learn complex features while maintaining robust, information-rich representations of the world.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
775. USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning Capabilities of LLMs as Urban Agents
๐ฌ 22
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ large language model spatiotemporal reasoning urban science
Large language models (LLMs) have shown emerging potential in spatiotemporal reasoning, making them promising candidates for building urban agents that support diverse urban downstream applications. Despite these benefits, existing studies primarily focus on evaluating urban LLM agent on outcome-level metrics (e.g., prediction accuracy, traffic efficiency), offering limited insight into their underlying reasoning processes. As a result, the strengths and limitations of urban LLM agents in spatiotemporal reasoning remain poorly understood. To this end, we introduce USTBench, the first benchmark to evaluate LLMsโ spatiotemporal reasoning abilities as urban agents across four decomposed dimensions: spatiotemporal understanding, forecasting, planning, and reflection. Specifically, USTBench supports five diverse urban decision-making and four spatiotemporal prediction tasks, all running within our constructed interactive city environment UAgentEnv. The benchmark includes 62,466 structured QA pairs for process-based evaluation and standardized end-to-end task assessments, enabling fine-grained diagnostics and broad task-level comparison across diverse urban scenarios. Through extensive evaluation of fourteen leading LLMs, we reveal that although LLMs show promising potential across various urban downstream tasks, they still struggle in long-horizon planning and reflective adaptation in dynamic urban contexts. Notably, recent advanced reasoning models (e.g., DeepSeek-R1) trained on general logic or mathematical problems do not consistently outperform non-reasoning LLMs. This discrepancy highlights the need for domain-specialized adaptation methods to enhance urban spatiotemporal reasoning. Overall, USTBench provides a foundation to build more adaptive and effective LLM-based urban agents and broad smart city applications. Our project is available at https://anonymous.4open.science/r/USTBench.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
776. OD$^3$: Optimization-free Dataset Distillation for Object Detection
๐ฌ 11
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ dataset distillation object detection data-centric framework efficient machine learning
Training large neural networks on large-scale datasets requires substantial computational resources, particularly for dense prediction tasks such as object detection. Although dataset distillation (DD) has been proposed to alleviate these demands by synthesizing compact datasets from larger ones, most existing work focuses solely on image classification, leaving the more complex detection setting largely unexplored. In this paper, we introduce OD$^3$, a novel optimization-free data distillation framework specifically designed for object detection. Our approach involves two stages: first, a candidate selection process in which object instances are iteratively placed in synthesized images based on their suitable locations, and second, a candidate screening process using a pre-trained observer model to remove low-confidence objects. We perform our data synthesis framework on MS COCO and PASCAL VOC, two popular detection datasets, with compression ratios ranging from 0.25% to 5%. Compared to the prior solely existing dataset distillation method on detection and conventional core set selection methods, OD$^3$ delivers superior accuracy, establishes new state-of-the-art results, surpassing prior best method by more than 14% on COCO mAP$_{50}$ at a compression ratio of 1.0%. The code is in the supplementary material.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
777. DP-Fusion: Token-Level Differentially Private Inference for Large Language Models
๐ฌ 25
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Privacy Large Language Models Document Privatization
Large language models (LLMs) do not preserve privacy at inference-time. The LLM's outputs can inadvertently reveal information about the model's context, which presents a privacy challenge when the LLM is augmented via tools or databases containing sensitive information. Existing privacy-preserving methods at inference-time have significant limitations since they (i) lack provable guarantees or (ii) have a poor utility/privacy trade-off. We propose DP-Fusion, a Differentially Private Inference (DPI) mechanism for LLMs that provably bounds the influence a set of tokens in the context can have on the LLM's output. DP-Fusion works as follows: (1) label a subset of sensitive tokens, (2) infer the LLM without any sensitive tokens to obtain a baseline, (3) infer the LLM with the sensitive tokens, and (4) blend distributions so that the final output remains within a bounded distance of the baseline distribution. While this per-token influence bound also mitigates jailbreak-style prompt injection, we focus on document privatization, where the goal is to paraphrase a document containing sensitive tokens, e.g., personally identifiable information, so that no attacker can reliably infer them from the paraphrased document while preserving high text quality. The privacy/utility trade-off is controlled by $\epsilon$, where $\epsilon=0$ hides sensitive tokens entirely, while higher values trade off privacy for improved text quality. We show that our method creates token-level provably privatized documents with substantially improved theoretical and empirical privacy, achieving $6\times$ lower perplexity than related DPI methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
778. SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism
๐ฌ 27
โญ 6.00
๐ generative models
๐ท๏ธ Speculative Decoding LLM inference
Recently, speculative decoding (SD) has emerged as a promising technique to accelerate LLM inference by employing a small draft model to propose draft tokens in advance, and validating them in parallel with the large target model. However, the existing SD methods still remain fundamentally constrained by their serialized execution, which causes the mutual waiting bubbles between the draft and target models. To address this challenge, we draw inspiration from branch prediction in modern processors and propose a novel framework \textbf{SpecBranch} to unlock branch parallelism in SD. Specifically, we first take an in-depth analysis of the potential of branch parallelism in SD, and recognize that the key challenge lies in the trade-offs between parallelization and token rollback. Based on the analysis, we strategically introduce parallel speculative branches to preemptively hedge against likely rejections. Meanwhile, to enhance parallelism, we jointly orchestrate adaptive draft lengths with a hybrid combination of the implicit draft model confidence and explicit reusing of target model features. Extensive experiments across various models and benchmarks show that SpecBranch achieves over \textbf{1.8}$\times \sim$ \textbf{4.5}$\times$ speedups against the auto-regressive decoding and reduces rollback tokens by $\textbf{50}$\% for poorly aligned models, while maintaining an identical sampling distribution.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
779. HoloPart: Generative 3D Part Amodal Segmentation
๐ฌ 10
โญ 6.00
๐ generative models
๐ท๏ธ 3D Generation 3D Segmentation 3D Part
3D part amodal segmentation--decomposing a 3D shape into complete, semantically meaningful parts, even when occluded--is a challenging but crucial task for 3D content creation and understanding. Existing 3D part segmentation methods only identify visible surface patches, limiting their utility. Inspired by 2D amodal segmentation, we introduce this novel task to the 3D domain and propose a practical, two-stage approach, addressing the key challenges of inferring occluded 3D geometry, maintaining global shape consistency, and handling diverse shapes with limited training data. First, we leverage existing 3D part segmentation to obtain initial, incomplete part segments. Second, we introduce HoloPart, a novel diffusion-based model, to complete these segments into full 3D parts. HoloPart utilizes a specialized architecture with local attention to capture fine-grained part geometry and global shape context attention to ensure overall shape consistency. We introduce new benchmarks based on the ABO and PartObjaverse-Tiny datasets and demonstrate that HoloPart significantly outperforms state-of-the-art shape completion methods. By incorporating HoloPart with existing segmentation techniques, we achieve promising results on 3D part amodal segmentation, opening new avenues for applications in geometry editing, animation, and material assignment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
780. From Curiosity to Caution: Mitigating Reward Hacking for Best-of-$N$ with Pessimism
๐ฌ 15
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Reward Hacking Reward Models Pessimism Inference-time Scaling Large Language Models
Inference-time compute scaling has emerged as a powerful paradigm for improving language model performance on a wide range of tasks, but the question of how best to use the additional compute remains open. A popular approach is *Best-of-$N$* (BoN) sampling, where $N$ candidate responses are generated, scored according to a reward model, and the highest-scoring response is selected. While this approach can improve performance, it is vulnerable to *reward hacking*, where performance degrades as $N$ increases due to the selection of responses that exploit imperfections in the reward model instead of genuinely improving generation quality. Prior attempts to mitigate reward hacking---via stronger reward models or heavy-handed distributional regularization---either fail to fully address over-optimization or are too conservative to exploit additional compute. In this work, we explore the principle of *pessimism* in reinforcement learning (RL), which uses lower confidence bounds on value estimates to avoid out-of-distribution (OOD) actions with uncertain reward estimates. Our approach, termed as *caution*, can be seen as the *reverse* of *curiosity*: where curiosity (e.g., via Random Network Distillation, RND) rewards prediction error as a signal of novelty, caution penalizes prediction error as a signal of distributional uncertainty. Practically, caution trains an error model on typical responses and uses its prediction error to lower reward estimates for atypical ones. Our extensive empirical evaluation demonstrates that caution is a simple, computationally efficient approach that substantially mitigates reward hacking in BoN sampling. We also provide a theoretical analysis in a simplified linear setting, which shows that caution provably improves over the standard BoN approach. Together, our results not only establish caution as a practical solution to reward hacking, but also provide evidence that curiosity-based approaches can be a general OOD detection technique in LLM settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
781. Subquadratic Algorithms and Hardness for Attention with Any Temperature
๐ฌ 12
โญ 6.00
๐ learning theory
๐ท๏ธ Attention Complexity
Despite the popularity of the Transformer architecture, the standard algorithm for computing Attention suffers from quadratic time complexity in context length $n$. Alman and Song showed that when the head dimension $d = \Theta(\log n)$, subquadratic Attention is possible if and only if the inputs have small entries bounded by $B = o(\sqrt{\log n})$ in absolute values, under the Strong Exponential Time Hypothesis ($\mathsf{SETH}$). Equivalently, subquadratic Attention is possible if and only if the softmax is applied with high temperature for $d=\Theta(\log n)$. Running times of these algorithms depend exponentially on $B$ and thus they do not lead to even a polynomial-time algorithm outside the specific range of $B$.
This naturally leads to the question: when can Attention be computed efficiently without strong assumptions on temperature? Are there fast attention algorithms that scale polylogarithmically with entry size $B$? In this work, we resolve this question and characterize when fast Attention for arbitrary temperatures is possible. First, for all constant $d = O(1)$, we give the first subquadratic $\tilde{O}(n^{2 - 1/d} \cdot \mathrm{polylog}(B))$ time algorithm for Attention with large $B$. Our result holds even for matrices with large head dimension if they have low rank. Combined with a reduction from Gradient Computation to Attention, we obtain a subquadratic algorithm for the full LLM training process. Furthermore, we show that any substantial improvement on our algorithm is unlikely. In particular, we show that even when $d = 2^{\Theta(\log^* n)}$, Attention requires $n^{2 - o(1)}$ time under $\mathsf{SETH}$.
Finally, in the regime where $d = \mathrm{poly}(n)$, the standard algorithm requires $O(n^{2} d)$ time while previous lower bounds only ruled out algorithms with truly subquadratic time in $n$. We close this gap and show that the standard algorithm is optimal under popular fine-grained complexity assumptions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
782. Choices Speak Louder than Questions
๐ฌ 8
โญ 6.00
๐ generative models
๐ท๏ธ large language model evaluation methodologies multiple choice question
Recent findings raise concerns about whether the evaluation of Multiple-Choice Question Answering (MCQA) accurately reflects the comprehension abilities of large language models. This paper explores the concept of \textit{choice sensitivity}, which refers to the tendency for model decisions to be more influenced by the answer options than by a genuine understanding of the question. We introduce a new scoring method called **Normalized Probability Shift by the Question (NPSQ)**, designed to isolate the impact of the question itself and provide a more reliable assessment of comprehension. Through experiments involving various input formats, including cloze, symbols, and hybrid formats, we find that traditional scoring methods โ such as those based on log-likelihood or its length-normalized variant โ are vulnerable to superficial characteristics of the answer choices. In contrast, NPSQ remains stable even when modifications are made to the answer options.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
783. AlignFlow: Improving Flow-based Generative Models with Semi-Discrete Optimal Transport
๐ฌ 14
โญ 6.00
๐ generative models
๐ท๏ธ Flow-based generative model; flow matching; Semi-discrete optimal transport
Flow-based Generative Models (FGMs) effectively transform noise into a data
distribution, and coupling the noise and data in the training of FGM by Optimal
Transport (OT) improves the straightness of the flow paths. However, existing OT-
based couplings are difficult to combine with modern models and/or to scale to
large datasets due to the curse of dimensionality in the sample complexity of (batch) OT.
This paper introduces AlignFlow, a new approach using Semi-Discrete Optimal
Transport (SDOT) to enhance FGM training by establishing explicit alignment
between noise and data pairs. SDOT computes a transport map by partitioning
the noise space into Laguerre cells, each mapped to a corresponding data point.
During the training of FGM, i.i.d.-sampled noise is matched with corresponding
data by the SDOT map. AlignFlow bypasses the curse of dimensionality and
scales effectively to large datasets and models. Our experiments demonstrate that
AlignFlow improves a wide range of state-of-the-art FGM algorithms and can be
integrated as a plug-and-play solution with negligible additional cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
784. Beyond Short Steps in Frank-Wolfe Algorithms
๐ฌ 10
โญ 6.00
๐ optimization
๐ท๏ธ Frank-Wolfe algorithm optimism primal-dual algorithms
We introduce novel techniques to enhance Frank-Wolfe algorithms by leveraging function smoothness beyond traditional short steps. Our study focuses on Frank-Wolfe algorithms with step sizes that incorporate primal-dual guarantees, offering practical stopping criteria. We present a new Frank-Wolfe algorithm utilizing an optimistic framework and provide a primal-dual convergence proof. Additionally, we propose a generalized short-step strategy aimed at optimizing a computable primal-dual gap. Interestingly, this new generalized short-step strategy is also applicable to gradient descent algorithms beyond Frank-Wolfe methods. Empirical results demonstrate that our optimistic algorithm outperforms existing methods, highlighting its practical advantages.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
785. Light Differentiable Logic Gate Networks
๐ฌ 26
โญ 6.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ reparameterization logic gate networks vanishing gradients
Differentiable logic gate networks (DLGNs) exhibit extraordinary efficiency at inference while sustaining competitive accuracy.
But vanishing gradients, discretization errors, and high training cost impede scaling these networks.
Even with dedicated parameter initialization schemes from subsequent works, increasing depth still harms accuracy.
We show that the root cause of these issues lies in the underlying parametrization of logic gate neurons themselves.
To overcome this issue, we propose a reparametrization that also shrinks the parameter size logarithmically in the number of inputs per gate.
For binary inputs, this already reduces the model size by 4x, speeds up the backward pass by up to 1.86x, and converges in 8.5x fewer training steps.
On top of that, we show that the accuracy on CIFAR-100 remains stable and sometimes superior to the original parametrization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
786. Error Feedback for Muon and Friends
๐ฌ 14
โญ 6.00
๐ optimization
๐ท๏ธ optimization communication efficiency compression error feedback
Recent optimizers like Muon, Scion, and Gluon have pushed the frontier of large-scale deep learning by exploiting layer-wise linear minimization oracles (LMOs) over non-Euclidean norm balls, capturing neural network structure in ways traditional algorithms cannot. Yet, no principled distributed framework exists for these methods, and communication bottlenecks remain unaddressed. The very few distributed variants are heuristic, with no convergence guarantees in sight. We introduce EF21-Muon, the first communication-efficient, non-Euclidean LMO-based optimizer with rigorous convergence guarantees. EF21-Muon supports stochastic gradients, momentum, and bidirectional compression with error feedbackโmarking the first extension of error feedback beyond the Euclidean setting. It recovers Muon/Scion when compression is off and specific norms are chosen, providing the first efficient distributed implementation of this powerful family. Our theory covers non-Euclidean smooth and the more general (L0, L1)โsmooth setting, matching best-known Euclidean rates and enabling faster convergence under suitable norm choices. We further extend the analysis to layer-wise (generalized) smoothness regimes, capturing the anisotropic structure of deep networks. Experiments on NanoGPT benchmarking EF21-Muon against uncompressed Muon/Scion/Gluon demonstrate up to 7ร communication savings with no accuracy degradation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
787. CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs
๐ฌ 16
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Reinforcement learning LLM reasoning Curriculum learning
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models (LLMs) on reasoning tasks. However, existing methods often fail to adequately account for variations in prompt difficulty or rely on simplistic filtering mechanisms to select prompt datasets within a narrow criterion range, resulting in significant computational waste. In this work, we approach the problem from the perspective of reinforcement learning gradient optimization, offering a systematic and theoretical investigation into how to improve the training efficiency of LLMs. We identify two key factors influencing training efficiency: the selection of training prompts and the allocation of rollout quantities across different prompts. Our theoretical analysis reveals that the sampling distribution of prompts dictates the convergence rate of gradient descent, while the allocation of the rollout quantity influences the consistency and stability of overall gradient updates. Based on these insights, we propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead. Experiments demonstrate that our CurES outperforms Group Relative Policy Optimization (GRPO) by \textbf{+3.3} points and \textbf{+4.82} points with 1.5B and 7B models, respectively, and exceeds the best prior sample efficient methods by \textbf{+2.12} points on average across eight math reasoning benchmarks. Our CurES also improves convergence speed compare to baselines such as GRPO.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
788. Control Tax: The Price of Keeping AI in Check
๐ฌ 10
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ AI control scalable oversight AI safety
The rapid integration of agentic AI into high-stakes real-world applications requires robust oversight mechanisms. The emerging field of AI Control (AIC) aims to provide such an oversight mechanism, but practical adoption depends heavily on implementation overhead. To study this problem better, we introduce the notion of Control tax---the operational and financial cost of integrating control measures into AI pipelines. Our work makes three key contributions to the field of AIC: (1) we introduce a theoretical framework that quantifies the Control Tax and maps classifier performance to safety assurances; (2) we conduct comprehensive evaluations of state-of-the-art language models in adversarial settings, where attacker models insert subtle backdoors into code while monitoring models attempt to detect these vulnerabilities; and (3) we provide empirical financial cost estimates for control protocols and develop optimized monitoring strategies that balance safety and cost-effectiveness while accounting for practical constraints like auditing budgets. Our framework enables practitioners to make informed decisions by systematically connecting safety guarantees with their costs, advancing AIC through principled economic feasibility assessment across different deployment contexts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
789. A General Spatio-Temporal Backbone with Scalable Contextual Pattern Bank for Urban Continual Forecasting
๐ฌ 21
โญ 6.00
๐ learning on time series and dynamical systems
๐ท๏ธ general backbone contextual pattern bank continual spatio-temporal forecasting
With the rapid growth of spatio-temporal data fueled by IoT deployments and urban infrastructure expansion, accurate and efficient continual forecasting has become a critical challenge. Most existing Spatio-Temporal Graph Neural Networks rely on static graph structures and offline training, rendering them inadequate for real-world streaming scenarios characterized by node expansion and distribution shifts. Although Continual Spatio-Temporal Forecasting methods have been proposed to tackle these issues, they often adopt backbones with limited modeling capacity and lack effective mechanisms to balance stability and adaptability. To overcome these limitations, we propose STBP, a novel framework that integrates a general spatio-temporal backbone with a scalable contextual pattern bank. The backbone extracts stable representations in the frequency domain and captures dynamic spatial correlations through lightweight linear graph attention. To support continual adaptation and mitigate catastrophic forgetting, the contextual pattern bank is updated incrementally via parameter expansion, enabling the capture of evolving node-level heterogeneity and relevance. During incremental training, the backbone remains fixed to preserve general knowledge, while the pattern bank adapts to new scenarios and distributions. Extensive experiments demonstrate that STBP outperforms state-of-the-art baselines in both forecasting accuracy and scalability, validating its effectiveness for continual spatio-temporal forecasting.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
790. GAP: Gradient Adjustment with Phase-guidance for Robust Vision-Proprioception Policies in Robotic Manipulation
๐ฌ 23
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Robotic Manipulation Vision-Proprioception Policy
Proprioceptive information is critical for precise servo control by providing real-time robotic states. Its collaboration with vision is highly expected to enhance performances of the manipulation policy in complex tasks. However, recent studies have reported inconsistent observations on the generalization of vision-proprioception policies. In this work, we investigate this by conducting temporally controlled experiments. We found that during task sub-phases that robot's motion transitions, which require target localization, the vision modality of the vision-proprioception policy plays a limited role. Further analysis reveals that the policy naturally gravitates toward concise proprioceptive signals that offer faster loss reduction when training, thereby dominating the optimization and suppressing the learning of the visual modality during motion-transition phases. To alleviate this, we propose the Gradient Adjustment with Phase-guidance (GAP) algorithm that adaptively modulates the optimization of proprioception, enabling dynamic collaboration within the vision-proprioception policy. Specifically, we leverage proprioception to capture robotic states and estimate the probability of each timestep in the trajectory belonging to motion-transition phases. During policy learning, we apply fine-grained adjustment that reduces the magnitude of proprioception's gradient based on estimated probabilities, leading to robust and generalizable vision-proprioception policies. The comprehensive experiments demonstrate GAP is applicable in both simulated and real-world environments, across one-arm and dual-arm setups, and compatible with both conventional and Vision-Language-Action models. We believe this work can offer valuable insights into the development of vision-proprioception policies in robotic manipulation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
791. Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series
๐ฌ 15
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ EEG ECG Deep learning Transformer
Accurate analysis of Medical time series (MedTS) data, such as Electroencephalography (EEG) and Electrocardiography (ECG), plays a pivotal role in healthcare applications, including the diagnosis of brain and heart diseases. MedTS data typically exhibits two critical patterns: **temporal dependencies** within individual channels and **channel dependencies** across multiple channels. While recent advances in deep learning have leveraged Transformer-based models to effectively capture temporal dependencies, they often struggle to model channel dependencies. This limitation stems from a structural mismatch: ***MedTS signals are inherently centralized, whereas the Transformer's attention is decentralized***, making it less effective at capturing global synchronization and unified waveform patterns. To bridge this gap, we propose **CoTAR** (Core Token Aggregation-Redistribution), a centralized MLP-based module tailored to replace the decentralized attention. Instead of allowing all tokens to interact directly, as in attention, CoTAR introduces a global core token that acts as a proxy to facilitate the inter-token interaction, thereby enforcing a centralized aggregation and redistribution strategy. This design not only better aligns with the centralized nature of MedTS signals but also reduces computational complexity from quadratic to linear. Experiments on five benchmarks validate the superiority of our method in both effectiveness and efficiency, achieving up to a **12.13%** improvement on the APAVA dataset, with merely 33% memory usage and 20% inference time compared to the previous state-of-the-art. Code and all training scripts are available in this [**Link**](https://anonymous.4open.science/r/TeCh-24).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
792. MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning
๐ฌ 20
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ multimodal reasoning multimodal benchmark multi-image benchmark thinking models
Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMsโ reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,676 multiple-choice questions based on 19,367 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
793. Improved high-dimensional estimation with Langevin dynamics and stochastic weight averaging
๐ฌ 7
โญ 6.00
๐ learning theory
๐ท๏ธ Learning theory high-dimensional statistics non-convex optimization
Significant recent work has studied the ability of gradient descent to recover a hidden planted direction $\theta^\star \in S^{d-1}$ in different high-dimensional settings, including tensor PCA and single-index models. The key quantity that governs the ability of gradient descent to traverse these landscapes is the information exponent $k^\star$ (Ben Arous et al., (2021)), which corresponds to the order of the saddle at initialization in the population landscape. Ben Arous et al., (2021) showed that $n \gtrsim d^{\max(1, k^\star-1)}$ samples were necessary and sufficient for online SGD to recover $\theta^\star$, and Ben Arous et al., (2020) proved a similar lower bound for Langevin dynamics. More recently, Damian et al., (2023) showed it was possible to circumvent these lower bounds by running gradient descent on a smoothed landscape, and that this algorithm succeeds with $n \gtrsim d^{\max(1, k^\star/2)}$ samples, which is optimal in the worst case. This raises the question of whether it is possible to achieve the same rate without explicit smoothing. In this paper, we show that Langevin dynamics can succeed with $n \gtrsim d^{ k^\star/2 }$ samples if one considers the average iterate, rather than the last iterate. The key idea is that the combination of noise-injection and iterate averaging is able to emulate the effect of landscape smoothing. We apply this result to both the tensor PCA and single-index model settings. Finally, we conjecture that minibatch SGD can also achieve the same rate without adding any additional noise.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
794. Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model
๐ฌ 13
โญ 6.00
๐ generative models
๐ท๏ธ diffusion caching image generation efficient deep learning diffusion transformers inference acceleration
Diffusion-based image generation models excel at producing high-quality synthetic content, but suffer from slow and computationally expensive inference. Prior work has attempted to mitigate this by caching and reusing features within diffusion transformers across inference steps. These methods, however, often rely on rigid heuristics that result in limited acceleration or poor generalization across architectures. We propose **E**volutionary **C**aching to **A**ccelerate **D**iffusion models (ECAD), a genetic algorithm that learns efficient, per-model, caching schedules forming a Pareto frontier, using only a small set of calibration prompts. ECAD requires no modifications to network parameters or reference images. It offers significant inference speedups, enables fine-grained control over the quality-latency trade-off, and adapts seamlessly to different diffusion models. Notably, ECAD's learned schedules can generalize effectively to resolutions and model variants not seen during calibration. We evaluate ECAD on PixArt-alpha, PixArt-Sigma, and FLUX-1.dev using multiple metrics (FID, CLIP, Image Reward) across diverse benchmarks (COCO, MJHQ-30k, PartiPrompts), demonstrating consistent improvements over previous approaches. On PixArt-alpha, ECAD identifies a schedule that outperforms the previous state-of-the-art method by 4.47 COCO FID while increasing inference speedup from 2.35x to 2.58x. Our results establish ECAD as a scalable and generalizable approach for accelerating diffusion inference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
795. PRISM: Enhancing PRotein Inverse Folding through Fine- Grained Retrieval on Structure-Sequence Multimodal Representations
๐ฌ 4
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Retrieval Augmented Generation Protein Language Modeling Protein Inverse Folding Protein Sequence Design Multimodal Representation
Designing protein sequences that fold into a target 3-D structure, termed as the inverse folding problem, is central to protein engineering. However, it remains challenging due to the vast sequence space and the importance of local structural constraints. Existing deep learning approaches achieve strong recovery rates, however, lack explicit mechanisms to reuse fine-grained structure-sequence patterns conserved across natural proteins.
To mitigate this, we present PRISM a multimodal retrieval-augmented generation framework for inverse folding. PRISM retrieves fine-grained representations of potential motifs from known proteins and integrates them with a hybrid self-cross attention decoder. PRISM is formulated as a latent-variable probabilistic model and implemented with an efficient approximation, combining theoretical grounding with practical scalability.
Experiments across multiple benchmarks, including CATH-4.2, TS50, TS500, CAMEO 2022, and the PDB date split, demonstrate the fine-grained multimodal retrieval efficacy of PRISM in yielding SoTA perplexity and amino acid recovery, while also improving the foldability metrics (RMSD, TM-score, pLDDT).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
796. HiVid: LLM-Guided Video Saliency For Content-Aware VOD And Live Streaming
๐ฌ 15
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Streaming Highlight Detection Large Language Model Time Series Forecasting
Content-aware streaming requires dynamic, chunk-level importance weights to optimize subjective quality of experience (QoE). However, direct human annotation is prohibitively expensive while vision-saliency models generalize poorly. We introduce HiVid, the first framework to leverage Large Language Models (LLMs) as a scalable human proxy to generate high-fidelity weights for both Video-on-Demand (VOD) and live streaming. We address 3 non-trivial challenges: (1) To extend LLMs' limited modality and circumvent token limits, we propose a perception module to assess frames in a local context window, autoregressively building a coherent understanding of the video.
(2) For VOD with rating inconsistency across local windows, we propose a ranking module to perform global re-ranking with a novel LLM-guided merge-sort algorithm.
(3) For live streaming which requires low-latency, online inference without future knowledge, we propose a prediction module to predict future weights with a multi-modal time series model, which comprises a content-aware attention and adaptive horizon to accommodate asynchronous LLM inference. Extensive experiments show HiVid improves weight prediction accuracy by up to 11.5\% for VOD and 26\% for live streaming over SOTA baselines. Real-world user study validates HiVid boosts streaming QoE correlation by 14.7\%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
797. Skill Learning via Policy Diversity Yields Identifiable Representations for Reinforcement Learning
๐ฌ 15
โญ 6.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning representation learning identifiability ICA exploration unsupervised skill discovery
Self-supervised feature learning and pretraining methods in reinforcement learning (RL) often rely on information-theoretic principles, termed mutual information skill learning (MISL). These methods aim to learn a representation of the environment while also incentivizing exploration thereof. However, the role of the representation and mutual information parametrization in MISL is not yet well understood theoretically. Our work investigates MISL through the lens of identifiable representation learning by focusing on the Contrastive Successor Features (CSF) method. We prove that CSF can provably recover the environment's ground-truth features up to a linear transformation due to the inner product parametrization of the features and skill diversity in a discriminative sense. This first identifiability guarantee for representation learning in RL also helps explain the implications of different mutual information objectives and the downsides of entropy regularizers. We empirically validate our claims in MuJoCo and DeepMind Control and show how CSF provably recovers the ground-truth features both from states and pixels.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
798. Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction
๐ฌ 19
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Speech-to-Speech (S2S) Systems Human-Likeness Evaluation Turing-Test
The pursuit of human-like conversational agents has long been guided by the Turing test. For modern speech-to-speech (S2S) systems, a critical yet unanswered question is whether they can converse like humans. To tackle this, we conduct the first Turing test for S2S systems, collecting 2,968 human judgments on dialogues between 9 state-of-the-art S2S systems and 28 human participants. Our results deliver a clear finding: no existing evaluated S2S system passes the test, revealing a significant gap in human-likeness. To diagnose this failure, we develop a fine-grained taxonomy of 18 human-likeness dimensions and crowd-annotate our collected dialogues accordingly. Our analysis shows that the bottleneck is not semantic understanding but stems from paralinguistic features, emotional expressivity, and conversational persona. Furthermore, we find that off-the-shelf AI models perform unreliably as Turing test judges. In response, we propose an interpretable model that leverages the fine-grained human-likeness ratings and delivers accurate and transparent human-vs-machine discrimination, offering a powerful tool for automatic human-likeness evaluation. Our work establishes the first human-likeness evaluation for S2S systems and moves beyond binary outcomes to enable detailed diagnostic insights, paving the way for human-like improvements in conversational AI systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
799. Protection against Source Inference Attacks in Federated Learning
๐ฌ 11
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Federated Learning Source Inference Attack Shuffle Model Residue Number System
Federated Learning (FL) was initially proposed as a privacy-preserving machine learning paradigm. However, FL has been shown to be susceptible to a series of privacy attacks. Recently, there has been concern about the Source Inference Attack (SIA), where an honest-but-curious central server attempts to identify exactly which client owns a given data point which was used in the training phase. Alarmingly, standard gradient obfuscation techniques with Differential Privacy have been shown to be ineffective against SIAs, at least without severely diminishing the accuracy.
In this work, we propose a defense against SIAs within the widely studied shuffle model of FL, where an honest shuffler acts as an intermediary between the clients and the server. First, we demonstrate that standard naive shuffling alone is insufficient to prevent SIAs. To effectively defend against SIAs, shuffling needs to be applied at a more granular level; we propose a novel combination of parameter-level shuffling with the residue number system (RNS). Our approach provides robust protection against SIAs without affecting the accuracy of the joint model and can be seamlessly integrated into other privacy protection mechanisms.
We conduct experiments on a series of models and datasets, confirming that standard shuffling approaches fail to prevent SIAs and that, in contrast, our proposed method reduce the attackโs accuracy to the level of random guessing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
800. HAMLET: Switch Your Vision-Language-Action Model into a History-Aware Policy
๐ฌ 21
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Vision-language-action models Robot manipulation
Inherently, robotic manipulation tasks are history-dependent: leveraging past context could be beneficial. However, most existing Vision-Language-Action models (VLAs) have been designed without considering this aspect, i.e., they rely solely on the current observation, ignoring preceding context. In this paper, we propose HAMLET, a scalable framework to adapt VLAs to attend to the historical context during action prediction. Specifically, we introduce moment tokens that compactly encode perceptual information at each timestep. Their representations are initialized with time-contrastive learning, allowing them to better capture temporally distinctive aspects. Next, we employ a lightweight memory module that integrates the moment tokens across past timesteps into memory features, which are then leveraged for action prediction. Through empirical evaluation, we show that HAMLET successfully transforms a state-of-the-art VLA into a history-aware policy, especially demonstrating significant improvements on long-horizon tasks that require historical context. In particular, on top of GR00T N1.5, HAMLET achieves an average success rate of 76.4% on history-dependent real-world tasks, surpassing the baseline performance by 47.2%. Furthermore, HAMLET pushes prior art performance from 64.1% to 66.4% on RoboCasa Kitchen (100-demo setup) and from 95.6% to 97.7% on LIBERO, highlighting its effectiveness even under generic
robot-manipulation benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
801. What Layers When: Learning to Skip Compute in LLMs with Residual Gates
๐ฌ 13
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ decoder-only language models large language models layer skipping adaptive compute efficient inference LLM
We introduce GateSkip, a simple residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs. Each Attention/MLP branch is equipped with a sigmoid-linear gate that compresses the branchโs output before it re-enters the residual stream. During inference we rank tokens by the gate and skip low-importance ones using a per-layer budget. While early-exit or router-based Mixture-of-Depths models are known to be unstable and need extensive retraining, our smooth, differentiable gates fine-tune stably on top of pretrained models. On long-form reasoning, we save up to 15% compute while retaining >90% of baseline accuracy. On instruction-tuned models we see accuracy gains at full compute and match baseline quality near 50% savings. The learned gates give insight into transformer information flow (e.g., BOS tokens act as anchors), and the method combines easily with quantization, pruning, and self-speculative decoding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
802. Asynchronous Matching with Dynamic Sampling for Multimodal Dataset Distillation
๐ฌ 10
โญ 6.00
๐ optimization
๐ท๏ธ Trajectory Matching Dataset Distillation
Multimodal Dataset Distillation (MDD) has emerged as a vital paradigm for enabling efficient training of vision-language models (VLMs) in the era of multimodal data proliferation. Unlike traditional dataset distillation methods that focus on single-modal tasks, MDD presents distinct challenges: (i) the effective distillation of heterogeneous multimodal knowledge, complicated by feature space misalignment and asynchronous optimization dynamics; and (ii) the lack of discrete class guidance, which hinders the distribution coverage and representativeness of synthetic data due to the vastness and continuity of the semantic space. To address these challenges, this paper proposes an Asynchronous Matching with Dynamic sampling (AMD) framework. AMD enables asynchronous trajectory matching by decoupling the selection of starting points for image and text trajectories. Additionally, a Semantics-Aware Prototype Mining module is introduced, which replaces random initialization by leveraging feature-space clustering to identify representative prototypes, enhancing the coverage and representativeness of the distilled samples. Extensive experiments demonstrate that AMD achieves superior distillation performance on Flickr30k and COCO (e.g., IR@1, IR@5, and IR@10 \textbf{gains of 4.5\%, 9.6\%, and 10.9\%}, respectively, on Flickr30k 200 pairs.) with negligible computational overhead.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
803. SpikePingpong: Spike Vision-based Fast-Slow Pingpong Robot System
๐ฌ 14
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Robotics Imitation Learning
Learning to control high-speed objects in dynamic environments represents a fundamental challenge in robotics. Table tennis serves as an ideal testbed for advancing robotic capabilities in dynamic environments. This task presents two fundamental challenges: it requires a high-precision vision system capable of accurately predicting ball trajectories under complex dynamics, and it necessitates intelligent control strategies to ensure precise ball striking to target regions. High-speed object manipulation typically demands advanced visual perception hardware capable of capturing rapid motion with exceptional temporal resolution.
Drawing inspiration from Kahneman's dual-system theory, where fast intuitive processing complements slower deliberate reasoning, there exists an opportunity to develop more robust perception architectures that can handle high-speed dynamics while maintaining accuracy.
To this end, we present \textit{\textbf{SpikePingpong}}, a novel system that integrates spike-based vision with imitation learning for high-precision robotic table tennis. We develop a cognitive-inspired Fast-Slow system architecture where System 1 provides rapid ball detection and preliminary trajectory prediction with millisecond-level responses, while System 2 employs spike-oriented neural calibration for precise hittable position corrections. For strategic ball striking, we introduce Imitation-based Motion Planning And Control Technology, which learns optimal robotic arm striking policies through demonstration-based learning.
Experimental results demonstrate that \textit{\textbf{SpikePingpong}} achieves a remarkable 92\% success rate for 30 cm accuracy zones and 70\% in the more challenging 20 cm precision targeting. This work demonstrates the potential of cognitive-inspired architectures for advancing robotic capabilities in time-critical manipulation tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
804. Translate Policy to Language: Flow Matching Generated Rewards for LLM Explanations
๐ฌ 15
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Continuous Normalizing Flow Diffusion Model RLAIF Explainable AI
As humans increasingly share environments with diverse agents powered by RL, LLMs, and beyond, the ability to explain agent policies in natural language is vital for reliable coexistence. We introduce a general-purpose framework that trains explanation-generating LLMs via reinforcement learning from AI feedback, with distributional rewards generated by generative continuous normalizing flows (CNFs). CNFs capture the pluralistic and probabilistic nature of human judgments about explanations. Moreover, under mild assumptions, CNFs provably bound deviations from true human reward distributions when trained on noisy proxy rewards from LLMs. We design a specialized CNF architecture that selectively attends to linguistic cues in decision context and explanations when generating rewards. Human and LLM evaluators find that our method delivers explanations that enable more accurate predictions of true agent decisions, exhibit greater logical soundness and actionability, and impose lower cognitive load than explanations trained with proxy LLM rewards or state-of-the-art RLHF and RLAIF baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
805. Covariate-Guided Clusterwise Linear Regression for Generalization to Unseen Data
๐ฌ 19
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Clusterwise Linear Regression (CLR) Covariate-Guided Assignment Proxy Network Vector Quantization Convergence Guarantee
In many tabular regression tasks, the relationships between covariates and response can often be approximated as linear only within localized regions of the input space; a single global linear model therefore fails to capture these local relationships. Conventional Clusterwise Linear Regression (CLR) mitigates this issue by learning $K$ local regressors. However, existing algorithms either optimize latent binary indicators, (i) providing no explicit rule for assigning an $\textit{unseen}$ covariate vector to a cluster at test time, or rely on heuristic mixture of experts approaches, (ii) lacking convergence guarantees. To address these limitations, we propose $\textit{covariate-guided}$ CLR, an end-to-end framework that jointly learns an assignment function and $K$ linear regressors within a single gradient-based optimization loop. During training, a proxy network iteratively predicts coefficient vectors for inputs, and hard vector quantization assigns samples to their nearest codebook regressors. This alternating minimization procedure yields monotone descent of the empirical risk, converges under mild assumptions, and enjoys a PAC-style excess-risk bound. By treating the covariate data from all clusters as a single concatenated design matrix, we derive an $F$-test statistic from a nested linear model, quantitatively characterizing the effective model complexity. As $K$ varies, our method spans the spectrum from a single global linear model to instance-wise fits. Experimental results show that our method exactly reconstructs synthetic piecewise-linear surfaces, achieves accuracy comparable to strong black-box models on standard tabular benchmarks, and consistently outperforms existing CLR and mixture-of-experts approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
806. Birch SGD: A Tree Graph Framework for Local and Asynchronous SGD Methods
๐ฌ 10
โญ 6.00
๐ optimization
๐ท๏ธ asynchronous sgd optimal time complexity nonconvex optimization parallel methods stochastic optimization unified framework
We propose a new unifying framework, Birch SGD, for analyzing and designing distributed SGD methods. The central idea is to represent each method as a weighted directed tree, referred to as a computation tree. Leveraging this representation, we introduce a general theoretical result that reduces convergence analysis to studying the geometry of these trees. This perspective yields a purely graph-based interpretation of optimization dynamics, offering a new and intuitive foundation for method development. Using Birch SGD, we design eight new methods and analyze them alongside previously known ones, with at least six of the new methods shown to have optimal computational time complexity. Our research leads to two key insights: (i) all methods share the same iteration rate of $\mathcal{O}\left(\frac{(R + 1) L \Delta}{\varepsilon} + \frac{\sigma^2 L \Delta}{\varepsilon^2}\right)$, where $R$ the maximum ``tree distance'' along the main branch of a tree; and (ii) different methods exhibit different trade-offs---for example, some update iterates more frequently, improving practical performance, while others are more communication-efficient or focus on other aspects. Birch SGD serves as a unifying framework for navigating these trade-offs. We believe these results provide a unified foundation for understanding, analyzing, and designing efficient asynchronous and parallel optimization methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
807. Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models
๐ฌ 10
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Generative AI Evaluation Diffusion models Synthetic Imagery Cultural Bias in AI Historical Representation
As Text-to-Image (TTI) diffusion models become increasingly influential in content creation, growing attention is being directed toward their societal and cultural implications. While prior research has primarily examined demographic and cultural biases, the ability of these models to accurately represent historical contexts remains largely underexplored. To address this gap, we introduce a benchmark for evaluating how TTI models depict historical contexts. The benchmark combines HistVis, a dataset of 30,000 synthetic images generated by three state-of-the-art diffusion models from carefully designed prompts covering universal human activities across multiple historical periods, with a reproducible evaluation protocol. We evaluate generated imagery across three key aspects: (1) Implicit Stylistic Associations: examining default visual styles associated with specific eras; (2) Historical Consistency: identifying anachronisms such as modern artifacts in pre-modern contexts; and (3) Demographic Representation: comparing generated racial and gender distributions against historically plausible baselines. Our findings reveal systematic inaccuracies in historically themed generated imagery, as TTI models frequently stereotype past eras by incorporating unstated stylistic cues, introduce anachronisms, and fail to reflect plausible demographic patterns. By providing a reproducible benchmark for historical representation in generated imagery, this work provides an initial step toward building more historically accurate TTI models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
808. Addressing Pitfalls in the Evaluation of Uncertainty Estimation Methods for Natural Language Generation
๐ฌ 12
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ uncertainty natural language generation evaluation large language models elo judge
Hallucinations are a common issue that undermine the reliability of large language models (LLMs). Recent studies have identified a specific subset of hallucinations, known as confabulations, which arise due to predictive uncertainty of LLMs. To detect confabulations, various methods for estimating predictive uncertainty in natural language generation (NLG) have been developed. These methods are typically evaluated by correlating uncertainty estimates with the correctness of generated text, with question-answering (QA) datasets serving as the standard benchmark. However, commonly used approximate correctness functions have substantial disagreement between each other and, consequently, in the ranking of the uncertainty estimation methods. This allows one to inflate the apparent performance of uncertainty estimation methods. We propose using several alternative risk indicators for risk correlation experiments that improve robustness of empirical assessment of UE algorithms for NLG. For QA tasks, we show that marginalizing over multiple LLM-as-a-judge variants leads to reducing the evaluation biases. Furthermore, we explore structured tasks as well as out of distribution and perturbation detection tasks which provide robust and controllable risk indicators. Finally, we propose to use an Elo rating of uncertainty estimation methods to give an objective summarization over extensive evaluation settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
809. BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs
๐ฌ 17
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Reasoning Models Factual Alignment Knowledge Boundary
Recent advances in Large Reasoning Models (LRMs) have shown impressive capabilities in mathematical and logical reasoning. However, current LRMs rarely admit ignorance or respond with โI donโt knowโ. Instead, they often produce incorrect answers while showing undue confidence, raising concerns about their factual reliability. In this work, we identify two pathological reasoning patterns characterized by overthinking that contribute to the overconfident and incorrect answers: last-minute guessing and second-thought spiraling. To address these issues, we propose BARRELโa novel framework that promotes concise and boundary-aware factual reasoning. Our experiments show that BARREL-training increases the reliability of DeepSeek-R1-Distill-Llama-8B from 39.33% to 61.48%, while still achieving accuracy comparable to models finetuned on reasoning data generated by R1. These results demonstrate that our pilot study is inspiring to build more reliable and factual System 2 LRMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
810. Model-Guided Microstimulation Steers Primate Visual Behavior
๐ฌ 19
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ causal interventions topographic deep artificial neural networks brain modeling
Brain stimulation is a powerful tool for understanding cortical function and holds the promise of therapeutic interventions to treat neuropsychiatric disorders such as impaired vision. Prototypical approaches to visual prosthetics apply patterns of electric microstimulation to the early visual cortex and can evoke percepts of simple symbols such as letters. However, these approaches are limited by the number of electrodes that can be implanted in early visual regions. Instead, higher-level visual regions are known to underlie the representations of complex visual objects such as faces and scenes and thus constitute a promising target for stimulating the cortex to elicit more complex visual experience. We developed a computational framework composed of two main components to address the challenge of stimulating cortex in high-dimensional object space spanned by higher-level visual cortex: 1. a causally predictive model that predicts primate behavior from image and stimulation input via topographic models and perturbation modules. 2. a mapping procedure that translates optimal model stimulation sites to monkey cortex. Testing our approach in two macaque monkeys that perform a visual recognition task, our results suggest that model-guided microstimulation is a promising approach to steer complex visual behavior. This proof-of-principle establishes a foundation for next-generation visual prosthetics that could restore complex visual experiences by stimulating higher-level visual cortex.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
811. DES-LOC: Desynced Low Communication Adaptive Optimizers for Foundation Models
๐ฌ 4
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Distributed Training Foundation Models Large Language Models Optimizers Communication Efficiency Federated Learning Distributed Systems Optimization Theory Scaling Robustness
Scaling foundation model training with Distributed Data Parallel~(DDP) methods is bandwidth-limited.
Existing infrequent communication methods like Local SGD were designed to synchronize model parameters only and cannot be trivially applied to adaptive optimizers due to additional optimizer states.
Heuristic approaches that keep states local or reset them lack guarantees and can be unstable in computeโefficient batch regimes; conversely, Local Adam synchronizes all states uniformly and is provably convergent but triples communication costs.
We propose Desynced Low Communication Adaptive Optimizers (DES-LOC), a family of optimizers assigning independent synchronization periods to parameters and momenta, enabling lower communication costs while preserving convergence. Our theoretical analysis shows that while parameter synchronization dominates the asymptotic rate in-expectation, high-probability convergence guarantees require at least infrequent synchronization of the second momentum. Furthermore, we prove that more frequent momentum sync permits larger stable step sizes. Experiments on language models of up to 1.7B show that DES-LOC can communicate 170x less than DDP and 2x less than the previous state-of-the-art Local Adam, enabling 1.3xโ2.1x wallโclock speedups over DDP for 1-13B models on 100Gb/s links. Furthermore, unlike previous heuristic methods, DES-LOC is robust to worker failures offering a scalable, efficient, and fault-tolerant solution for foundation model training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
812. Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning
๐ฌ 14
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Reinforcement learning General-utility Markov decision processes convex reinforcement learning markov decision processes
In this work, we contribute the first approach to solve infinite-horizon discounted general-utility Markov decision processes (GUMDPs) in the single-trial regime, i.e., when the agent's performance is evaluated based on a single trajectory. First, we provide some fundamental results regarding policy optimization in the single-trial regime, investigating which class of policies suffices for optimality, casting our problem as a particular MDP that is equivalent to our original problem, as well as studying the computational hardness of policy optimization in the single-trial regime. Second, we show how we can leverage online planning techniques, in particular a Monte-Carlo tree search algorithm, to solve GUMDPs in the single-trial regime. Third, we provide experimental results showcasing the superior performance of our approach in comparison to relevant baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
813. LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization
๐ฌ 19
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Visual Storytelling Multi-Image Sequence Generation Story Planning Visual Logic Consistency Causal Reasoning Narrative Coherence
Generating coherent and communicative visual sequences, such as image sequences and videos, remains a significant challenge for current multimodal systems. Despite advances in visual quality and the integration of world knowledge, existing models still struggle to maintain logical flow, often resulting in disjointed actions, fragmented narratives, and unclear storylines. We attribute these issues to the lack of attention to visual logic, a critical yet underexplored dimension of visual sequence generation that we define as the perceptual and causal coherence among characters, actions, and scenes over time.
To bridge this gap, we propose a logic-aware multi-image story visualization framework, LogiStory.
The framework is built around the central innovation of explicitly modeling visual logic in story visualization. To realize this idea, we design a multi-agent system that grounds roles, extracts causal chains, and verifies story-level consistency, transforming narrative coherence from an implicit byproduct of image generation into an explicit modeling objective. This design effectively bridges structured story planning with visual generation, enhancing both narrative clarity and visual quality in story visualization.
Furthermore, to evaluate the generation capacity, we construct LogicTale, a benchmark comprising richly annotated stories, emphasizing causal reasoning, and visual logic interpretability.
We establish comprehensive automatic and human evaluation protocols designed to measure both visual logic and perceptual quality.
Experiments demonstrate that our approach significantly improves the narrative logic of generated visual stories. This work provides a foundational step towards modeling and enforcing visual logic in general image sequence and video generation tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
814. Thompson Sampling via Fine-Tuning of LLMs
๐ฌ 16
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Bayesian optimization Thompson Sampling discrete domain variational Bayesian optimistic sampling cumulative regret theory large language model fine-tuning probability of maximality probability of optimality
Bayesian optimization in large unstructured discrete spaces is often hindered by the computational cost of maximizing acquisition functions due to the absence of gradients. We propose a scalable alternative based on Thompson sampling that eliminates the need for acquisition function maximization by directly parameterizing the probability that a candidate yields the maximum reward. Our approach, *Thompson Sampling via Fine-Tuning* (ToSFiT) leverages the prior knowledge embedded in prompt-conditioned large language models, and incrementally adapts them toward the posterior. Theoretically, we derive a novel regret bound for a variational formulation of Thompson Sampling that matches the strong guarantees of its standard counterpart. Our analysis reveals the critical role of careful adaptation to the posterior probability of maximalityโa principle that underpins our ToSFiT algorithm. Empirically, we validate our method on three diverse tasks: FAQ response refinement, thermally stable protein search, and quantum circuit design. Within a collection of methods covering Bayesian optimization, reinforcement learning, and evolutionary search, ToSFiT exhibits both state-of-the-art sample efficiency and computational efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
815. Monotone Near-Zero-Sum Games
๐ฌ 18
โญ 6.00
๐ optimization
๐ท๏ธ Non-zero-sum games; monotone games
Zero-sum and non-zero-sum (aka general-sum) games are relevant in a wide range of applications.
While general non-zero-sum games are computationally hard, researchers focus
on the special class of monotone games for gradient-based algorithms.
However, there is a substantial gap between the gradient complexity of monotone zero-sum and monotone general-sum games.
Moreover, in many practical scenarios of games the zero-sum assumption needs to be relaxed.
To address these issues, we define a new intermediate class of monotone near-zero-sum games that contains monotone zero-sum games as a special case. Then, we present a novel algorithm that transforms the near-zero-sum games into a sequence of zero-sum subproblems, improving the gradient-based complexity for the class. Finally, we demonstrate the applicability of this new
class to model practical scenarios of games motivated from the literature.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
816. ChronoEdit: Towards Temporal Reasoning for In-Context Image Editing and World Simulation
๐ฌ 12
โญ 6.00
๐ generative models
๐ท๏ธ image editing generative models
Recent advances in large generative models have significantly advanced image editing and in-context image generation, yet a critical gap remains in ensuring physical consistency, where edited objects must remain coherent. This capability is especially vital for world simulation related tasks. In this paper, we present ChronoEdit, a framework that reframes image editing as a video generation problem. First, ChronoEdit treats the input and edited images as the first and last frames of a video, allowing it to leverage large pretrained video generative models that capture not only object appearance but also the implicit physics of motion and interaction through learned temporal consistency. Second, ChronoEdit introduces a temporal reasoning stage that explicitly performs editing at inference time. Under this setting, target frame is jointly denoised with reasoning tokens to imagine a plausible editing trajectory that constrains the solution space to physically viable transformations. The reasoning tokens are then dropped after a few steps to avoid the high computational cost of rendering a full video. To validate ChronoEdit, we introduce PBench-Edit, a new benchmark of imageโprompt pairs for contexts that require physical consistency, and demonstrate that ChronoEdit surpasses state-of-the-art baselines in both visual fidelity and physical plausibility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
817. When More is Less: Understanding Chain-of-Thought Length in LLMs
๐ฌ 15
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Chain-of-Thought reasoning Simplicity bias Test-time scaling Reasoning length calibration
Large Language Models (LLMs) increasingly rely on Chain-of-Thought (CoT) reasoning to solve complex problems. Contrary to the common belief that longer CoTs always improve performance, we demonstrate that **longer is not always better**. Across both real-world LLMs and theoretical models, task accuracy follows an inverted U-shaped curve with respect to CoT length: performance rises initially but declines once reasoning chains become too long. Through controlled experiments, we uncover **scaling behaviors of the optimal CoT length**: it increases with task difficulty but decreases with model capability. This exposes a significant mismatch with current practice, where supervised training often reuses the same CoT data across models and tasks without adaptivity. We further show that Reinforcement Learning (RL) can mitigate this gap by dynamically calibrating CoT length, thereby improving accuracy and offering a new perspective on differences between supervised fine-tuning and RL training. To explain these phenomena, we introduce an error-accumulation analysis that characterizes how reasoning errors propagate across steps and derives the scaling behaviors of CoT length observed empirically. Building on these insights, we show that training with optimally sized CoTs and applying length-aware filtering during inference yields substantial improvements in performance. Taken together, these findings establish a principled explanation of the ''overthinking'' effect and yield practical guidelines for calibrating CoT length in accordance with task complexity and model capability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
818. SiMO: Single-Modality-Operable Multimodal Collaborative Perception
๐ฌ 13
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ collaborative perception multimodal modal failure modality competition
Collaborative perception integrates multi-agent perspectives to enhance the sensing range and overcome occlusion issues. While existing multimodal approaches leverage complementary sensors to improve performance, they are highly prone to failureโespecially when a key sensor like LiDAR is unavailable. The root cause is that feature fusion leads to semantic mismatches between single-modality features and the downstream modules. This paper addresses this challenge for the first time in the field of collaborative perception, introducing **Si**ngle-**M**odality-**O**perable Multimodal Collaborative Perception (**SiMO**). By adopting the proposed **L**ength-**A**daptive **M**ulti-**M**od**a**l Fusion (**LAMMA**), SiMO can adaptively handle remaining modal features during modal failures while maintaining consistency of the semantic space. Additionally, leveraging the innovative "Pretrain-Align-Fuse-RD" training strategy, SiMO addresses the issue of modality competitionโgenerally overlooked by existing methodsโensuring the independence of each individual modality branch. Experiments demonstrate that SiMO effectively aligns multimodal features while simultaneously preserving modality-specific features, enabling it to maintain optimal performance across all individual modalities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
819. TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
๐ฌ 14
โญ 6.00
๐ optimization
๐ท๏ธ Vector Quantization KV Cache Compression Nearest Neighbor Search Similarity Search Acceleration Online Compression Algorithms
Vector quantization, a problem rooted in Shannon's source coding theory, aims to quantize high-dimensional Euclidean vectors while minimizing distortion in their geometric structure. We propose TurboQuant to address both mean-squared error (MSE) and inner product distortion, overcoming limitations of existing methods that fail to achieve optimal distortion rates. Our data-oblivious algorithms, suitable for online applications, achieve near-optimal distortion rates (within a small constant factor) across all bit-widths and dimensions. TurboQuant achieves this by randomly rotating input vectors, inducing a concentrated Beta distribution on coordinates, and leveraging the near-independence property of distinct coordinates in high dimensions to simply apply optimal scalar quantizers per each coordinate. Recognizing that MSE-optimal quantizers introduce bias in inner product estimation, we propose a two-stage approach: applying an MSE quantizer followed by a 1-bit Quantized JL (QJL) transform on the residual, resulting in an unbiased inner product quantizer. We also provide a formal proof of the information-theoretic lower bounds on best achievable distortion rate by any vector quantizer, demonstrating that TurboQuant closely matches these bounds, differing only by a small constant ($\approx 2.7$) factor. Experimental results validate our theoretical findings, showing that for KV cache quantization, we achieve absolute quality neutrality with 3.5 bits per channel and marginal quality degradation with 2.5 bits per channel. Furthermore, in nearest neighbor search tasks, our method outperforms existing product quantization techniques in recall while reducing indexing time to virtually zero.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
820. SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
๐ฌ 15
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Large Language Models Reinforcement Learning Reasoning
Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final outcome. As a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (\textit{e.g.}, MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 $\times$ more parameters. All code, models, and datasets will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
821. Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling
๐ฌ 18
โญ 6.00
๐ generative models
๐ท๏ธ generation then reconstruction acceleration masked autoregregrassive model image synthesis
Masked Autoregressive (MAR) models promise better efficiency in visual generation than continuous autoregressive (AR) models for the ability of parallel generation, yet their acceleration potential remains constrained by the modeling complexity of spatially correlated visual tokens in a single step. To address this limitation, we introduce Generation then Reconstruction (GtR), a training-free hierarchical sampling strategy that decomposes generation into two stages: structure generation establishing global semantic scaffolding, followed by detail reconstruction efficiently completing remaining tokens. Assuming that it is more difficult to create an image from scratch than to complement images based on a basic image framework, GtR is designed to achieve acceleration by computing the reconstruction stage quickly while maintaining the generation quality by computing the generation stage slowly. Moreover, observing that tokens on the details of an image often carry more semantic information than tokens in the salient regions, we further propose Frequency-Weighted Token Selection (FTS) to offer more computation budget to tokens on image details, which are localized based on the energy of high frequency information. Extensive experiments on ImageNet class-conditional and text-to-image generation demonstrate 3.72$\times$ speedup on MAR-H while maintaining comparable quality (e.g., FID: 1.59, IS: 304.4 vs. original 1.59, 299.1), substantially outperforming existing acceleration methods across various model scales and generation tasks. Our codes have been released in supplementary materials and will be released on Github.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
822. Tricks or Traps? A Deep Dive into RL for LLM Reasoning
๐ฌ 5
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Reasoning; Reinforcement Learning; Reasoning
Reinforcement learning (RL) for LLM reasoning has rapidly emerged as a prominent research area, marked by a significant surge in related studies on both algorithmic innovations and practical applications. Despite this progress, several critical challenges remain, including the absence of standardized guidelines for applying RL techniques and a fragmented understanding of their underlying mechanisms. In addition, inconsistent experimental settings, variations in training data, and differences in model initialization have led to conflicting conclusions, obscuring the key characteristics of these techniques and creating confusion among practitioners when selecting appropriate techniques. This paper systematically reviews widely adopted RL techniques through rigorous reproductions and isolated evaluations within a unified open-source framework. We analyze the internal mechanisms, applicable scenarios, and core principles of each technique through fine-grained experiments, including datasets of varying difficulty, model sizes, and architectures. Based on these insights, we present clear guidelines for selecting RL techniques tailored to specific setups and provide a reliable roadmap for practitioners navigating the RL for the LLM domain. Finally, we show that a minimalist combination of two techniques can unlock the learning capability of critic-free policies with a vanilla PPO loss. The results demonstrate that our simple combination consistently improves performance, surpassing strategies such as GRPO and DAPO.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
823. ClarifyVC: Clarifying Ambiguous Commands in Vehicle Control with a Hybrid Data Augmentation Pipeline
๐ฌ 22
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Interactive Control Systems Clarification-First Dialogue Ambiguity Resolution Hybrid Data Augmentation Function-Calling Language Models Human Validation and Robustness
Natural language interfaces for vehicle control must contend with vague commands, evolving dialogue context, and strict protocol constraints.
We introduce ClarifyVC, a unified framework that integrates a hybrid data-augmentation pipeline (ClarifyVC-Data), reference models trained on the data (ClarifyVC-Models)
and a evaluation protocol (ClarifyVC-Eval).
The agent-orchestrated pipeline generates diverse, ambiguity-rich dialogues from real-world seeded queries under schema and safety constraints, while the evaluation protocol systematically probes single-turn parsing, conservative clarification under extreme fuzziness, and multi-turn grounding.
Fine-tuning on ClarifyVC-Data yields consistent gainsโup to 15\% higher parsing accuracy, 20\% stronger ambiguity resolution, and 98\% protocol complianceโacross realistic in-cabin scenarios, with human-in-the-loop assessments confirming high realism, coherence, and applicability.
ClarifyVC thus advances beyond simulation-only datasets by tightly coupling real-world grounding with scalable generation and standardized evaluation, and provides a generalizable pipeline for broader interactive control domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
824. Topological Causal Effects
๐ฌ 11
โญ 6.00
๐ causal reasoning
๐ท๏ธ topological data analysis causal inference doubly robust estimator persistence landscape Silhouettes persistent homology
Estimating causal effects becomes particularly challenging when outcomes possess complex, non-Euclidean structures, where conventional approaches often fail to capture meaningful structural variation. We introduce a novel framework for topological causal inference, defining treatment effects through changes in the underlying topological structure of outcomes. In our framework, intervention-driven topological shifts across homology are summarized via power-weighted silhouettes. We propose a doubly robust estimator, derive its asymptotic properties, and develop a formal test for the null hypothesis of no topological effect. Empirical studies demonstrate that our approach reliably quantifies treatment effects and remains robust across diverse, complex outcome spaces.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
825. Unifying Complexity-Theoretic Perspectives on Provable Explanations
๐ฌ 10
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ explainability interpretability
Previous work has explored the computational complexity of deriving two fundamental types of explanations for ML model predictions: (1) *sufficient reasons*, which are subsets of input features that, when fixed, determine a prediction, and (2) *contrastive reasons*, which are subsets of input features that, when modified, alter a prediction. Prior studies have examined these explanations in different contexts, such as non-probabilistic versus probabilistic frameworks and local versus global settings. In this study, we introduce a unified framework for analyzing these explanations, demonstrating that they can all be characterized through the minimization of a unified probabilistic value function. We then prove that the complexity of these computations is influenced by three key properties of the value function: (1) *monotonicity*, (2) *submodularity*, and *supermodularity*. Our findings uncover some counterintuitive results regarding the nature of these properties within the explanation settings examined. For instance, although the *local* value functions do not exhibit monotonicity or submodularity/supermodularity whatsoever, we demonstrate that the *global* value functions do possess these properties. This distinction enables us to prove a series of novel polynomial-time results for computing various explanations with provable guarantees in the global explainability setting, across a range of ML models that span the interpretability spectrum, such as neural networks, decision trees, and tree ensembles. In contrast, we show that even highly simplified versions of these explanations become NP-hard to compute in the corresponding local explainability setting.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
826. Towards Strategic Persuasion with Language Models
๐ฌ 15
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Strategic Behavior Information Design
Large language models (LLMs) have demonstrated strong persuasive capabilities comparable to those of humans, offering promising benefits while raising societal concerns about their deployment. However, systematically evaluating persuasive capabilities is inherently challenging, as the effectiveness of persuasion among humans varies significantly across different domains. In this paper, we take a theory-driven approach to provide a scalable and principled framework to measure the persuasive capabilities of LLMs in strategic interactions. Grounded in the Bayesian Persuasion (BP) framework, we repurpose existing humanโhuman persuasion datasets to construct environments for evaluating and training LLMs in strategic persuasion. Our results reveal that frontier models can consistently achieve high persuasion gains and exhibit sophisticated persuasion strategies that align with theoretical predictions. Building on this, we use reinforcement learning to train LLMs for strategic persuasion in our environments. Our results also demonstrate that even small LLMs can obtain significantly higher persuasion gains through reinforcement learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
827. Semi-Supervised Preference Optimization with Limited Feedback
๐ฌ 18
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Preference Optimization Semi-Supervised Learning
The field of preference optimization has made outstanding contributions to the alignment of language models with human preferences. Despite these advancements, recent methods still rely heavily on substantial paired (labeled) feedback data, leading to substantial resource expenditures. To address these challenges, we study the problem of Semi-Supervised Preference Optimization in which the idea is to learn from both a small number of pairwise preference labels and a large pool of unpaired samples simultaneously. Our key theoretical contribution proves the existence of an optimal reward threshold capable of separating winning and losing responses with high probability, which enables a principled pseudo-labeling of unpaired data. By leveraging these pseudo-labels, SSPO effectively distills latent preferences from large-scale unpaired data, thus maintaining human alignment while drastically reducing acquisition costs. Extensive experiments across datasets validate this remarkable data efficiency; for instance, SSPO trained with Mistral-7B-Instruct on just 1% of UltraFeedback consistently surpasses strong baselines trained on 10% of UltraFeedback.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
828. P3D: Highly Scalable 3D Neural Surrogates for Physics Simulations with Global Context
๐ฌ 14
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ neural surrogates physics simulations transformers 3D
We present a scalable framework for learning deterministic and probabilistic neural surrogates for high-resolution 3D physics simulations. We introduce P3D, a hybrid CNN-Transformer backbone architecture targeted for 3D physics simulations, which significantly outperforms existing architectures in terms of speed and accuracy. Our proposed network can be pretrained on small patches of the simulation domain, which can be fused to obtain a global solution, optionally guided via a scalable sequence-to-sequence model to include long-range dependencies. This setup allows for training large-scale models with reduced memory and compute requirements for high-resolution datasets. We evaluate our backbone architecture against a large set of baseline methods with the objective to simultaneously learn 14 different types of PDE dynamics in 3D. We demonstrate how to scale our model to high-resolution isotropic turbulence with spatial resolutions of up to $512^3$. Finally, we show the versatility of our architecture by training it as a diffusion model to produce probabilistic samples of highly turbulent 3D channel flows across varying Reynolds numbers, accurately capturing the underlying flow statistics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
829. Score Distillation Beyond Acceleration: Generative Modeling from Corrupted Data
๐ฌ 16
โญ 6.00
๐ generative models
๐ท๏ธ Generative model diffusion distillation
Learning generative models directly from corrupted observations is a long-standing challenge across natural and scientific domains. We introduce *Distillation from Corrupted Data (DCD)*, a unified framework for learning high-fidelity, one-step generative models using **only** degraded data of the form $
y = \mathcal{A}(x) + \sigma \varepsilon, \ x\sim p_X,\ \varepsilon\sim \mathcal{N}(0,I_m),
$
where the mapping $\mathcal{A}$ may be the identity or a non-invertible corruption operator (e.g., blur, masking, subsampling, Fourier acquisition). DCD first pretrains a *corruption-aware diffusion teacher* on the observed measurements, then *distills* it into an efficient one-step generator whose samples are statistically closer to the clean distribution $p_X$. The framework subsumes identity corruption (denoising task) as a special case of our general formulation.
Empirically, DCD consistently reduces Frรฉchet Inception Distance (FID) relative to corruption-aware diffusion teachers across noisy generation (*CIFAR-10*, *FFHQ*, *CelebA-HQ*, *AFHQ-v2*), image restoration (Gaussian deblurring, random inpainting, super-resolution, and mixtures with additive noise), and multi-coil MRIโ*without access to any clean images*. The distilled generator inherits one-step sampling efficiency, yielding up to $30\times$ speedups over multi-step diffusion while surpassing the teachers after substantially fewer training iterations. These results establish score distillation as a practical tool for generative modeling from corrupted data, *not merely for acceleration*. We also provide theoretical support for the use of distillation in enhancing generation quality in the Appendix.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
830. Aurora: Towards Universal Generative Multimodal Time Series Forecasting
๐ฌ 37
โญ 6.00
๐ learning on time series and dynamical systems
๐ท๏ธ Time Series Forecasting Multimodality
Cross-domain generalization is very important in Time Series Forecasting because similar historical information may lead to distinct future trends due to the domain-specific characteristics. Recent works focus on building unimodal time series foundation models and end-to-end multimodal supervised models. Since domain-specific knowledge is often contained in modalities like texts, the former lacks the explicit utilization of them, thus hindering the performance. The latter is tailored for end-to-end scenarios and does not support zero-shot inference for cross-domain scenarios. In this work, we introduce Aurora, a Multimodal Time Series Foundation Model, which supports multimodal inputs and zero-shot inference. Pretrained on Corss-domain Multimodal Time Series Corpus, Aurora can adaptively extract and focus on key domain knowledge contained in corrsponding text or image modalities, thus possessing strong Cross-domain generalization capability. Through tokenization, encoding, and distillation, Aurora can extract multimodal domain knowledge as guidance and then utilizes a Modality-Guided Multi-head Self-Attention to inject them into the modeling of temporal representations. In the decoding phase, the multimodal representations are used to generate the conditions and prototypes of future tokens, contributing to a novel Prototype-Guided Flow Matching for generative probabilistic forecasting. Comprehensive experiments on well-recognized benchmarks, including TimeMMD, TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art performance of Aurora on both unimodal and multimodal scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
831. Diffusion Transformers with Representation Autoencoders
๐ฌ 20
โญ 6.00
๐ generative models
๐ท๏ธ Generative Models Diffusion Models Representation Learning High-dimensional Diffusion
Latent generative modeling has become the standard strategy for Diffusion Transformers (DiTs), but the autoencoder has barely evolved. Most DiTs still use the legacy VAE encoder, which introduces several limitations: large convolutional backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations resulting from purely reconstruction-based training. In this work, we investigate replacing the VAE encoderโdecoder with pretrained representation encoders (e.g., DINO, SigLIP, MAE) combined with trained decoders, forming what we call \emph{Representation Autoencoders} (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture. A key challenge arises in enabling diffusion transformers to operate effectively within these high-dimensional representations. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant with a lightweight wide DDT-head, we demonstrate state-of-the-art image generation performance, reaching FIDs of 1.18 @256 resolution and 1.13 @512 on ImageNet.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
832. Breaking Barriers: Do Reinforcement Fine-tuning Gains Transfer To Unseen Domains?
๐ฌ 20
โญ 6.00
๐ reinforcement learning
๐ท๏ธ large language models reinforcement learning supervised fine-tuning generalizability
Reinforcement post training (RPT) has recently shown promise in improving the reasoning abilities of large language models (LLMs).
However, it remains unclear how well these improvements generalize to new domains, as prior work evaluates RPT models on data from the same domains used for fine-tuning.
To understand the generalizability of RPT, we conduct
two studies.
(1) Observational: we compare a wide range of open-weight RPT models against their corresponding base models across multiple domains, including both seen and unseen domains in their fine-tuning data.
(2) Interventional: we fine-tune LLMs with RPT on
single domains and evaluate their performance across multiple domains. Both studies
converge on the same conclusion that, although RPT brings substantial gains on
tasks similar to the fine-tuning data, the gains generalize inconsistently
and can vanish on domains with different reasoning patterns.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
833. Dual Optimistic Ascent (PI Control) is the Augmented Lagrangian Method in Disguise
๐ฌ 11
โญ 6.00
๐ optimization
๐ท๏ธ Constrained Optimization Min-max Optimization Augmented Lagrangian Method Optimistic Gradient Method
Constrained optimization is a powerful framework for enforcing requirements on neural networks. These constrained deep learning problems are typically solved using first-order methods on their min-max Lagrangian formulation, but such approaches often suffer from oscillations and can fail to find all local solutions. While the Augmented Lagrangian method (ALM) addresses these issues, practitioners often favor dual optimistic ascent schemes (PI control) on the standard Lagrangian, which perform well empirically but lack formal guarantees. In this paper, we establish a previously unknown equivalence between these approaches: dual optimistic ascent on the Lagrangian is equivalent to gradient descent-ascent on the Augmented Lagrangian. This finding allows us to transfer the robust theoretical guarantees of the ALM to the dual optimistic setting, proving it converges linearly to all local solutions. Furthermore, the equivalence provides principled guidance for tuning the optimism hyper-parameter. Our work closes a critical gap between the empirical success of dual optimistic methods and their theoretical foundation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
834. How reinforcement learning after next-token prediction facilitates learning
๐ฌ 8
โญ 6.00
๐ learning theory
๐ท๏ธ large language models reinforcement learning length increase theory
Recent advances in reasoning domains with neural networks have primarily been enabled by a training recipe that optimizes Large Language Models, previously trained to predict the next-token in a sequence, with reinforcement learning algorithms. We introduce a framework to study the success of this paradigm, and we theoretically expose the optimization mechanisms by which reinforcement learning improves over next-token prediction in this setting. We study learning from mixture distributions of short and long โchain-of-thoughtโ sequences encoding a single task. In particular, when the task consists of predicting the parity of $d$ bits and long sequences are rare, we show how reinforcement learning after next-token prediction enables autoregressive transformers to generalize, whereas mere next-token prediction requires extreme statistical or computational resources to do so. We further explain how reinforcement learning leverages increased test-time computation, manifested in longer responses, to facilitate this learning process. In a simplified setting, we theoretically prove that autoregressive linear models following this training recipe can efficiently learn to predict the parity of $d$ bits as long as the proportion of long demonstrations in the data mix is not exponentially small in the input dimension $d$.
Finally, we demonstrate these same phenomena in other settings, including the post-training of Llama-series models on mixture variations of common mathematical reasoning benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
835. Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation
๐ฌ 12
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Masked Diffusion Model Unified Multi-modal model
We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation. Unlike existing multimodal MDMs such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution (1024px) text-to-image synthesis. Lavida-O incorporates a novel Elastic Mixture-of-Transformers (Elastic-MoT) architecture that couples a lightweight generation branch with a larger understanding branch, supported by token compression, universal text conditioning and stratified sampling for efficient and high-quality generation. Lavida-O further incorporates planning and iterative self-reflection in image generation and editing tasks, seamlessly boosting generation quality with its understanding capabilities. Lavida-O achieves state-of-the-art performance on a wide range of benchmarks including RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive models and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference. These advances establish Lavida-O as a new paradigm for scalable multimodal reasoning and generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
836. Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking
๐ฌ 20
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Retrieval LLM Reranking
In information retrieval, training reranking models mainly focuses on two types of objectives: metric learning (e.g. contrastive loss to increase the predicted scores on relevant query-document pairs) and classification (binary label prediction of relevance vs. irrelevance).
For BERT-style encoders, various studies have shown that contrastive learning (CL) can be more effective than discriminative (classification) learning. However, for large language models (LLMs), classification via supervised fine-tuning (SFT), which predicts ''yes'' (resp. ''no'') token for relevant (resp. irrelevant) pairs, appears more promising as it aligns well with the generative nature of LLMs.
This divergence raises a central question: which objective is intrinsically better suited to LLM-based reranking, and what mechanism underlies the difference? In this work, we conduct a comprehensive comparison and analysis between CL and SFT for reranking, taking the universal multimodal retrieval (UMR) as the experimental playground. We first decompose the objectives into two components: weight, which controls the magnitude of those updates, and direction, which guides the model updates, then present a unified framework for understanding their interactions. Through probing experiments, we find that SFT provides a substantially stronger weighting scheme than CL, whereas the preferred scoring direction shows no clear winner. Taken together, these results point to a consistent advantage of SFT over CL for LLM reranking. To further validate our findings, we conduct large-scale training with SFT and present new state-of-the-art rerankers on the MRB benchmark. We also provide ablations on SFT settings and expect our findings to benefit future research and applications in this area.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
837. Type-Compliant Adaptation Cascades
๐ฌ 9
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ language model adaptation probabilistic programming reasoning
Reliably composing Large Language Models (LLMs) for complex, multi-step workflows remains a significant challenge. The dominant paradigm---optimizing discrete prompts in a pipeline---is notoriously brittle and struggles to enforce the formal compliance required for structured tasks. We introduce Type-Compliant Adaptation Cascades (TACs), a framework that recasts workflow adaptation as learning typed probabilistic programs. TACs treat the entire workflow, which is composed of parameter-efficiently adapted LLMs and deterministic logic, as an unnormalized joint distribution. This enables principled, gradient-based training even with latent intermediate structures. We provide theoretical justification for our tractable optimization objective, proving that the optimization bias vanishes as the model learns type compliance. Empirically, TACs significantly outperform state-of-the-art prompt-optimization baselines. Gains are particularly pronounced on structured tasks, improving FinQA from $12.0\%$ to $24.7\%$ for a Qwen 3 8B model, MGSM-SymPy from $57.1\%$ to $75.9\%$ for a Gemma 2 27B model, MGSM from $1.6\%$ to $27.3\%$, and MuSR from $36.5\%$ to $62.6\%$ for a Gemma 7B model. TACs offer a robust and theoretically grounded paradigm for developing reliable, task-compliant LLM systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
838. Terminal Velocity Matching
๐ฌ 15
โญ 6.00
๐ generative models
๐ท๏ธ one-step generative model from scratch diffusion flow matching
We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the $2$-Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
839. Guided Policy Optimization under Partial Observability
๐ฌ 14
โญ 6.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning teacher-student learning policy distillation POMDPs policy gradient
Reinforcement Learning (RL) in partially observable environments poses significant challenges due to the complexity of learning under uncertainty. While additional information, such as that available in simulations, can enhance training, effectively leveraging it remains an open problem. To address this, we introduce Guided Policy Optimization (GPO), a framework that co-trains a guider and a learner. The guider takes advantage of privileged information while ensuring alignment with the learner's policy that is primarily trained via imitation learning. We theoretically demonstrate that this learning scheme achieves optimality comparable to direct RL, thereby overcoming key limitations inherent in existing approaches. Empirical evaluations show strong performance of GPO across various tasks, including continuous control with partial observability and noise, and memory-based challenges, significantly outperforming existing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
840. Setting up for failure: automatic discovery of the neural mechanisms of cognitive errors
๐ฌ 15
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ neuroscience working memory recurrent neural networks diffusion models behavioral modeling
Discovering the neural mechanisms underpinning cognition is one of the grand challenges of neuroscience. Addressing this challenge greatly benefits from specific hypotheses about the underlying neural network dynamics. However, previous approaches bridging neural network dynamics and cognitive behaviour required iterative refinement of network architectures and/or objectives for normative task optimization, resulting in a long, and mostly heuristic, human-in-the-loop design process. Here, we offer an alternative approach that automates this process by explicitly training recurrent neural networks (RNNs) to reproduce behaviour, including the same characteristic errors, that humans and animals produce in a cognitive task. Achieving this required two main innovations. First, as the amount of behavioural data that can be collected in experiments is often too limited to suffice for training RNNs, we use a non-parametric generative model of behavioural responses to produce surrogate data for training RNNs. Second, to capture all relevant statistical moments in the data, rather than a limited number of hand-picked low-order moments as in previous moment matching-based approaches, we developed a novel diffusion model-based approach for training RNNs. We chose a visual working memory (VWM) task as our test-bed, as behaviour in this task is well known to produce response distributions that are patently multimodal (due to so-called swap errors). The resulting network dynamics correctly predicted previously reported qualitative features of neural data recorded in macaques. Importantly, this was only the case when RNNs were trained using our approach, fitting the full richness of behavioural data -- and not when only a limited set of behavioural signatures were fitted, nor when RNNs were trained for task optimality instead of reproducing behaviour (as has been typical for RNNs used to generate dynamical hypotheses). Our model also makes novel predictions about the mechanism of swap errors, which can be readily tested in experiments. These results suggest that fitting rich patterns of behaviour provides a powerful way for the automatic discovery of neural network dynamics supporting important cognitive functions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
841. QuoKA: Query-Oriented KV Selection for Efficient LLM Prefill
๐ฌ 8
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Efficient LLM Inference LLM Prefill Acceleration Sparse Attention KV Cache Subselection Training-Free
We present QuoKA: Query-oriented KV selection for efficient attention, a training-free and hardware agnostic sparse attention algorithm for accelerating transformer inference under chunked prefill. While many queries focus on a smaller group of keys in the attention operator, we observe that queries with low cosine similarity with respect to the mean query interact more strongly with more keys and have the greatest contribution to final attention logits. By prioritizing these low cosine similarity queries, the behavior of full attention during the prefill stage can be closely approximated. QuoKA leverages this observation, accelerating attention by (1) first retaining a small set of representative queries and (2) then subselecting the keys most aligned with those queries. Through experiments on Needle-In-A-Haystack, LongBench, RULER, and Math500, we show that, while realizing a 3ร reduction in time-to-first-token, 5ร speedup in attention on Nvidia GPUs and up to nearly a 7ร speedup on Intel Xeon CPUs, QuoKA achieves near-baseline accuracy, utilizing 88% fewer key-value pairs per attention evaluation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
842. VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models
๐ฌ 19
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ adversarial attack vision-encoder-only large vision language models downstream-agnostic
Large Vision-Language Models (LVLMs) have demonstrated capabilities in multimodal understanding, yet their vulnerability to adversarial attacks raises significant concerns. To achieve practical attacking, this paper aims at efficient and transferable untargeted attacks under limited perturbation sizes. Considering this objective, whiteโbox attacks require fullโmodel gradients and taskโspecific labels, making costs scale with tasks, while blackโbox attacks rely on proxy models, typically requiring large perturbation sizes and elaborate transfer strategies. Given the centrality and widespread reuse of the vision encoder in LVLMs, we adopt a grayโbox setting that targets the vision encoder alone for efficient but effective attacking. We theoretically establish the feasibility of visionโencoderโonly attacks, laying the foundation for our grayโbox setting. Based on this analysis, we propose perturbing patch tokens rather than the class token, informed by both theoretical and empirical insights. We generate adversarial examples by minimizing the cosine similarity between clean and perturbed visual features, without accessing the subsequent models, tasks, or labels. This significantly reduces computational overhead while eliminating the task and label dependence. VEAttack has achieved a performance degradation of 94.5% on image caption task and 75.7% on visual question answering task. We also reveal some key observations to provide insights into LVLM attack/defense: 1) hidden layer variations of LLM, 2) token attention differential, 3) Mรถbius band in transfer attack, 4) low sensitivity to attack steps.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
843. In-Context Algorithm Emulation in Fixed-Weight Transformers
๐ฌ 23
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ In-Context Learning Attention Mechanism In-Context Gradient Descent Transformer Universal Approximation
We prove that a minimal Transformer with frozen weights emulates a broad class of algorithms by in-context prompting. We formalize two modes of in-context algorithm emulation. In the *task-specific mode*, for any continuous function $f: \mathbb{R} \to \mathbb{R}$, we show the existence of a single-head softmax attention layer whose forward pass reproduces reproduces functions of the form $f(w^\top x - y)$ to arbitrary precision. This general template subsumes many popular machine learning algorithms (e.g., gradient descent, linear regression, ridge regression). In the *prompt-programmable mode*, we prove universality: a single fixed-weight two-layer softmax attention module emulates all algorithms from the task-specific class (i.e., each implementable by a single softmax attention) via only prompting. Our key idea is to construct prompts that encode an algorithmโs parameters into token representations, creating sharp dot-product gaps that force the softmax attention to follow the intended computation. This construction requires no feed-forward layers and no parameter updates. All adaptation happens through the prompt alone. Numerical results corroborate our theory. These findings forge a direct link between in-context learning and algorithmic emulation, and offer a simple mechanism for large Transformers to serve as prompt-programmable interpreters of algorithms. They illuminate how GPT-style foundation models may swap algorithms via prompts alone, and establish a form of algorithmic universality in modern Transformer models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
844. Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks
๐ฌ 18
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Vision centric Agents Deep Reasoning VLMs Tool Use Evaluation
Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agentsโ multistep and deep reasoning capabilities in real-world, multimodal settings. AgentX features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
845. Scalable Second-order Riemannian Optimization for $K$-means Clustering
๐ฌ 14
โญ 6.00
๐ optimization
๐ท๏ธ K-means clustering manifold optimization Newton's method nonconvex
Clustering is a hard discrete optimization problem. Nonconvex approaches such as low-rank semidefinite programming (SDP) have recently demonstrated promising statistical and local algorithmic guarantees for cluster recovery. Due to the combinatorial structure of the $K$-means clustering problem, current relaxation algorithms struggle to balance their constraint feasibility and objective optimality, presenting tremendous challenges in computing the second-order critical points with rigorous guarantees. In this paper, we provide a new formulation of the $K$-means problem as a smooth unconstrained optimization over a submanifold and characterize its Riemannian structures to allow it to be solved using a second-order cubic-regularized Riemannian Newton algorithm. By factorizing the $K$-means manifold into a product manifold, we show how each Newton subproblem can be solved in linear time. Our numerical experiments show that the proposed method converges significantly faster than the state-of-the-art first-order nonnegative low-rank factorization method, while achieving similarly optimal statistical accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
846. Faster Gradient Methods for Highly-smooth Stochastic Bilevel Optimization
๐ฌ 9
โญ 6.00
๐ optimization
๐ท๏ธ bilevel optimization stochastic acceleration
This paper studies the complexity of finding an $\epsilon$-stationary point for stochastic bilevel optimization when the upper-level problem is nonconvex and the lower-level problem is strongly convex.
Recent work proposed the first-order method, F${}^2$SA, achieving the $\tilde{\mathcal{O}}(\epsilon^{-6})$ upper complexity bound for first-order smooth problems. This is slower than the optimal $\Omega(\epsilon^{-4})$ complexity lower bound in its single-level counterpart.
In this work, we show that faster rates are achievable for higher-order smooth problems. We first reformulate F$^2$SA as approximating the hyper-gradient with a forward difference. Based on this observation, we propose a class of methods F${}^2$SA-$p$ that uses $p$th-order finite difference for hyper-gradient approximation and improves the upper bound to $\tilde{\mathcal{O}}(p \epsilon^{-4-2/p})$ for $p$th-order smooth problems. Finally, we demonstrate that the
$\Omega(\epsilon^{-4})$ lower bound
also holds for stochastic bilevel problems when the high-order smoothness holds for the lower-level variable, indicating that the upper bound of F${}^2$SA-$p$ is nearly optimal in the highly smooth region $p = \Omega( \log \epsilon^{-1} / \log \log \epsilon^{-1})$.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
847. DUET: Optimizing Training Data Mixtures via Coarse, Noisy Feedback from Unseen Evaluation Tasks
๐ฌ 15
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Bayesian Optimization Data Mixture Optimization Optimization from feedback
The performance of an LLM depends heavily on the relevance of its training data to the downstream evaluation task. However, in practice, we do not have fine-grained knowledge of the data in the evaluation task (e.g., conversations between an LLM and a user are end-to-end encrypted). Hence, it is unclear what data is relevant for fine-tuning the LLM.
Instead, we can only deploy the LLM on the unseen task to gather multiple rounds of coarse, noisy feedback on how well the model performs (e.g., user ratings). Our paper presents DUET, a novel global-to-local algorithm that optimizes training data mixtures by interleaving data selection with Bayesian optimization to exploit coarse and noisy feedback from a downstream evaluation task. DUET is flexible enough to incorporate different data selection methods, each with different performance-compute tradeoffs. By analyzing DUET's cumulative regret, we theoretically show that DUET converges to the optimal training data mixture even without any fine-grained data information from an unseen task. Finally, our experiments across a variety of language tasks demonstrate that DUET attains substantial performance improvement over existing data selection and mixing methods in the unseen-task setting. Our anonymized code can be found at https://github.com/pmsdapfmbf/DUET.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
848. DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models
๐ฌ 17
โญ 6.00
๐ generative models
๐ท๏ธ text-to-image semantic leakage computer vision automatic evaluation multimodal
Text-to-Image (T2I) models have advanced rapidly, yet they remain vulnerable to semantic leakage, the unintended transfer of semantically related features between distinct entities. Existing mitigation strategies are often optimization-based or dependent on external inputs. We introduce **DeLeaker**, a lightweight, optimization-free inference-time approach that mitigates leakage by directly intervening on the modelโs attention maps. Throughout the diffusion process, DeLeaker dynamically reweights attention maps to suppress excessive cross-entity interactions while strengthening the identity of each entity. To support systematic evaluation, we introduce **SLIM** (Semantic Leakage in IMages), the first dataset dedicated to semantic leakage, comprising 1,130 human-verified samples spanning diverse scenarios, together with a novel automatic evaluation framework. Experiments demonstrate that DeLeaker consistently outperforms all baselines, even when they are provided with external information, achieving effective leakage mitigation without compromising fidelity or quality. These results underscore the value of attention control and pave the way for more semantically precise T2I models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
849. LLM Unlearning with LLM Beliefs
๐ฌ 26
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Model Unlearning
Large language models trained on vast corpora inherently risk memorizing sensitive or harmful content, which may later resurface in their outputs.
Prevailing unlearning methods generally rely on gradient ascent and its variants to lower the probability of specific target responses.
However, we find that this strategy induces a critical side effect: probability mass is redistributed into high-likelihood regions, often corresponding to semantically related rephrasings of the targets.
We refer to this as the ***squeezing effect***, which explains why many methods yield merely spurious unlearning, a problem further obscured by automated metrics (e.g., ROUGE, truth ratio) that misreport actual success.
To address this, we propose a ***bootstrapping*** (BS) framework that explicitly links the squeezing effect with the modelโs own high-confidence generations, namely its ***model beliefs***.
Since model beliefs inherently capture the very high-likelihood regions where probability mass is squeezed, incorporating them into the unlearning objective directly counters the squeezing effect.
By jointly suppressing both target responses and model beliefs, BS-T (token) attenuates high-probability tokens, whereas BS-S (sequence) removes entire high-confidence generations, together achieving more thorough forgetting while preserving utility.
Extensive experiments on diverse benchmarks confirm the effectiveness of our approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
850. Hot PATE: Private Aggregation of Distributions for Diverse Tasks
๐ฌ 16
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Differential Privacy Sequential Text Generation Coordinated Ensembles
The Private Aggregation of Teacher Ensembles (PATE) framework enables privacy-preserving machine learning by aggregating responses from disjoint subsets of sensitive data. Adaptations of PATE to tasks with inherent output diversity such as text generation, where the desired output is a sample from a distribution, face a core tension: as diversity increases, samples from different teachers are less likely to agree, but lower agreement results in reduced utility for the same privacy requirements. Yet suppressing diversity to artificially increase agreement is undesirable, as it distorts the output of the underlying model, and thus reduces output quality.
We propose Hot PATE, a variant of PATE designed for diverse generative settings. We formalize the notion of a \emph{diversity-preserving} \emph{ensemble sampler} and introduce an efficient sampler that provably transfers diversity without incurring additional privacy cost. Hot PATE requires only API access to proprietary models and can be used as a drop-in replacement for existing "cold" PATE samplers. Our empirical results corroborate the theoretical guarantees, showing that Hot PATE achieves orders-of-magnitude improvements in utility per privacy budget on in-context learning tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
851. Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
๐ฌ 27
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Embodied Reasoning Vision-Language Model Embodied AI Reinforcement Learning Zero-shot Generalization
Generalization in embodied AI is hindered by the "seeing-to-doing gap", stemming from data scarcity and embodiment heterogeneity. To address this, we pioneer "pointing" as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives. We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing. We use a wide range of embodied and general visual reasoning datasets as sources to construct a large-scale dataset, Embodied-Points-200K, which supports key embodied pointing capabilities. Then we train Embodied-R1 using a two-stage Reinforced Fine-tuning (RFT) curriculum with specialized multi-task reward design. Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks. Critically, it demonstrates robust zero-shot generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5% across 8 real-world XArm tasks without any task-specific fine-tuning, representing a 62% improvement over strong baselines. Furthermore, the model exhibits high robustness against diverse visual disturbances. Our work shows that a pointing-centric representation, combined with an RFT training paradigm, offers an effective and generalizable pathway to closing the perception-action gap in robotics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
852. Avoid Catastrophic Forgetting with Rank-1 Fisher from Diffusion Models
๐ฌ 14
โญ 6.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ continual learning diffusion models catastrophic forgetting image generation elastic weight consolidation generative replay
Catastrophic forgetting remains a central obstacle for continual learning in neural models.
Popular approaches---replay and elastic weight consolidation (EWC)---have limitations: replay requires a strong generator and is prone to distributional drift, while EWC implicitly assumes a shared optimum across tasks and typically uses a diagonal Fisher approximation.
In this work, we study the gradient geometry of diffusion models, which can already produce high-quality replay data.
We provide theoretical and empirical evidence that, in the low signal-to-noise ratio (SNR) regime, per-sample gradients become strongly collinear, yielding an empirical Fisher that is effectively rank-1 and aligned with the mean gradient.
Leveraging this structure, we propose a rank-1 variant of EWC that is as cheap as the diagonal approximation yet captures the dominant curvature direction.
We pair this penalty with a replay-based approach to encourage parameter sharing across tasks while mitigating drift.
On class-incremental image generation datasets (MNIST, FashionMNIST, CIFAR-10, ImageNet-1k), our method consistently improves average FID and reduces forgetting relative to replay-only and diagonal-EWC baselines. In particular, forgetting is nearly eliminated on MNIST and FashionMNIST and is roughly halved on ImageNet-1k.
These results suggest that diffusion models admit an approximately rank-1 Fisher.
With a better Fisher estimate, EWC becomes a strong complement to replay: replay encourages parameter sharing across tasks, while EWC effectively constrains replay-induced drift.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
853. Demystifying Deep Search: A Holistic Evaluation with Hint-free Multi-Hop Questions and Factorised Metrics
๐ฌ 15
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Deep Search Agent
RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviors into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilization, and refusal behavior. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilization despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gapโtoday's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow EvidenceLoop that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
854. From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation
๐ฌ 21
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Spatial VLMs General Robotic Manipulation VLM Reasoning Spatial Chain-of-Thought
Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data construction pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSDโs capabilities in both โseeingโ and โdoingโ, achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilities, as well as on our proposed more challenging benchmark VABench. We also verified zero-shot capabilities in robot manipulation, demonstrating significant performance improvements over baseline methods in both SimplerEnv and real robot settings. Experimental results show that FSD achieves 40.6% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
855. Noisy-Pair Robust Representation Alignment for Positive-Unlabeled Learning
๐ฌ 12
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ PU Learning Non-contrastive representation Learning
Positive-Unlabeled (PU) learning aims to train a binary classifier (positive vs. negative) where only limited positive data and abundant unlabeled data are available. While widely applicable, state-of-the-art PU learning methods substantially underperform their supervised counterparts on complex datasets, especially without auxiliary negatives or pre-estimated parameters (e.g., a 14.26% gap on CIFAR-100 dataset). We identify the primary bottleneck as the challenge of learning discriminative representations under unreliable supervision. To tackle this challenge, we propose NcPU, a non-contrastive PU learning framework that requires no auxiliary information. NcPU combines a noisy-pair robust supervised non-contrastive loss (NoiSNCL), which aligns intra-class representations despite unreliable supervision, with a phantom label disambiguation (PLD) scheme that supplies conservative negative supervision via regret-based label updates. Theoretically, NoiSNCL and PLD can iteratively benefit each other from the perspective of the Expectation-Maximization framework. Empirically, extensive experiments demonstrate that: (1) NoiSNCL enables simple PU methods to achieve competitive performance; and (2) NcPU achieves substantial improvements over state-of-the-art PU methods across diverse datasets, including challenging datasets on post-disaster building damage mapping, highlighting its promise for real-world applications. Code: https://github.com/ICLR2026-285/NcPU.git.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
856. Gradient Descent Dynamics of Rank-One Matrix Denoising
๐ฌ 20
โญ 6.00
๐ learning theory
๐ท๏ธ Random Matrix Theory High-Dimensional Statistics Matrix Denoising Gradient Flow
Matrix denoising is a crucial component in machine learning, offering valuable insights into the behavior of learning algorithms (Bishop and Nasrabadi, 2006). This paper focuses on the rectangular matrix denoising problem, which involves estimating the left and right singular vectors of a rank-one matrix that is corrupted by additive noise. Traditional algorithms for this problem often exhibit high computational complexity, leading to the widespread use of gradient descent (GD)-based estimation methods with a quadratic cost function. However, the learning dynamics of these GD-based methods, particularly the analytical solutions that describe their exact trajectories, have been largely overlooked in existing literature. To fill this gap, we investigate the learning dynamics in detail, providing convergence proofs and asymptotic analysis. By leveraging tools from large random matrix theory, we derive a closed-form solution for the learning dynamics, characterized by the inner products of the estimates and the ground truth vectors. We rigorously prove the almost sure convergence of these dynamics as the signal dimensions tend to infinity. Additionally, we analyze the asymptotic behavior of the learning dynamics in the large-time limit, which aligns with the well-known Baik-Ben Arous-Pรฉchรฉe phase transition phenomenon n (Baik et al., 2005). Experimental results support our theoretical findings, demonstrating that when the signal-to-noise ratio (SNR) surpasses a critical threshold, learning converges rapidly from an initial value close to the stationary point. In contrast, estimation becomes infeasible when the ratio of the inner products between the initial left and right vectors and their corresponding ground truth vectors reaches a specific value, which depends on both the SNR and the data dimensions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
857. PSP: Prompt-Guided Self-Training Sampling Policy for Active Prompt Learning
๐ฌ 16
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Active Prompt Learning Reinforcement Learning CLIP
Active Prompt Learning (APL) using vision-language models (\textit{e.g.}, CLIP) has attracted considerable attention for mitigating the dependence on fully labeled dataset in downstream task adaptation. However, existing methods fail to explicitly leverage prompt to guide sample selection, resulting in the selected samples being ineffective in facilitating the prompt template's downstream task adaptation, while also overlooking valuable complementary information in the unselected samples. To fill this gap, we propose a novel Prompt-Guided Self-Training Sampling Policy (PSP) for APL, which integrates Soft Actor-Critic with a customized real-pseudo hybrid reward and vectorized critics to incorporate prompts in guiding sample selection toward those that facilitate the optimization of prompt template, by jointly considering both selected and unselected samples. Specifically, PSP comprises two prominent components: Vectorized Soft Actor-Critic Sampling Policy (VSSP) and Uncertainty Augmented Self-Training (UST) mechanism. VSSP customizes a real-pseudo hybrid reward based on learned prompts and image features, which is fed into vectorized critics to estimate Q-value for each sample and compute gradients that optimize the actor, allowing it to refine its sampling policy in an End-to-End manner to identify the most informative samples for prompt learning. Moreover, UST leverages the CLIP from the previous round to generate reliable pseudo-labeled data based on uncertainty and confidence of average predictions, thereby deepening the understanding of the overall data. Extensive experiments conducted on diverse real-world datasets validate the effectiveness of our PSP.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
858. Learning Survival Distributions with Individually Calibrated Asymmetric Laplace Distribution
๐ฌ 16
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Machine Learning; Probabilistic methods; Survival Analysis; Asymmetric Laplace Distribution; Calibration
Survival analysis plays a critical role in modeling time-to-event outcomes across various domains.
Although recent advances have focused on improving _predictive accuracy_ and _concordance_, fine-grained _calibration_ remains comparatively underexplored.
In this paper, we propose a survival modeling framework based on the Individually Calibrated Asymmetric Laplace Distribution (ICALD), which unifies _parametric_ and _nonparametric_ approaches based on the ALD.
We begin by revisiting the probabilistic foundation of the widely used _pinball_ loss in _quantile regression_ and its reparameterization as the _asymmetry form_ of the ALD.
This reparameterization enables a principled shift to _parametric_ modeling while preserving the flexibility of _nonparametric_ methods.
Furthermore, we show theoretically that ICALD, with the _quantile regression_ loss is probably approximately individually calibrated.
Then we design an extended ICALD framework that supports both _pre-calibration_ and _post-calibration_ strategies.
Extensive experiments on 14 synthetic and 7 real-world datasets demonstrate that our method achieves competitive performance in terms of _predictive accuracy_, _concordance_, and _calibration_, while outperforming 12 existing baselines including recent _pre-calibration_ and _post-calibration_ methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
859. True Self-Supervised Novel View Synthesis is Transferable
๐ฌ 30
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Novel View Synthesis Self-Supervised Unsupervised Representation Learning
In this paper, we identify that the key criterion for determining whether a model is truly capable of novel view synthesis (NVS) is transferability: Whether any pose representation extracted from one video sequence can be used to re-render the same camera trajectory in another. We analyze prior work on self-supervised NVS and find that their predicted poses do not transfer: The same set of poses lead
to different camera trajectories in different 3D scenes. Here, we present XFactor, the first geometry-free self-supervised model capable of true NVS. XFactor combines pair-wise pose estimation with a simple augmentation scheme of the inputs and outputs that jointly enables disentangling camera pose from scene content and facilitates geometric reasoning. Remarkably, we show that XFactor achieves transferability with unconstrained latent pose variables, without any 3D inductive biases or concepts from multi-view geometry โ such as an explicit parameterization of poses as elements of SE(3). We introduce a new metric to quantify transferability, and through large-scale experiments, we demonstrate that XFactor significantly outperforms prior pose-free NVS transformers, and show that latent poses are highly correlated with real-world poses through probing experiments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
860. Learning to Answer from Correct Demonstrations
๐ฌ 12
โญ 6.00
๐ learning theory
๐ท๏ธ Promot-Completion Imitation Learning Likelihood Maximization
We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time. Learning is based on demonstrations of some correct answer to each training question, as in Supervised Fine Tuning (SFT). We formalize the problem as offline imitation learning in contextual bandits, with demonstrations from some optimal policy, without explicitly observed rewards. Prior work assumes that the demonstrator belongs to a low-complexity policy class, which motivates maximum likelihood estimation (i.e., log-loss minimization). In contrast, we propose relying only on the reward model (specifying which answers are correct) being in a low-cardinality class, which we argue is a weaker assumption. We show that likelihood maximization methods can fail in this case, and instead suggest an alternative novel approach that learns with sample complexity logarithmic in the cardinality of the reward class. Our work motivates looking beyond likelihood maximization when learning from demonstrations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
861. FastVMT: Eliminating Redundancy in Video Motion Transfer
๐ฌ 7
โญ 6.00
๐ generative models
๐ท๏ธ Video Motion Transfer; Efficiency; Diffusion model;
Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: **motion redundancy** arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; **gradient redundancy** occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a 3.43ร speedup without degrading the visual fidelity or the temporal consistency of the generated videos.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
862. Conformal Prediction for Long-Tailed Classification
๐ฌ 14
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ conformal prediction uncertainty quantification long tail class imbalance fine-grained image classification
Many real-world classification problems, such as plant identification, have extremely long-tailed class distributions. In order for prediction sets to be useful in such settings, they should (i) provide good class-conditional coverage, ensuring that rare classes are not systematically omitted from the prediction sets, and (ii) be a reasonable size, allowing users to easily verify candidate labels. Unfortunately, existing conformal prediction methods, when applied to the long-tailed setting, force practitioners to make a binary choice between small sets with poor class-conditional coverage or sets with very good class-conditional coverage but that are extremely large. We propose methods with guaranteed marginal coverage that smoothly trade off between set size and class-conditional coverage. First, we introduce a new conformal score function, coined prevalence-adjusted softmax, that targets macro-coverage, a relaxed notion of class-conditional coverage. Second, we propose a label-weighted conformal prediction method that allows us to interpolate between marginal and class-conditional conformal prediction. We demonstrate our methods on Pl@ntNet-300K and iNaturalist-2018, two long-tailed image datasets with 1,081 and 8,142 classes, respectively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
863. Efficient Spatially-Variant Convolution via Differentiable Sparse Kernel Complex
๐ฌ 14
โญ 6.00
๐ optimization
๐ท๏ธ Kernel Approximation Differentiable Filtering Spatially-Varying Convolution Efficient Image Processing
Image convolution with complex kernels is a fundamental operation in photography, scientific imaging, and animation effects, yet direct dense convolution is computationally prohibitive on resource-limited devices.
Existing approximations, such as simulated annealing or low-rank decompositions, either lack efficiency or fail to capture non-convex kernels.
We introduce a differentiable kernel decomposition framework that represents a target spatially-variant, dense, complex kernel using a set of sparse kernel samples. Our approach features (i) a decomposition that enables differentiable optimization of sparse kernels, (ii) a dedicated initialization strategy for non-convex shapes to avoid poor local minima, and (iii) a kernel-space interpolation scheme that extends single-kernel filtering to spatially varying filtering without retraining and additional runtime overhead.
Experiments on Gaussian and non-convex kernels show that our method achieves higher fidelity than simulated annealing and significantly lower cost than low-rank decompositions. Our approach provides a practical solution for mobile imaging and real-time rendering, while remaining fully differentiable for integration into broader learning pipelines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
864. Condition Matters in Full-head 3D GANs
๐ฌ 16
โญ 6.00
๐ generative models
๐ท๏ธ 3D Head Synthesis 3D Avatar 3D-aware GANs
Conditioning is crucial for stable training of full-head 3D-aware GANs. Without any conditioning signal, the model suffers from severe mode collapse, making it impractical to training (\cref{fig:intro}(a,b)). However, a series of previous full-head 3D-aware GANs conventionally choose the view angle as the conditioning input, which leads to a bias in the learned 3D full-head space along the conditional view direction. This is evident in the significant differences in generation quality and diversity between the conditional view and non-conditional views of the generated 3D heads, resulting in global incoherence across different head regions (\cref{fig:intro}(d-i)). In this work, we propose to use \textit{view-invariant semantic feature} as the conditioning input, thereby decoupling the generative capability of 3D heads from the viewing direction. To construct a view-invariant semantic condition for each training image, we create a novel synthesized head image dataset. We leverage FLUX.1 Kontext to extend existing high-quality frontal face datasets to a wide range of view angles. The image clip feature extracted from the frontal view is then used as a shared semantic condition across all views in the extended images, ensuring semantic alignment while eliminating directional bias. This also allows supervision from different views of the same subject to be consolidated under a shared semantic condition, which accelerates training (\cref{fig:intro}(c)) and enhances the global coherence of the generated 3D heads (\cref{fig:teaser}). Moreover, as GANs often experience slower improvements in diversity once the generator learns a few modes that successfully fool the discriminator, our semantic conditioning encourages the generator to follow the true semantic distribution, thereby promoting continuous learning and diverse generation. Extensive experiments on full-head synthesis and single-view GAN inversion demonstrate that our method achieves significantly higher fidelity, diversity, and generalizability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
865. ProxyThinker: Test-Time Guidance through Small Visual Reasoners
๐ฌ 13
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ decoding-time algorithms visual reasoning
Recent advancements in reinforcement learning with verifiable rewards have pushed the boundaries of the visual reasoning capabilities in large vision-language models (LVLMs). However, training LVLMs with reinforcement fine-tuning (RFT) is computationally expensive, posing a significant challenge to scaling model size. In this work, we propose ProxyThinker, an inference-time technique that enables large models to inherit the visual reasoning capabilities from small, slow-thinking visual reasoners without any training. By subtracting the output distributions of base models from those of RFT reasoners, ProxyThinker modifies the decoding dynamics and successfully elicits the slow-thinking reasoning demonstrated by the emerged sophisticated behaviors such as self-verification and self-correction. ProxyThinker consistently boosts performance on challenging visual benchmarks on spatial, mathematical, and multidisciplinary reasoning, enabling untuned base models to compete with the performance of their full-scale RFT counterparts. Furthermore, our implementation efficiently coordinates multiple language models with parallelism techniques and achieves faster inference compared to previous decoding-time methods, paving the way for the practical deployment of ProxyThinker. Code is available at https://anonymous.4open.science/r/ProxyThinker-FAAF.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
866. Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning
๐ฌ 11
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Offline Reinforcement Learning Generative Models Flow Matching
Generative models such as diffusion and flow-matching offer expressive policies for offline reinforcement learning (RL) by capturing rich, multimodal action distributions, but their iterative sampling introduces high inference costs and training instability due to gradient propagation across sampling steps. We propose the \textit{Single-Step Completion Policy} (SSCP), a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples, enabling accurate, one-shot action generation. In an off-policy actor-critic framework, SSCP combines the expressiveness of generative models with the training and inference efficiency of unimodal policies, without requiring long backpropagation chains. Our method scales effectively to offline, offline-to-online, and online RL settings, offering substantial gains in speed and adaptability over diffusion-based baselines. We further extend SSCP to goal-conditioned RL (GCRL), enabling flat policies to exploit subgoal structures without explicit hierarchical inference. SSCP achieves strong results across standard offline RL and GCRL benchmarks, positioning it as a versatile, expressive, and efficient framework for deep RL and sequential decision-making.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
867. Motion-Aligned Word Embeddings for Text-to-Motion Generation
๐ฌ 4
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ text-to-motion generation large language model fine-tuning word embeddings
Existing text-to-motion (T2M) generation models typically rely on pretrained large language models to encode textual inputs. However, these models, trained on generic text corpora, lack explicit alignment between motion-related words (e.g., "clockwise'', "quickly'') and human skeletal movements. This misalignment, fundamentally rooted in the word embedding layers, severely limits the ability of T2M models to understand and generalize fine-grained motion semantics. To tackle this issue, we propose Motion-Aligned Text Encoding (MATE), a novel framework that explicitly incorporates motion semantics into the word embedding layers of large language models to enhance text-motion alignment for motion generation. To address the challenge of inherent semantic entanglement in motion sequences, MATE introduces two key components: 1) a motion localization strategy that establishes localized correspondences between sub-texts and motion segments, enabling soft attention guidance for semantic localization; and 2) a motion disentanglement module that isolates word-specific motion semantics via contrastive kinematic prototypes, ensuring word-level alignment between linguistic and kinematic representations. Remarkably, language models enhanced with MATE can be seamlessly integrated into existing T2M methods, significantly surpassing state-of-the-art performance on two standard benchmarks with minimal modifications. Codes and pretrained models will be released upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
868. Rethinking Uncertainty Estimation in LLMs: A Principled Single-Sequence Measure
๐ฌ 8
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ llm nlg uncertainty estimation uncertainty measures single-sequence measures
Large Language Models (LLMs) are increasingly employed in real-world applications, driving the need to evaluate the trustworthiness of their generated text. To this end, reliable uncertainty estimation is essential. Leading uncertainty estimation methods generate and analyze multiple output sequences, which is computationally expensive and impractical at scale. In this work, we inspect the theoretical foundations of these methods and explore new directions to enhance computational efficiency. Building on the framework of proper scoring rules, we find that the negative log-likelihood of the most likely output sequence constitutes a theoretically principled uncertainty measure. To approximate this alternative measure, we propose G-NLL, obtained using a single output sequence from greedy decoding. This approach streamlines uncertainty estimation while preserving theoretical rigor. Empirical results demonstrate that G-NLL achieves state-of-the-art performance across various scenarios. Our work lays the theoretical foundation for efficient and reliable uncertainty estimation in natural language generation, challenging the necessity of the prevalent methods that are more complex and resource-intensive.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
869. PLANETALIGN: A Comprehensive Python Library for Benchmarking Network Alignment
๐ฌ 4
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Network Alignment Graph Machine Learning
Network alignment (NA) aims to identify node correspondence across different networks and serves as a critical cornerstone behind various downstream multi-network learning tasks. Despite growing research in NA, there lacks a comprehensive library that facilitates the systematic development and benchmarking of NA methods. In this work, we introduce PLANETALIGN, a comprehensive Python library for network alignment that features a rich collection of built-in datasets, methods, and evaluation pipelines with easy-to-use APIs. Specifically, PLANETALIGN integrates 18 datasets and 14 NA methods with extensible APIs for easy use and development of NA methods. Our standardized evaluation pipeline encompasses a wide range of metrics, enabling a systematic assessment of the effectiveness, scalability, and robustness of NA methods. Through extensive comparative studies, we reveal practical insights into the strengths and limitations of existing NA methods. We hope that PLANETALIGN can foster a deeper understanding of the NA problem and facilitate the development and benchmarking of more effective, scalable, and robust methods in the future. The The source code of PLANETALIGN is available at https://anonymous.4open.science/r/PlanetAlign-E9BA
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
870. From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
๐ฌ 24
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Compression Human and Machine Cognition Information Theory Concepts
Humans organize knowledge into compact categories that balance compression with semantic meaning preservation. Large Language Models (LLMs) demonstrate striking linguistic abilities, yet whether they achieve this same balance remains unclear. We apply the Information Bottleneck principle to quantitatively compare how LLMs and humans navigate this compression-meaning trade-off.
Analyzing embeddings from 40+ LLMs against classic human categorization benchmarks, we uncover three key findings. First, LLMs broadly align with human categories but miss fine-grained semantic distinctions crucial for human understanding. Second, LLMs demonstrate aggressive statistical compression, achieving ``optimal'' information-theoretic efficiency, while humans prioritize contextual richness and adaptive flexibility. Third, encoder models surprisingly outperform decoder models in human alignment, suggesting that generation and understanding rely on distinct mechanisms in current architectures. In addition, training dynamics analysis reveals that conceptual structure develops in distinct phases: rapid initial formation followed by architectural reorganization, with semantic processing migrating from deeper to mid-network layers as models discover more efficient encoding. These divergent strategies, where LLMs optimize for compression and humans for adaptive utility, reveal fundamental differences between artificial and biological intelligence, guiding development toward more human-aligned AI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
871. Energy-Based Transformers are Scalable Learners and Thinkers
๐ฌ 22
โญ 6.00
๐ generative models
๐ท๏ธ Energy-Based Models System 2 Thinking Reasoning Verification Scaling Transformers Generative Modeling
Inference-time computation, analogous to human System 2 Thinking, has recently become popular for improving model performance. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question โIs it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?โ We find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs)---a new class of Energy-Based Models (EBMs)---to assign an energy value to every input and candidate-prediction, enabling predictions through energy minimization until convergence. To support this approach, we introduce several key techniques for stable and parallelizable training, which enable the emergence of strong System 2 Thinking capabilities and scalable EBMs. Across discrete and continuous modalities, we find EBTs outperform the Transformer++ approach, scaling up to 35% faster during pretraining, and improving inference-time performance by up to 29%. EBTs also surpass Diffusion Transformers on image denoising while requiring 99% fewer forward passes. Moreover, System 2 Thinking with EBTs yields larger performance gains on data that is farther out-of-distribution, and EBTs achieve better results than existing models on most downstream tasks despite achieving the same or worse pretraining performance, enabling EBTs to generalize better than existing approaches. Consequently, EBTs are a flexible and exciting new approach for scaling both the learning and thinking capabilities of models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
872. Less is more: Clustered Cross-Covariance Control for Offline RL
๐ฌ 31
โญ 6.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning;offline RL; OOD area; Clustering-based RL;
A fundamental challenge in offline reinforcement learning is distributional shift. Scarce data or datasets dominated by out-of-distribution (OOD) areas exacerbate this issue. Our theoretical analysis and experiments show that the standard squared error objective induces a harmful TD cross covariance. This effect amplifies in OOD areas, biasing optimization and degrading policy learning. To counteract this mechanism, we develop two complementary strategies: partitioned buffer sampling that restricts updates to localized replay partitions, attenuates irregular covariance effects, and aligns update directions, yielding a scheme that is easy to integrate with existing implementations, namely Clustered Cross-Covariance Control for TD ($C^4$). We also introduce an explicit gradient-based corrective penalty that cancels the covariance induced bias within each update. We prove that buffer partitioning preserves the lower bound property of the maximization objective, and that these constraints mitigate excessive conservatism in extreme OOD areas without altering the core behavior of policy constrained offline reinforcement learning. Empirically, our method showcases higher stability and up to 30% improvement in returns over prior methods, especially with small datasets and splits that emphasize OOD areas.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
873. Pose-RFT: Aligning MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning
๐ฌ 16
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Human Pose Estimation Multimodal Large Language Model Reinforcement Fine-Tuning
Generating 3D human poses from multimodal inputs such as text or images requires models to capture both rich semantic and spatial correspondences. While pose-specific multimodal large language models (MLLMs) have shown promise, their supervised fine-tuning (SFT) paradigm struggles to resolve the task's inherent ambiguity. Its reliance on objectives like SMPL parameter regression creates a critical alignment gap, compromising the model's ability to achieve the required semantic and spatial fidelity. To close the gap, we propose Pose-RFT, a framework that shifts the learning paradigm from supervised imitation to reward-driven reinforcement fine-tuning (RFT). We address the core technical challenge of this task: a
hybrid action space requiring joint optimization of discrete language and continuous pose outputs. To this end, we introduce HyGRPO, a hybrid reinforcement learning algorithm that enables stable optimization by performing group-wise reward normalization over sampled responses. Pose-RFT incorporates task-specific reward functions to guide optimization towards spatial alignment in image-to-pose generation and semantic consistency in text-to-pose generation.
Extensive experiments on multiple pose generation benchmarks demonstrate that Pose-RFT significantly improves performance over existing pose-specific MLLMs, validating the effectiveness of our approach in closing the alignment gap for 3D pose generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
874. Accelerated Learning with Linear Temporal Logic using Differentiable Simulation
๐ฌ 11
โญ 6.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning temporal logic differentiable simulation
Ensuring that reinforcement learning (RL) controllers satisfy safety and reliability constraints in real-world settings remains challenging: state-avoidance and constrained Markov decision processes often fail to capture trajectory-level requirements or induce overly conservative behavior. Formal specification languages such as linear temporal logic (LTL) offer correct-by-construction objectives, yet their rewards are typically sparse, and heuristic shaping can undermine correctness. We introduce, to our knowledge, the first end-to-end framework that integrates LTL with differentiable simulators, enabling efficient gradient-based learning directly from formal specifications. Our method relaxes discrete automaton transitions via soft labeling of states, yielding differentiable rewards and state representations that mitigate the sparsity issue intrinsic to LTL while preserving objective soundness. We provide theoretical guarantees connecting Bรผchi acceptance to both discrete and differentiable LTL returns and derive a tunable bound on their discrepancy in deterministic and stochastic settings. Empirically, across complex, nonlinear, contact-rich continuous-control tasks, our approach substantially accelerates training and achieves up to twice the returns of discrete baselines. We further demonstrate compatibility with reward machines, thereby covering co-safe LTL and LTLf without modification. By rendering automaton-based rewards differentiable, our work bridges formal methods and deep RL, enabling safe, specification-driven learning in continuous domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
875. Improving Attributed Long-form Question Answering with Intent Awareness
๐ฌ 17
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ deep research long form question answering attributed question answering RAG supervised fine-tuning
Large language models (LLMs) are increasingly being used to generate comprehensive, knowledge-intensive reports. However, while these models are trained on diverse academic papers and reports, they are not exposed to the reasoning processes and intents that guide authors in crafting these documents. We hypothesize that enhancing a model's intent awareness can significantly improve the quality of generated long-form reports. We develop and employ structured, tag-based schemes to better elicit underlying implicit intents to write or cite. We demonstrate that these extracted intents enhance both zero-shot generation capabilities in LLMs and enable the creation of high-quality synthetic data for fine-tuning smaller models. Our experiments reveal improved performance across various challenging scientific report generation tasks, with an average improvement of +2.9 and +12.3 absolute points for large and small models over baselines, respectively. Furthermore, our analysis illuminates how intent awareness enhances model citation usage and substantially improves report readability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
876. Differentially Private Equilibrium Finding in Polymatrix Games
๐ฌ 12
โญ 6.00
๐ learning theory
๐ท๏ธ Polymatrix Game Differential Privacy
We study equilibrium finding in polymatrix games under differential privacy constraints. Prior work in this area fails to achieve both high-accuracy equilibria and a low privacy budget. To better understand the fundamental limitations of differential privacy in games, we show hardness results establishing that no algorithm can simultaneously obtain high accuracy and a vanishing privacy budget as the number of players tends to infinity. This impossibility holds in two regimes: (i) We seek to establish equilibrium approximation guarantees in terms of Euclidean \emph{distance} to the equilibrium set, and (ii) The adversary has access to all communication channels. We then consider the more realistic setting in which the adversary can access only a bounded number of channels and propose a new distributed algorithm that: recovers strategies with simultaneously vanishing \emph{Nash gap} (in expected utility, also referred to as \emph{exploitability}) and \emph{privacy budget} as the number of players increases. Our approach leverages structural properties of polymatrix games. To our knowledge, this is the first paper that can achieve this in equilibrium computation. Finally, we also provide numerical results to justify our algorithm.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
877. On the Spectral Differences Between NTK and CNTK and Their Implications for Point Cloud Recognition
๐ฌ 7
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Neural Tangent Kernel Interpretability of neural networks
The Convolutional Neural Tangent Kernel (CNTK) offers a principled framework for understanding convolutional architectures in the infinite-width regime. However, a comprehensive spectral comparison between CNTK and the classical Neural Tangent Kernel (NTK) remains underexplored. In this work, we present a detailed analysis of the spectral properties of CNTK and NTK, revealing that point cloud data exhibits a stronger alignment with the spectral bias of CNTK than images. This finding suggests that convolutional structures are inherently more suited to such geometric and irregular data formats. Based on this insight, we implement CNTK-based kernel regression for point cloud recognition tasks and demonstrate that it significantly outperforms NTK and other kernel baselines, especially in low-data settings. Furthermore, we derive a closed-form expression that connects CNTK with NTK in hybrid architectures. In addition, we introduce a closed-form of CNTK followed by NTK, while not the main focus, achieves strong empirical performance when applied to point-cloud tasks. Our study not only provides new theoretical understanding of spectral behaviors in neural tangent kernels but also shows that these insights can guide the practical design of CNTK-based regression for structured data such as point clouds.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
878. On the Limits of Sparse Autoencoders: A Theoretical Framework and Reweighted Remedy
๐ฌ 13
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ sparse autoencoder SAE theoretical understanding
Sparse autoencoders (SAEs) have recently emerged as a powerful tool for interpreting the features learned by large language models (LLMs). By reconstructing features with sparsely activated networks, SAEs aim to recover complex superposed polysemantic features into interpretable monosemantic ones. Despite their wide applications, it remains unclear under what conditions SAEs can fully recover the ground truth monosemantic features from the superposed polysemantic ones. In this paper, we provide the first theoretical analysis with a closed-form solution for SAEs, revealing that they generally fail to fully recover the ground truth monosemantic features unless the ground truth features are extremely sparse. To improve the feature recovery of SAEs in general cases, we propose a reweighting strategy targeting at enhancing the reconstruction of the ground truth monosemantic features instead of the observed polysemantic ones. We further establish a theoretical weight selection principle for our proposed weighted SAE (WSAE). Experiments across multiple settings validate our theoretical findings and demonstrate that our WSAE significantly improves feature monosemanticity and interpretability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
879. Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions
๐ฌ 19
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models; Reasoning; Reinforcement Learning; Supervised Fine-Tuning
Recent advances in large language model (LLM) reasoning have shown that reasoning ability can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at improving performance on questions within the model's original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce \textbf{ReLIFT} (\textbf{Re}inforcement \textbf{L}earning \textbf{I}nterleaved with Online \textbf{F}ine-\textbf{T}uning), a novel training strategy. ReLIFT employs RL for general training, but interleaves it with targeted SFT on challenging questions for which high-quality solutions are collected online. By alternating between RL and SFT, ReLIFT addresses model weaknesses as they emerge. Empirically, ReLIFT outperforms previous RLVR methods by an average of +6.7 points across a suite of six benchmarks (five math reasoning and one out-of-distribution). More importantly, ReLIFT surpasses baselines such as individual RL, individual SFT, and various hybrid approaches while reducing the required training time. These results provide compelling evidence that ReLIFT is a powerful and resource-efficient paradigm for developing capable reasoning models. The code is available at \href{https://anonymous.4open.science/r/Learning-What-Reinforcement-Learning-Can-t-6AFF/}{here}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
880. FastGHA: Generalized Few-Shot 3D Gaussian Head Avatars with Real-Time Animation
๐ฌ 14
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Animation Gaussian Avatar Feedforward Gaussian Model
Despite recent progress in 3D Gaussian-based head avatar modeling, efficiently generating high fidelity avatars remains a challenge. Current methods typically rely on extensive multi-view capture setups or monocular videos with per-identity optimization during inference, limiting their scalability and ease of use on unseen subjects. To overcome these efficiency drawbacks, we propose FastGHA, a feed-forward method to generate high-quality Gaussian head avatars from only a few input images while supporting real-time animation. Our approach directly learns a per-pixel Gaussian representation from the input images, and aggregates multi-view information using a transformer-based encoder that fuses image features from both DINOv3 and Stable Diffusion VAE. For real-time animation, we extend the explicit Gaussian representations with per-Gaussian features and introduce a lightweight MLP-based dynamic network to predict 3D Gaussian deformations from expression codes. Furthermore, to enhance geometric smoothness of the 3D head, we employ point maps from a pre-trained large reconstruction model as geometry supervision. Experiments show that our approach significantly outperforms existing methods in both rendering quality and inference efficiency, while supporting real-time dynamic avatar animation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
881. Rodrigues Network for Learning Robot Actions
๐ฌ 11
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Robot learning Action understanding Neural architecture
Understanding and predicting articulated actions is important in robot learning. However, common architectures such as MLPs and Transformers lack inductive biases that reflect the underlying kinematic structure of articulated systems. To this end, we propose the **Neural Rodrigues Operator**, a learnable generalization of the classical forward kinematics operation, designed to inject kinematics-aware inductive bias into neural computation. Building on this operator, we design the **Rodrigues Network (RodriNet)**, a novel neural architecture specialized for processing actions. We evaluate the expressivity of our network on two synthetic tasks on kinematic and motion prediction, showing significant improvements compared to standard backbones. We further demonstrate its effectiveness in two realistic applications: (i) imitation learning on robotic benchmarks with the Diffusion Policy, and (ii) single-image 3D hand reconstruction. Our results suggest that integrating structured kinematic priors into the network architecture improves action learning in various domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
882. Contact Wasserstein Geodesics for Non-Conservative Schrรถdinger Bridges
๐ฌ 13
โญ 6.00
๐ generative models
๐ท๏ธ Schrรถdinger Bridge Generative Models Hamiltonian Contact Hamiltonian Differential Geometry Wasserstein metric
The Schrรถdinger Bridge provides a principled framework for modeling stochastic processes between distributions; however, existing methods are limited by energy-conservation assumptions, which constrains the bridge's shape preventing it from model varying-energy phenomena. To overcome this, we introduce the non-conservative generalized Schrรถdinger bridge (NCGSB), a novel, energy-varying reformulation based on contact Hamiltonian mechanics. By allowing energy to change over time, the NCGSB provides a broader class of real-world stochastic processes, capturing richer and more faithful intermediate dynamics. By parameterizing the Wasserstein manifold, we lift the bridge problem to a tractable geodesic computation in a finite-dimensional space. Unlike computationally expensive iterative solutions, our contact Wasserstein geodesic (CWG) is naturally implemented via a ResNet architecture and relies on a non-iterative solver with near-linear complexity. Furthermore, CWG supports guided generation by modulating a task-specific distance metric. We validate our framework on tasks including manifold navigation, molecular dynamics predictions, and image generation, demonstrating its practical benefits and versatility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
883. Cautious Optimizers: Improving Training with One Line of Code
๐ฌ 13
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Optimizer AdamW
AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we propose a \textbf{single-line modification in Pytorch} to any momentum-based optimizer, which we rename cautious optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing not only consistent speed-up on LLM pretraining and post-training tasks, but also better results in MAE pretraining, with minimum extra tuning on hyperparameters.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
884. Towards a Sharp Analysis of Learning Offline $f$-Divergence-Regularized Contextual Bandits
๐ฌ 9
โญ 6.00
๐ learning theory
๐ท๏ธ contextual bandits policy optimization $f$-divergence regularization
Many offline reinforcement learning algorithms are underpinned by $f$-divergence regularization, but their sample complexity *defined with respect to regularized objectives* still lacks tight analyses, especially in terms of concrete data coverage conditions. In this paper, we study the exact concentrability requirements to achieve the $\tilde{\Theta}(\epsilon^{-1})$ sample complexity for offline $f$-divergence-regularized contextual bandits. For reverse KullbackโLeibler (KL) divergence, arguably the most commonly used one, we achieve an
$\tilde{O}(\epsilon^{-1})$ sample complexity under single-policy concentrability for the first time via a novel pessimism-based analysis,
surpassing existing $\tilde{O}(\epsilon^{-1})$ bound under all-policy concentrability and $\tilde{O}(\epsilon^{-2})$ bound under
single-policy concentrability. We also propose a near-matching lower bound, demonstrating that a multiplicative dependency on single-policy concentrability is necessary to maximally exploit the curvature property of reverse KL. Moreover, for $f$-divergences with strongly convex $f$, to which reverse KL *does not* belong, we show that the sharp sample complexity $\tilde{\Theta}(\epsilon^{-1})$ is achievable even without pessimistic estimation or single-policy concentrability. We further corroborate our theoretical insights with numerical experiments and extend our analysis to contextual dueling bandits. We believe these results take a significant step towards a comprehensive understanding of objectives with $f$-divergence regularization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
885. Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods
๐ฌ 12
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ In-Context Learning Transformer Approximation Theory Kernel Regression on Manifold
While in-context learning (ICL) has achieved remarkable success in natural language and vision domains, its theoretical understandingโparticularly in the context of structured geometric dataโremains unexplored. This paper initiates a theoretical study of ICL for regression of H\"older functions on manifolds. We establish a novel connection between the attention mechanism and classical kernel methods, demonstrating that transformers effectively perform kernel-based prediction at a new query through its interaction with the prompt. This connection is validated by numerical experiments, revealing that the learned queryโprompt scores for H\"older functions are highly correlated with the Gaussian kernel. Building on this insight, we derive generalization error bounds in terms of the prompt length and the number of training tasks. When a sufficient number of training tasks are observed, transformers give rise to the minimax regression rate of H\"older functions on manifolds, which scales exponentially with respect to the prompt length with the exponent depending on the intrinsic dimension of the manifold, rather than the ambient space dimension. Our result also characterizes how the generalization error scales with the number of training tasks, shedding light on the complexity of transformers as in-context kernel algorithm learners. Our findings provide foundational insights into the role of geometry in ICL and novels tools to study ICL of nonlinear models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
886. Using maximal information auxiliary variables to improve synthetic data generation based on TabPFN foundation models
๐ฌ 17
โญ 6.00
๐ generative models
๐ท๏ธ tabular synthetic data generation in-context learning tabular foundation models
Synthetic data generation for tabular datasets is shifting toward the use of large, general-purpose foundation models. TabPFN, a state-of-the-art example, uses in-context learning to generate probabilistic predictions conditioned on observed examples in a single forward pass. However, when variables are only weakly associated with others, the model's ability to generate realistic synthetic data deteriorates, as the context examples provide little predictive signal. To address this, we introduce the maximal information auxiliary variable (MIAV) strategy, which increases context information with auxiliary variables constructed by rank-matching random noise variables to real data. We establish theoretical properties of the approach which explain its good performance for weakly associated variables. Additional practical advantages of the MIAV approach include improved computational efficiency and invariance to variable order during the synthetic data generation process. Empirical evaluations, on simulated and real datasets, illustrate how the MIAV strategy improves data generation when compared to direct application of TabPFN, and is competitive against other baselines. To illustrate the generality of the MIAV approach we also present an implementation based on the TabICL model (a more scalable tabular foundation model restricted to classification tasks) for performing synthetic data generation on categorical datasets. Overall, MIAV offers an effective foundation modelโbased alternative to bespoke synthetic data generators.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
887. Declarative Audio Editing with Audio Language Model
๐ฌ 25
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Audio editing Latent diffusion model Audio language model
Audio editing plays a central role in VR/AR immersion, virtual conferencing, sound design, and other interactive media.
However, recent generative audio editing models depend on template-like instruction formats and are restricted to mono-channel audio.
These models fail to deal with declarative audio editing, where the user declares what the desired outcome should be, while leaving the details of editing operations to the system.
We introduce SmartDJ, a novel framework for stereo audio editing that combines the reasoning capability of audio language models with the generative power of latent diffusion.
Given a high-level instruction, SmartDJ decomposes it into a sequence of atomic edit operations, such as adding, removing, or spatially relocating events.
These operations are then executed by a diffusion model trained to manipulate stereo audio.
To support this, we design a data synthesis pipeline that produces paired examples of high-level instructions, atomic edit operations, and audios before and after each edit operation.
Experiments demonstrate that SmartDJ achieves superior perceptual quality, spatial realism, and semantic alignment compared to prior audio editing methods.
Demos are provided in the supplementary file. Code and data will be released upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
888. Permutation-Consistent Variational Encoding for Incomplete Multi-View Multi-Label Classification
๐ฌ 4
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Multi-Label Classification Multi-View Learning Information bottleneck
Incomplete multi-view multi-label learning is fundamentally an information integration problem under simultaneous view and label incompleteness. We introduce Permutation-Consistent Variational Encoding framework (PCVE) with an information bottleneck strategy, which learns variational representations capable of aggregating shared semantics across views while remaining robust to incompleteness. PCVE formulates a principled objective that maximizes a variational evidence lower bound to retain task-relevant information, and introduces a permutation-consistent regularization to encourage distributional consistency among representations that encode the same target semantics from different views. This regularization acts as an information alignment mechanism that suppresses view-private redundancy and mitigates over-alignment, thereby improving both sufficiency and consistency of the learned representations. To address missing labels, PCVE further incorporates a masked multi-label learning objective that leverages available supervision while modeling label dependencies. Extensive experiments across diverse benchmarks and missing ratios demonstrate consistent gains over state-of-the-art methods in multi-label classification, while enabling reliable inference of missing views without explicit imputation. Analyses corroborate that the proposed information-theoretic formulation improves cross-view semantic cohesion and preserves discriminative capacity, underscoring the effectiveness and generality of PCVE for incomplete multi-view multi-label learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
889. Characteristic Root Analysis and Regularization for Linear Time Series Forecasting
๐ฌ 21
โญ 6.00
๐ learning on time series and dynamical systems
๐ท๏ธ long term time series forecasting linear model characteristic roots modes noise robustness rank reduction root purge
Time series forecasting remains a critical challenge across numerous domains, yet the effectiveness of complex models often varies unpredictably across datasets. Recent studies highlight the surprising competitiveness of simple linear models, suggesting that their robustness and interpretability warrant deeper theoretical investigation. This paper presents a systematic study of linear models for time series forecasting, with a focus on the role of characteristic roots in temporal dynamics. We begin by analyzing the noise-free setting, where we show that characteristic roots govern long-term behavior and explain how design choices such as instance normalization and channel independence affect model capabilities. We then extend our analysis to the noisy regime, revealing that models tend to produce spurious roots. This leads to the identification of a key data-scaling property: mitigating the influence of noise requires disproportionately large training data, highlighting the need for structural regularization. To address these challenges, we propose two complementary strategies for robust root restructuring. The first uses rank reduction techniques, including Reduced-Rank Regression and Direct Weight Rank Reduction, to recover the low-dimensional latent dynamics. The second, a novel adaptive method called Root Purge, encourages the model to learn a noise-suppressing null space during training. Extensive experiments on standard benchmarks demonstrate the effectiveness of both approaches, validating our theoretical insights and achieving state-of-the-art results in several settings. Our findings underscore the potential of integrating classical theories for linear systems with modern learning techniques to build robust, interpretable, and data-efficient forecasting models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
890. PairFlow: Closed-Form Source-Target Coupling for Few-Step Generation in Discrete Flow Models
๐ฌ 12
โญ 6.00
๐ generative models
๐ท๏ธ ReFlow Flow matching Rectified flow
We introduce $\texttt{PairFlow}$, a lightweight preprocessing step for training Discrete Flow Models (DFMs) to achieve few-step sampling without requiring a pretrained teacher. DFMs have recently emerged as a new class of generative models for discrete data, offering strong performance. However, they suffer from slow sampling due to their iterative nature. Existing acceleration methods largely depend on finetuning, which introduces substantial additional training overhead. $\texttt{PairFlow}$ addresses this issue with a lightweight preprocessing step. Inspired by ReFlow and its extension to DFMs, we train DFMs from coupled samples of source and target distributions, without requiring any pretrained teacher.
At the core of our approach is a closed-form inversion for DFMs, which allows efficient construction of paired sourceโtarget samples. Despite its extremely low cost, taking only up to 1.7\% of the compute needed for full model training, $\texttt{PairFlow}$ matches or even surpasses the performance of two-stage training involving finetuning. Furthermore, models trained with our framework provide stronger base models for subsequent distillation, yielding further acceleration after finetuning. Experiments on molecular data as well as binary and RGB images demonstrate the broad applicability and effectiveness of our approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
891. PretrainโTest Task Alignment Governs Generalization in In-Context Learning
๐ฌ 16
โญ 6.00
๐ learning theory
๐ท๏ธ In-Context Learning Task Alignment Spectral Bias Pretraining Linear Attention
In-context learning (ICL) is a central capability of Transformer models, but the structures in data that enable its emergence and govern its robustness remain poorly understood. In this work, we study how the structure of pretraining tasks governs generalization in ICL. Using a solvable model for ICL of linear regression by linear attention, we derive an exact expression for ICL generalization error in high dimensions under arbitrary pretrainingโtesting task covariance mismatch. This leads to a new alignment measure that quantifies how much information about the pretraining task distribution is useful for inference at test time. We show that this measure directly predicts ICL performance not only in the solvable model but also in nonlinear Transformers. Our analysis further reveals a tradeoff between specialization and generalization in ICL: depending on task distribution alignment, increasing pretraining task diversity can either improve or harm test performance. Together, these results identify train-test task alignment as a key determinant of generalization in ICL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
892. Incomplete Multi-View Multi-Label Classification via Shared Codebook and Fused-Teacher Self-Distillation
๐ฌ 11
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ multi-label classification dual incomplete multi-view multi-label classification representation learning label correlations multi-view consistent representation
Although multi-view multi-label learning has been extensively studied, research on the dual-missing scenario, where both views and labels are incomplete, remains largely unexplored. Existing methods mainly rely on contrastive learning or information bottleneck theory to learn consistent representations under missing-view conditions, but relying solely on loss-based constraints limits the ability to capture stable and discriminative shared semantics. To address this issue, we introduce a more structured mechanism for consistent representation learning: we learn discrete consistent representations through a multi-view shared codebook and cross-view reconstruction, which naturally align different views within the limited shared codebook embeddings and reduce redundant features. At the decision level, we design a weight estimation method that evaluates the ability of each view to preserve label correlation structures, assigning weights accordingly to enhance the quality of the fused prediction. In addition, we introduce a fused-teacher self-distillation framework, where the fused prediction guides the training of view-specific classifiers and feeds the global knowledge back into the single-view branches, thereby enhancing the generalization ability of the model under missing-label conditions. The effectiveness of our proposed method is thoroughly demonstrated through extensive comparative experiments with advanced methods on five benchmark datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
893. RL of Thoughts: Navigating LLM Reasoning with Inference-time Reinforcement Learning
๐ฌ 16
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Large language models reasoning reinforcement learning
Despite rapid advancements in large language models (LLMs), the token-level autoregressive nature constrains their complex reasoning capabilities.
To enhance LLM reasoning, inference-time techniques, including Chain/Tree/Graph-of-Thought(s), successfully improve the performance, as they are fairly cost-effective by guiding reasoning through external logical structures without modifying LLMs' parameters.
However, these manually predefined, task-agnostic frameworks are applied uniformly across diverse tasks, lacking adaptability.
To improve this, we propose **RL-of-Thoughts (RLoT)**, where we train a lightweight navigator model with reinforcement learning (RL) to generate task-adaptive logical structures at inference time, enhancing LLM reasoning.
Specifically, we design five basic logic blocks from the perspective of human cognition.
During the reasoning process, the trained RL navigator dynamically selects the suitable logic blocks and combines them into task-specific logical structures according to problem characteristics.
Experiments across multiple reasoning benchmarks (AIME, MATH, GPQA, etc.) with multiple LLMs (GPT, Llama, Qwen, and DeepSeek) illustrate that RLoT outperforms established inference-time techniques in most cases and improves up to 13.4% in challenging situations.
Remarkably, with less than 3K parameters, our RL navigator is able to make sub-10B LLMs comparable to 100B-scale counterparts.
Moreover, the RL navigator demonstrates strong transferability: a model trained on one specific LLM-task pair can effectively generalize to unseen LLMs and tasks.
Our code is open-source at https://anonymous.4open.science/r/RL-LLM-Reasoning-1A30.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
894. ViMo: A Generative Visual GUI World Model for App Agents
๐ฌ 22
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ World Model GUI Generation App Agent
App agents, which autonomously operate mobile Apps through GUIs, have gained significant interest in real-world applications. Yet, they often struggle with long-horizon planning, failing to find the optimal actions for complex tasks with longer steps. To address this, world models are used to predict the next GUI observation based on user actions, enabling more effective agent planning. However, existing world models primarily focus on generating only textual descriptions, lacking essential visual details. To fill this gap, we propose ViMo, the first Visual world Model designed to generate future App observations as images. For the challenge of generating text in image patches, where even minor pixel errors can distort readability, we decompose GUI generation into graphic and text content generation. We propose a novel data representation, the Symbolic Text Representation (STR), to overlay text content with symbolic placeholders while preserving graphics. With this design, ViMo employs a STR Predictor to predict future GUIsโ graphics and a GUI-text Predictor for generating the corresponding text. Moreover, we deploy ViMo to enhance agent-focused tasks by predicting the outcome of actions. Experiments show that ViMo establishes visual world models as a compelling alternative to language-based approaches, producing visually plausible and functionally effective GUIs that empower App agents with more informed decisions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
895. Understanding the Robustness of Distributed Self-Supervised Learning Frameworks Against Non-IID Data
๐ฌ 18
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Distributed Learning Heterogeneous and Unlabeled Data Self-Supervised Learning Federated Learning Decentralized Learning
Recent research has introduced distributed self-supervised learning (D-SSL) approaches to leverage vast amounts of unlabeled decentralized data. However, D-SSL faces the critical challenge of data heterogeneity, and there is limited theoretical understanding of how different D-SSL frameworks respond to this challenge. To fill this gap, we present a rigorous theoretical analysis of the robustness of D-SSL frameworks under non-IID (non-independent and identically distributed) settings. Our results show that pre-training with Masked Image Modeling (MIM) is inherently more robust to heterogeneous data than Contrastive Learning (CL), and that the robustness of decentralized SSL increases with average network connectivity, implying that federated learning (FL) is no less robust than decentralized learning (DecL). These findings provide a solid theoretical foundation for guiding the design of future D-SSL algorithms. To further illustrate the practical implications of our theory, we introduce MAR loss, a refinement of the MIM objective with local-to-global alignment regularization. Extensive experiments across model architectures and distributed settings validate our theoretical insights, and additionally confirm the effectiveness of MAR loss as an application of our analysis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
896. ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation
๐ฌ 21
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Benchmark Agent Simulation Personalization Proactivity
As large language models (LLMs) become increasingly integrated into daily life, there is growing demand for AI assistants that are not only reactive but also proactive and personalized. While recent advances have pushed forward proactivity and personalization individually, their combination remains underexplored. To bridge this gap, we introduce ProPerSim, a new task and simulation framework for developing assistants capable of making timely, personalized recommendations in realistic home scenarios. In our simulation environment, a user agent with a rich persona interacts with the assistant, providing ratings on how well each suggestion aligns with its preferences and context. The assistantโs goal is to use these ratings to learn and adapt to achieve higher scores over time. Built on ProPerSim, we propose ProPerAssistant, a retrieval-augmented, preference-aligned assistant that continually learns and adapts through user feedback. Experiments across 32 diverse personas show that ProPerAssistant adapts its strategy and steadily improves user satisfaction, highlighting the promise of uniting proactivity and personalization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
897. AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations
๐ฌ 16
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ memory agent long-context
Long-horizon interactions between users and LLM-based assistants necessitates effective memory management, yet current approaches face challenges in training and evaluation of memory. Existing memory benchmarks rely on static, off-policy data as context, limiting evaluation reliability and scalability. To address these gaps, we introduce AMemGym, an interactive environment enabling on-policy evaluation and optimization for memory-driven personalization.
AMemGym employs structured data sampling to predefine user profiles, state-dependent questions, and state evolution trajectories, enabling cost-effective generation of high-quality, evaluation-aligned interactions. LLM-simulated users expose latent states through role-play while maintaining structured state consistency.
Comprehensive metrics based on structured data guide both assessment and optimization of assistants.
Extensive experiments reveal performance gaps in existing memory systems (e.g., RAG, long-context LLMs, and agentic memory) and corresponding reasons. AMemGym not only enables effective selection among competing approaches but also can potentially drive the self-evolution of memory management strategies.
By bridging structured state evolution with free-form interactions, our framework provides a scalable, diagnostically rich environment for advancing memory capabilities in conversational agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
898. Fast FrankโWolfe Algorithms with Adaptive Bregman Step-Size for Weakly Convex Functions
๐ฌ 14
โญ 6.00
๐ optimization
๐ท๏ธ Optimization First-order method Convex optimization Nonconvex optimization
We propose a FrankโWolfe (FW) algorithm with an adaptive Bregman step-size strategy for smooth adaptable (also called: relatively smooth) (weakly-) convex functions. This means that the gradient of the objective function is not necessarily Lipschitz continuous, and we only require the smooth adaptable property. Compared to existing FW algorithms, our assumptions are less restrictive. We establish convergence guarantees in various settings, such as sublinear to linear convergence rates, depending on the assumptions for convex and nonconvex objective functions. Assuming that the objective function is weakly convex and satisfies the local quadratic growth condition, we provide both local sublinear and local linear convergence regarding the primal gap. We also propose a variant of the away-step FW algorithm using Bregman distances over polytopes. We establish global faster (up to linear) convergence for convex optimization under the Hรถlder error bound condition and its local linear convergence for nonconvex optimization under the local quadratic growth condition. Numerical experiments demonstrate that our proposed FW algorithms outperform existing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
899. Derandomized Online-to-Non-convex Conversion for Stochastic Weakly Convex Optimization
๐ฌ 4
โญ 6.00
๐ optimization
๐ท๏ธ Non-smooth optimization Non-convex optimization Stochastic gradient descent with momentum online learning neural networks
Online-to-non-convex conversion (O2NC) is an online updates learning framework for producing Goldstein $(\delta,\epsilon)$-stationary points of non-smooth non-convex functions with optimal oracle complexity $\mathcal{O}(\delta^{-1} \epsilon^{-3})$. Subject to auxiliary \emph{random interpolation or scaling}, O2NC recapitulates the stochastic gradient descent with momentum (SGDM) algorithm popularly used for training neural networks. Randomization, however, introduces deviations from practical SGDM. So a natural question arises: Can we derandomize O2NC to achieve the same optimal guarantees while resembling SGDM? On the negative side, the general answer is \emph{no} due to the impossibility results of~\citet{jordan23deterministic}, showing that no dimension-free rate can be achieved by deterministic algorithms. On the positive side, as the primary contribution of the present work, we show that O2NC can be naturally derandomized for \emph{weakly convex} functions. Remarkably, our deterministic algorithm converges at an optimal rate as long as the weak convexity parameter is no larger than $\mathcal{O}(\delta^{-1}\epsilon^{-1})$. In other words, the stronger stationarity is expected, the higher non-convexity can be tolerated by our optimizer. Additionally, we develop a periodically restarted variant of our method to allow for more progressive update when the iterates are far from stationary. The resulting algorithm, which corresponds to a momentum-restarted version of SGDM, has been empirically shown to be effective and efficient for training ResNet and ViT networks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
900. Revisiting Weight Regularization for Low-Rank Continual Learning
๐ฌ 16
โญ 6.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Continual Learning Class-incremental Learning Weight Regularization Elastic Weight Consolidation
Continual Learning (CL) with large-scale pre-trained models (PTMs) has recently gained wide attention, shifting the focus from training from scratch to continually adapting PTMs. This has given rise to a promising paradigm: parameter-efficient continual learning (PECL), where task interference is typically mitigated by assigning a task-specific module during training, such as low-rank adapters. However, weight regularization techniques, such as Elastic Weight Consolidation (EWC)โa key strategy in CLโremain underexplored in this new paradigm. In this paper, we revisit weight regularization in low-rank CL as a new perspective for mitigating task interference in PECL. Unlike existing low-rank CL methods, we mitigate task interference by regularizing a shared low-rank update through EWC, thereby keeping the storage requirement constant regardless of the number of tasks. Moreover, we provide the first systematic investigation of EWC in low-rank CL, showing that it achieves a better stabilityโplasticity trade-off than other low-rank methods and enables competitive performance across a wide range of trade-off points. Building on these insights, we propose EWC-LoRA, which leverages a low-rank representation to estimate parameter importance over the full-dimensional space. This design offers a practical, computational- and memory-efficient solution for CL with PTMs, and provides insights that may inform the broader application of regularization techniques within PECL. Extensive experiments on various benchmarks demonstrate the effectiveness of EWC-LoRA. On average, EWC-LoRA improves over vanilla LoRA by 8.92\% and achieves comparable or even superior performance to other state-of-the-art low-rank CL methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
901. Locality-Attending Vision Transformer
๐ฌ 13
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Vision Transformer Semantic Segmentation Attention Mechanism Global Average Pooling
Vision transformers have demonstrated remarkable success in classification by leveraging global self-attention to capture long-range dependencies. However, this same mechanism can obscure fine-grained spatial details crucial for tasks such as segmentation. In this work, we seek to enhance the segmentation performance of vision transformers after being trained using the usual image-level classification objective. More specifically, we present a simple yet effective add-on for vision transformers that improve their performance on segmentation tasks while retaining their image-level recognition capabilities. In our approach, we modulate the self-attention with a learnable Gaussian kernel that biases the attention toward neighboring patches. We further refine the patch representations to learn better embeddings at patch positions. These modifications ensure meaningful representations at spatial positions and encourage tokens to focus on local surroundings, while still preserving the model's ability to incorporate global information. Experiments demonstrate the effectiveness of our modifications, evidenced by substantial segmentation gains on three benchmarks (e.g., over 6% and 4% on ADE20K for ViT Tiny and Base), without changing the training regime or sacrificing classification performance. The code is available at https://anonymous.4open.science/r/LocAtViTRepo/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
902. On Natural Ways to Generate and Their Provable Power
๐ฌ 10
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ masked diffusion models autoregressive models expressiveness
Diffusion language models have recently emerged as a competitive alternative to autoregressive language models. Beyond next-token generation, they are more efficient and flexible by enabling parallel and any-order token generation. However, despite empirical successes, their computational power and fundamental limitations remain poorly understood. In this paper, we formally study whether non-autoregressive generation in Masked Diffusion Models (MDM) enables solving problems beyond the reach of Auto-Regressive Models (ARM). Our results show that MDM with sufficiently large context length is computationally universal with decoding steps matching the optimal parallel time complexity in PRAM. However, when controlling for other factors, MDM's flexibility to generate in any-order does not expand what ARM can already solve. To address this, we propose a new form of generation called any-process generation, which extends MDM with capabilities to remask, insert and delete tokens. Theoretically and empirically, we demonstrate these capabilities enable scalability to significantly harder reasoning problems that are otherwise intractable for ARM and vanilla MDM. Additionally, they prove essential for generation tasks where objects naturally evolve through non-sequential processes, crucial for extending current LLMs beyond natural language to domains such as coding and science.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
903. Deterministic Bounds and Random Estimates of Metric Tensors on Neuromanifolds
๐ฌ 11
โญ 6.00
๐ learning theory
๐ท๏ธ Fisher information informaton geometry Hutchinson's trick deep learning theory
The high dimensional parameter space of modern deep neural networks โ the neuromanifold โ is endowed with a unique metric tensor defined by the Fisher information, estimating which is crucial for both theory and practical methods in deep learning. To analyze this tensor for classification networks, we return to a low dimensional space of probability distributions โ the core space โ and carefully analyze the spectrum of its Riemannian metric. We extend our discoveries there into deterministic bounds of the metric tensor on the neuromanifold. We introduce an unbiased random estimate of the metric tensor and its bounds based on Hutchinsonโs trace estimator. It can be evaluated efficiently through a single backward pass, with a standard deviation bounded by the true value up to scaling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
904. Direct Preference Optimization for Primitive-Enabled Hierarchical RL: A Bilevel Approach
๐ฌ 18
โญ 6.00
๐ reinforcement learning
๐ท๏ธ hierarchical reinforcement learning preference based learning
Hierarchical reinforcement learning (HRL) enables agents to solve complex, long-horizon tasks by decomposing them into manageable sub-tasks. However, HRL methods face two fundamental challenges: (i) non-stationarity caused by the evolving lower-level policy during training, which destabilizes higher-level learning, and (ii) the generation of infeasible subgoals that lower-level policies cannot achieve. To address these challenges, we introduce DIPPER, a novel HRL framework that formulates goal-conditioned HRL as a bi-level optimization problem and leverages direct preference optimization (DPO) to train the higher-level policy. By learning from preference comparisons over subgoal sequences rather than rewards that depend on the evolving lower-level policy, DIPPER mitigates the impact of non-stationarity on higher-level learning. To address infeasible subgoals, DIPPER incorporates lower-level value function regularization that encourages the higher-level policy to propose achievable subgoals. We introduce two novel metrics to quantitatively verify that DIPPER mitigates non-stationarity and infeasible subgoal generation issues in HRL. Empirical evaluation on challenging robotic navigation and manipulation benchmarks shows that DIPPER achieves upto 40% improvements over state-of-the-art baselines on challenging sparse-reward scenarios, highlighting the potential of preference-based learning for addressing longstanding HRL limitations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
905. Seq vs Seq: An Open Suite of Paired Encoders and Decoders
๐ฌ 8
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ encoders pretraining objective mlm ntp retrieval
The large language model (LLM) community focuses almost exclusively on decoder-only language models, since they are easier to use for text generation. However, a large subset of the community still uses encoder-only models for tasks such as classification or retrieval. Previous work has attempted to compare these architectures, but is forced to make comparisons with models that have different numbers of parameters, training techniques, and datasets. We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens. Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes, beating ModernBERT as an encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we find that encoder-only models excel at classification and retrieval tasks while decoders excel at generative tasks. However, we show that adapting a decoder model to encoder tasks (and vice versa) through continued training is subpar compared to using only the reverse objective (i.e. a 400M encoder outperforms a 1B decoder on MNLI, and vice versa for generative tasks). We open-source all artifacts of this study including training data, training order segmented by checkpoint, and 200+ checkpoints to allow future work to analyze or extend all aspects of training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
906. Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation
๐ฌ 17
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ log KV cache generation
While humans naturally learn and adapt from past experiences, large language models (LLMs) and their agentic counterparts often fail to retain reasoning from previous tasks and apply it in future contexts.
We introduce **L**og-**A**ugmented **G**eneration (LAG), a novel framework that *directly reuses* prior computation and reasoning from past logs at test time, enabling models to learn from previous tasks and perform better on new, unseen challenges, without sacrificing the system's efficiency or scalability.
Our approach represents task logs as key-value (KV) caches that encode the full reasoning context of prior tasks, while storing KV values for only a selected subset of tokens. When a new task arises, LAG retrieves KV values from relevant logs to augment generation.
Unlike reflection-based memory mechanisms, which require additional extraction or distillation steps, LAG reuses prior reasoning verbatim.
Moreover, it extends beyond existing KV caching techniques, which have primarily targeted efficiency, by explicitly improving accuracy through log reuse.
Experiments on knowledge- and reasoning-intensive datasets demonstrate that our method significantly outperforms standard agentic systems that do not utilize logs, as well as existing solutions based on reflection and KV cache techniques.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
907. Decoupling Dynamical Richness from Representation Learning: Towards Practical Measurement
๐ฌ 17
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ training dynamics representation learning lazy/rich regime neural collapse grokking kernel methods
Dynamic feature transformation (the rich regime) does not always align with predictive performance (better representation), yet accuracy is often used as a proxy for richness, limiting analysis of their relationship. We propose a computationally efficient, performance-independent metric of richness grounded in the low-rank bias of rich dynamics, which recovers neural collapse as a special case. The metric is empirically more stable than existing alternatives and captures known lazy-to-rich transitions (e.g., grokking) without relying on accuracy. We further use it to examine how training factors (e.g., learning rate) relate to richness, confirming recognized assumptions and highlighting new observations (e.g., batch normalization promote rich dynamics). An eigendecomposition-based visualization is also introduced to support interpretability, together providing a diagnostic tool for studying the relationship between training factors, dynamics, and representations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
908. AdS-GNN - a Conformally Equivariant Graph Neural Network
๐ฌ 11
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ equivariance; conformal group; scale equivariance; ising model
Conformal symmetries, i.e.\ coordinate transformations that preserve angles, play a key role in many fields, including physics, mathematics, computer vision and (geometric) machine learning. Here we build a neural network that is equivariant under general conformal transformations. To achieve this, we lift data from flat Euclidean space to Anti de Sitter (AdS) space. This allows us to exploit a known correspondence between conformal transformations of flat space and isometric transformations on the Anti de Sitter space. We then build upon the fact that such isometric transformations have been extensively studied on general geometries in the geometric deep learning literature. In particular, we employ message-passing layers conditioned on the proper distance, yielding a computationally efficient framework. We validate our model on tasks from computer vision and statistical physics, demonstrating strong performance, improved generalization capacities, and the ability to extract conformal data such as scaling dimensions from the trained network.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
909. CauKer: Classification Time Series Foundation Models Can Be Pretrained on Synthetic Data
๐ฌ 15
โญ 6.00
๐ learning on time series and dynamical systems
๐ท๏ธ Time Series Foundation Model Time Series Classification
Time series foundation models (TSFMs) have recently gained significant attention due to their strong zero-shot capabilities and widespread real-world applications. Such models typically require a computationally costly pretraining on large-scale, carefully curated collections of real-world sequences. To allow for a sample-efficient pretraining of TSFMs, we propose CauKer, a novel algorithm designed to generate diverse, causally coherent synthetic time series with realistic trends, seasonality, and nonlinear interactions. CauKer combines Gaussian Process (GP) kernel composition with Structural Causal Models (SCM) to produce data for sample-efficient pretraining of state-of-the-art classification TSFMs having different architectures and following different pretraining approaches. Additionally, our experiments reveal that CauKer-generated datasets exhibit clear scaling laws for both dataset size (10K to 10M samples) and model capacity (1M to 783M parameters), unlike real-world datasets, which display irregular scaling behavior.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
910. Learning Dynamic Causal Graphs Under Parametric Uncertainty via Polynomial Chaos Expansions
๐ฌ 4
โญ 6.00
๐ causal reasoning
๐ท๏ธ Causal Discovery Polynomial Chaos Expansion Parametric Uncertainty Functional Causal Models Uncertainty Quantification
Existing causal discovery methods are fundamentally limited by the assumption of a static causal graph, a constraint that fails in real-world systems where causal relationships dynamically vary with underlying system parameters. This discrepancy prevents the application of causal discovery in critical domains such as industrial process control, where understanding how causal effects change is essential. We address this gap by proposing a new paradigm that moves beyond static graphs to learn functional causal representations. We introduce a framework that models each causal link not as a static weight but as a function of measurable system parameters. By representing these functions using Polynomial Chaos Expansions (PCE), we develop a tractable method to learn the complete parametric causal structure from observational data. We provide theoretical proofs for the identifiability of these functional models and introduce a novel, provably convergent learning algorithm. On a large-scale chemical reactor dataset, our method learns the dynamic causal structure with a 90.9% F1-score, nearly doubling the performance of state-of-the-art baselines and providing an interpretable model of how causal mechanisms evolve.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
911. RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy
๐ฌ 16
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multi-modal Embodied Agent Unified Generative Model Auto-Regressive World Model
Reasoning before action and imagining potential outcomes (i.e., world models) are essential for embodied agents operating in complex open-world environments. Yet, prior work either incorporates only one of these abilities in an end-to-end agent or integrates multiple specialized models into an agent system, limiting the learning efficiency and generalization of the policy.
Thus, this paper makes the first attempt to synergize Reasoning and Imagination in an end-to-end Generalist policy, termed RIG.
To train RIG in an end-to-end manner, we construct a data pipeline that progressively integrates and enriches the content of imagination and reasoning in the trajectories collected from existing agents. The joint learning of reasoning and next image generation explicitly models the inherent correlation between reasoning, action, and dynamics of environments. It thus exhibits more than $17\times$ sample efficiency improvements and generalization in comparison with previous works.
During inference, RIG first reasons about the next action, produces potential action, and then predicts the action outcomes, which offers the agent a chance to review and self-correct based on the imagination before taking real actions.
Experimental results show that the synergy of reasoning and imagination not only improves the robustness, generalization, and interoperability of generalist policy but also enables test-time scaling to enhance overall performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
912. HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization
๐ฌ 25
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Benchmark Large Language Models Combinatorial Optimization Code Generation Agent Automatic Heuristic Generation
While Large Language Models (LLMs) have demonstrated significant advancements in reasoning and agent-based problem-solving, current evaluation methodologies fail to adequately assess their capabilities: existing benchmarks either rely on closed-ended questions prone to saturation and memorization, or subjective comparisons that lack consistency and rigor. In this work, we introduce HeuriGym, an agentic framework designed for evaluating heuristic algorithms generated by LLMs for combinatorial optimization problems, characterized by clearly defined objectives and expansive solution spaces. HeuriGym empowers LLMs to propose heuristics, receive evaluative feedback via code execution, and iteratively refine their solutions. We evaluate nine state-of-the-art models on various problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning. To quantify performance, we propose the Quality-Yield Index (QYI), a metric that captures both solution pass rate and quality. Even top models like GPT-o4-mini-high and Gemini-2.5-Pro attain QYI scores of only 0.6, well below the expert baseline of 1. Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
913. Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking
๐ฌ 14
โญ 6.00
๐ optimization
๐ท๏ธ grokking optimization generalization acceleration
Grokking is the phenomenon whereby, unlike the training performance which peaks very early on during training, the test/generalization performance of a model stagnates over arbitrarily many epochs and then suddenly jumps to usually close to perfect levels. In practice, it is desirable to reduce the length of such plateaus, that is to make the learning process "grok" faster. In this work, we provide new insights into grokking. First, we show both empirically and theoretically that grokking can be induced by asymmetric speeds of (stochastic) gradient descent, along different principal (i.e singular directions) of the gradients. We then propose a simple modification that normalizes the gradients so that dynamics along all the principal directions evolves at exactly the same speed. Then, we establish that this modified method, which we call egalitarian gradient descent (EGD) and can be seen as a carefully modified form of natural gradient descent, groks much faster. In fact, in some cases the stagnation is completely removed. Finally, we empirically show that on classical arithmetic problems like modular addition and sparse parity problem which this stagnation has been widely observed and intensively studied, that our proposed method removes the plateaus.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
914. Online Prediction of Stochastic Sequences with High Probability Regret Bounds
๐ฌ 13
โญ 6.00
๐ learning theory
๐ท๏ธ online prediction learning theory high-probability bound regret stochastic sequences
We revisit the classical problem of universal prediction of stochastic sequences with a finite time horizon $T$ known to the learner. The question we investigate is whether it is possible to derive vanishing regret bounds that hold with high probability, complementing existing bounds from the literature that hold in expectation. We propose such high-probability bounds which have a very similar form as the prior expectation bounds. For the case of universal prediction of a stochastic process over a countable alphabet, our bound states a convergence rate of $\mathcal{O}(T^{-1/2} \delta^{-1/2})$ with probability as least $1-\delta$ compared to prior known in-expectation bounds of the order $\mathcal{O}(T^{-1/2})$. We also propose an impossibility result which proves that it is not possible to improve the exponent of $\delta$ in a bound of the same form without making additional assumptions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
915. Batch and Sequential Unlearning for Neural Networks
๐ฌ 15
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ machine unlearning second-order unlearning
With the increasing deployment of machine learning models trained on personal data, machine unlearning has become crucial for data owners to exercise their "right to be forgotten" and protect their privacy. While model owners can retrain the models without the erased data to achieve this goal, this process is often prohibitively expensive. Previous works have shown that Newton's method can be applied to linear models to unlearn multiple data points in batch (batch unlearning) with minimal iterations. However, adapting this method to non-linear models, such as neural networks, poses significant challenges due to the presence of degenerate Hessians. This problem becomes more pronounced when unlearning is performed sequentially (sequential unlearning). Existing techniques that tried to tackle this degeneracy often 1) incur unlearning updates with excessively large norm that yield unsatisfactory unlearning performance and 2) may require manual tuning of regularization hyperparameters. In this work, we propose new unlearning algorithms that leverage cubic regularization for Newton's method to address both challenges. We discuss the theoretical benefits of our method and empirically show that our algorithms can efficiently achieve competitive performance in both batch and sequential unlearning on real-world datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
916. Hierarchical Encoding Tree with Modality Mixup for Cross-modal Hashing
๐ฌ 21
โญ 6.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Cross-modal Hashing Unsupervised Hash Retrieval Cross-modal Retireval
Cross-modal retrieval is a significant task that aims to learn the semantic correspondence between visual and textual modalities. Unsupervised hashing methods can efficiently manage large-scale data and can be effectively applied to cross-modal retrieval studies. However, existing methods typically fail to fully exploit the hierarchical structure between text and image data. Moreover, the commonly used direct modal alignment cannot effectively bridge the semantic gap between these two modalities. To address these issues, we introduce a novel Hierarchical Encoding Tree with Modality Mixup (HINT) method, which achieves effective cross-modal retrieval by extracting hierarchical cross-modal relations. HINT constructs a cross-modal encoding tree guided by hierarchical structural entropy and generates proxy samples of text and image modalities for each instance from the encoding tree. Through the curriculum-based mixup of proxy samples, HINT achieves progressive modal alignment and effective cross-modal retrieval. Furthermore, we conduct cross-modal consistency learning to achieve global-view semantic alignment between text and image representations. Extensive experiments on a range of cross-modal retrieval datasets demonstrate the superiority of HINT over state-of-the-art methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
917. Multi-LLM Adaptive Conformal Inference for Reliable LLM Response
๐ฌ 24
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ LLM Response Factuality Conformal Inference Multi-LLM
Ensuring factuality is essential for the safe use of Large Language Models (LLMs) in high-stakes domains such as medicine and law. Conformal inference provides distribution-free guarantees, but existing approaches are either overly conservative, discarding many true-claims, or rely on adaptive error rates and simple linear models that fail to capture complex group structures. To address these challenges, we reformulate conformal inference in a multiplicative filtering setting, modeling factuality as a product of claim-level scores. Our method, Multi-LLM Adaptive Conformal Inference MACI, leverages ensembles to produce more accurate factuality scores, which in our experiments led to higher retention, while validity is preserved through group-conditional calibration. Experiments show that MACI consistently achieves user-specified coverage with substantially higher retention and lower time cost than baselines. Our anonymized repository is available at https://github.com/Anonymous2026conf/MACI.git.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
918. Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
๐ฌ 14
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Mechanistic Interpretability Attention Superposition Sparse Dictionary Learning Circuit Analysis
We propose Low-Rank Sparse Attention (Lorsa), a sparse replacement model
of Transformer attention layers to disentangle original Multi Head Self Attention (MHSA) into individually comprehensible components.
Lorsa is designed to address the challenge of \textit{attention superposition} to understand attention-mediated interaction between
features in different token positions. Lorsa helps find cleaner and finer-grained versions of previously discovered MHSA behaviors like induction
heads, successor heads, attention sink, and a comprehensive family of arithmetic-specific Lorsa heads. Interestingly, we identify a novel
head type called \emph{subtoken induction heads} that function at character level rather than token level.
Automated interpretability analysis indicates
that Lorsa achieves parity with SAE in interpretability while Lorsa exhibits superior circuit discovery properties.
We also conduct extensive experiments on architectural design ablation, correlation to original MHSA heads and error analysis.
Our early attempt to fully sparsify a toy Transformer succeeds to reveal clean global circuits. Eventually, we hope Lorsa would help
us greatly understand attention computation and enable full sparsification of model computation along with its MLP counterparts.
Lorsa is open-sourced at https://anonymous.4open.science/r/Lorsa-5686/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
919. Diagnosing Failures in Generalization from Task-Relevant Representational Geometry
๐ฌ 22
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Representational geometry Out of distribution generalization Image classification
Generalizationโthe ability to perform well beyond the training contextโis a hallmark of biological and artificial intelligence, yet anticipating unseen failures remains a central challenge. Conventional approaches often take a bottom-up mechanistic route by reverse-engineering interpretable features or circuits to build explanatory models. However, they provide little top-down guidance such as system-level measurements that predict and prevent failures. Here we propose a complementary diagnostic paradigm for studying generalization failures. Rather than mapping out detailed internal mechanisms, we use task-relevant measures to probe structureโfunction links, identify prognostic indicators, and test predictions in real-world settings. In image classification, we find that task-relevant geometric properties of in-distribution (ID) object manifolds consistently signal poor out-of-distribution (OOD) generalization. In particular, reductions in two geometric measuresโeffective manifold dimensionality and utilityโpredict weaker OOD performance across diverse architectures, optimizers, and datasets. We apply this finding to transfer learning with ImageNet-pretrained models, each available with multiple weight variants. We consistently find that the same geometric patterns predict OOD transfer performance more reliably than ID accuracy. This work demonstrates that representational geometry can expose hidden vulnerabilities, offering more robust guidance for model selection.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
920. The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
๐ฌ 15
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Chain-of-Thought Knowledge Distillation Large Language Models Benchmarking Data Augmentation Data Selection Data Mixing
Data-centric distillation, including data augmentation, selection, and mixing, offers a promising path to creating smaller, more efficient student Large Language Models (LLMs) that retain strong reasoning abilities. However, there still lacks a comprehensive benchmark to systematically assess the effect of each distillation approach. This paper introduces DC-CoT, the first data-centric benchmark that investigates data manipulation in chain-of-thought (CoT) distillation from method, model and data perspectives. Utilizing various teacher models (e.g., o4-mini, Gemini-Pro, Claude-3.5) and student architectures (e.g., 3B, 7B parameters), we rigorously evaluate the impact of these data manipulations on student model performance across multiple reasoning datasets, with a focus on in-distribution (IID) and out-of-distribution (OOD) generalization, and cross-domain transfer. Our findings aim to provide actionable insights and establish best practices for optimizing CoT distillation through data-centric techniques, ultimately facilitating the development of more accessible and capable reasoning models. The nonymous codebase can be accessed https://anonymous.4open.science/r/DC-COT-FF4C/
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
921. SVD Provably Denoises Nearest Neighbor Data
๐ฌ 9
โญ 6.00
๐ optimization
๐ท๏ธ nearest neighbor planted models
We study the Nearest Neighbor Search (NNS) problem in a high-dimensional setting where data originates from a low-dimensional subspace and is corrupted by Gaussian noise. Specifically, we consider a semi-random model where $n$ points from an unknown $k$-dimensional subspace of $\mathbb{R}^d$ ($k \ll d$) are perturbed by zero-mean $d$-dimensional Gaussian noise with variance $\sigma^2$ on each coordinate. Without loss of generality, we may assume the nearest neighbor is at distance $1$ from the query, and that all other points are at distance at least $1+\varepsilon$. We assume we are given only the noisy data and are required to find NN of the uncorrupted data. We prove the following results:
1. For $\sigma \in O(1/k^{1/4})$, we show that simply performing SVD denoises the data; namely, we provably recover accurate NN of uncorrupted data (Theorem 1.1).
2. For $\sigma \gg 1/k^{1/4}$, NN in uncorrupted data is not even {\bf identifiable} from the noisy data in general. This is a matching lower bound on $\sigma$ with the above result, demonstrating the necessity of this threshold for NNS (Lemma 3.1).
3. For $\sigma \gg 1/\sqrt k$, the noise magnitude ($\sigma \sqrt{d}$) is significantly exceeds the inter-point distances in the unperturbed data. Moreover, NN in noisy data is different from NN in the uncorrupted data in general.
\end{enumerate}
Note that (1) and (3) together imply SVD identifies correct NN in uncorrupted data even in a regime
where it is different from NN in noisy data. This was not the case in existing literature (see e.g. (Abdullah et al., 2014)). Another comparison with (Abdullah et al., 2014) is that it requires $\sigma$ to be at least an inverse polynomial in the ambient dimension $d$. The proof of (1) above uses upper bounds on perturbations of singular spaces of matrices as well as concentration and spherical symmetry of Gaussians. We thus give theoretical justification for the performance of spectral methods in practice. We also provide empirical results on real datasets to corroborate our findings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
922. Unveiling the Cognitive Compass: Theory-of-MindโGuided Multimodal Emotion Reasoning
๐ฌ 13
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Multimodal Affective Computing Multimodal Understanding and Reasoning Reinforcement Learning
Despite rapid progress in multimodal large language models (MLLMs), their capability for deep emotional understanding remains limited. We argue that genuine affective intelligence requires explicit modeling of Theory of Mind (ToM), the cognitive substrate from which emotions arise. To this end, we introduce HitEmotion, a ToM-grounded hierarchical benchmark that diagnoses capability breakpoints across increasing levels of cognitive depth. Second, we propose a ToM-guided reasoning chain that tracks mental states and calibrates cross-modal evidence to achieve faithful emotional reasoning. We further introduce TMPO, a reinforcement learning method that uses intermediate mental states as process-level supervision to guide and strengthen model reasoning. Extensive experiments show that HitEmotion exposes deep emotional reasoning deficits in state-of-the-art models, especially on cognitively demanding tasks. In evaluation, the ToM-guided reasoning chain and TMPO improve end-task accuracy and yield more faithful, more coherent rationales. In conclusion, our work provides the research community with a practical toolkit for evaluating and enhancing the cognition-based emotional understanding capabilities of MLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
923. CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density
๐ฌ 18
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ benchmark LLM reasoning long-context reasoning Cognitive Load Theory CLT synthetic benchmark natural language benchmark intrinsic difficulty extraneous load needle-in-a-haystack
Current benchmarks for long-context reasoning in Large Language Models (LLMs) often blur critical factors like intrinsic task complexity, distractor interference, and task length. To enable more precise failure analysis, we introduce CogniLoad, a novel synthetic benchmark grounded in Cognitive Load Theory (CLT). CogniLoad generates natural-language logic puzzles with independently tunable parameters that reflect CLT's core dimensions: intrinsic difficulty ($d$) controls intrinsic load; distractor-to-signal ratio ($\rho$) regulates extraneous load; and task length ($N$) serves as an operational proxy for conditions demanding germane load. Evaluating 22 SotA reasoning LLMs, CogniLoad reveals distinct performance sensitivities, identifying task length as a dominant constraint and uncovering varied tolerances to intrinsic complexity and U-shaped responses to distractor ratios. By offering systematic, factorial control over these cognitive load dimensions, CogniLoad provides a reproducible, scalable, and diagnostically rich tool for dissecting LLM reasoning limitations and guiding future model development.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
924. pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation
๐ฌ 9
โญ 6.00
๐ generative models
๐ท๏ธ diffusion models flow models few-step generation distillation
Few-step diffusion or flow-based generative models typically distill a velocity-predicting teacher into a student that predicts a shortcut towards denoised data. This format mismatch has led to complex distillation procedures that often suffer from a quality--diversity trade-off. To address this, we propose policy-based flow models ($\pi$-Flow).
$\pi$-Flow modifies the output layer of a student flow model to predict a network-free policy at one timestep. The policy then produces dynamic flow velocities at future substeps with negligible overhead, enabling fast and accurate ODE integration without extra network evaluations.
To match the policy's ODE trajectory to the teacher's,
we introduce a novel imitation distillation approach, which matches the policy's velocity to the teacher's along the policy's trajectory using a standard $\ell_2$ flow matching loss.
By simply mimicking the teacher's behavior, $\pi$-Flow enables stable and scalable training and avoids the quality--diversity trade-off.
On ImageNet $256\times 256$, it attains a 1-NFE FID of 2.85, outperforming previous 1-NFE models of the same DiT architecture.
On FLUX.1-12B and Qwen-Image-20B at 4 NFEs, $\pi$-Flow achieves substantially better diversity than state-of-the-art DMD models, while maintaining teacher-level quality.
Code and models will be released publicly.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
925. Combinatorial Bandit Bayesian Optimization for Tensor Outputs
๐ฌ 22
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Tensor data Non-separable kernels Gaussian process Bayesian optimization Combinatorial multi-arm bandit Upper confidence bound
Bayesian optimization (BO) has been widely used to optimize expensive and black-box functions across various domains. Existing BO methods have not addressed tensor-output functions. To fill this gap, we propose a novel tensor-output BO method. Specifically, we first introduce a tensor-output Gaussian process (TOGP) with two classes of tensor-output kernels as a surrogate model of the tensor-output function, which can effectively capture the structural dependencies within the tensor. Based on it, we develop an upper confidence bound (UCB) acquisition function to select the queried points. Furthermore, we introduce a more complex and practical problem setting, named combinatorial bandit Bayesian optimization (CBBO), where only a subset of the outputs can be selected to contribute to the objective function. To tackle this, we propose a tensor-output CBBO method, which extends TOGP to handle partially observed outputs, and accordingly design a novel CMAB-UCB2 criterion to sequentially select both the queried points and the optimal output subset. Theoretical regret bounds for the two methods are established, ensuring their sublinear performance. Extensive synthetic and real-world experiments demonstrate their superiority.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
926. SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents
๐ฌ 23
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ smart home simulator language model language agent benchmark
Large Language Model (LLM) agents excel at multi-step, tool-augmented tasks. However, smart homes introduce distinct challenges, requiring agents to handle latent user intents, temporal dependencies, device constraints, scheduling, and more.
The main bottlenecks for developing smart home agents with such capabilities include the lack of a realistic simulation environment where agents can interact with devices and observe the results, as well as a challenging benchmark to evaluate them.
To address this, we introduce $\textbf{SimuHome}$, a time-accelerated home environment that simulates smart devices, supports API calls, and reflects changes in environmental variables.
By building the simulator on the Matter protocol, the global industry standard for smart home communication, SimuHome provides a high-fidelity environment, and agents validated in SimuHome can be deployed on real Matter-compliant devices with minimal adaptation.
We provide a challenging benchmark of 600 episodes across twelve user query types that require the aforementioned capabilities.
Our evaluation of 16 agents under a unified ReAct framework reveals distinct capabilities and limitations across models. Models under 7B parameters exhibited negligible performance across all query types. Even GPT-4.1, the best-performing standard model, struggled with implicit intent inference, state verification, and particularly temporal scheduling. While reasoning models such as GPT-5.1 consistently outperformed standard models on every query type, they required over three times the average inference time, which can be prohibitive for real-time smart home applications. This highlights a critical trade-off between task performance and real-world practicality. We will release our code and dataset upon publication of the paper.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
927. Inference-Time Scaling of Discrete Diffusion Models via Importance Weighting and Optimal Proposal Design
๐ฌ 22
โญ 6.00
๐ generative models
๐ท๏ธ discrete diffusion test-time scaling reward aligntment
Discrete diffusion models have become highly effective across various domains. However, real-world applications often require the generative process to adhere to certain constraints.
To this end, we propose a Sequential Monte Carlo (SMC) framework that enables scalable inference-time control of discrete diffusion models through principled importance weighting and optimal proposal construction.
Specifically, our approach derives tractable importance weights for a range of intermediate targets and characterises the optimal proposal, for which we develop two practical approximations: a first-order gradient-based approximation and an amortised proposal trained to minimise the log-variance of the importance weights.
Empirical results across synthetic tasks, language modelling, biology design, and text-to-image generation demonstrate that our framework enhances controllability and sample quality, highlighting the effectiveness of SMC as a versatile recipe for scaling discrete diffusion models at inference time.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
928. SpatialHand: Generative Object Manipulation from 3D Prespective
๐ฌ 13
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ AIGC Application; Image Editing
We introduce SpatialHand, a novel framework for generative object insertion with precise 3D control. Current generative object manipulation methods primarily operate within the 2D image plane, but often fail to grasp 3D scene complexities, leading to ambiguities in an object's 3D position, orientation, and occlusion relations. SpatialHand addresses this by conceptualizing object insertion from a true ``3D perspective," enabling manipulation with a complete 6 Degrees-of-Freedom (6DoF) controllability. Specifically, our solution naturally and implicitly encodes the 6DoF pose condition by decomposing it into 2D location (via masked image), depth (via composited depth map), and 3D orientation (embedded into latent features). To overcome the scarcity of paired training data, we develop an automated data construction pipeline using synthetic 3D assets, rendering, and subject-driven generation, complemented by visual foundation models for pose estimation. We further design a multi-stage training scheme to progressively drive SpatialHand to robustly follow multiple complex conditions. Extensive experiments reveal our approach's superiority over existing alternatives and its great potential for enabling more versatile and intuitive AR/VR-like object manipulation within images.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
929. Identifiability Challenges in Sparse Linear Ordinary Differential Equations
๐ฌ 13
โญ 6.00
๐ learning on time series and dynamical systems
๐ท๏ธ dynamical systems identifiability sparsity
Dynamical systems modeling is a core pillar of scientific inquiry across natural and life sciences. Increasingly, dynamical system models are learned from data, rendering identifiability a paramount concept. For systems that are not identifiable from data, no guarantees can be given about their behavior under new conditions and inputs, or about possible control mechanisms to steer the system. It is known in the community that "linear ordinary differential equations (ODE) are almost surely identifiable from a single trajectory." However, this only holds for dense matrices. The sparse regime remains underexplored, despite its practical relevance with sparsity arising naturally in many biological, social, and physical systems.
In this work, we address this gap by characterizing the identifiability of sparse linear ODEs. Contrary to the dense case, we show that sparse systems are unidentifiable with a positive probability in practically relevant sparsity regimes and provide lower bounds for this probability. We further study empirically how this theoretical unidentifiability manifests in state-of-the-art methods to estimate linear ODEs from data. Our results corroborate that sparse systems are also practically unidentifiable. Theoretical limitations are not resolved through inductive biases or optimization dynamics. Our findings call for rethinking what can be expected from data-driven dynamical system modeling and allows for quantitative assessments of how much to trust a learned linear ODE.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
930. PD$^{2}$GS: Part-Level Decoupling and Continuous Deformation of Articulated Objects via Gaussian Splatting
๐ฌ 20
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ articulated object; reconstruction; digital twin;
Articulated objects are ubiquitous and important in robotics, AR/VR, and digital twins. Most self-supervised methods for articulated object modeling reconstruct discrete interaction states and relate them via cross-state geometric consistency, yielding representational fragmentation and drift that hinder smooth control of articulated configurations. We introduce PD$^{2}$GS, a novel framework that learns a shared canonical Gaussian field and models the arbitrary interaction state as its continuous deformation, jointly encoding geometry and kinematics. By associating each interaction state with a latent code and refining part boundaries using generic vision priors, PD$^{2}$GS enables accurate and reliable part-level decoupling while enforcing mutual exclusivity between parts and preserving scene-level coherence. This unified formulation supports part-aware reconstruction, fine-grained continuous control, and accurate kinematic modeling, all without manual supervision. To assess realism and generalization, we release RS-Art, a real-to-sim RGB-D dataset aligned with reverse-engineered 3D models, supporting real-world evaluation. Extensive experiments demonstrate that PD$^{2}$GS surpasses prior methods in geometric and kinematic accuracy, and in consistency under continuous control, both on synthetic and real data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
931. Towards One-step Causal Video Generation via Adversarial Self-Distillation
๐ฌ 14
โญ 6.00
๐ generative models
๐ท๏ธ Diffusion Distillation Causal Text to Video Generation
Recent hybrid video generation models combine autoregressive temporal dynamics with diffusion-based spatial denoising, but their sequential, iterative nature leads to error accumulation and long inference times. In this work, we propose a distillation-based framework for efficient causal video generation that enables high-quality synthesis with extreme limited denoising steps. Our approach builds upon Distribution Matching Distillation (DMD) framework and proposes a novel form of Adversarial Self-Distillation (ASD) strategy, which aligns the outputs of the student model's $n$-step denoising process with its $(n+1)$-step version in the distribution level. This design provides smoother supervision by bridging small intra-student gaps and more informative guidance by combining teacher knowledge with locally consistent student behavior, substantially improving training stability and generation quality in extremely few-step scenarios. In addition, we present a First-Frame Enhancement (FFE) strategy, which allocates more denoising steps to the initial frames to mitigate error propagation while applying larger skipping steps to later frames. Extensive experiments on VBench demonstrate that our method surpasses state-of-the-art approaches in both one-step and two-step video generation. Notably, our framework produces a single distilled model that flexibly supports multiple inference-step settings, eliminating the need for repeated re-distillation and enabling efficient, high-quality video synthesis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
932. Discounted Online Convex Optimization: Uniform Regret Across a Continuous Interval
๐ฌ 10
โญ 6.00
๐ optimization
๐ท๏ธ Online Convex Optimization Discounted Online Learning Adaptive Algorithms
Reflecting the greater significance of recent history over the distant past in non-stationary environments, $\lambda$-discounted regret has been introduced in online convex optimization (OCO) to gracefully forget past data as new information arrives. When the discount factor $\lambda$ is given, online gradient descent with an appropriate step size achieves an $O(1/\sqrt{1-\lambda})$ discounted regret. However, the value of $\lambda$ is often not predetermined in real-world scenarios. This gives rise to a significant \emph{open question}: is it possible to develop a discounted algorithm that adapts to an unknown discount factor. In this paper, we affirmatively answer this question by providing a novel analysis to demonstrate that smoothed OGD (SOGD) achieves a uniform $O(\sqrt{\log T/1-\lambda})$ discounted regret, holding for all values of $\lambda$ across a continuous interval simultaneously. The basic idea is to maintain multiple OGD instances to handle different discount factors, and aggregate their outputs sequentially by an online prediction algorithm named as Discounted-Normal-Predictor (DNP). Our analysis reveals that DNP can combine the decisions of two experts, even when they operate on discounted regret with different discount factors.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
933. Taming Curvature: Architecture Warm-up for Stable Transformer Training
๐ฌ 19
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Curvature transformers
Training billion-parameter Transformers is often brittle, with transient loss spikes and divergence that waste compute. Even though the recently developed Edge of Stability (EoS) theory provides a powerful tool to understand and control the stability of optimization methods via the (preconditioned) curvature, these curvature-controlling methods are not popular in large-scale Transformer training due to the complexity of curvature estimation. To this end, we first introduce a fast online estimator of the largest (preconditioned) Hessian eigenvalue (i.e., curvature) based on a warm-started variant for power iteration with Hessianโvector products. We show theoretically, and verify empirically, that the proposed method makes per-iteration curvature tracking feasible at billion-parameter scale while being more accurate. Using this tool, we find that training instabilities coincide with surges in preconditioned curvature and that curvature grows with depth. Motivated by these observations, we propose architecture warm-up: progressively growing network depth to carefully control the preconditioned Hessian and stabilize training. Experiments on large Transformers validate that our approach enables efficient curvature tracking and reduces instabilities compared to existing state-of-the-art stabilization techniques without slowing down convergence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
934. DexMove: Learning Tactile-Guided Non-Prehensile Manipulation with Dexterous Hands
๐ฌ 10
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ tactile robotics dexterous hand manipulation
Non-prehensile manipulation offers a robust alternative to traditional pick-and-place methods for object repositioning. However, learning such skills with dexterous, multi-fingered hands remains largely unexplored, leaving their potential for stable and efficient manipulation underutilized. Progress has been limited by the lack of large-scale, contact-aware non-prehensile datasets for dexterous hands and the absence of wristโfinger control policies. To bridge these gaps, we present DexMove, a tactile-guided non-prehensile manipulation framework for dexterous hands. DexMove combines a scalable simulation pipeline that generates physically plausible wristโfinger trajectories with a wearable device, which captures multi-finger contact data from human demonstrations using vision-based tactile sensors. Using these data, we train a flow-based policy that enables real-time, synergistic wristโfinger control for robust non-prehensile manipulation of diverse tabletop objects. In real-world experiments, DexMove successfully manipulated six objects of varying shapes and materials, achieving a 77.8\% success rate. Our method outperforms ablated baselines by 36.6\% and improves efficiency by nearly 300\%. Furthermore, the learned policy generalizes to language-conditioned, long-horizon tasks such as object sorting and desktop tidying.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
935. How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective
๐ฌ 25
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Code LLMs;Benchmark;Evaluation;Test Case
Code evaluation and reinforcement learning rely critically on test cases. However, collecting golden test cases is hard and expensive, motivating the use of LLMs for automatic test case generation.
This, in turn, raises a pivotal challenge: how can we rigorously evaluate the quality of the generated test cases?
Existing benchmarks often evaluate the exclusion ratio on large, unstructured collections of wrong codes, leading to high computational costs and severe score inflation.
Furthermore, they inadvertently reward generators that detect common, trivial bugs, while failing to penalize their inability to identify rare yet critical faults.
In this work, we connect two fundamental questions: (1) What is the minimal set of wrong codes sufficient to represent the entire error space? and (2) What is the minimal set of test cases needed to distinguish them?
We introduce a novel framework that formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix, where rows represent wrong codes and columns represent test case results.
The rank of this matrix plays a dual role. It specifies the minimal number of independent error patterns, which determines the size of wrong codes. It also provides a tight upper bound on the number of test cases required for complete fault coverage.
Our objective is to identify a basis of size equal to the matrix rank that maximizes internal diversity, which is defined as the average pairwise Jaccard similarity of the codes' failure signatures (i.e., the matrix rows).
To tackle this NP-hard problem, we propose WrongSelect, an efficient approximation algorithm combining pre-filtering and random-restart local search to select maximally diverse wrong codes.
Applying this framework to millions of competitive programming submissions, we construct TC-Bench, a compact, diverse, and inflation-resistant benchmark. Extensive experiments show that even the most advanced test case generation methods achieve only ~60\% exclusion rates on TC-Bench, exposing a significant gap in their diagnostic power and highlighting substantial room for future improvement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
936. Beyond Membership: Limitations of Add/Remove Adjacency in Differential Privacy
๐ฌ 13
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ differential privacy deep learning privacy auditing
Training machine learning models with differential privacy (DP) limits an adversary's ability to infer sensitive information about the training data. It can be interpreted as a bound on the adversary's capability to distinguish two adjacent datasets according to the chosen adjacency relation. In practice, most DP implementations use the add/remove adjacency relation, where two datasets are adjacent if one can be obtained from the other by adding or removing a single record, thereby protecting membership. In many ML applications, however, the goal is to protect attributes of individual records (e.g., labels used in supervised fine-tuning). We show that privacy accounting under add/remove overstates attribute privacy compared to accounting under the substitute adjacency relation, which permits substituting one record. To demonstrate this gap, we develop novel attacks to audit DP under substitute adjacency, and show empirically that audit results are inconsistent with DP guarantees reported under add/remove, yet remain consistent with the budget accounted under the substitute adjacency relation. Our results highlight that the choice of adjacency when reporting DP guarantees is critical when the protection target is per-record attributes rather than membership.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
937. FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning
๐ฌ 15
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Agent Benchmark Financial Search Financial Reasoning
Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks, Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation, closely reproducing real-world financial analyst workflows. To ensure difficulty and reliability, we engage $70$ professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes $635$ questions spanning global and Greater China markets, and we evaluate $21$ models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly. By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
938. Sobolev Gradient Ascent for Optimal Transport: Barycenter Optimization and Convergence Analysis
๐ฌ 12
โญ 6.00
๐ optimization
๐ท๏ธ optimal transport; Wasserstein barycenter; concave dual; gradient ascent;
This paper introduces a new constraint-free concave dual formulation for the Wasserstein barycenter. Tailoring the vanilla dual gradient ascent algorithm to the Sobolev geometry, we derive a scalable Sobolev gradient ascent (SGA) algorithm to compute the barycenter for input distributions supported on a regular grid. Despite the algorithmic simplicity, we provide a global convergence analysis that achieves the same rate as the classical subgradient descent methods for minimizing nonsmooth convex functions in the Euclidean space. A central feature of our SGA algorithm is that the computationally expensive $c$-concavity projection operator enforced on the Kantorovich dual potentials is unnecessary to guarantee convergence, leading to significant algorithmic and theoretical simplifications over all existing primal and dual methods for computing the exact barycenter. Our numerical experiments demonstrate the superior empirical performance of SGA over the existing optimal transport barycenter solvers.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
939. BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models
๐ฌ 15
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ AI for biology foundation models synthetic captions
This work investigates descriptive captions as an additional source of supervision for biological multimodal foundation models. Images and captions can be viewed as complementary samples from the latent morphospace of a species, each capturing certain biological traits. Incorporating captions during training encourages alignment with this shared latent structure, emphasizing potentially diagnostic characters while suppressing spurious correlations. The main challenge, however, lies in obtaining faithful, instance-specific captions at scale. This requirement has limited the utilization of natural language supervision in organismal biology compared with many other scientific domains. We complement this gap by generating synthetic captions with multimodal large language models (MLLMs), guided by Wikipedia-derived visual information and taxon-tailored format examples. These domain-specific contexts help reduce hallucination and yield accurate, instance-based descriptive captions. Using these captions, we train BioCAP (i.e., BioCLIP with Captions), a biological foundation model that captures rich semantics and achieves strong performance in species classification and text-image retrieval. These results demonstrate the value of descriptive captions beyond labels in bridging biological images with multimodal foundation models
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
940. floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL
๐ฌ 20
โญ 6.00
๐ reinforcement learning
๐ท๏ธ offline RL online fine-tuning flow-matching TD-learning
A hallmark of modern large-scale machine learning techniques is the use of training objectives that provide dense supervision to intermediate computations, such as teacher forcing the next token in language models or denoising step-by-step in diffusion models. This enables models to learn complex functions in a generalizable manner. Motivated by this observation, we investigate the benefits of iterative computation for temporal difference (TD) methods in reinforcement learning (RL). Typically, they represent value functions in a monolithic fashion, without iterative compute. We introduce floq (flow-matching Q-functions), an approach that parameterizes the Q-function using a velocity field and trains it with techniques from flow-matching, typically used in generative modeling. This velocity field underneath the flow is trained using a TD-learning objective, which bootstraps from values produced by a target velocity field, computed by running multiple steps of numerical integration. Crucially, floq allows for more fine-grained control and scaling of the Q-function capacity than monolithic architectures, by appropriately setting the number of integration steps. Across a suite of challenging offline RL benchmarks and online fine-tuning tasks, floq improves performance by nearly 1.8x. floq scales capacity far better than standard TD-learning architectures, highlighting the potential of iterative computation for value learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
941. TINKER: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization
๐ฌ 16
โญ 6.00
๐ generative models
๐ท๏ธ Diffusion Model 3D Editing
We introduce TINKER, a novel framework for high-fidelity 3D editing without any per-scene finetuning, where only a single edited image (one-shot) or a few edited images (few-shot) are required as input. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, TINKER delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Multi-view consistent editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video scene completion model : Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, TINKER significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks, while also demonstrating strong potential for 4D editing. We believe that TINKER represents a key step towards truly scalable, zero-shot 3D and 4D editing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
942. What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging
๐ฌ 16
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Vision Language Model Negation Understanding Affirmative Bias Described Object Detection Chain-of-Thought Reasoning Token Merging
State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias. This limitation is particularly severe in described object detection (DOD) tasks. To address this, we propose two primary contributions: (1) a new dataset pipeline and (2) a novel, lightweight adaptation recipe. First, we introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data. Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias. NegToMe fundamentally addresses the structural loss of negation cues in tokenization, grouping them with attributes into coherent semantic phrases. It maintains correct polarity at the input level, enabling robust negation understanding even with limited data. For instance, to prevent a model from treating the fragmented tokens "not" and "girl" as simply "girl", NegToMe binds them into a single token whose meaning is correctly distinguished from that of "girl" alone. This module is integrated with a parameter-efficient and strategic LoRA fine-tuning approach. Our method significantly improves performance on challenging negation benchmarks with a lowered false positive rate, boosting NMS-AP by up to +10.8 points on OVDEval and demonstrating generalization to SoTA VLMs. This work marks a crucial step forward in addressing negation understanding for real-world detection applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
943. YuE: Scaling Open Foundation Models for Long-Form Music Generation
๐ฌ 9
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ lyrics2song song generation long-form foundation model music generation
We tackle the task of long-form music generation, particularly the challenging \textbf{lyrics-to-song} problem, by introducing \textbf{YuE (ไน)}, a family of open-source music generation foundation models. Specifically,
YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through \textbf{track-decoupled next-token prediction} to overcome dense mixture signals, and \textbf{structural progressive conditioning} for long-context lyrical alignment. In addition, we redesign the \textbf{in-context learning} technique for music generation, enabling bidirectional content creation, style cloning, and improving musicality. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility (as of 2025-01). We strongly encourage readers to \textbf{listen to our demo}\footnote{\url{https://yue-anonymous.github.io}}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
944. IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning
๐ฌ 19
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Vision Lanuage Model Token Pruning
Large Vision-Language Models (LVLMs) achieve impressive performance across multiple tasks. A significant challenge, however, is their prohibitive inference cost when processing high-resolution visual inputs. While visual token pruning has emerged as a promising solution, existing methods that primarily focus on semantic relevance often discard tokens that are crucial for spatial reasoning. We address this gap through a novel insight into how LVLMs process spatial reasoning. Specifically, we reveal that LVLMs implicitly establish visual coordinate systems through Rotary Position Embeddings (RoPE), where specific token positions serve as implicit visual coordinates (IVC tokens) that are essential for spatial reasoning. Based on this insight, we propose IVC-Prune, a training-free, prompt-aware pruning strategy that retains both IVC tokens and semantically relevant foreground tokens. IVC tokens are identified by theoretically analyzing the mathematical properties of RoPE, targeting positions at which its rotation matrices approximate identity matrix or the $90^\circ$ rotation matrix. Foreground tokens are identified through a robust two-stage process: semantic seed discovery followed by contextual refinement via value-vector similarity. Extensive evaluations across four representative LVLMs and twenty diverse benchmarks show that IVC-Prune reduces visual tokens by approximately 50\% while maintaining $\geq$ 99\% of the original performance and even achieving improvements on several benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
945. LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection
๐ฌ 10
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Music Audio Multimodal learning Representation Learning Transformer
Music learners can greatly
benefit from tools that accurately detect errors in their practice.
Existing approaches typically compare audio recordings to music scores using heuristics or learnable models.
This paper introduces \textit{LadderSym}, a novel Transformer-based method for music error detection.
\textit{LadderSym} is guided by two key observations about the state-of-the-art approaches: (1) late fusion limits inter-stream alignment and cross-modality comparison capability;
and (2) reliance on score audio introduces ambiguity in the frequency spectrum, degrading performance in music with concurrent notes.
To address these limitations, \textit{LadderSym} introduces (1) a two-stream encoder with inter-stream alignment modules to improve audio comparison capabilities and error detection F1 scores, and (2) a multimodal strategy that leverages both audio and symbolic scores by incorporating symbolic representations as decoder prompts, reducing ambiguity and improving F1 scores.
We evaluate our method on the \textit{MAESTRO-E} and \textit{CocoChorales-E} datasets by measuring the F1 score for each note category.
Compared to the previous state of the art, \textit{LadderSym} more than doubles F1 for missed notes on \textit{MAESTRO-E} (26.8\%~$\rightarrow$~56.3\%) and improves extra note detection by 14.4 points (72.0\%~$\rightarrow$~86.4\%).
Similar gains are observed on \textit{CocoChorales-E}. Furthermore, we also evaluate our models on real data we curated. This work introduces insights about comparison models that could inform sequence evaluation tasks for reinforcement learning, human skill assessment, and model evaluation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
946. Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models
๐ฌ 32
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Diffusion Language Models Semantic Entropy Self-Consistency Reinforcement Learning
Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
947. Modeling Interference for Treatment Effect Estimation in Network Dynamic Environment
๐ฌ 27
โญ 6.00
๐ causal reasoning
๐ท๏ธ Dynamic Network Interference Causality
In recent years, estimating causal effects of treatment on the outcome variable in network environments has attracted growing interest. The intrinsic interconnectedness of network and the attendant violation of the SUTVA assumption have prompted a wave of treatment effect estimation methods tailored to network settings, yielding considerable progress such as capturing hidden confounders by leveraging auxiliary network structure. Nevertheless, despite these advances, the existing methods: (i) mainly focus on the static network, overlooking the dynamic nature of many real-world networks and confounders that evolve over time; (ii) assume the absence of dynamic network interference where one unitโs treatment can affect its neighborsโ outcomes. To address these two limitations, we first define a new estimand of treatment effects accounting for interference in a dynamic network environment, i.e., CATE-ID, and establish its identifiability under such an environment. Then we accordingly propose DSPNET, a framework tailored specifically for treatment effect estimation in dynamic network environment, that leverages historical information and network structure to capture time-varying confounders and model dynamic interference. Extensive experiments demonstrate the superiority of our proposed method compared to state-of-the-art approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
948. When Shift Happens - Confounding Is to Blame
๐ฌ 19
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Explainability OOD Generalization Confounding shifts
Distribution shifts introduce uncertainty that undermines the robustness and generalization capabilities of machine learning models. While conventional wisdom suggests that learning causal-invariant representations enhances robustness to such shifts, recent empirical studies present a counterintuitive finding: (i) empirical risk minimization (ERM) can rival or even outperform state-of-the-art out-of-distribution (OOD) generalization methods, and (ii) OOD generalization performance improves when all available covariates, including non-causal ones, are utilized. We present theoretical and empirical explanations that attribute this phenomenon to hidden confounding. Shifts in hidden confounding induce changes in data distributions that violate assumptions commonly made by existing approaches. Under such conditions, we prove that generalization requires learning environment-specific relationships, rather than relying solely on invariant ones. Furthermore, we explain why models augmented with non-causal but informative covariates can mitigate the challenges posed by hidden confounding shifts. These findings offer new theoretical insights and practical guidance, serving as a roadmap for future research on OOD generalization and principled covariate-selection strategies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
949. SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models
๐ฌ 14
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ LLM Reasoning Agents Controlled Evaluation RAG
Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent *knowledge advantage gap*, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
950. Score-Based Density Estimation from Pairwise Comparisons
๐ฌ 15
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ score-based methods pairwise comparisons density estimation elicitation random utility models tempering
We study density estimation from pairwise comparisons, motivated by expert knowledge elicitation and learning from human feedback. We relate the unobserved target density to a tempered winner density (marginal density of preferred choices), learning the winner's score via score-matching. This allows estimating the target by `de-tempering' the estimated winner density's score. We prove that the score vectors of the belief and the winner density are collinear, linked by a position-dependent tempering field. We give analytical formulas for this field and propose an estimator for it under the Bradley-Terry model. Using a diffusion model trained on tempered samples generated via score-scaled annealed Langevin dynamics, we can learn complex multivariate belief densities of simulated experts, from only hundreds to thousands of pairwise comparisons.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
951. Provable Separations between Memorization and Generalization in Diffusion Models
๐ฌ 16
โญ 6.00
๐ generative models
๐ท๏ธ Memorization and Generalization Diffusion Models Statistical Estimation Network Approximation
Diffusion models have achieved remarkable success across diverse domains, but they remain vulnerable to memorization---reproducing training data rather than generating novel outputs. This not only limits their creative potential but also raises concerns about privacy and safety. While empirical studies have explored mitigation strategies, theoretical understanding of memorization remains limited. We address this gap through developing a dual-separation result via two complementary perspectives: statistical estimation and network approximation. From the estimation side, we show that the ground-truth score function does not minimize the empirical denoising loss, creating a separation that drives memorization. From the approximation side, we prove that implementing the empirical score function requires network size to scale with sample size, spelling a separation compared to the more compact network representation of the ground-truth score function. Guided by these insights, we develop a pruning-based method that reduces memorization while maintaining generation quality in diffusion transformers.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
952. Smooth Calibration Error: Uniform Convergence and Functional Gradient Analysis
๐ฌ 12
โญ 6.00
๐ learning theory
๐ท๏ธ calibration smooth calibration gradient boosting ece generalization uniform convergence
Calibration is a critical requirement for reliable probabilistic prediction, especially in high-risk applications. However, the theoretical understanding of which learning algorithms can simultaneously achieve high accuracy and good calibration remains limited, and many existing studies provide empirical validation or a theoretical guarantee in restrictive settings. To address this issue, in this work, we focus on the smooth calibration error (CE) and provide a uniform convergence bound, showing that the smooth CE is bounded by the sum of the smooth CE over the training dataset and a generalization gap. We further prove that the functional gradient of the loss function can effectively control the training smooth CE. Based on this framework, we analyze three representative algorithms: gradient boosting trees, kernel boosting, and two-layer neural networks. For each, we derive conditions under which both classification and calibration performances are simultaneously guaranteed. Our results offer new theoretical insights and practical guidance for designing reliable probabilistic models with provable calibration guarantees.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
953. PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
๐ฌ 18
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Large Multimodal Models Scentific document understanding evaluation benchmark
Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of \emph{choice-only shortcuts} in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1โ54.2\%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants. We provide the source code and dataset viewer in the appendix, and will release the full source code, dataset, and annotation tool publicly upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
954. Early Signs of Steganographic Capabilities in Frontier LLMs
๐ฌ 11
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ AI Safety Alignment CoT Monitoring AI Control Encoded Reasoning Steganography Evaluations LLMs
Monitoring Large Language Model (LLM) outputs is crucial for mitigating risks from misuse and misalignment. However, LLMs could evade monitoring through steganography: Encoding hidden information within seemingly benign generations. In this paper, we evaluate the steganography capabilities in frontier LLMs to better understand the risk they pose. We focus on two types of steganography: passing encoded messages and performing encoded reasoning. We find that current models are unable to encode short messages in their outputs without a monitor noticing under standard affordances. They can succeed, however, if given additional affordances like using an unmonitored scratchpad and coordinating on what encoding scheme to use. We additionally find early signs that models can perform basic encoded reasoning in a simple state-tracking problem. This includes some ability to reason with their own and pre-defined schemes, including encoding schemes such as Hexadecimal. Despite this, they can rarely hide reasoning subtly within a cover task to fool a monitor. Overall, our results indicate that current LLMs exhibit nascent steganographic capabilities. While these capabilities are likely insufficient to bypass well-designed monitors at present, this could change in the future.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
955. Corner Gradient Descent
๐ฌ 10
โญ 6.00
๐ optimization
๐ท๏ธ mini-batch stochastic gradient descent momentum sampling noise convergence rates acceleration power laws phase diagram contour integration rational approximations asymptotic methods MNIST frequency response function
We consider SGD-type optimization on infinite-dimensional quadratic problems with power law spectral conditions. It is well-known that on such problems deterministic GD has loss convergence rates $L_t=O(t^{-\zeta})$, which can be improved to $L_t=O(t^{-2\zeta})$ by using Heavy Ball with a non-stationary Jacobi-based schedule (and the latter rate is optimal among fixed schedules). However, in the mini-batch Stochastic GD setting, the sampling noise causes the Jacobi HB to diverge; accordingly no $O(t^{-2\zeta})$ algorithm is known. In this paper we show that rates up to $O(t^{-2\zeta})$ can be achieved by a generalized stationary SGD with infinite memory. We start by identifying generalized (S)GD algorithms with contours in the complex plane. We then show that contours that have a corner with external angle $\theta\pi$ accelerate the plain GD rate $O(t^{-\zeta})$ to $O(t^{-\theta\zeta})$. For deterministic GD, increasing $\theta$ allows to achieve rates arbitrarily close to $O(t^{-2\zeta})$. However, in Stochastic GD, increasing $\theta$ also amplifies the sampling noise, so in general $\theta$ needs to be optimized by balancing the acceleration and noise effects. We prove that the optimal rate is given by $\theta_{\max}=\min(2,\nu,\tfrac{2}{\zeta+1/\nu})$, where $\nu,\zeta$ are the exponents appearing in the capacity and source spectral conditions. Furthermore, using fast rational approximations of the power functions, we show that ideal corner algorithms can be efficiently approximated by practical finite-memory algorithms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
956. Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI
๐ฌ 19
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Neuroscience Functional Magnetic Resonance Imaging Image reconstruction Reconstruction
Understanding how the brain encodes visual information is a central challenge in neuroscience and machine learning. A promising approach is to reconstruct visual stimuliโessentially imagesโfrom functional Magnetic Resonance Imaging (fMRI) signals. This involves two stages: transforming fMRI signals into a latent space and then using a pre-trained generative model to reconstruct images. The reconstruction quality depends on how similar the latent space is to the structure of neural activity and how well the generative model produces images from that space. Yet, it remains unclear which type of latent space best supports this transformation and how it should be organized to represent visual stimuli effectively.
We present two key findings. First, fMRI signals are more similar to the text space of a language model than to either a vision-based space or a joint textโimage space. Second, text representations and the generative model should be adapted to capture the compositional nature of visual stimuli, including objects, their detailed attributes, and relationships. Building on these insights, we propose PRISM, a model that Projects fMRI sIgnals into a Structured text space as an interMediate representation for visual stimuli reconstruction. It includes an object-centric diffusion module that generates images by composing individual objects to reduce object detection errors, and an attributeโrelationship search module that automatically identifies key attributes and relationships that best align with the neural activity.
Extensive experiments on real-world datasets demonstrate that our framework outperforms existing methods, achieving up to an 8% reduction in perceptual loss. These results highlight the importance of using structured text as the intermediate space to bridge fMRI signals and image reconstruction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
957. Gogo: Group-wise granularity-ordered codec for stable and efficient speech generation
๐ฌ 19
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ speech codec speech language model speech generation text-to-speech synthesis
Current speech language models require their core component, the speech codec, to discretize continuous speech signals into tokens that not only capture high-level cues for autoregressive modeling but also preserve sufficient acoustic details for perceptual quality. To address this need, we propose Gogo, a group-wise granularity-ordered codec that quantizes each group of frames into tokens arranged from coarse to fine, where coarse tokens encode high-level abstractions and fine tokens progressively recover low-level details. Building on the granularity-ordering property of Gogo, we introduce GogoSpeech, a two-stage speech language model that performs speech generation by first constructing a coarse speech backbone at an extremely low token rate and then enriching the backbone with fine-grained acoustic details. Considering the inherently non-uniform information distribution in speech signals, we further design a Group Relative Policy Optimization (GRPO)-trained token allocator that adaptively allocates token budgets to groups based on group-wise complexity. Experimental results demonstrate that Gogo delivers state-of-the-art reconstruction performance across most metrics at a token rate of 47. Moreover, evaluations on zero-shot text-to-speech tasks show that GogoSpeech enables efficient generation by adaptively reducing the average token rate, and attains state-of-the-art results in long-form speech generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
958. PrismAudio: Decomposed Chain-of-Thought and Multi-dimensional Rewards for Video-to-Audio Generation
๐ฌ 16
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Chain-of-Thought Reinforcement Learning Video-to-Audio Generation
Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single loss functions and lack human preference alignment. We introduce **PrismAudio**, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning. Our approach decomposes monolithic reasoning into four specialized CoT modules (Semantic, Temporal, Aesthetic, and Spatial CoT), each paired with targeted reward functions. This CoT-reward correspondence enables **multidimensional RL optimization** that guides the model to jointly generate better reasoning across all perspectives, solving the objective entanglement problem while preserving interpretability. To make this optimization computationally practical, we propose **Fast-GRPO**, which employs hybrid ODE-SDE sampling that dramatically reduces the training overhead compared to existing GRPO implementations. We also introduce **AudioCanvas**, a rigorous benchmark that is more distributionally balanced and covers more realistically diverse and challenging scenarios than existing datasets, with 300 single-event classes and 501 multi-event samples. Experimental results demonstrate that PrismAudio achieves state-of-the-art performance across all four perceptual dimensions on both the in-domain VGGSound test set and out-of-domain AudioCanvas benchmark. The project page is available at~\url{https://PrismAudio.github.io}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
959. Intention-Conditioned Flow Occupancy Models
๐ฌ 16
โญ 6.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning flow matching latent variable model pre-training and fine-tuning
Large-scale pre-training has fundamentally changed how machine learning research is done today: large foundation models are trained once, and then can be used by anyone in the community (including those without data or compute resources to train a model from scratch) to adapt and fine-tune to specific tasks. Applying this same framework to reinforcement learning (RL) is appealing because it offers compelling avenues for addressing core challenges in RL, including sample efficiency and robustness. However, there remains a fundamental challenge to pre-train large models in the context of RL: actions have long-term dependencies, so training a foundation model that reasons across *time* is important. Recent advances in generative AI have provided new tools for modeling highly complex distributions. In this paper, we build a probabilistic model to predict which states an agent will visit in the temporally distant future (i.e., an occupancy measure) using flow matching. As large datasets are often constructed by many distinct users performing distinct tasks, we include in our model a latent variable capturing the user intention. This intention increases the expressivity of our model, and enables adaptation with generalized policy improvement. We call our proposed method **intention-conditioned flow occupancy models (InFOM)**. Comparing with alternative methods for pre-training, our experiments on $36$ state-based and $4$ image-based benchmark tasks demonstrate that the proposed method achieves $1.8 \times$ median improvement in returns and increases success rates by $36\\%$.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
960. Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers
๐ฌ 8
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Sparse Autoencoders SAEs LLMs interpretability
Sparse autoencoders (SAEs) are widely used to extract sparse, interpretable latents from transformer activations. We test whether commonly used SAE quality metrics and automatic explanation pipelines can distinguish trained transformers from randomly initialized ones (e.g., where parameters are sampled i.i.d. from a Gaussian). Over a wide range of Pythia model sizes and multiple randomization schemes, we find that, in many settings, SAEs trained on randomly initialized transformers produce auto-interpretability scores and reconstruction metrics that are similar to those from trained models. These results show that high aggregate auto-interpretability scores do not, by themselves, guarantee that learned, computationally relevant features have been recovered. We therefore recommend treating common SAE metrics as useful but insufficient proxies for mechanistic interpretability and argue for routine randomized baselines and targeted measures of feature 'abstractness'.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
961. TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning
๐ฌ 14
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Natural Language Processing AI/NLP for Science Large Language Models Vision Language Models Reinforcement Learning Code Generation Representation Learning
Large language models (LLMs) are increasingly used to assist scientists across diverse workflows. A key challenge is generating high-quality figures from textual descriptions, often represented as TikZ programs that can be rendered as scientific images. Prior research has proposed a variety of datasets and modeling approaches for this task. However, existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ, causing mismatches between text and rendered figures. Moreover, prior approaches rely solely on supervised fine-tuning (SFT), which does not expose the model to the rendered semantics of the figure, often resulting in errors such as looping, irrelevant content, and incorrect spatial relations. To address these issues, we construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality than DaTikZ-V3, enriched with LLM-generated figure descriptions. Using this dataset, we train TikZilla, a family of small open-source Qwen models (3B and 8B) with a two-stage pipeline of SFT followed by reinforcement learning (RL). For RL, we leverage an image encoder trained via inverse graphics to provide semantically faithful reward signals. Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at much smaller model sizes. Code, data, and models will be made available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
962. Massive Memorization with Hundreds of Trillions of Parameters for Sequential Transducer Generative Recommenders
๐ฌ 11
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ sequential recommendation systems generative recommendation production-scale data user interaction history
Modern large-scale recommendation systems rely heavily on user interaction history sequences to enhance the model performance.
The advent of large language models and sequential modeling techniques, particularly transformer architectures, has led to significant advancements (e.g., HSTU, SIM, and TWIN models). While scaling to ultra-long user histories (10k to 100k items) generally improves model performance, it also creates significant challenges on latency, queries per second (QPS) and GPU cost in industry-scale recommendation systems. Existing models do not adequately address these industrial scalability issues. In this paper, we propose a novel two-stage modeling framework, namely \emph{VIrtual Sequential Target Attention} (VISTA), which decomposes traditional target attention from a candidate item to user history items into two distinct stages: (1) user history summarization into a few hundred tokens; followed by (2) candidate item attention to those tokens. These summarization token embeddings are then cached in storage system and then utilized as sequence features for downstream model training and inference. This novel design for scalability enables VISTA to scale to lifelong user histories (up to one million items) while keeping downstream training and inference costs fixed, which is essential in industry.
Our approach achieves significant improvements in offline and online metrics and has been successfully deployed on an industrial platform serving billions of users.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
963. KANO: Kolmogorov-Arnold Neural Operator
๐ฌ 9
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Neural Operator Operator Network KAN SciML AI4Science Interpretable AI
We introduce KolmogorovโArnold Neural Operator (KANO), a dualโdomain neural operator jointly parameterized by both spectral and spatial bases with intrinsic symbolic interpretability. We theoretically demonstrate that KANO overcomes the pure-spectral bottleneck of Fourier Neural Operator (FNO): KANO remains expressive over a generic position-dependent dynamics for any physical input, whereas FNO stays practical only to spectrally sparse operators and strictly imposes fast-decaying input Fourier tail. We verify our claims empirically on position-dependent differential operators, for which KANO robustly generalizes but FNO fails to. In the quantum Hamiltonian learning benchmark, KANO reconstructs groundโtruth Hamiltonians in closed-form symbolic representations accurate to the fourth decimal place in coefficients and attains $\approx6\times10^{-6}$ state infidelity from projective measurement data, substantially outperforming that of the FNO trained with ideal full wave function data, $\approx1.5\times10^{-2}$, by orders of magnitude.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
964. GenDR: Lighten Generative Detail Restoration
๐ฌ 18
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Diffusion Super-Resolution Score distillation
Although recent research applying text-to-image (T2I) diffusion models to real-world super-resolution (SR) has achieved remarkable progress, the misalignment of their targets leads to a suboptimal trade-off between inference speed and detail fidelity. Specifically, the T2I task requires multiple inference steps to synthesize images matching to prompts and reduces the latent dimension to lower generating difficulty. Contrariwise, SR can restore high-frequency details in fewer inference steps, but it necessitates a more reliable variational auto-encoder (VAE) to preserve input information. However, most diffusion-based SRs are multistep and use 4-channel VAEs, while existing models with 16-channel VAEs are overqualified diffusion transformers, e.g., FLUX (12B). To align the target, we present a one-step diffusion model for generative detail restoration, GenDR, distilled from a tailored diffusion model with a larger latent space. In detail, we train a new SD2.1-VAE16 (0.9B) via representation alignment to expand the latent space without increasing the model size. Regarding step distillation, we propose consistent score identity distillation (CiD) that incorporates SR task-specific loss into score distillation to leverage more SR priors and align the training target. Furthermore, we extend CiD with adversarial learning and representation alignment (CiDA) to enhance perceptual quality and accelerate training. We also polish the pipeline to achieve a more efficient inference. Experimental results demonstrate that GenDR achieves state-of-the-art performance in both quantitative metrics and visual fidelity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
965. LatentQA: Teaching LLMs to Decode Activations Into Natural Language
๐ฌ 12
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ AI Safety Activation Engineering Top-Down Transparency of Language Models
Top-down transparency typically analyzes language model activations using probes with scalar or single-token outputs, limiting the range of behaviors that can be captured. To alleviate this issue, we develop a more expressive probe that can directly output natural language and perform LatentQA: the task of answering open-ended questions about activations. A key difficulty in developing such a probe is collecting a dataset mapping activations to natural-language descriptions. In response, we propose an approach for generating a pseudo-labeled dataset of activations and associated question-answer pairs and develop a fine-tuning method for training a decoder LLM on this dataset. We then validate our decoderโs fidelity by assessing its ability to read and steer model activations. First, we evaluate the decoder on a number of supervised reading tasks with a known answer, such as uncovering hidden system prompts and relational knowledge extraction, and observe that it outperforms competitive probing baselines. Second, we demonstrate that the decoder is precise enough to steer the target model to exhibit behaviors unseen during training. Finally, we show that LatentQA scales well with increasing dataset and model size, which is promising given how easily our approach can generate additional pseudo-labels.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
966. Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents
๐ฌ 27
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Earth observation Earth-Agent Earth-Bench
Earth observation (EO) is essential for understanding the evolving states of the Earth system. Although recent MLLMs have advanced EO research, they still lack the capability to tackle complex tasks that require multi-step reasoning and the use of domain-specific tools. Agent-based methods offer a promising direction, but current attempts remain in their infancy, confined to RGB perception, shallow reasoning, and lacking systematic evaluation protocols.
To overcome these limitations, we introduce Earth-Agent, the first agentic framework that unifies RGB and spectral EO data within an MCP-based tool ecosystem, enabling cross-modal, multi-step, and quantitative spatiotemporal reasoning beyond pretrained MLLMs. Earth-Agent supports complex scientific tasks such as geophysical parameter retrieval and quantitative spatiotemporal analysis by dynamically invoking expert tools and models across modalities. To support comprehensive evaluation, we further propose Earth-Bench, a benchmark of 248 expert-curated tasks with 13,729 images, spanning spectrum, products and RGB modalities, and equipped with a dual-level evaluation protocol that assesses both reasoning trajectories and final outcomes. We conduct comprehensive experiments varying different LLM backbones, comparisons with general agent frameworks, and comparisons with MLLMs on remote sensing benchmarks, demonstrating both the effectiveness and potential of Earth-Agent. Earth-Agent establishes a new paradigm for EO analysis, moving the field toward scientifically grounded, next-generation applications of LLMs in Earth observation. Our code and dataset will be publicly released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
967. MoRA: Mobility as the Backbone for Geospatial Representation Learning at Scale
๐ฌ 11
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ GeoAI spatial representation learning location embedding multi-modal contrastive learning
Representation learning of geospatial locations remains a core challenge in achieving general geospatial intelligence, with increasingly diverging philosophies and techniques. While Earth observation paradigms excel at depicting locations in their physical states, we propose that a locationโs full characterization requires grounding in both its physical attributes and its internal human activity pattern, the latter being particularly crucial for understanding its human-centric functions. We present MoRA, a human-centric geospatial framework that leverages a mobility graph as its core backbone to fuse various data modalities, aiming to learn embeddings that represent the socio-economic context and functional role of a location. MoRA achieves this through the integration of spatial tokenization, GNNs, and asymmetric contrastive learning to align 100M+ POIs, massive remote sensing imagery, and structured demographic statistics with a billion-edge mobility graph, ensuring the three auxiliary modalities are interpreted through the lens of fundamental human dynamics. To rigorously evaluate the effectiveness of MoRA, we construct a benchmark dataset composed of 9 downstream prediction tasks across social and economic domains. Experiments show that MoRA, with four input modalities and a compact 128-dimensional representation space, achieves superior predictive performances than state-of-the-art models by an average of 12.9\%. Echoing LLM scaling laws, we further demonstrate the scaling behavior in geospatial representation learning. We open-source code and pretrained models at: https://anonymous.4open.science/r/MoRA-.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
968. Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime
๐ฌ 17
โญ 6.00
๐ optimization
๐ท๏ธ Adam implicit bias separable data adaptive algorithms mini-batch
Adam [Kingma & Ba, 2015] is the de facto optimizer in deep learning, yet its theoretical understanding remains limited. Prior analyses show that Adam favors solutions aligned with $\ell_\infty$-geometry, but these results are restricted to the full-batch regime. In this work, we study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data, and show that its bias can deviate from the full-batch behavior. As an extreme example, we construct datasets on which incremental Adam provably converges to the $\ell_2$-max-margin classifier, in contrast to the $\ell_\infty$-max-margin bias of full-batch Adam. For general datasets, we characterize its bias using a proxy algorithm for the $\beta_2 \to 1$ limit. This proxy maximizes a data-adaptive Mahalanobis-norm margin, whose associated covariance matrix is determined by a data-dependent dual fixed-point formulation. We further present concrete datasets where this bias reduces to the standard $\ell_2$- and $\ell_\infty$-max-margin classifiers. As a counterpoint, we prove that Signum [Bernstein et al., 2018] converges to the $\ell_\infty$-max-margin classifier for any batch size. Overall, our results highlight that the implicit bias of Adam crucially depends on both the batching scheme and the dataset, while Signum remains invariant.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
969. Pretraining Scaling Laws for Generative Evaluations of Language Models
๐ฌ 11
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ language models large language models scaling laws evaluations generative evaluations sampling
Neural scaling laws have driven the field's ever-expanding exponential growth in parameters, data and compute. While scaling behaviors for pretraining losses and discriminative benchmarks are well established, generative benchmarks such as mathematical problem-solving or software engineering remain under-explored.
We propose and evaluate three different pretraining scaling laws for fitting pass-at-$k$ on generative evaluations and for predicting pass-at-$k$ of the most expensive model using cheaper models.
Our three scaling laws differ in the covariates used: (1) pretraining compute, (2) model parameters and pretraining tokens, (3) log likelihoods of gold reference solutions.
First, we demonstrate that generative evaluations introduce new hyperparameters (in our setting, $k$) that act as a control lever for scaling behavior, modulating both the scaling law parameters and the predictability of performance.
Second, we identify a stark difference in parameter stability: while the compute and parameters+tokens laws stabilize for only the last $\mathord{\sim}1.5\mathord{-}2.5$ orders of magnitude, the gold reference likelihood law is uniquely stable, converging across $\mathord{\sim}5$ orders. Third, in terms of predictive performance, we find all three scaling laws perform comparably, although the compute law predicts slightly worse for small $k$ and the gold reference law predicts slightly worse for large $k$. Finally, we establish a theoretical connection, proving that the compute scaling law emerges as the compute-optimal envelope of the parameters-and-tokens law. Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance, accelerating progress toward models that can reason, solve, and create.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
970. Low-Latency Neural LiDAR Compression with 2D Context Models
๐ฌ 12
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Data Compression
Context modeling is fundamental to LiDAR point cloud compression. Existing methods rely on computationally intensive 3D contexts, such as voxel and octree, which struggle to balance the compression efficiency and coding speed. In this work, we propose a neural LiDAR compressor based on 2D context models that simultaneously supports high-efficiency compression, fast coding, and universal geometry-intensity compression. The 2D context structure significantly reduces the coding latency. We further develop a comprehensive context model that integrates spatial latents, temporal references, and cross-modal camera context in the 2D domain to enhance the compression performance. Specifically, we first represent the point cloud as a range image and propose a multi-scale spatial context model to capture the intra-frame dependencies. Furthermore, we design an optical-flow-based temporal context model for inter-frame prediction. Moreover, we incorporate a deformable attention module and a context refinement strategy to predict LiDAR scans from camera images. In addition, we develop a backbone for joint geometry and intensity compression, which unifies the compression of both modalities while minimizing redundant computation. Experiments demonstrate significant improvements in both rate-distortion performance and coding speed. The code will be released upon the acceptance of the paper.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
971. StreamingVLM: Real-Time Understanding for Infinite Video Streams
๐ฌ 8
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Machine learning Vision Language Model ML System
Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage.
Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation.
In this paper, we introduce **StreamingVLM**, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference.
During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens.
This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts.
For evaluation, we build **Inf-Streams-Eval**, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text.
On Inf-Streams-Eval, **StreamingVLM** achieves a **66.18%** win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100.
Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96.
Code will be released upon publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
972. EVALUATING MEMORY IN LLM AGENTS VIA INCRE- MENTAL MULTI-TURN INTERACTIONS
๐ฌ 13
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ LLM Agents; Agents with Memory; Memory Agents Benchmark; Evaluation for Memory
Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical componentโmemory, encompassing how agents memorize, update, and retrieve long-term informationโis under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, based on classic theories from memory science and cognitive science, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and selective forgetting. Existing benchmarks either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Moreover, no existing benchmarks
cover all four competencies. We introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark transforms existing long-context datasets and incorporates newly constructed datasets into a multi-turn format, effectively simulating the incremental information processing characteristic of memory agents. By carefully selecting and curating datasets, our benchmark provides comprehensive coverage of the four core memory competencies outlined above, thereby offering a systematic and challenging testbed for assessing memory quality.
We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
973. IC-Custom: Diverse Image Customization via In-Context Learning
๐ฌ 12
โญ 6.00
๐ generative models
๐ท๏ธ image customization image generation image editing diffusion model diffusion transformer
Image customization, a crucial technique for industrial media production, aims to generate content that is consistent with reference images. However, current approaches conventionally separate image customization into position-aware and position-free customization paradigms and lack a universal framework for diverse customization, limiting their applications across various scenarios. To overcome these limitations, we propose IC-Custom, a unified framework that seamlessly integrates position-aware and position-free image customization through in-context learning. IC-Custom concatenates reference images with target images to a polyptych, leveraging DiT's multi-modal attention mechanism for fine-grained token-level interactions. We propose the In-context Multi-Modal Attention (ICMA) mechanism, which employs learnable task-oriented register tokens and boundary-aware positional embeddings to enable the model to effectively handle diverse tasks and distinguish between inputs in polyptych configurations. To address the data gap, we curated a 12K identity-consistent dataset with 8K real-world and 4K high-quality synthetic samples, avoiding the overly glossy, oversaturated look typical of synthetic data. IC-Custom supports various industrial applications, including try-on, image insertion, and creative IP customization. Extensive evaluations on our proposed ProductBench and the publicly available DreamBench demonstrate that IC-Custom significantly outperforms community workflows, closed-source models, and state-of-the-art open-source approaches. IC-Custom achieves about 73\% higher human preference across identity consistency, harmony, and text alignment metrics, while training only 0.4\% of the original model parameters.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
974. RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding
๐ฌ 18
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Visual culture understanding Cultural benchmark Multimodal retrieval-augmented generation
As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application
in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 10,000 unique Wikipedia documents curated and ranked by human annotators. Through the extensive evaluation on seven multimodal retrievers and fifteen VLMs, RAVENEA reveals some undiscovered findings: (i) In general, cultural grounding annotations can enhance multimodal retrieval and corresponding downstream tasks. (ii) Lightweight VLMs, when augmented with culture-aware retrieval, outperform their non-augmented counterparts (by at least 3.2% on cVQA and 6.2% on cIC). (iii) Performance varies widely across countries, with culture-aware retrieval augmented VLMs showing more stable results on Korean and Chinese contexts than in the other countries. These findings highlight the critical limitations of current multimodal retrievers and VLMs, and underscore the need to enhance RAG visual culture understanding. Our RAVENEA can serve as a foundational tool for advancing the study of RAG visual culture understanding of multimodal AI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
975. MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
๐ฌ 18
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Data Generation for Robot Learning Bimanual Mobile Manipulation Imitation Learning for Robotics
Imitation learning from large-scale, diverse human demonstrations has proven effective for training robots, but collecting such data is costly and time-consuming. This challenge is amplified for multi-step bimanual mobile manipulation, where humans must teleoperate both a mobile base and two high-degree-of-freedom arms. Prior automated data generation frameworks have addressed static bimanual manipulation by augmenting a few human demonstrations in simulation, but they fall short for mobile settings due to two key challenges: (1) determining base placement to ensure reachability, and (2) positioning the camera to provide sufficient visibility for visuomotor policies. To address these issues, we introduce MoMaGen, which formulates data generation as a constrained optimization problem that enforces hard constraints (e.g., reachability) while balancing soft constraints (e.g., visibility during navigation). This formulation generalizes prior approaches and provides a principled foundation for future methods. We evaluate MoMaGen on four multi-step bimanual mobile manipulation tasks and show that it generates significantly more diverse datasets than existing methods. Leveraging this diversity, MoMaGen can train successful imitation learning policies from a single source demonstration, and these policies can be fine-tuned with as few as 40 real-world demonstrations to achieve deployment on physical robotic hardware. More details are available at our project page: momagen-iclr2026.github.io.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
976. MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization
๐ฌ 23
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ MLLM Reasoning
While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient.
When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7\% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
977. Continuous Space-Time Video Super-Resolution with 3D Fourier Fields
๐ฌ 14
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ continuous space-time video super-resolution arbitrary-scale super-resolution low-level vision
We introduce a novel formulation for continuous space-time video super-resolution. Instead of decoupling the representation of a video sequence into separate spatial and temporal components and relying on brittle, explicit frame warping for motion compensation, we encode video as a continuous, spatio-temporally coherent 3D Video Fourier Field (VFF). That representation offers three key advantages: (1) it enables cheap, flexible sampling at arbitrary locations in space and time; (2) it is able to simultaneously capture fine spatial detail and smooth temporal dynamics; and (3) it offers the possibility to include an analytical, Gaussian point spread function in the sampling to ensure aliasing-free reconstruction at arbitrary scale. The coefficients of the proposed, Fourier-like sinusoidal basis are predicted with a neural encoder with a large spatio-temporal receptive field, conditioned on the low-resolution input video. Through extensive experiments, we show that our joint modeling substantially improves both spatial and temporal super-resolution and sets a new state of the art for multiple benchmarks: across a wide range of upscaling factors, it delivers sharper and temporally more consistent reconstructions than existing baselines, while being computationally more efficient. Code will be published upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
978. TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation
๐ฌ 9
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Remote Sensing Foundation Model Geospatial
Modern Earth observation (EO) increasingly leverages deep learning to harness the scale and diversity of satellite imagery across sensors and regions. While recent foundation models have demonstrated promising generalization across EO tasks, many remain limited by the scale, geographical coverage, and spectral diversity of their training data, factors critical for learning globally transferable representations. In this work, we introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery, combined with large spatial tiles and land-cover aware sampling to enrich spatial and semantic coverage. By treating sensing modalities as natural augmentations in our self-supervised approach, we unify radar and optical inputs via modality-specific patch embeddings and adaptive cross-attention fusion. Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism that incorporates class-frequency-aware regularization to address long-tailed distributions in land cover. TerraFM achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench. Our code and pretrained models will be publicly released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
979. CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs
๐ฌ 19
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Large Language Model Knowledege Distillation
Recently, efficient Multimodal Large Language Models (MLLMs) have gained significant attention as a solution to their high computational complexity, making them more practical for real-world applications. In this regard, the knowledge distillation (KD) approach has emerged as a promising alternative, which transfers the rich visual and linguistic knowledge from a larger model (teacher) to a smaller model (student). However, we observe that existing KD methods struggle to effectively distill the teacher MLLM's rich visual perception abilities to the student, a challenge that has been largely overlooked in previous studies. Through a systematic analysis, we identify visual attention misalignment between student and teacher as the main cause of this issue. Based on this insight, we propose CompoDistill, a novel KD framework that explicitly aligns the student's visual attention with that of the teacher to enhance the student's visual perception abilities. Our extensive experiments show that CompoDistill significantly improves performance on compositional reasoning tasks that require visual perception abilities while maintaining strong performance on visual question answering tasks, as done in existing studies. Furthermore, CompoDistill demonstrates effectiveness with a more advanced backbone, highlighting its generalizability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
980. From Utterance to Vividity: Training Expressive Subtitle Translation LLM via Adaptive Local Preference Optimization
๐ฌ 16
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Language Model Subtitle Translation Preference Optimization
The rapid development of Large Language Models (LLMs) has significantly enhanced the general capabilities of machine translation. However, as application scenarios become more complex, the limitations of LLMs in vertical domain translations are gradually becoming apparent. In this study, we focus on how to construct translation LLMs that meet the needs of domain customization. We take visual media subtitle translation as our topic and explore how to train expressive and vivid translation LLMs. We investigated the situations of subtitle translation and other domains of literal and liberal translation, verifying the reliability of LLM as reward model and evaluator for translation. Additionally, to train an expressive translation LLM, we constructed and released a multidirectional subtitle parallel corpus dataset and proposed the Adaptive Local Preference Optimization (ALPO) method to address fine-grained preference alignment. Experimental results demonstrate that ALPO achieves outstanding performance in multidimensional evaluation of translation quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
981. AlphaBench: Benchmarking Large Language Models in Formulaic Alpha Factor Mining
๐ฌ 8
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Alpha Mining LLM Benchmark LLM Agent Data Science and Engineering
Formulaic alpha factor mining (FAFM) is a central problem in quantitative investment, where interpretable formulas are designed to extract predictive signals from historical financial series. With the emergence of large language models (LLMs), recent studies have begun to explore their roles in FAFM, yet their capabilities across different tasks and configurations remain unclear. In this work, we introduce AlphaBench, the first systematic benchmark for evaluating LLMs in FAFM. AlphaBench covers three core tasks, including factor generation, factor evaluation, and factor searching, which are all popular tasks integrated in the workflow of quantitative researchers. Beyond task-level evaluation, we further analyze how different LLM settings, including model type, prompting paradigm, and reasoning strategy, influence performance. Our experiments on a range of open-source and closed-source models reveal that LLMs hold strong potential in automating factor mining, while also facing persistent challenges in robustness, search efficiency, and practical usability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
982. WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM
๐ฌ 12
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ audio-visual embeddings multimodal LLMs video retrieval
While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf{u}nified \& \textbf{v}ersatile \textbf{a}udio-\textbf{v}isual \textbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code, checkpoints, and data will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
983. NetArena: Dynamically Generated LLM Benchmarks for Network Applications
๐ฌ 11
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ LLM for Network Systems Dynamic Benchmark
As large language models (LLMs) expand into high-stakes domains like network
system operations, evaluating their real-world reliability becomes increasingly
critical. However, existing benchmarks risk contamination due to static design,
show high statistical variance from limited dataset size, and fail to reflect the
complexity of production environments. We introduce NetArena, a dynamic
benchmark generation framework for network applications. NetArena features a
novel abstraction and unified interface that generalizes across applications, effec-
tively addressing the challenges of dynamic benchmarking posed by the diversity
of network tasks. At runtime, users can generate unlimited queries on demand.
NetArena integrates with network emulators to provide execution-time feedback
on correctness, safety, and latency. We demonstrate NetArena on three repre-
sentative applications and find that (1) it significantly improve statistical reliability
among LLM agents (confidence interval overlap reduced from 85% to 0), (2) agents
achieve only 13โ38% average performance (as low as 3%) for large-scale, realistic
queries, (3) it reveals finer-grained behaviors missed by static, correctness-only
benchmarks. NetArena also enables use cases such as SFT and RL fine-tuning on
network system tasks. Code is available anonymously at https://anonymous.4open.science/r/netarena_iclr2026-BE94/README.md
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
984. Mordal: Automated Pretrained Model Selection for Vision Language Models
๐ฌ 10
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Model Vision Language Model Mode Selection
Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks.
Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility.
Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models.
We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention.
Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate.
Our evaluation shows that Mordal can find the best VLM for a given problem using $8.9\times$--$11.6\times$ lower GPU hours than grid search.
We have also discovered that Mordal achieves about 69\% higher weighted Kendallโs $\tau$ on average than the state-of-the-art model selection method across diverse tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
985. OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation
๐ฌ 11
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ diffusion model text image manipulation scene text editing
Recent advancements in diffusion-based text synthesis have demonstrated significant performance in inserting and editing text within images via inpainting. However, despite the potential of text inpainting methods, three key limitations hinder their applicability to broader Text Image Manipulation (TIM) tasks: (i) the inability to remove text, (ii) the lack of control over the style of rendered text, and (iii) a tendency to generate duplicated letters. To address these challenges, we propose OmniText, a training-free generalist capable of performing a wide range of TIM tasks. Specifically, we investigate two key properties of cross- and self-attention mechanisms to enable text removal and to provide control over both text styles and content. Our findings reveal that text removal can be achieved by applying self-attention inversion, which mitigates the model's tendency to focus on surrounding text, thus reducing text hallucinations. Additionally, we redistribute cross-attention, as increasing the probability of certain text tokens reduces text hallucination. For controllable inpainting, we introduce novel loss functions in a latent optimization framework: a cross-attention content loss to improve text rendering accuracy and a self-attention style loss to facilitate style customization. Furthermore, we present OmniText-Bench, a benchmark dataset for evaluating diverse TIM tasks. It includes input images, target text with masks, and style references, covering diverse applications such as text removal, rescaling, repositioning, and insertion and editing with various styles. Our OmniText framework is the first generalist method capable of performing diverse TIM tasks. It achieves state-of-the-art performance across multiple tasks and metrics compared to other text inpainting methods and is comparable with specialist methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
986. GmNet: Revisiting Gating Mechanisms From A Frequency View
๐ฌ 18
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Efficient model lightweight network design
Lightweight neural networks, essential for on-device applications, often suffer from a low-frequency bias due to their constrained capacity and depth.
This limits their ability to capture the fine-grained, high-frequency details (e.g., textures, edges) that are crucial for complex computer vision tasks. To address this fundamental limitation, we perform the first systematic analysis of gating mechanisms from a frequency perspective.
Inspired by the convolution theorem, we show how the interplay between element-wise multiplication and non-linear activation functions within Gated Linear Units (GLUs) provides a powerful mechanism to selectively amplify high-frequency signals, thereby enriching the model's feature representations.
Based on these findings, we introduce the Gating Mechanism Network (GmNet), a simple yet highly effective architecture that incorporates our frequency-aware gating principles into a standard lightweight backbone. The efficacy of our approach is remarkable: without relying on complex training strategies or architectural search, GmNet achieves a new state-of-the-art for efficient models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
987. MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
๐ฌ 13
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Tool-using Agent; Real-World Tasks; Model Context Protocol
We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich inputโoutput coupling. Also, tasks in MCP-Bench test agentsโ ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflowsโcapabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectorylevel planning and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
988. Efficient Benchmarking of Functional Connectivity Modeling via Structure-aware Core-set Selection
๐ฌ 25
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Functional Connectivity Benchmark Core-set Selection Network Modeling Structure-aware Sampling
Benchmarking the hundreds of available functional connectivity (FC) models on large fMRI datasets is critical for reproducible neuroscience, but is often computationally infeasible, with full-scale comparisons requiring months of compute time. This creates a critical bottleneck, hindering data-driven model selection. To break this bottleneck, we address the challenge of FC benchmarking by introducing a pre-analytical step: selecting a small, representative core-set whose sole purpose is to preserve the relative performance ranking of FC models. We formulate this as a ranking recommendation problem and propose Structure-aware Contrastive Learning for Core-set Selection (SCLCS), a self-supervised framework to select these core-sets. SCLCS first uses an adaptive Transformer to learn each sample's unique FC structure. It then introduces a novel Structural Perturbation Score (SPS) to quantify the stability of these learned structures during training, identifying samples that represent foundational connectivity archetypes. Finally, it combines this stability-based ranking with a density-aware sampling strategy to ensure the selected core-set is both robust and diverse. On the large-scale REST-meta-MDD dataset, SCLCS preserves the ground-truth model ranking with just 10% of the data, outperforming state-of-the-art (SOTA) selection methods by up to 23.2% in ranking consistency (nDCG@k). To our knowledge, this is the first work to formalize core-set selection for FC model benchmarking, making previously intractable large-scale model comparisons feasible.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
989. Arbitrary-Order Block SignSGD for Memory-Efficient LLM Fine-Tuning
๐ฌ 13
โญ 6.00
๐ optimization
๐ท๏ธ Block-Coordinate Optimization SignSGD Large Language Models (LLMs) Memory-Efficient Fine-Tuning
We propose \textbf{ABSignSGD}, a blockโcoordinate variant of sign-based descent with flexible block selection that enables memoryโ and runtimeโefficient fullโparameter fineโtuning of large language models. We present a unified convergence analysis under mild conditions, covering both the base method and a \textit{majorityโvote} extension for distributed training. The latter improves communication efficiency by aggregating only gradient signs rather than averaging full gradients. Experiments on \textcolor{blue}{Qwen3โ8B, Llama3-8B, and Qwen3-32B}, spanning mathematical reasoning and general instructionโfollowing tasks, show that ABSignSGD converges faster per iteration and delivers superior downstream performance while reducing both runtime and memory usage compared to existing methods. Ablation studies further indicate that the memoryless sign-based update naturally complements blockโwise updates, explaining the methodโs strong empirical performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
990. MoE-GS: Mixture of Experts for Dynamic Gaussian Splatting
๐ฌ 27
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Gaussian Splatting Dynamic Scene Reconstruction
Recent advances in dynamic scene reconstruction have significantly benefited from 3D Gaussian Splatting, yet existing methods show inconsistent performance across diverse scenes, indicating no single approach effectively handles all dynamic challenges. To overcome these limitations, we propose Mixture of Experts for Dynamic Gaussian Splatting (MoE-GS), a unified framework integrating multiple specialized experts via a novel Volume-aware Pixel Router. Unlike sparsity-oriented MoE architectures in large language models, MoE-GS is designed to improve dynamic novel view synthesis quality by combining heterogeneous deformation priors, rather than to reduce training or inference-time FLOPs. Our router adaptively blends expert outputs by projecting volumetric Gaussian-level weights into pixel space through differentiable weight splatting, ensuring spatially and temporally coherent results. Although MoE-GS improves rendering quality, the increased model capacity and reduced FPS are inherent to the MoE architecture. To mitigate this, we explore two complementary directions: (1) single-pass multi-expert rendering and gate-aware Gaussian pruning, which improve efficiency within the MoE framework, and (2) a distillation strategy that transfers MoE performance to individual experts, enabling lightweight deployment without architectural changes. To the best of our knowledge, MoE-GS is the first approach incorporating Mixture-of-Experts techniques into dynamic Gaussian splatting. Extensive experiments on the N3V and Technicolor datasets demonstrate that MoE-GS consistently outperforms state-of-the-art methods with improved efficiency. Video demonstrations are available at https://huggingface.co/spaces/moegs/MoE-GS.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
991. SketchThinker-R1: Towards Efficient Sketch-Style Reasoning in Large Multimodal Models
๐ฌ 16
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Efficient Reasoning Large Multimodal Models
Despite the empirical success of extensive, step-by-step reasoning in large multimodal models, long reasoning processes inevitably incur substantial computational overhead, i.e., in terms of higher token costs and increased response time, which undermines inference efficiency. In contrast, humans often employ sketch-style reasoning: a concise, goal-directed cognitive process that prioritizes salient
information and enables efficient problem-solving. Inspired by this cognitive efficiency, we propose SketchThinker-R1, which incentivizes sketch-style reasoning ability in large multimodal models. Our method consists of three primary stages. In the Sketch-Mode Cold Start stage, we convert standard long reasoning process into sketch-style reasoning and finetune base multimodal model, instilling initial sketch-style reasoning capability. Next, we train SketchJudge Reward Model, which explicitly evaluates thinking process of model and assigns higher scores to sketch-style reasoning. Finally, we conduct Sketch-Thinking Reinforcement Learning under supervision of SketchJudge to further generalize sketch-style reasoning ability. Experimental evaluation on four benchmarks reveals that our SketchThinker-R1 achieves over 64% reduction in reasoning token cost without compromising final answer accuracy. Qualitative analysis further shows that sketch-style reasoning focuses more on key cues during problem solving.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
992. Training Deep Normalization-Free Spiking Neural Networks with Lateral Inhibition.
๐ฌ 25
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Spiking Neural Networks Normalization Excitation-Inhibition Balance Lateral Inhibition
Spiking neural networks (SNNs) have garnered significant attention as a central paradigm in neuromorphic computing, owing to their energy efficiency and biological plausibility. However, training deep SNNs has critically depended on explicit normalization schemes, leading to a trade-off between performance and biological realism. To resolve this conflict, we propose a normalization-free learning framework that incorporates lateral inhibition inspired by cortical circuits. Our framework replaces the traditional feedforward SNN layer with a circuit of distinct excitatory (E) and inhibitory (I) neurons that captures the features of the canonical architecture of cortical E-I circuits. The circuit dynamically regulates neuronal activity through subtractive and divisive inhibition, which respectively control the activity and the gain of excitatory neurons. To enable and stabilize end-to-end training of the biologically constrained SNN, we propose two key techniques: E-I Init and E-I Prop. E-I Init is a dynamic parameter initialization scheme that balances excitatory and inhibitory inputs while performing gain control. E-I Prop decouples the backpropagation of the E-I circuits from the forward pass and regulates gradient flow.
Experiments across multiple datasets and network architectures demonstrate that our framework enables stable training of deep normalization-free SNNs with biological realism and achieves competitive performance without resorting to explicit normalization schemes. Therefore, our work not only provides a solution to training deep SNNs but also serves as a computational platform for further exploring the functions of E-I interactions in large-scale cortical computation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
993. PixelCraft: A Multi-Agent system for High-Fidelity Visual Reasoning on Structured Images
๐ฌ 15
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ chart understanding multi-agent system visual reasoning
Structured images (e.g., charts and geometric diagrams) remain challenging for multimodal large language models (MLLMs), as perceptual slips can cascade into erroneous conclusions. Intermediate visual cues can steer reasoning; however, existing cue-based methods are constrained with low-fidelity image processing and linear, rigid reasoning patterns, limiting their effectiveness on complex structured-image tasks. In this paper, we propose PixelCraft, a novel multi-agent system for high-fidelity image processing and flexible visual reasoning on structured images. The system comprises a dispatcher, a planner, a reasoner, critics, and a set of visual tool agents. To achieve high-fidelity processing, we construct a high-quality corpus and fine-tune an MLLM into a grounding model, whose pixel-level localizations are integrated with traditional computer vision (CV) algorithms in tool agents.
Building on this foundation, PixelCraft facilitates flexible visual reasoning through a dynamic three-stage workflow of tool selection, agent discussion, and self-criticism.
Moreover, unlike prior linear reasoning patterns that simply append historical images, PixelCraft maintains an image memory to allow the planner to adaptively revisit earlier visual steps, explore alternative reasoning branches, and dynamically adjust the reasoning trajectory during discussion. Extensive experiments on challenging chart and geometry benchmarks demonstrate that PixelCraft significantly improves visual reasoning performance for advanced MLLMs, setting a new standard for structured image reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
994. Improving Diffusion Models for Class-imbalanced Training Data via Capacity Manipulation
๐ฌ 14
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Imbalance Diffusion Models
While diffusion models have achieved remarkable performance in image generation, they often struggle with the imbalanced datasets frequently encountered in real-world applications, resulting in significant performance degradation on minority classes. In this paper, we identify model capacity allocation as a key and previously underexplored factor contributing to this issue, providing a perspective that is orthogonal to existing research. Our empirical experiments and theoretical analysis reveal that majority classes monopolize an unnecessarily large portion of the model's capacity, thereby restricting the representation of minority classes. To address this, we propose Capacity Manipulation (CM), which explicitly reserves model capacity for minority classes. Our approach leverages a low-rank decomposition of model parameters and introduces a capacity manipulation loss to allocate appropriate capacity for capturing minority knowledge, thus enhancing minority class representation. Extensive experiments demonstrate that CM consistently and significantly improves the robustness of diffusion models on imbalanced datasets, and when combined with existing methods, further boosts overall performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
995. ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models
๐ฌ 22
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Token-level Policy Gradients Reshape;Tool-use Large Language Model; Entropy-aware; Reinforcement Learning; Reasoning Model
Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools. Reinforcement learning (RL) offers a principled framework for optimizing these emergent tool-use policies, yet the prevailing paradigm relies exclusively on sparse outcome rewards and lacks consideration of the particularity of tool-use tasks, inflating policy-gradient variance and resulting in inefficient training.
To better understand and address these challenges, we first establish a theoretical link between policy entropy and training stability of tool-use tasks, which reveals that structured, low-entropy tokens are primary determinants of rewards. Motivated by this insight, we propose Reshaped Token-level policy gradients (ResT) for tool-use tasks. ResT reshapes the policy gradient through entropy-informed token reweighting, progressively upweighting reasoning tokens as training proceeds. This scheme enables a smooth shift from structural correctness to semantic reasoning and stabilizes convergence in multi-turn tool-use tasks. Evaluation on BFCL and API-Bank shows that ResT outperforms other strong baselines, outperforming prior methods by up to 8.76%. When fine-tuned on a 4B base LLM, ResT further surpasses GPT-4o by 4.11% on single-turn tasks and 1.50% on multi-turn base tasks. Code is available at https://anonymous.4open.science/r/ResT_Tool_use_LLM-F11B.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
996. LiveMoments: Reselected Key Photo Restoration in Live Photos via Reference-guided Diffusion
๐ฌ 14
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Live Photo Reference-based Image Restoration Conditional Image Generation Motion Alignment
Live Photo captures both a high-quality key photo and a short video clip to preserve the precious dynamics around the captured moment.
While users may choose alternative frames as the key photo to capture better expressions or timing, these frames often exhibit noticeable quality degradation, as the photo capture ISP pipeline delivers significantly higher image quality than the video pipeline. This quality gap highlights the need for dedicated restoration techniques to enhance the reselected key photo. To this end, we propose LiveMoments, a reference-guided image restoration framework tailored for the reselected key photo in Live Photos. Our method employs a two-branch neural network: a reference branch that extracts structural and textural information from the original high-quality key photo, and a main branch that restores the reselected frame using the guidance provided by the reference branch. Furthermore, we introduce a unified Motion Alignment module that incorporates motion guidance for spatial alignment at both the latent and image levels. Experiments on real and synthetic Live Photos demonstrate that LiveMoments significantly improves perceptual quality and fidelity over existing solutions, especially in scenes with fast motion or complex structures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
997. TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning
๐ฌ 10
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ temporal search long video understanding reinforcement learning large video language model
Temporal search aims to identify a minimal set of relevant frames from tens of thousands based on a given query, serving as a foundation for accurate long-form video understanding. Many existing works attempt to progressively narrow the search space. However, these approaches typically rely on a hand-crafted search process, lacking end-to-end optimization for learning optimal search strategies. In this paper, we propose **TimeSearch-R**, which reformulates temporal search as interleaved textโvideo thinking, seamlessly integrating searching video clips into the reasoning process through reinforcement learning (RL). However, applying RL training methods, such as Group Relative Policy Optimization (GRPO), to video reasoning can result in unsupervised intermediate search decisions. This leads to insufficient exploration of the video content and inconsistent logical reasoning. To address these issues, we introduce GRPO with Completeness Self-Verification (GRPO-CSV), which gathers searched video frames from the interleaved reasoning process and utilizes the same policy model to verify the adequacy of searched frames, thereby improving the completeness of video reasoning. Additionally, we construct datasets specifically designed for the SFT cold-start and RL training of GRPO-CSV, filtering out samples with weak temporal dependencies to enhance task difficulty and improve temporal search capabilities. Extensive experiments demonstrate that TimeSearch-R achieves substantial improvements on temporal search benchmarks such as Haystack-LVBench and Haystack-Ego4D, long-form video understanding benchmarks like VideoMME, MLVU, and LongVideoBench, as well as video reasoning benchmarks such as Video-Holmes, consistently and significantly outperforming other existing temporal search approaches and text-only reasoning models. All the code, models, and data will be released soon.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
998. TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix
๐ฌ 16
โญ 6.00
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ Large language models inference multi-head latent attention shared prefix
Multi-Head Latent Attention (MLA) is a recent attention mechanism adopted in state-of-the-art LLMs such as DeepSeek-v3 and Kimi K2. Thanks to its novel formulation, MLA allows two functionally equivalent but computationally distinct kernel implementations: naive and absorb. While the naive kernels (e.g., FlashAttention) are typically preferred in training and prefill for their computational efficiency, existing decoding kernels (e.g., FlashMLA) rely on the absorb method to minimize HBM bandwidth usage. However, the compute-bound nature of the absorb implementations prohibits performance benefits from data reuse opportunities in attention calculations, such as shared prefixes. In this work, we introduce TyphoonMLA, a hybrid approach that combines naive and absorb formulations to harness the strengths of both. TyphoonMLA effectively leverages the shared prefix by applying the naive formulation to the compute-bound parts of attention calculations, while reducing the bandwidth requirements for non-shared parts by using the absorb formulation. As a result, TyphoonMLA improves the throughput of attention calculations in MLA architectures by up to 3ร and 3.24ร on NPU and GPUs, and boosts end-to-end throughput by up to 1.48ร in tokens per second, with only a 3\% overhead in HBM size.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
999. When MLLMs Meets Compression Distortion: A Coding Paradigm Tailored to MLLMs
๐ฌ 18
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Image Coding Image Compression Multimodal Large Language Models
The increasing deployment of powerful Multimodal Large Language Models (MLLMs), typically hosted on cloud platforms, urgently requires effective compression techniques to efficiently transmit signal inputs (e.g., images, videos) from edge devices with minimal bandwidth usage. However, conventional image codecs are optimized for fidelity to serve the Human Visual System (HVS) and ill-suited for MLLMs, in which diverse downstream tasks are jointly considered. In this paper, we first systematically analyze the impact of compression artifacts on several mainstream MLLMs. We find that: Compression distortion unevenly impacts different-level image features, leading to varying effects on MLLMs' downstream tasks depending on their feature-level reliance. Motivated by this discovery, we propose an image Codec TAilored to MLLMs (CoTAM) designed to adaptively protect multi-level features and suit different demands of downstream tasks. The encoder leverages CLIP's shallow-layer attention to generate an importance map for bit allocation, preserving critical semantic regions. Concurrently, the decoder integrates a lightweight adapter with a multi-level loss function to ensure the faithful reconstruction both of low-level details and high-level semantic context for robust synthesis of cross-level features. Extensive experiments validate that our method achieves up to 35.99\% bitrate saving while maintaining the same performance on the MLLM tasks, outperforming previous SOTA neural codecs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1000. Behavior Learning
๐ฌ 20
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Utility Maximization Intrinsic Interpretability Identifiability PerformanceโInterpretability Trade-off Statistical Consistency Counterfactual Prediction Energy-Based Models (EBMs)
Interpretable machine learning is increasingly vital for scientific research, yet the performanceโinterpretability trade-off, insufficient alignment with scientific theory, and non-identifiability limit its scientific credibility. Grounded in behavioral science, we propose Behavior Learning (BL), a novel general-purpose machine learning framework that unifies predictive performance, intrinsic interpretability, and identifiability for scientifically credible modeling. BL discovers interpretable and identifiable optimization structures from data. It does so by parameterizing a compositional utility function built from intrinsically interpretable modular blocks, which induces a data distribution for prediction and generation. Each block represents and can be written in symbolic form as a utility maximization problem (UMP), a foundational paradigm in behavioral science and a universal framework of optimization. BL supports architectures ranging from a single UMP to hierarchical compositions, the latter modeling hierarchical optimization systems that offer both expressiveness and structural transparency. Its smooth and monotone variant (IBL) guarantees identifiability under mild conditions. Theoretically, we establish the universal approximation property of both BL and IBL, and analyze the M-estimation properties of IBL. Empirically, BL demonstrates strong predictive performance, intrinsic interpretability and scalability to high-dimensional data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1001. Conditionally Whitened Generative Models for Probabilistic Time Series Forecasting
๐ฌ 6
โญ 6.00
๐ generative models
๐ท๏ธ Diffusion Model Probabilistic Time Series Forecasting Conditional Generation
Probabilistic forecasting of multivariate time series is challenging due to non-stationarity, inter-variable dependencies, and distribution shifts. While recent diffusion and flow matching models have shown promise, they often ignore informative priors such as conditional means and covariances. In this work, we propose Conditionally Whitened Generative Models (CW-Gen), a framework that incorporates prior information through conditional whitening. Theoretically, we establish sufficient conditions under which replacing the traditional terminal distribution of diffusion models, namely the standard multivariate normal, with a multivariate normal distribution parameterized by estimators of the conditional mean and covariance improves sample quality. Guided by this analysis, we design a novel Joint Mean-Covariance Estimator (JMCE) that simultaneously learns the conditional mean and sliding-window covariance. Building on JMCE, we introduce Conditionally Whitened Diffusion Models (CW-Diff) and extend them to Conditionally Whitened Flow Matching (CW-Flow). Experiments on five real-world datasets with six state-of-the-art generative models demonstrate that CW-Gen consistently enhances predictive performance, capturing non-stationary dynamics and inter-variable correlations more effectively than prior-free approaches. Empirical results further demonstrate that CW-Gen can effectively mitigate the effects of distribution shift.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1002. TROLL: Trust Regions Improve Reinforcement Learning for Large Language Models
๐ฌ 16
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ RL from verifiable rewards Finetuning LLMs Trust Regions
Reinforcement Learning (RL) with PPO-like clip objectives has become the standard choice for reward-based fine-tuning of large language models (LLMs).
Although recent work has explored improved estimators of advantages and normalization, the clipping mechanism itself has remained untouched.
Originally introduced as a proxy for principled KL-based trust regions, clipping is a crude approximation that often causes unstable updates and suboptimal performance.
We replace the clip objective with a novel discrete differentiable trust region projection, which provides principled token-level KL constraints.
The projection operates on a sparse subset of the modelโs most important token logits to balance computational cost and projection effectiveness.
Our approach, Trust Region Optimization for Large Language Models (TROLL), serves as a direct replacement for PPO-like clipping during training and does not alter the modelโs inference behavior.
Across mathematical reasoning and code generation tasks, model families, as well as advantage-estimation methods, TROLL consistently outperforms PPO-like clipping in terms of training speed, stability, and final success rates.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1003. Revisiting [CLS] and Patch Token Interaction in Vision Transformers
๐ฌ 10
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ representation vision transformer SSL attention specialization architecture interpretability DINO DINOv2 CLIP DEIT
Vision Transformers have emerged as powerful, scalable and versatile representation learners. To capture both global and local features, a learnable [CLS] class token is typically prepended to the input sequence of patch tokens. Despite their distinct nature, both token types are processed identically throughout the model.
In this work, we investigate the friction between global and local feature learning under different pre-training strategies by analyzing the interactions between class and patch tokens.
Our analysis reveals that standard normalization layers introduce an implicit differentiation between these token types. Building on this insight, we propose specialized processing paths that selectively disentangle the computational flow of class and patch tokens, particularly within normalization layers and early query-key-value projections.
This targeted specialization leads to significantly improved patch representation quality for dense prediction tasks. Our experiments demonstrate segmentation performance gains of over 2 mIoU points on standard benchmarks, while maintaining strong classification accuracy. The proposed modifications introduce only an 8\% increase in parameters, with no additional computational overhead.
Through comprehensive ablations, we provide insights into which architectural components benefit most from specialization and how our approach generalizes across model scales and learning frameworks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1004. Sign-SGD via Parameter-Free Optimization
๐ฌ 18
โญ 6.00
๐ optimization
๐ท๏ธ Parameter-free optimization Sign descent Convex optimization Stochastic optimization
Large language models have achieved major advances across domains, yet training them remains extremely resource-intensive. We revisit Sign-SGD, which serves both as a memory-efficient optimizer for single-node training and as a gradient compression mechanism for distributed learning. This paper addresses a central limitation: the effective stepsize cannot be determined a priori because it relies on unknown, problem-specific quantities. We present a parameter-free Sign-SGD that removes manual stepsize selection. We analyze the deterministic single-node case, and extend the method to stochastic single-node training and multi-node settings. We also incorporate the momentum technique into our algorithms and propose a memory-efficient variant that stores only gradient signs instead of full gradients. We evaluate our methods on pre-training LLaMA models (130M and 350M) and fine-tuning a Swin Transformer (28M). Across considered tasks, the proposed methods match the performance of tuned Sign-SGD and AdamW (grid-searched stepsizes with a cosine schedule), while avoiding tuning overhead. Employing parameter-free training yields approximately $1.5\times$ end-to-end speedup compared to runs with grid-searched stepsizes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1005. Scaling Generalist Data-Analytic Agents
๐ฌ 4
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Data Analysis LLM Agents Agent Training
Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DataMind, we curate DataMind-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind-12K and DataMind-7B,14B for the community's future research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1006. Multi-Domain Transferable Graph Gluing for Building Graph Foundation Models
๐ฌ 26
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Multi-domain graph pre-training graph neural network graph foundation model Riemannian geometry
Multi-domain graph pre-training integrates knowledge from diverse domains to enhance performance in the target domains, which is crucial for building graph foundation models. Despite initial success, existing solutions often fall short of answering a fundamental question: how is knowledge integrated or transferred across domains? This theoretical limitation motivates us to rethink the consistency and transferability between the pre-trained model and target domains. In this paper, we propose a fresh differential geometry perspective, whose core idea is to merge any graph dataset into a unified, smooth Riemannian manifold, enabling a systematic understanding of knowledge integration and transfer. To achieve this, our key contribution is the theoretical establishment of neural manifold gluing,
which first characterizes local geometry using an adaptive orthogonal frame and then โgluesโ the local pieces together into a coherent whole. Building on this theory, we present the GraphGlue framework, which supports batched pre-training with EMA prototyping and provides a transferability measure based on geometric consistence. Extensive experiments demonstrate its superior performance across diverse graph domains. Moreover, we empirically validated GraphGlueโs geometric scaling law, showing that larger quantities of datasets improve model transferability by producing a smoother manifold.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1007. Nasty Adversarial Training: A Probability Sparsity Perspective for Robustness Enhancement
๐ฌ 7
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ adversarial training adversarial robustness
The vulnerability of deep neural networks to adversarial examples poses significant challenges to their reliable deployment. Among existing empirical defenses, adversarial training and robust distillation have proven the most effective. In this paper, we identify a property originally associated with model intellectual property, i.e., probability sparsity induced by nasty training, and demonstrate that it can also provide interpretable improvements to adversarial robustness.
We begin by analyzing how nasty training induces sparse probability distributions and qualitatively explore the spatial metric preferences this sparsity introduces to the model. Building on these insights, we propose a simple yet effective adversarial training method, nasty adversarial training (NAT), which incorporates probability sparsity as a regularization mechanism to boost adversarial robustness. Both theoretical analysis and experimental results validate the effectiveness of NAT, highlighting its potential to enhance the adversarial robustness of deep neural networks in an interpretable manner.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1008. MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation
๐ฌ 22
โญ 6.00
๐ generative models
๐ท๏ธ Character Animation Diffusion Model Video Generation
Character image animation, which synthesizes videos of reference characters driven by pose sequences, has advanced rapidly but remains largely limited to single-human settings. Existing methods struggle to generalize to multi-humanoid scenarios, which involve diverse humanoid forms, complex interactions, and frequent occlusions. We address this gap with two key innovations. First, we introduce unified motion representations that extract identity-agnostic 4D motions and explicitly bind them to corresponding characters, enabling generalization across diverse humanoid forms and seamless extension to multi-humanoid scenarios. Second, we propose a holistic 4D-anchored paradigm that constructs a shared 4D space to fuse motion representations with video latents, and further reinforces this process with hierarchical 4D-level supervision to better handle interactions and occlusions. We instantiate these ideas in MotionWeaver, an end-to-end framework for multi-humanoid image animation. To support this setting, we curate a 46-hour dataset of multi-human videos with rich interactions, and construct a 300-video benchmark featuring paired humanoid characters. Quantitative and qualitative experiments demonstrate that MotionWeaver not only achieves state-of-the-art results on our benchmark but also generalizes effectively across diverse humanoid forms, complex interactions, and challenging multi-humanoid scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1009. Pusa V1.0: Unlocking Temporal Control in Pretrained Video Diffusion Models via Vectorized Timestep Adaptation
๐ฌ 13
โญ 6.00
๐ generative models
๐ท๏ธ Vectorized Timesteps Flow Matching Temporal Modeling Video Generation
The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present \textbf{Pusa} V1.0, a versatile model that leverages \textbf{vectorized timestep adaptation (VTA)} to enable fine-grained temporal control within a unified video diffusion framework. Note that VTA is a non-destructive adaptation, which means that it fully preserves the capabilities of the base model.
\textbf{Unlike conventional methods like Wan-I2V, which finetune a base text-to-video (T2V) model with abundant resources to do image-to-video (I2V), we achieve comparable results in a zero-shot manner after an ultra-efficient finetuning process based on VTA. Moreover, this method also unlocks many other zero-shot capabilities simultaneously, such as start-end frames and video extension ---all without task-specific training. Meanwhile, it keeps the T2V capability from the base model.} Mechanistic analyses also reveal that our approach preserves the foundation model's generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to the vectorized timestep. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1010. Catalog-Native LLM: Speaking Item-ID dialect with Less Entanglement for Recommendation
๐ฌ 11
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Recommender Systems Large Language Models Mixture of Experts
While collaborative filtering delivers predictive accuracy and efficiency, and Large Language Models (LLMs) enable expressive and generalizable reasoning, modern recommendation systems must bring these strengths together. Growing user expectations, such as natural-language queries and transparent explanations, further highlight the need for a unified approach. However, doing so is nontrivial. Collaborative signals are often token-efficient but semantically opaque, while LLMs are semantically rich but struggle to model implicit user preferences when trained only on textual inputs. This paper introduces Item-ID + Natural-language Mixture-of-Experts Language Model (IDIOMoE), which treats item interaction histories as a native dialect within the language space, enabling collaborative signals to be understood in the same way as natural language. By splitting the Feed Forward Network of each block of a pretrained LLM into a separate text expert and an item expert with token-type gating, our method avoids destructive interference between text and catalog modalities.
IDIOMoE demonstrates strong recommendation performance across both public and proprietary datasets, while preserving the text understanding of the pretrained model.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1011. Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
๐ฌ 19
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Visual Search;Thinking-with-images;Reinforcement Learning;
Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoningโspanning tens of stepsโand achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3โstyle behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1012. Towards Quantifying Long-Range Interactions in Graph Machine Learning: a Large Graph Dataset and a Measurement
๐ฌ 3
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph Neural Networks Long-range dependency
Long-range dependencies are critical for effective graph representation learning, yet most existing datasets focus on small graphs tailored to inductive tasks, offering limited insight into long-range interactions. Current evaluations primarily compare models employing global attention (e.g., graph transformers) with those using local neighborhood aggregation (e.g., message-passing neural networks) without a direct measurement of long-range dependency. In this work, we introduce $\texttt{City-Networks}$, a novel large-scale transductive learning dataset derived from real-world city road networks. This dataset features graphs with over $10^5$ nodes and significantly larger diameters than those in existing benchmarks, naturally embodying long-range information. We annotate the graphs based on local node eccentricities, ensuring that the classification task inherently requires information from distant nodes. Furthermore, we propose a generic measurement based on the Jacobians of neighbors from distant hops, offering a principled quantification of long-range dependencies. Finally, we provide theoretical justifications for both our dataset design and the proposed measurementโparticularly by focusing on over-smoothing and influence score dilutionโwhich establishes a robust foundation for further exploration of long-range interactions in graph neural networks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1013. Foundation Visual Encoders Are Secretly Few-Shot Anomaly Detectors
๐ฌ 17
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Representation Learning Few-Shot Anomaly Detection Applications of Foundation Models
Few-shot anomaly detection streamlines and simplifies industrial safety inspection. However, limited samples make accurate differentiation between normal and abnormal features challenging, and even more so under category-agnostic conditions. Large-scale pre-training of foundation visual encoders has advanced many fields, as the enormous quantity of data helps to learn the general distribution of normal images.
We observe that the anomaly amount in an image directly correlates with the difference in the learnt embeddings and utilize this to design a few-shot anomaly detector termed FoundAD.
This is done by learning a nonlinear projection operator onto the natural image manifold.
The simple operator acts as an effective tool for anomaly detection to characterize and identify out-of-distribution regions in an image.
Extensive experiments show that our approach supports multi-class detection and achieves competitive performance compared to other approaches, while surpassing them in model size and inference efficiency.
Backed up by evaluations with multiple foundation encoders, including fresh DINOv3, we believe this idea broadens the perspective on foundation features and advances the field of few-shot anomaly detection. Our code will be made public.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1014. QeRL: Beyond Efficiency - Quantization-enhanced Reinforcement Learning for LLMs
๐ฌ 16
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Quantization RL LLMs
We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout duration. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise throughout training. Experiments demonstrate that QeRL delivers around a 1.2รโ1.5ร speedup compared to BF16 LoRA in end-to-end RL training while drastically reducing memory usage, and a 1.5รโ2.0ร speedup compared to QLoRA. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1015. Spatial CAPTCHA: Generatively Benchmarking Spatial Reasoning for Human-Machine Differentiation
๐ฌ 14
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ CAPTCHA multimodal models spatial reasoning robustness evaluation benchmark
Online services rely on CAPTCHAs as a first line of defense against automated abuse, yet recent advances in multi-modal large language models (MLLMs) have eroded the effectiveness of conventional designs that focus on text recognition or 2D image understanding. To address this challenge, we present **Spatial CAPTCHA**, a novel human-verification framework that leverages fundamental differences in spatial reasoning between humans and MLLMs. Unlike existing CAPTCHAs that rely on low-level perception tasks vulnerable to modern AI, Spatial CAPTCHA generates dynamic questions requiring geometric reasoning, perspective-taking, occlusion handling, and mental rotationโskills intuitive for humans but difficult for current AI systems. The system employs a procedural generation pipeline with constraint-based difficulty control, automated correctness verification, and human-in-the-loop validation to ensure scalability, robustness, and adaptability. Evaluation on a corresponding benchmark, **Spatial-CAPTCHA-Bench**, demonstrates that humans vastly outperform 10 state-of-the-art MLLMs, with the best model achieving only 31.0\% Pass@1 accuracy. Result comparison with Google reCAPTCHA further confirms the effectiveness of Spatial CAPTCHA as both a security mechanism and a diagnostic tool for spatial reasoning in AI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1016. Sample Reward Soups: Query-efficient Multi-Reward Guidance for Text-to-Image Diffusion Models
๐ฌ 16
โญ 6.00
๐ generative models
๐ท๏ธ Diffusion model Text to Image Sample Reward Soups Training-free Black-box alignment
Recent advances in inference-time alignment of diffusion models have shown reduced susceptibility to reward over-optimization. However, when aligning with multiple black-box reward functions, the number of required queries grows exponentially with the number of reward functions, making the alignment process highly inefficient. To address the challenge, we propose the first inference-time soup strategy, named Sample Reward Soups (SRSoup), for Pareto-optimal sampling across the entire space of preferences. Specifically, at each denoising step, we independently steer multiple denoising distributions using reward-guided search gradients (one for each reward function) and then linearly interpolate their search gradients. This design is effective because sample rewards can be shared when two denoising distributions are close, particularly during the early stages of the denoising process. As a result, SRSoup significantly reduces the number of queries required in the early stages without sacrificing performance. Extensive experiments demonstrate the effectiveness of SRSoup in aligning T2I models with diverse reward functions, establishing a practical and scalable solution.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1017. Squeeze the Soaked Sponge: Efficient Off-policy RFT for Large Language Model
๐ฌ 26
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reinforcement Finetuning Large Language Model Reasoning
Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs), yet most existing Reinforcement Finetuning (RFT) methods are inherently \textit{on-policy} RL, failing to reuse historical data and thus preventing efficient scaling. In this work, we explore the potential of \textit{off-policy} RL to leverage historical data for rollout-efficient RFT. Specifically, we propose \textbf{Re}incarnating \textbf{Mix}-policy Proximal Policy Gradient (\textbf{ReMix}), which enables on-policy RFT methods to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio that utilizes the data from both current and past policies for efficient training; (2) KL-Convex policy constraint that combines the KL constraints on the base and precedent model to balance stability and flexibility; (3) Policy reincarnation that replaces the base model with the mix-policy RFT model in the mid way of training and restarts on-policy training, to achieve a seamless transition from early efficiency to steady convergence. In our experiments, we train a series of ReMix models based on PPO, GRPO from 1.5B, 7B base models. On five math reasoning benchmarks (i.e., AIME'24, AMC'23, Minerva, OlympiadBench, and MATH500), ReMix achieves an average Pass@1 accuracy of \textbf{52.10\%} (with \textbf{0.079M rollouts}) and \textbf{64.39\%} (with \textbf{0.011M rollouts}) on 1.5B and 7B models, respectively. Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over \textbf{30x to 450x reduction in training cost in terms of rollout data volume}, demonstrating superior training efficiency. Additionally, our multifaceted analysis reveals insightful findings, including the implicit preference for shorter responses of off-policy RFT, the collapse mode of self-reflection under severe off-policyness, etc.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1018. Phantom-Data: Towards a General Subject-Consistent Video Generation Dataset
๐ฌ 6
โญ 6.00
๐ generative models
๐ท๏ธ Video Generation Multimodal Generation
Subject-to-video generation has witnessed substantial progress in recent years. However, existing models still face significant challenges in faithfully following textual instructions. This limitation, commonly known as the copy-paste problem, arises from the widely used in-pair training paradigm. This approach inherently entangles subject identity with background and contextual attributes by sampling reference images from the same scene as the target video. To address this issue, we introduce \textbf{Phantom-Data, the first general-purpose cross-pair subject-to-video consistency dataset}, containing approximately one million identity-consistent pairs across diverse categories. Our dataset is constructed via a three-stage pipeline: (1) a general and input-aligned subject detection module, (2) large-scale cross-context subject retrieval from more than 53 million videos and 3 billion images, and (3) prior-guided identity verification to ensure visual consistency under contextual variation. Comprehensive experiments show that training with Phantom-Data significantly improves prompt alignment and visual quality while preserving identity consistency on par with in-pair baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1019. A Memory-Efficient Hierarchical Algorithm for Large-scale Optimal Transport Problems
๐ฌ 16
โญ 6.00
๐ optimization
๐ท๏ธ optimal transport linear programming multiscale framework first-order methods
In this paper we propose a memory-efficient hierarchical algorithm for solving large-scale optimal transport (OT) problems with squared Euclidean cost.
The core of our proposed approach is the combination of multiscale hierarchical representation of the OT problem and a GPU-implemented Primal-Dual Hybrid Gradient (PDHG) method.
Moreover, an active pruning technique is applied to further reduce computational complexity.
Theoretically, we establish a scale-independent iteration-complexity upper bound for the refinement phase, which is consistent with our numerical observations.
Numerically, experiments on image dataset DOTmark and point cloud dataset ModelNet10 demonstrate that the proposed algorithm effectively addresses the memory and scalability bottlenecks.
Compared to state-of-the-art baselines, our method demonstrates significant advantages: for images with $n=1024^2$ pixels, it achieves an $8.9\times$ speedup and $70.5$\% reduction in memory usage under comparable accuracy; for 3D point clouds at scale $n=2^{18}$, it achieves a $1.84\times$ speedup and an $83.2$\% reduction in memory usage with $24.9$\% lower transport cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1020. Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild
๐ฌ 11
โญ 6.00
๐ generative models
๐ท๏ธ single-view 3D human reconstruction image-to-3D multi-view diffusion model alignment post training
Single-view 3D human reconstruction has achieved remarkable progress through the adoption of multi-view diffusion models, yet the recovered 3D humans often exhibit unnatural poses. This phenomenon becomes pronounced when reconstructing 3D humans with dynamic or challenging poses, which we attribute to the limited scale of available 3D human datasets with diverse poses. To address this limitation, we introduce DrPose, a Direct Reward fine-tuning algorithm on Poses, which enables post-training of a multi-view diffusion model on diverse poses without requiring expensive 3D human assets. DrPose trains a model using only human poses paired with single-view images, employing a direct reward fine-tuning to maximize PoseScore, which is our proposed differentiable reward that quantifies consistency between a generated multi-view latent image and a ground-truth human pose. This optimization is conducted on DrPose15K, a novel dataset that was constructed from an existing human motion dataset and a pose-conditioned video generative model. Constructed from abundant human pose sequence data, DrPose15K exhibits a broader pose distribution compared to existing 3D human datasets. We validate our approach through evaluation on conventional benchmark datasets, in-the-wild images, and a newly constructed benchmark, with a particular focus on assessing performance on challenging human poses. Our results demonstrate consistent qualitative and quantitative improvements across all benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1021. DeepAFL: Deep Analytic Federated Learning
๐ฌ 20
โญ 6.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Analytic Learning Federated Learning Continual Learning Lifelong Learning Incremental Learning Representation Learning Data Heterogeneity
Federated Learning (FL) is a popular distributed learning paradigm to break down data silo. Traditional FL approaches largely rely on gradient-based updates, facing significant issues about heterogeneity, scalability, convergence, and overhead, etc. Recently, some analytic-learning-based work has attempted to handle these issues by eliminating gradient-based updates via analytical (i.e., closed-form) solutions. Despite achieving superior invariance to data heterogeneity, these approaches are fundamentally limited by their single-layer linear model with a frozen pre-trained backbone. As a result, they can only achieve suboptimal performance due to their lack of representation learning capabilities. In this paper, to enable representable analytic models while preserving the ideal invariance to data heterogeneity for FL, we propose our Deep Analytic Federated Learning approach, named DeepAFL. Drawing inspiration from the great success of ResNet in gradient-based learning, we design gradient-free residual blocks in our DeepAFL with analytical solutions. We further introduce an efficient layer-wise protocol for training our deep analytic models layer by layer in FL through least squares. Both theoretical analyses and empirical evaluations validate our DeepAFL's superior performance with its dual advantages in heterogeneity invariance and representation learning, outperforming state-of-the-art baselines by up to 5.68%-8.42% across three benchmark datasets. The related codes will be made open-sourced upon the acceptance of this paper.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1022. UniTrack: Differentiable Graph Representation Learning for Multi-Object Tracking
๐ฌ 10
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Multi-object tracking graph representation learning differentiable optimization end-to-end learning identity preservation spatio-temporal modeling flow networks unified loss functions video understanding deep learning
We present UniTrack, a plug-and-play graph-theoretic loss function designed to significantly enhance multi-object tracking (MOT) performance by directly optimizing tracking-specific objectives through unified differentiable learning. Unlike prior graph-based MOT methods that redesign tracking architectures, UniTrack provides a universal training objective that integrates detection accuracy, identity preservation, and spatiotemporal consistency into a single end-to-end trainable loss function, enabling seamless integration with existing MOT systems without architectural modifications. Through differentiable graph representation learning, UniTrack enables networks to learn holistic representations of motion continuity and identity relationships across frames. We validate UniTrack across diverse tracking models and multiple challenging benchmarks, demonstrating consistent improvements across all tested architectures and datasets including Trackformer, MOTR, FairMOT, ByteTrack, GTR, and MOTE. Extensive evaluations show up to 53\% reduction in identity switches and 12\% IDF1 improvements across challenging benchmarks, with GTR achieving peak performance gains of 9.7\% MOTA on SportsMOT.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1023. A Framework for Studying AI Agent Behavior: Evidence from Consumer Choice Experiments
๐ฌ 19
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Agents Agentic AI Behavior Choices Alignment Safety Benchmark
Environments built for people are increasingly operated by a new class of economic actors: LLM-powered software agents making decisions on our behalf. These decisions range from our purchases to travel plans to medical treatment selection. Current evaluations of these agents largely focus on task competence, but we argue for a deeper assessment: how these agents choose when faced with realistic decisions. We introduce ABxLab, a framework for systematically probing agentic choice through controlled manipulations of option attributes and persuasive cues. We apply this to a realistic web-based shopping environment, where we vary prices, ratings, and psychological nudges, all of which are factors long known to shape human choice. We find that agent decisions shift predictably and substantially in response, revealing that agents are strongly biased choosers even without being subject to the cognitive constraints that shape human biases. This susceptibility reveals both risk and opportunity: risk, because agentic consumers may inherit and amplify human biases; opportunity, because consumer choice provides a powerful testbed for a behavioral science of AI agents, just as it has for the study of human behavior. We release our framework as an open benchmark for rigorous, scalable evaluation of agent decision-making.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1024. LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
๐ฌ 15
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLMs RL Long-form generation
Ultra-long generation by large language models (LLMs) is a widely demanded scenario, yet it remains a significant challenge due to their maximum generation length limit and overall quality degradation as sequence length increases. Previous approaches, exemplified by LongWriter, typically rely on ''teaching'', which involves supervised fine-tuning (SFT) on synthetic long-form outputs. However, this strategy heavily depends on synthetic SFT data, which is difficult and costly to construct, often lacks coherence and consistency, and tends to be overly artificial and structurally monotonous. In this work, we propose an incentivization-based approach that, starting entirely from scratch and without relying on any annotated or synthetic data, leverages reinforcement learning (RL) to foster the emergence of ultra-long, high-quality text generation capabilities in LLMs. We perform RL training starting from a base model, similar to R1-Zero, guiding it to engage in reasoning that facilitates planning and refinement during the writing process. To support this, we employ specialized reward models that steer the LLM towards improved length control, writing quality, and structural formatting. Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and Arena-Write, and even surpassing 100B+ models such as DeepSeek R1 and Qwen3-235B.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1025. LUMINA: Detecting Hallucinations in RAG System with ContextโKnowledge Signals
๐ฌ 14
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Hallucination detection Retrieval-augmented generation Reliability of LLM
Retrieval-Augmented Generation (RAG) aims to mitigate hallucinations in large language models (LLMs) by grounding responses in retrieved documents. Yet, RAG-based LLMs still hallucinate even when provided with correct and sufficient context. A growing line of work suggests that this stems from an imbalance between how models use external context and their internal knowledge, and several approaches have attempted to quantify these signals for hallucination detection. However, existing methods require extensive hyperparameter tuning, limiting their generalizability. We propose LUMINA, a novel framework that detects hallucinations in RAG systems through contextโknowledge signals: external context utilization is quantified via distributional distance, while internal knowledge utilization is measured by tracking how predicted tokens evolve across transformer layers. We further introduce a framework for statistically validating these measurements. Experiments on common RAG hallucination benchmarks and four open-source LLMs show that LUMINA achieves consistently high AUROC and AUPRC scores, outperforming prior utilization-based methods by up to +13% AUROC on HalluRAG. Moreover, LUMINA remains robust under relaxed assumptions about retrieval quality and model matching, offering both effectiveness and practicality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1026. Tackling Time-Series Forecasting Generalization via Mitigating Concept Drift
๐ฌ 7
โญ 6.00
๐ learning on time series and dynamical systems
๐ท๏ธ Time-Series Forecasting Distribution Shift Concept Drift
Time-series forecasting finds broad applications in real-world scenarios. Due to the dynamic nature of time series data, it is important for time-series forecasting models to handle potential distribution shifts over time. In this paper, we initially identify two types of distribution shifts in time series: concept drift and temporal shift. We acknowledge that while existing studies primarily focus on addressing temporal shift issues in time series forecasting, designing proper concept drift methods for time series forecasting has received comparatively less attention.
Motivated by the need to address potential concept drift, while conventional concept drift methods via invariant learning face certain challenges in time-series forecasting, we propose a soft attention mechanism that finds invariant patterns from both lookback and horizon time series. Additionally, we emphasize the critical importance of mitigating temporal shifts as a preliminary to addressing concept drift. In this context, we introduce ShifTS, a method-agnostic framework designed to tackle temporal shift first and then concept drift within a unified approach. Extensive experiments demonstrate the efficacy of ShifTS in consistently enhancing the forecasting accuracy of agnostic models across multiple datasets, and outperforming existing concept drift, temporal shift, and combined baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1027. Turning Internal Gap into Self-Improvement: Promoting the Generation-Understanding Unification in MLLMs
๐ฌ 17
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ MLLM Self-improvement Unification
Although unified MLLMs aim to unify generation and understanding, they are considered to exhibit an internal gap, with understanding outperforming generation. Through largeโscale evaluation across multiple MLLMs and tasks, we confirm the widespread nonโunification of MLLMs, and demonstrate that it indeed stems from weak generation rather than misunderstanding. This finding motivates us to propose a simple yet effective internal gap-based self-improvement framework, which mitigates internal gaps by leveraging stronger understanding to guide weaker generation without relying on any external signals. We validate this strategy through comprehensive experiments: scoring generations with understanding to construct image data for post-training (e.g., SFT and DPO) significantly improves generation while promoting unification. Furthermore, we empirically discover a co-improvement effect of such self-improvement, a phenomenon well known in pre-training but underexplored in post-training. Specifically, as generation improves, understanding becomes more effective at detecting false positives that were previously misclassified as promptโaligned. To explain this effect, we extend learning dynamic theory to the MLLM setting, showing that the shared empirical neural tangent kernel between generation and understanding encourages aligned learning dynamics, thereby driving co-improvement. This interplay between generation and understanding further motivates a curriculum learning approach for stronger selfโimprovement: progressively enhanced understanding and generation revisit samples underutilized by preโtrained MLLMs, dynamically expanding postโtraining data and leading to improved performance and unification.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1028. VSF: Simple, Efficient, and Effective Negative Guidance in Few-Step Image Generation Models By Value Sign Flip
๐ฌ 10
โญ 6.00
๐ generative models
๐ท๏ธ Image Generation Diffusion Models Negative Guidance
We introduce Value Sign Flip (VSF), a simple and efficient method for incorporating negative prompt guidance in few-step (1-8 steps) diffusion and flow-matching image generation models. Unlike existing approaches such as classifier-free guidance (CFG), NASA, and NAG, VSF dynamically suppresses undesired content by flipping the sign of attention values from negative prompts. Our method requires only a small computational overhead and integrates effectively with MMDiT-style architectures such as Stable Diffusion 3.5 Turbo and Flux Schnell, as well as cross-attention-based models like Wan. We validate VSF on challenging datasets with complex prompt pairs and demonstrate superior performance in both static image and video generation tasks. Experimental results on our proposed dataset NegGenBench show that VSF significantly improves negative prompt adherence (reaching 0.420 negative score for quality settings and 0.545 for strong settings) compared to prior methods in few-step models, which scored 0.320-0.380 negative score, and even CFG in non-few-step models (scored 0.300 negative score), while maintaining competitive image quality and positive prompt adherence. Our method is also a suppressed generate-then-edit pipeline, while also having a much faster runtime. Code, ComfyUI node, and dataset will be released. Videos generated are in the Supplementary Material.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1029. UltraLLaDA: Scaling the Context Length to 128K for Diffusion Large Language Models
๐ฌ 22
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ diffusion language model; long context LLM
Diffusion LLMs have attracted growing interest, with plenty of recent work emphasizing their great potential in various downstream tasks; yet the longโcontext behavior of diffusion LLMs remains largely uncharted. We present a case study of postโtraining techniques for extending the context window of diffusion LLMs (i.e., LLaDA) without retraining from scratch. We show that a simple modification to the standard Rotary Positional Embeddings (RoPE) extension effectively accommodates the probabilistic modeling inherent in the diffusion process, enabling stable scaling to longer context ranges. We further compare masking strategies used during postโtraining and analyze their impact on optimization stability and longโrange recall. Instantiating these insights, we introduce UltraLLaDA, a diffusion LLM with a 128Kโtoken context window that, in our empirical evaluation on longโcontext tasks, significantly outperforms trainingโfree baselines. Our experimental results highlight the special positional extension as a key lever for scaling diffusion LLMs to extended contexts and offer practical guidance for practitioners seeking 128Kโscale context via efficient postโtraining.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1030. On Smoothness Bounds for Non-Clairvoyant Scheduling with Predictions
๐ฌ 17
โญ 6.00
๐ optimization
๐ท๏ธ Algorithms with predictions Smoothness Scheduling
Algorithms with predictions leverage predictions for unknown inputs in online decision-making. These algorithms are analyzed by consistency, i.e., competitive ratio under perfect predictions, and robustness, i.e., competitive ratio under worst-case predictions. Smooth degrading performance with an increased prediction error is also desirable. This paper refines the notion of smoothness, a function of prediction error, defined as the competitive ratio over the problem instances where predictions are guaranteed to provide additional information.
With our refined smoothness metric, we establish smoothness bounds for a few scheduling problems, including online total completion time minimization and makespan minimization. For a single machine to minimize the total completion time, we show a lower bound of $\eta$ and a $\eta^2$-smooth algorithm, where $\eta$ is the prediction error ($\eta \geq 1$); the bound holds for small errors. For parallel identical machines to minimize the makespan, we show a lower bound of $2 - O(\eta^{-2})$ and present an $O(\eta^2)$-smooth algorithm for small errors. Both bounds are tighter than the existing ones. For uniformly-related machines to minimize the makespan, we show a tight lower bound of $\lceil \log \eta \rceil$, matched by an $O(\log \eta)$-smooth algorithm.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1031. Boolean Satisfiability via Imitation Learning
๐ฌ 14
โญ 6.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Boolean Satisfiability Imitation Learning Autoregressive Modeling Branching Heuristics
We propose ImitSAT, a branching policy for conflict-driven clause learning (CDCL) solvers based on imitation learning for the Boolean satisfiability problem (SAT). Unlike previous methods that predict instance-level signals to improve CDCL branching indirectly, or rely on reinforcement learning and insufficient CDCL information to enhance branching, ImitSAT learns from expert KeyTrace that collapses a full run into the sequence of surviving decisions. Replaying a KeyTrace on the same instance is nearly conflict-free, providing dense decision- level supervision and directly reducing propagationsโthe dominant contributor to wall-clock time. This prefix-conditioned supervision enables ImitSAT to reproduce high-quality branches without exploration, yielding faster convergence, stable training, and seamless integration into CDCL. Extensive experiments demonstrate that ImitSAT reduces propagation counts and runtime, outperforming state-of-the-art learned approaches. We will release code, trained model, and CDCL integration.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1032. FedMC: Federated Manifold Calibration
๐ฌ 19
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Federated Learning Distribution Calibrations Geometric Knowledge
Data heterogeneity in Federated Learning (FL) leads to significant bias in local training. While recent efforts to introduce distributional statistics as priors have shown progress, they universally rely on a flawed global linearity assumption, failing to capture the nonlinear manifold structures prevalent in real-world data. This model-reality mismatch causes the calibration process to generate out-of-distribution (OOD) samples, which fundamentally misleads the model. To address this, we introduce a paradigm shift. We propose Federated Manifold Calibration (FedMC), a novel framework that learns and leverages the local, nonlinear geometry of data. FedMC employs local kernel PCA on the client side to learn fine-grained local geometries, and constructs a global "geometry dictionary" on the server side to aggregate and distribute this knowledge. Clients then utilize this dictionary to perform context-aware, on-manifold calibration. We validate our proposed method by integrating it with a wide range of existing FL algorithms. Experimental results show that by explicitly modeling nonlinear manifolds, FedMC consistently and significantly enhances the performance of these state-of-the-art methods across multiple benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1033. Conjuring Semantic Similarity
๐ฌ 9
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Meaning Representation Semantic Similarity Diffusion Model
The semantic similarity between sample expressions measures the distance between their latent `meaning'.These meanings are themselves typically represented by textual expressions. We propose a novel approach whereby the semantic similarity among textual expressions is based not on other expressions they can be rephrased as, but rather based on the imagery they evoke. While this is not possible with humans, generative models allow us to easily visualize and compare generated images, or their distribution, evoked by a textual prompt. Therefore, we characterize the semantic similarity between two textual expressions simply as the distance between image distributions they induce, or 'conjure.' We show that by choosing the Jeffreys divergence between the reverse-time diffusion stochastic differential equations (SDEs) induced by each textual expression, this can be directly computed via Monte-Carlo sampling. Our method contributes a novel perspective on semantic similarity that not only aligns with human-annotated scores, but also opens up new avenues for the evaluation of text-conditioned generative models while offering better interpretability of their learnt representations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1034. Reducing information dependency does not cause training data privacy. Adversarially non-robust features do.
๐ฌ 25
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Privacy model inversion adversarial examples memorization
In this paper, we show that the prevailing view that information dependency (including rote memorization) drives training data exposure to image reconstruction attacks is incomplete. We find that extensive exposure can persist without rote memorization, driven instead by a tunable connection to adversarial robustness. We begin by presenting three surprising results: (1) recent defenses that inhibit reconstruction by Model Inversion Attacks (MIAs), which evaluate leakage under an idealized attacker, do *not* reduce standard measures of information dependency (HSIC); (2) models that maximally memorize their training datasets remain robust to MIA reconstruction; and (3) models trained without seeing 97% of the training pixels, where recent information-theoretic bounds give arbitrarily strong privacy guarantees under standard assumptions, can still be devastatingly reconstructed by MIA. To explain these findings, we provide causal evidence that privacy under MIA arises from what the adversarial examples literature calls ``non-robust'' features (generalizable but imperceptible and unstable features). We further show that recent MIA defenses obtain their privacy improvements by unintentionally shifting models toward such features. We leverage this mechanism to introduce Anti-Adversarial Training (AT-AT), a training regime that intentionally learns non-robust features to obtain both superior reconstruction defense and higher accuracy than state-of-the-art defenses. Our results revise the prevailing understanding of training data exposure and reveal a new privacy-robustness tradeoff.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1035. Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?
๐ฌ 18
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Text-to-Image Generation Reasoning Benchmark
Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, which thus correspond to two core capabilities: \textbf{\textit{composition}} and \textbf{\textit{reasoning}}. Despite recent advances of T2I models in both composition and reasoning, existing benchmarks remain limited in evaluation. They not only fail to provide comprehensive coverage across and within both capabilities, but also largely restrict evaluation to low scene density and simple one-to-one reasoning. To address these limitations, we propose \textbf{\textsc{T2I-CoReBench}}, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (\textit{instance}, \textit{attribute}, and \textit{relation}) and reasoning around the philosophical framework of inference (\textit{deductive}, \textit{inductive}, and \textit{abductive}), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent real-world complexities, we curate each prompt with higher compositional density for composition and greater reasoning intensity for reasoning. To facilitate fine-grained and reliable evaluation, we also pair each evaluation prompt with a checklist that specifies individual \textit{yes/no} questions to assess each intended element independently. In statistics, our benchmark comprises $1,080$ challenging prompts and around $13,500$ checklist questions. Experiments across 28 current T2I models reveal that their composition capability still remains limited in high compositional scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1036. On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
๐ฌ 15
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Supervised Fine-Tuning Large Language Model Reinforcement Learning
In this work, we present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model compared to RL. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. With just a single-line change, the method outperforms standard SFT on multiple difficult benchmarks and base models, from math reasoning to code generation and multi-modal tasks, demonstrating improved generalization. Additionally, DFT achieves competitive results in offline RL settings, and further boosts the effectiveness of subsequent RL training, providing an effective yet streamlined alternative. The experiments further demonstrate that DFT not only strengthens SFT performance but also consistently improves the effectiveness of subsequent RL training. By bridging theoretical insights with practical solutions, this work advances the state of SFT. The source code will be publicly released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1037. Any-to-Bokeh: Arbitrary-Subject Video Refocusing with Video Diffusion Model
๐ฌ 8
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Computational photography
Diffusion models have recently emerged as powerful tools for camera simulation, enabling both geometric transformations and realistic optical effects. Among these, image-based bokeh rendering has shown promising results, but diffusion for video bokeh remains unexplored. Existing image-based methods are plagued by temporal flickering and inconsistent blur transitions, while current video editing methods lack explicit control over the focus plane and bokeh intensity. These issues limit their applicability for controllable video bokeh. In this work, we propose a one-step diffusion framework for generating temporally coherent, depth-aware video bokeh rendering. The framework employs a multi-plane image (MPI) representation adapted to the focal plane to condition the video diffusion model, thereby enabling it to exploit strong 3D priors from pretrained backbones. To further enhance temporal stability, depth robustness, and detail preservation, we introduce a progressive training strategy. Experiments on synthetic and real-world benchmarks demonstrate superior temporal coherence, spatial accuracy, and controllability, outperforming prior baselines. This work represents the first dedicated diffusion framework for video bokeh generation, establishing a new baseline for temporally coherent and controllable depth-of-field effects. Code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1038. Retain and Adapt: Auto-Balanced Model Editing for Open-Vocabulary Object Detection under Domain Shifts
๐ฌ 19
โญ 6.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Open-Vocabulary Object Detection Model Editing Continual Learning Knowledge Injection Few-Shot Learning Catastrophic Forgetting
Recent advances in Open Vocabulary Object Detection (OVOD) have shown strong performance on standard benchmarks, but performance drops sharply under out-of-distribution (OOD) shifts. Continual learning offers a potential remedy by sequentially integrating new tasks, yet existing methods often struggle to balance retaining the pre-trained model capabilities with adapting to new tasks, and usually require retraining under specific task orders. To address these limitations, we observe that model editing naturally lends itself to this setting, as it enables efficient knowledge injection while retaining prior capabilities. Building on this insight, we introduce $\textbf{A}$utomatically $\textbf{B}$alanced $\textbf{M}$odel $\textbf{E}$diting ($\textbf{ABME}$), which injects new task knowledge into the powerful OVOD models while preserving the modelโs original abilities. We first stores compact keyโvalue representations with storage cost independent of task volume. Then we leverage the stored KV matrices to automatically balance the new and old knowledge for varying learning scenarios,
supporting order-agnostic task insertion or removal without additional retraining. Experiments show that ABME consistently achieves a better trade-off between maintaining pre-trained performance and adapting to diverse OOD tasks compared to existing continual learning approaches for open-vocabulary object detection, and generalizes seamlessly across different models and task scales.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1039. CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework
๐ฌ 19
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multi-modal Large Language Agent Medical Visual Question Answering Visually Grounded Reasoning Reinforcement Learning with Verifiable Reward
Large visual language models (VLMs) have shown strong multi-modal medical reasoning ability, but most operate as end-to-end black boxes, diverging from cliniciansโ evidence-based, staged workflows and hindering clinical accountability. Complementarily, expert visual grounding models can accurately localize regions of interest (ROIs), providing explicit, reliable evidence that improves both reasoning accuracy and trust. In this paper, we introduce **CARE**, advancing **C**linical **A**ccountability in multi-modal medical **R**easoning with an **E**vidence-grounded agentic framework. Unlike existing approaches that couple grounding and reasoning within a single generalist model, CARE decomposes the task into coordinated sub-modules to reduce shortcut learning and hallucination: a compact VLM proposes relevant medical entities; an expert entity-referring segmentation model produces pixel-level ROI evidence; and a grounded VLM reasons over the full image augmented by ROI hints. The VLMs are optimized with reinforcement learning with verifiable rewards to align answers with supporting evidence. Furthermore, a VLM coordinator plans tool invocation and reviews evidence-answer consistency, providing agentic control and final verification. Evaluated on standard medical VQA benchmarks, our **CARE-Flow** (coordinator-free) improves average accuracy by **10.9%** over the same size (10B) state-of-the-art (SOTA). With dynamic planning and answer review, our **CARE-Coord** yields a further gain, outperforming the heavily pre-trained SOTA by **5.2%**. Our experiments demonstrate that an agentic framework that emulates clinical workflows, incorporating decoupled specialized models and explicit evidence, yields more accurate and accountable medical AI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1040. MRAD: Zero-Shot Anomaly Detection with Memory-Driven Retrieval
๐ฌ 19
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Anomaly detection Zero-shot anomaly detection Memory retrieval CLIP
Zero-shot anomaly detection (ZSAD) often leverages pretrained vision or vision-language models, but many existing methods use prompt learning or complex modeling to fit the data distribution, resulting in high training or inference cost and limited cross-domain stability. To address these limitations, we propose Memory-Retrieval Anomaly Detection method (MRAD), a unified framework that replaces parametric fitting with a direct memory retrieval. The train-free base model, MRAD-TF, freezes the CLIP image encoder and constructs a two-level memory bank (image-level and pixel-level) from auxiliary data, where feature-label pairs are explicitly stored as keys and values. During inference, anomaly scores are obtained directly by similarity retrieval over the memory bank. Based on the MRAD-TF, we further propose two lightweight variants as enhancements: (i) MRAD-FT fine-tunes the retrieval metric with two linear layers to enhance the discriminability between normal and anomaly; (ii) MRAD-CLIP injects the normal and anomalous region priors from the MRAD-FT as dynamic biases into CLIP's learnable text prompts, strengthening generalization to unseen categories. Across 16 industrial and medical datasets, the MRAD framework consistently demonstrates superior performance in anomaly classification and segmentation, under both train-free and training-based settings. Our work shows that fully leveraging the empirical distribution of raw data, rather than relying only on model fitting, can achieve stronger anomaly detection performance. Code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1041. Object-Centric Refinement for Enhanced Zero-Shot Segmentation
๐ฌ 14
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Zero-Shot Learning Vision-Language models Semantic Segmentation Computer Vision
Zero-shot semantic segmentation aims to recognize, pixel-wise, unseen categories without annotated masks, typically by leveraging vision-language models such as CLIP. However, the patch representations obtained by the CLIP's vision encoder lack object-centric structure, making it difficult to localize coherent semantic regions.
This hinders the performance of the segmentation decoder, especially for unseen categories. To mitigate this issue, we propose object-centric zero-shot segmentation (OC-ZSS) that enhances patch representations using object-level information.
To extract object features for patch refinement, we introduce self-supervision-guided object prompts into the encoder. These prompts attend to coarse object regions using attention masks derived from unsupervised clustering of features from a pretrained self-supervised~(SSL) model. Although these prompts offer a structured initialization of the object-level context, the extracted features remain coarse due to the unsupervised nature of clustering. To further refine the object features and effectively enrich patch representations, we develop a dual-stage Object Refinement Attention (ORA) module that iteratively updates both object and patch features through cross-attention. Last, to make the refinement more robust and sensitive to objects of varying spatial scales, we incorporate a lightweight granular attention mechanism that operates over multiple receptive fields. OC-ZSS achieves state-of-the-art performance on standard zero-shot segmentation benchmarks across inductive, transductive, and cross-domain settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1042. Causal Discovery in the Wild: A Voting-Theoretic Ensemble Approach
๐ฌ 19
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Causal Discovery Ensemble Learning
Causal discovery is a critical yet persistently challenging task across scientific domains. Despite years of significant algorithmic advances, existing methods still struggle with inconsistent outcomes due to reliance on untestable assumptions, sensitivity to data perturbations, and optimization constraints. To this end, ensemble-based causal discovery has been actively pursued, aiming to aggregate multiple structural predictions for increased stability and uncertainty estimation. However, current aggregation methods are largely heuristic, lacking theoretical guarantees and guidance on how ensemble design choices affect performance. This work is proposed to address there fundamental limitations. We introduce a principled voting-based framework for structural ensembling, establishing conditions under which the aggregated structure recovers the true causal graph. Our analysis yields a theoretically justified weighted voting mechanism that informs optimal choices regarding the number, competency, and diversity of causal discovery experts in the ensemble. Extensive experiments on synthetic and real-world datasets verify the robustness and effectiveness of our approach, offering a rigorous alternative to existing heuristic ensemble methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1043. Scalable Exploration for High-Dimensional Continuous Control via Value-Guided Flow
๐ฌ 17
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Scalable exploration high-dimensional continuous control
Controlling high-dimensional systems in biological and robotic applications is challenging due to expansive stateโaction spaces, where effective exploration is critical. Commonly used exploration strategies in reinforcement learning are largely undirected with sharp degradation as action dimensionality grows. Many existing methods resort to dimensionality reduction, which constrains policy expressiveness and forfeits system flexibility. We introduce Q-guided Flow Exploration (Qflex), a scalable reinforcement learning method that conducts exploration directly in the native high-dimensional action space. During training, Qflex traverses actions from a learnable source distribution along a probability flow induced by the learned value function, aligning exploration with task-relevant gradients rather than isotropic noise. Our proposed method substantially outperforms representative online reinforcement learning baselines across diverse high-dimensional continuous-control benchmarks. Qflex also successfully controls a full-body human musculoskeletal model to perform agile, complex movements, demonstrating superior scalability and sample efficiency in very high-dimensional settings. Our results indicate that value-guided flows offer a principled and practical route to exploration at scale.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1044. Multiplayer Nash Preference Optimization
๐ฌ 21
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Preference Optimization RLHF
Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models (LLMs) with human preferences. However, reward-based methods built on the BradleyโTerry assumption struggle to capture the non-transitive and heterogeneous nature of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While this perspective has inspired algorithms such as INPO, ONPO, and EGPO with strong theoretical and empirical guarantees, they remain fundamentally restricted to two-player interactions, creating a single-opponent bias that fails to capture the full complexity of realistic preference structures.
In this work, we introduce Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an $n$-player game, where each policy competes against a population of opponents while being regularized toward a reference model.
Our framework establishes well-defined Nash equilibria in multiplayer settings and extends the concept of duality gap to quantify approximation quality. We demonstrate that MNPO inherits the equilibrium guarantees of two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. Through comprehensive empirical evaluation, we show that MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios. Together, these results establish MNPO as a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1045. Convex Efficient Coding
๐ฌ 15
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Neuroscience Representation Identifiability
Why do neurons encode information the way they do? Normative answers to this question model neural activity as the solution to an optimisation problem; for example, the celebrated efficient coding hypothesis frames neural activity as the optimal encoding of information under efficiency constraints. Successful normative theories have varied dramatically in complexity, from simple linear models (Atick & Redlich, 1990), to complex deep neural networks (Lindsay, 2021). What complex models gain in flexibility, they lose in tractability and often understandability. Here, we split the difference by constructing a set of tractable but flexible normative representational theories. Instead of optimising the neural activities directly, following (Sengupta et al. 2018), we instead optimise the representational similarity, a matrix formed from the dot products of each pair of neural responses. Using this, we show that a large family of interesting optimisation problems are convex. This includes problems corresponding to linear and some non-linear neural networks, and problems from the literature not previously recognised as convex such as modified versions of semi-nonnegative matrix factorisation or nonnegative sparse coding. We put these findings to work in two ways. First, we extend previous results on modularity and mixed selectivity in neural activity; in so doing we provide the first necessary and sufficient identifiability result for a form of semi-nonnegative matrix factorisations. Second, we seek to understand the meaningfulness of single neural tuning curves as compared to neural representations. In particular we derive an identifiability result stating that, for an optimal representational similarity matrix, if neural tunings are `different enough' then they are uniquely linked to the optimal representational similarity, partially justifying the use of single neuron tuning analysis in neuroscience. In sum, we identify an interesting space of convex problems, and use that to derive neural coding results.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1046. DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence
๐ฌ 15
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ deep research generative search engines NLP audit framework sociotechnical evaluation large language models
Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices. We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations. DeepTRACE uses statement-level analysis (decomposition, confidence scoring) and builds citation and factual-support matrices to audit how systems reason with and attribute evidence end-to-end. Using automated extraction pipelines for popular public models (e.g., GPT-4.5/5, You.com, Perplexity, Copilot/Bing, Gemini) and an LLM-judge with validated agreement to human raters, we evaluate both web-search engines and deep-research configurations. Our findings show that generative search engines and deep research agents frequently produce one-sided, highly confident responses on debate queries and include large fractions of statements unsupported by their own listed sources. Deep-research configurations reduce overconfidence and can attain high citation thoroughness, but they remain highly one-sided on debate queries and still exhibit large fractions of unsupported statements, with citation accuracy ranging from 40โ80\% across systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1047. Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing
๐ฌ 37
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Image Editing Image Generation Unified Multimodal Model Multimodal
In recent years, integrating multimodal understanding and generation into a single unified model has emerged as a promising paradigm. While this approach achieves strong results in text-to-image (T2I) generation, it still struggles with precise image editing. We attribute this limitation to an imbalanced division of responsibilities. The understanding module primarily functions as a translator that encodes user instructions into semantic conditions, while the generation module must simultaneously act as designer and painter, inferring the original layout, identifying the target editing region, and rendering the new content. This imbalance is counterintuitive because the understanding module is typically trained with several times more data on complex reasoning tasks than the generation module. To address this issue, we introduce *Draw-In-Mind* (DIM), a dataset comprising two complementary subsets: (**i**) DIM-T2I, containing 14M long-context imageโtext pairs to enhance complex instruction comprehension; and (**ii**) DIM-Edit, consisting of 233K chain-of-thought imaginations generated by GPT-4o, serving as explicit design blueprints for image edits. We connect a frozen Qwen2.5-VL-3B with a trainable SANA1.5-1.6B via a lightweight two-layer MLP, and train it on the proposed DIM dataset, resulting in DIM-4.6B-T2I/Edit. Despite its modest parameter scale, DIM-4.6B-Edit achieves SOTA or competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit. These findings demonstrate that explicitly assigning the design responsibility to the understanding module provides significant benefits for image editing. Our dataset and models will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1048. Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs
๐ฌ 13
โญ 6.00
๐ reinforcement learning
๐ท๏ธ DeepResearch Reasoning agentic reasoning
Tool-integrated reasoning has emerged as a key focus for enabling agentic applications. Among these, DeepResearch Agents have gained significant attention for their strong performance on complex, open-ended information-seeking tasks. We introduce Fathom-DeepResearch, an agentic system composed of two specialized models. The first is Fathom-Search-4B, a DeepSearch model trained from Qwen3-4B and optimized for evidence-based investigation through live web search and targeted webpage querying. Its training combines three advances: (i) DUETQA, a โผ5K-sample dataset generated via multi-agent self-play that enforces strict web-search dependence and heterogeneous source grounding; (ii) RAPO, a zero-overhead extension of GRPO that stabilizes multi-turn Reinforcement Learning with Verifiable Rewards through curriculum pruning, reward-aware advantage scaling, and per-prompt replay buffers; and (iii) a steerable step-level reward that classifies each tool call by cognitive behavior and marginal utility, enabling explicit control over search trajectory breadth, depth, and horizon. These improvements enable reliable extension of tool-calling beyond 20 calls when warranted. The second is Fathom-Synthesizer-4B, trained from Qwen3-4B, which converts multi-turn DeepSearch traces into structured, citation-dense DeepResearch Reports for comprehensive synthesis. Evaluated on DeepSearch benchmarks (SimpleQA, FRAMES, WebWalker, Seal0, MuSiQue) and DeepResearch-Bench, the system achieves state-of-the-art performance in the open-weights category while closely rivaling proprietary
closed systems, while also demonstrating strong performance in general reasoning benchmarks: HLE, AIME-25, GPQA-Diamond, and MedQA.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1049. Diffusion Bridge Variational Inference for Deep Gaussian Processes
๐ฌ 14
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Deep Gaussian Processes Diffusion Bridge Variational Inference
Deep Gaussian processes (DGPs) enable expressive hierarchical Bayesian modeling but pose substantial challenges for posterior inference, especially over inducing variables. Denoising diffusion variational inference (DDVI) addresses this by modeling the posterior as a time-reversed diffusion from a simple Gaussian prior. However, DDVIโs fixed unconditional starting distribution remains far from the complex true posterior, resulting in inefficient inference trajectories and slow convergence. In this work, we propose Diffusion Bridge Variational Inference (DBVI), a principled extension of DDVI that initiates the reverse diffusion from a learnable, data-dependent initial distribution. This initialization is parameterized via an amortized neural network and progressively adapted using gradients from the ELBO objective, reducing the posterior gap and improving sample efficiency. To enable scalable amortization, we design the network to operate on the inducing inputs $\mathbf{Z}^{(l)}$, which serve as structured, low-dimensional summaries of the dataset and naturally align with the inducing variables' shape. DBVI retains the mathematical elegance of DDVIโincluding Girsanov-based ELBOs and reverse-time SDEsโwhile reinterpreting the prior via a Doob-bridged diffusion process. We derive a tractable training objective under this formulation and implement DBVI for scalable inference in large-scale DGPs. Across regression, classification, and image reconstruction tasks, DBVI consistently outperforms DDVI and other variational baselines in predictive accuracy, convergence speed, and posterior quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1050. A Training-Free Framework for Long Video Understanding via Video-Query-Options Similarity
๐ฌ 9
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ long video understanding multimodal large language model
Multimodal Large Language Models (MLLMs) have achieved remarkable success in image and short video understanding tasks, but their performance on hour-long videos remains limited due to constraint of input token capacity. Existing approaches often require costly training procedures, hindering their adaptability to rapidly evolving MLLM architectures. In this paper, we propose a training-free framework for long video understanding, integrating three key innovations: Adaptive Frame Sampling (AFS), Dynamic Resolution Allocation (DRA), and Video-Query-Options Similarity (VQOS). AFS adaptively increases frame sampling density in highly relevant video segments to preserve critical temporal details, while DRA reduces spatial resolution in less relevant segments to suppress redundant information. VQOS enhances similarity calculation by prompting MLLMs to generate candidate answer options, fusing queries with options to refine relevance estimation. Mirroring human cognitive processes (hypothesis generation โ focused verification โ irrelevance filtering), our framework effectively improve model accuracy without fine-tuning. The method is implemented on LLaVA-Video and Qwen2.5-VL respectively, and experimental results show our method could achieve state-of-the-art performances over 5 mainstream benchmarks. More visualization results and code are available in the Appendix.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1051. When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling
๐ฌ 12
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ LLM Ensemble probability-level ensemble speculative decoding
Ensembling Large Language Models (LLMs) has gained attention as a promising approach to surpass the performance of individual models by leveraging their complementary strengths. In particular, aggregating modelsโ next-token probability distributions to select the next token has been shown to be effective in various tasks. However, while successful for short-form answers, its application to long-form generation remains underexplored. In this paper, we show that using existing ensemble methods in long-form generation requires a careful choice of ensembling positions, since the standard practice of ensembling at every token often degrades performance. We identify two key factors for determining the ensembling positions: tokenization mismatch across models and consensus in their next-token probability distributions. Based on this, we propose $\textbf{SAFE}$, ($\textbf{S}$table $\textbf{A}$nd $\textbf{F}$ast LLM $\textbf{E}$nsembling), a framework that selectively ensembles by jointly considering these factors. To further improve stability, we apply a probability sharpening strategy when the ensemble distribution becomes overly smooth, enabling the selection of more confident tokens during ensembling. Our experiments on diverse benchmarks, including MATH500 and BBH, demonstrate that SAFE outperforms existing methods in both accuracy and efficiency, with gains achieved even when ensembling fewer than 1\% of tokens.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1052. On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations
๐ฌ 15
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Robustness Vision-Language-Action Models
In VisionโLanguageโAction (VLA) models, robustness to real-world perturbations is critical for deployment. Existing methods target simple visual disturbances, overlooking the broader multi-modal perturbations that arise in actions, instructions, environments, and observations. Here, we first evaluate the robustness of mainstream VLAs under 17 perturbations across four modalities. We find (1) actions as the most fragile modality, (2) Existing visual-robust VLA do not gain robustness in other modality, and (3) $\pi_0$ demonstrates superior robustness. To build multi-modal robust VLAs, we propose RobustVLA against perturbations in VLA inputs and outputs. For output robustness, we perform offline robust optimization against worst-case action noise that maximizes mismatch in flow matching objective. This can be seen as adversarial training, label smoothing, and outlier penalization. For input robustness, we enforce consistent actions across input variations that preserve task semantics. To account for multiple perturbations, we formulate robustness as a multi-armed bandit problem and apply an upper confidence bound algorithm to automatically identify the most harmful noise. Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6\% on the $\pi_0$ backbone and 10.4\% on the OpenVLA backbone across all 17 perturbations, achieving 50.6x faster inference than existing visual-robust BYOVLA that requires external LLMs, and a 10.4\% gain under mixed perturbations. On the real-world FR5 robot, under four types of multimodal perturbations, RobustVLA shows strong low-data performance, outperforming $\pi_0$ by $65.6\%$ success rate with 25 demonstrations. Even with abundant demos, our method still outperform $\pi_0$ by 30\% success rate. Code and demo videos available at \url{https://anonymous.4open.science/r/RobustVLA-283D}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1053. Universal Beta Splatting
๐ฌ 17
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Radiance Field Splatting
We introduce Universal Beta Splatting (UBS), a unified framework that generalizes 3D Gaussian Splatting to N-dimensional anisotropic Beta kernels for explicit radiance field rendering. Unlike fixed Gaussian primitives, Beta kernels enable controllable dependency modeling across spatial, angular, and temporal dimensions within a single representation. Our unified approach captures complex light transport effects, handles anisotropic view-dependent appearance, and models scene dynamics without requiring auxiliary networks or specific color encodings. UBS maintains backward compatibility by approximating to Gaussian Splatting as a special case, guaranteeing plug-in usability and lower performance bounds. The learned Beta parameters naturally decompose scene properties into interpretable without explicit supervision: spatial (surface vs. texture), angular (diffuse vs. specular), and temporal (static vs. dynamic). Our CUDA-accelerated implementation achieves real-time rendering while consistently outperforming existing methods across static, view-dependent, and dynamic benchmarks, establishing Beta kernels as a scalable universal primitive for radiance field rendering.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1054. OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds
๐ฌ 13
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ generalist agent; GUI agent; embodied agent; MoE
Multimodal large language models are progressively advancing toward multimodal agents that can proactively execute tasks. Existing research on multimodal agents primarily targets either GUI or embodied scenarios, corresponding to interactions within 2D virtual world and 3D physical world, respectively. However, many real-world tasks inherently require agents to interleave interactions across both types of environments. We initially mix GUI and embodied data to train models, but find performance degradation caused by data conflicts. Further analysis reveals that GUI and embodied data exhibit synergy at shallow layers but conflict at deep layers, resembling the cerebrum-cerebellum mechanism in the human brain. To this end, we introduce a high-performance generalist agent OmniActor, designed from both structural and data perspectives. First, we propose Layer-heterogeneous MoE that separates parameters at deep layers to eliminate conflict, while sharing parameters at shallow layers to leverage synergy. This design enables OmniActor to outperform agents trained solely on GUI or embodied data in their respective tasks. Furthermore, we unify the action spaces of GUI and embodied tasks and collect large-scale datasets from diverse sources for training. This substantially enhances the performance of OmniActor across various scenarios, especially in GUI tasks. The code will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1055. MobileKGQA: On-Device KGQA System on Dynamic Mobile Environments
๐ฌ 16
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Knowledge Graph Question Answering Large Language Model
Developing a mobile system capable of generating responses based on stored user data is a crucial challenge. Since user data is stored in the form of Knowledge Graphs, the field of knowledge graph question answering (KGQA) presents a promising avenue towards addressing this problem. However, existing KGQA systems face two critical limitations that preclude their on-device deployment: resource constraints and the inability to handle data accumulation. Therefore, we propose MobileKGQA, the first on-device KGQA system capable of adapting to evolving databases with minimal resource demands. MobileKGQA significantly reduces computational overhead through embedding hashing. Moreover, it successfully adapts to evolving databases under resource constraints through a novel annotation generation method. Its mobile applicability is validated on the NVIDIA Jetson Orin Nano edge-device platform, achieving 20.3% higher performance while using only 30.4% of the energy consumed by the SOTA (state of the art). On standard KGQA benchmarks, using just 7.2% of the computation and 9% of the parameters, MobileKGQA demonstrates performance that is empirically indistinguishable from the SOTA and outperforms baselines under distribution shift scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1056. Separable Neural Networks: Approximation Theory, NTK Regime, and Preconditioned Gradient Descent
๐ฌ 19
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Separable Neural Networks Approximation Theory Preconditioned Gradient Descent Neural Tangent Kernel
Separable neural networks (SepNNs) are emerging neural architectures that significantly reduce computational costs by factorizing a multivariate function into linear combinations of univariate functions, benefiting downstream applications such as implicit neural representations (INRs) and physics-informed neural networks (PINNs). However, fundamental theoretical analysis for SepNN, including detailed representation capacity and spectral bias characterization \& alleviation, remains unexplored. This work makes three key contributions to theoretically understanding and improving SepNN. First, using Weierstrass-based approximation and universal approximation theory, we prove that SepNN can approximate any multivariate function with arbitrary precision, confirming its representation completeness. Second, we derive the neural tangent kernel (NTK) regimes for SepNN, showing that the NTK of infinite-width SepNN converges to a deterministic (or random) kernel under infinite (or fixed) decomposition rank, with corresponding convergence and spectral bias characterization. Third, we propose an efficient separable preconditioned gradient descent (SepPGD) for optimizing SepNN, which alleviates the spectral bias of SepNN by provably adjusting its NTK spectrum. The SepPGD enjoys an efficient $\mathcal{O}(nD)$ complexity for $n^D$ training samples, which is much more efficient than previous neural network PGD methods. Extensive experiments for kernel ridge regression, image and surface representation using INRs, and numerical PDEs using PINNs validate the efficiency of SepNN and the effectiveness of SepPGD for alleviating spectral bias.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1057. Fine-Grained Privacy Extraction from Retrieval-Augmented Generation Systems by Exploiting Knowledge Asymmetry
๐ฌ 24
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ RAG knowledge asymmetry privacy extraction cross-domain generalization
Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by incorporating external knowledge bases, significantly improving their factual accuracy and contextual relevance. However, this integration also introduces new privacy vulnerabilities. Existing privacy attacks on RAG systems may trigger data leakage, but they often fail to accurately isolate knowledge base-derived content within mixed responses and perform poorly in multi-domain settings. In this paper, we propose a novel black-box attack framework that exploits knowledge asymmetry between RAG systems and standard LLMs to enable fine-grained privacy extraction across heterogeneous knowledge domains. Our approach decomposes adversarial queries to maximize information divergence between the models, then applies semantic relationship scoring to resolve lexical and syntactic ambiguities. These features are used to train a neural classifier capable of precisely identifying response segments that contain private or sensitive information. Unlike prior methods, our framework generalizes to unseen domains through iterative refinement without requiring prior knowledge of the corpus. Experimental results show that our method achieves over 90\% extraction accuracy in single-domain scenarios and 80\% in multi-domain settings, outperforming baselines by over 30\% in key evaluation metrics. These results represent the first systematic solution for fine-grained privacy localization in RAG systems, exposing critical security vulnerabilities and paving the way for stronger, more resilient defenses.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1058. FlexHiNM-GP: Flexible Hierarchical Pruning via Region Allocation and Channel Permutation
๐ฌ 6
โญ 6.00
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ Model Pruning N:M spasity
N:M sparsity has emerged as a hardware-friendly pruning strategy, notably supported by NVIDIAโs Sparse Tensor Cores. While efficient, its fixed sparsity ratio restricts flexibility, making it difficult to adapt pruning granularity to varying weight importance across layers and architectures.
To overcome this limitation, we propose FlexHiNM, a hybrid framework that adaptively partitions each layer into three regions: dense, vector-pruned, and N:M sparse, enabling finer-grained control while preserving hardware compatibility. To better preserve salient weights, we extend this to FlexHiNM-GP, which incorporates Gyro-Permutation, an iterative channel-rearrangement algorithm. Through successive sampling, clustering, and assignment, Gyro-Permutation aligns high-importance weights with structured sparsity patterns and mitigates suboptimal configurations in multi-level pruning.
During gradual pruning, FlexHiNM-GP further employs a differentiable masking mechanism based on the Hard Concrete distribution, enabling gradient-based mask learning and preventing over-aggressive early pruning. Experiments on vision and language benchmarks demonstrate that FlexHiNM-GP consistently surpasses strong structured baselines and approaches the performance of unstructured pruning, validating the effectiveness of combining hybrid sparsity with learned masks and permutation strategies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1059. Bayes Adaptive Monte Carlo Tree Search for Offline Model-based Reinforcement Learning
๐ฌ 10
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Offline Reinforcement Learning Monte-Carlo Tree Search
Offline reinforcement learning (RL) is a powerful approach for data-driven decision-making and control. Compared to model-free methods, offline model-based reinforcement learning (MBRL) explicitly learns world models from a static dataset and uses them as surrogate simulators, improving the data efficiency and enabling the learned policy to potentially generalize beyond the dataset support. However, there could be various MDPs that behave identically on the offline dataset and dealing with the uncertainty about the true MDP can be challenging. In this paper, we propose modeling offline MBRL as a Bayes Adaptive Markov Decision Process (BAMDP), which is a principled framework for addressing model uncertainty. We further propose a novel Bayes Adaptive Monte-Carlo planning algorithm capable of solving BAMDPs in continuous state and action spaces with stochastic transitions. This planning process is based on Monte Carlo Tree Search and can be integrated into offline MBRL as a policy improvement operator in policy iteration. Our "RL + Search" framework follows in the footsteps of superhuman AIs like AlphaZero, improving on current offline MBRL methods by incorporating more computation input. The proposed algorithm significantly outperforms state-of-the-art offline RL methods on twelve D4RL MuJoCo tasks and three challenging, stochastic tokamak control tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1060. Geometry-aware 4D Video Generation for Robot Manipulation
๐ฌ 11
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Video Generation Robot Manipulation 3D Perception
Understanding and predicting dynamics of the physical world can enhance a robot's ability to plan and interact effectively in complex environments. While recent video generation models have shown strong potential in modeling dynamic scenes, generating videos that are both temporally coherent and geometrically consistent across camera views remains a significant challenge. To address this, we propose a 4D video generation model that enforces multi-view 3D consistency of generated videos by supervising the model with cross-view pointmap alignment during training. Through this geometric supervision, the model learns a shared 3D scene representation, enabling it to generate spatio-temporally aligned future video sequences from novel viewpoints given a single RGB-D image per view, and without relying on camera poses as input. Compared to existing baselines, our method produces more visually stable and spatially aligned predictions across multiple simulated and real-world robotic datasets. We further show that the predicted 4D videos can be used to recover robot end-effector trajectories using an off-the-shelf 6DoF pose tracker, yielding robot manipulation policies that generalize well to novel camera viewpoints.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1061. AsyncBEV: Cross-modal flow alignment in Asynchronous 3D Object Detection
๐ฌ 18
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ multi-modal 3D object detection autonomous driving a synchronous fusion.
In autonomous driving, multi-modal perception tasks like 3D object detection typically rely on well-synchronized sensors, both at training and inference. However, despite the use of hardware- or software-based synchronization algorithms, perfect synchrony is rarely guaranteed: Sensors may operate at different frequencies, and real-world factors such as network latency, hardware failures, or processing bottlenecks often introduce time offsets between sensors. Such asynchrony degrades perception performance, especially for dynamic objects. To address this challenge, we propose AsyncBEV, a trainable lightweight and generic module to improve the robustness of 3D Birds' Eye View (BEV) object detection models against sensor asynchrony. Inspired by scene flow estimation, AsyncBEV first estimates the 2D flow from the BEV features of two different sensor modalities, taking into account the known time offset between these sensor measurements. The predicted feature flow is then used to warp and spatially align the feature maps, which we show can easily be integrated into different current BEV detector architectures (e.g., BEV grid-based and token-based). Extensive experiments demonstrate AsyncBEV improves robustness against both small and large asynchrony between LiDAR or camera sensors in both the token-based CMT and grid-based UniBEV, especially for dynamic objects. We significantly outperform the ego motion compensated CMT and UniBEV baselines, notably by $16.6$ % and $11.9$ % NDS on dynamic objects in the worst-case scenario of a $0.5 s$ time offset. Code will be released upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1062. Mitigating Hallucination in Vision-Language Model with Depth and Spatial-aware Key-Value Refinement
๐ฌ 41
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Hallucination in Vision Language Model Depth and Spatial-aware key value Cache Refinement Key-Value Cache Manipulation Multi Modal
Large visionโlanguage models (VLMs) deliver state-of-the-art results on a wide range of multimodal tasks, yet they remain prone to visual hallucinations, producing content that is not grounded in the input image.
Despite progress with visual supervision, reinforcement learning, and post-hoc attention reshaping, the representational origins of hallucinations remain unclear.
Our study reveals that successful grounding emerges when adjacent visual tokens exhibit coherent alignment, while hallucinations arise when key vectors scatter isotropically, weakening cross-modal attention and blurring object boundaries.
Building on this insight, we propose Depth and Spatial aware Cache Refinement (DSCR), a lightweight and training-free method that augments the Transformer's key-value (KV) cache with depth cues and 2D spatial proximity.
DSCR clusters vectors within objects and separates those across surfaces, guiding attention toward relevant regions without any fine-tuning.
Comprehensive evaluations show that DSCR consistently reduces hallucinations, delivering up to 23\% accuracy gains across MME, POPE, RePOPE, CHAIR, and a new depth-sensitive benchmark.
Our findings highlight KV-coherence as a core factor behind hallucinations and demonstrate a practical, model-agnostic solution for enhancing VLM reliability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1063. D$^2$GS: Depth-and-Density Guided Gaussian Splatting for Stable and Accurate Sparse-View Reconstruction
๐ฌ 13
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Gaussian Splatting Sparse View
Recent advances in 3D Gaussian Splatting (3DGS) enable real-time, high-fidelity novel view synthesis (NVS) with explicit 3D representations. However, performance degradation and instability remain significant under sparse-view conditions. In this work, we identify two key failure modes under sparse-view conditions: overfitting in regions with excessive Gaussian density near the camera, and underfitting in distant areas with insufficient Gaussian coverage. To address these challenges, we propose a unified framework \modelname{}, comprising two key components: a Depth-and-Density Guided Dropout strategy that suppresses overfitting by adaptively masking redundant Gaussians based on density and depth, and a Distance-Aware Fidelity Enhancement module that improves reconstruction quality in under-fitted far-field areas through targeted supervision. Moreover, we introduce a new evaluation metric to quantify the stability of learned Gaussian distributions, providing insights into the robustness of the sparse-view 3DGS. Extensive experiments on multiple datasets demonstrate that our method significantly improves both visual quality and robustness under sparse view conditions. The source code and trained models will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1064. Preserve and Sculpt: Manifold-Aligned Fine-tuning of Vision-Language Models for Few-Shot Learning
๐ฌ 11
โญ 6.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Vision-Language Model Few-shot Transfer Image Classification
Pretrained vision-language models (VLMs), such as CLIP, have shown remarkable potential in few-shot image classification and led to numerous effective transfer learning strategies. These methods leverage the pretrained knowledge of VLMs to enable effective domain adaptation while mitigating overfitting through parameter-efficient tuning or instance-based consistency constraints. However, such regularizations often neglect the geometric structure of data distribution, which may lead to distortion of the overall semantic representation. To overcome this limitation, we propose a novel fine-tuning method, Manifold-Preserving and Sculpting Tuning (MPS-Tuning). Regarding the data distribution in feature space as a semantic manifold, MPS-Tuning explicitly constrains the intrinsic geometry of this manifold while further sculpting it to enhance class separability. Specifically, MPS-Tuning preserves both macroscopic and microscopic topological structures of the original manifold by aligning Gram matrices of features before and after fine-tuning. Theoretically, this constraint is shown to approximate an upper bound of the Gromov-Wasserstein distance. Furthermore, features from the image and text modalities are paired, and pairwise similarities are optimized to enhance the manifoldโs class discriminability. Extensive experiments demonstrate that MPS-Tuning significantly improves model performance while effectively preserving the structure of the semantic manifold. The code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1065. A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond
๐ฌ 14
โญ 6.00
๐ learning theory
๐ท๏ธ Neural Networks Optimization Structure Discovery Compressibility Derandomization Multiple Index Model Johnson Lindenstrauss MAXCUT
Understanding the dynamics of feature learning in neural networks (NNs) remains a significant challenge.
The work of (Mousavi-Hosseini et al., 2023) analyzes a multiple index teacher-student setting and shows that a two-layer student attains a low-rank structure in its first-layer weights when trained with stochastic gradient descent (SGD) and a strong regularizer.
This structural property is known to reduce sample complexity of generalization.
Indeed, in a second step, the same authors establish algorithm-specific learning guarantees under additional assumptions.
In this paper, we focus exclusively on the structure discovery aspect and study it under weaker assumptions, more specifically: we allow (a) NNs of arbitrary size and depth, (b) with all parameters trainable, (c) under any smooth loss function, (d) tiny regularization, and (e) trained by any method that attains a second-order stationary point (SOSP), e.g. perturbed gradient descent (PGD). At the core of our approach is a key $\textit{derandomization}$ lemma, which states that optimizing the function $E_{x} \left[g_{\theta}(Wx + b)\right]$ converges to a point where $W = 0$, under mild conditions. The fundamental nature of this lemma directly explains structure discovery and has immediate applications in other domains including an end-to-end approximation for MAXCUT, and computing Johnson-Lindenstrauss embeddings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1066. RoboOmni: Proactive Robot Manipulation in Omni-modal Context
๐ฌ 4
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Speech Robotic Manipulation Omni-Modal LLMs Proactive Intention Recognition
Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in VisionโLanguageโAction (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively.
In this work, we introduce *cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands.* To address this new setting, we present **RoboOmni**, a *Perceiver-Thinker-Talker-Executor* framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction.
To address the absence of training data for proactive intention recognition in robotic manipulation, we build **OmniAction**, comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance. All datasets, code, and real-world demonstration videos will be released publicly.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1067. DeRaDiff: Denoising Time Realignment of Diffusion Models
๐ฌ 14
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ alignment diffusion models
Recent advances align diffusion models with human preferences to increase aesthetic appeal and mitigate artifacts and biases. Such methods aim to maximize a conditional output distribution aligned with higher rewards whilst not drifting far from a pretrained prior. This is commonly enforced by KL (KullbackโLeibler) regularization. As such, a central issue still remains: how does one choose the right regularization strength? Too high of a strength leads to limited alignment and too low of a strength leads to "reward hacking". This renders the task of choosing the correct regularization strength highly non-trivial. Existing approaches sweep over this hyperparameter by aligning a pretrained model at multiple regularization strengths and then choose the best strength. Unfortunately, this is prohibitively expensive. We introduce _DeRaDiff_, a _denoising-time realignment_ procedure that, after aligning a pretrained model once, modulates the regularization strength _during sampling_ to emulate models trained at other regularization strengthsโ_without any additional training or fine-tuning_. Extending decoding-time realignment from language to diffusion models, DeRaDiff operates over iterative predictions of continuous latents by replacing the reverse-step reference distribution by a geometric mixture of an aligned and reference posterior, thus giving rise to a closed-form update under common schedulers and a single tunable parameter, $\lambda$, for on-the-fly control. Our experiments show that across multiple textโimage alignment and image-quality metrics, our method consistently provides a strong approximation for models aligned entirely from scratch at different regularization strengths. Thus, by enabling very precise inference-time control of the regularization strength, our method yields an efficient way to search for the optimal strength, eliminating the need for expensive alignment sweeps and thereby substantially reducing computational costs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1068. HiGS: History-Guided Sampling for Plug-and-Play Enhancement of Diffusion Models
๐ฌ 12
โญ 6.00
๐ generative models
๐ท๏ธ diffusion models sampling classifier-free guidance
While diffusion models have made remarkable progress in image generation, their outputs can still appear unrealistic and lack fine details, especially when using fewer number of neural function evaluations (NFEs) or lower guidance scales. To address this issue, we propose a novel momentum-based sampling technique, termed history-guided sampling (HiGS), which enhances quality and efficiency of diffusion sampling by integrating recent model predictions into each inference step. Specifically, HiGS leverages the difference between the current prediction and a weighted average of past predictions to steer the sampling process toward more realistic outputs with better details and structure. Our approach introduces practically no additional computation and integrates seamlessly into existing diffusion frameworks, requiring neither extra training nor fine-tuning. Extensive experiments show that HiGS consistently improves image quality across diverse models and architectures and under varying sampling budgets and guidance scales. Moreover, using a pretrained SiT model, HiGS achieves a new state-of-the-art FID of 1.61 for unguided ImageNet generation at 256$\times$256 with only 30 sampling steps (instead of the standard 250). We thus present HiGS as a plug-and-play enhancement to standard diffusion sampling that enables faster generation with higher fidelity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1069. 3DSMT: A Hybrid Spiking Mamba-Transformer for Point Cloud Analysis
๐ฌ 22
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Point Cloud Analysis Spiking neural network Spiking Local Offset Attention Spiking Mamba Block
The sparse unordered structure of point clouds causes unnecessary computation and energy consumption in deep models.
Conventionally, the Transformer architecture is leveraged to model global relationships in point clouds, however, its quadratic complexity restricts scalability. Although the Mamba architecture enables efficient global modeling with linear complexity, it lacks natural adaptability to unordered point clouds.
Spiking Neural Network (SNN) is an energy-efficient alternative to Artificial Neural Network (ANN), offering an ultra low-power event-driven paradigm.
The inherent sparsity and event-driven characteristics of SNN are highly compatible with the sparse distribution of point clouds. To balance efficiency and performance, we propose a hybrid spiking Mamba-Transformer (3DSMT) model for point cloud analysis. 3DSMT integrates a Spiking Local Offset Attention module to efficiently capture fine-grained local geometric features with a spiking Mamba block designed for unordered point clouds to achieve global feature integration with linear complexity. Experiments show that 3DSMT achieves state-of-the-art performance among SNN-based methods in shape classification, few-shot classification, and part segmentation tasks, significantly reducing computational energy consumption while also outperforming numerous ANN-based models.
Our source code is in supplementary material and will be made publicly available
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1070. Unified Privacy Guarantees for Decentralized Learning via Matrix Factorization
๐ฌ 15
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Differential Privacy Decentralized Learning Matrix Mechanism Gossip
Decentralized Learning (DL) enables users to collaboratively train models without sharing raw data by iteratively averaging local updates with neighbors in a network graph. This setting is increasingly popular for its scalability and its ability to keep data local under user control. Strong privacy guarantees in DL are typically achieved through Differential Privacy (DP), with results showing that DL can even amplify privacy by disseminating noise across peer-to-peer communications.
Yet in practice, the observed privacy-utility trade-off often appears worse than in centralized training, which may be due to limitations in current DP accounting methods for DL. In this paper, we show that recent advances in centralized DP accounting based on Matrix Factorization (MF) for analyzing temporal noise correlations can also be leveraged in DL. By generalizing existing MF results, we show how to cast both standard DL algorithms and common trust models into a unified formulation. This yields tighter privacy accounting for existing DP-DL algorithms and provides a principled way to develop new ones. To demonstrate the approach, we introduce MAFALDA-SGD, a gossip-based DL algorithm with user-level correlated noise that outperforms existing methods on synthetic and real-world graphs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1071. Mirage or Method? How ModelโTask Alignment Induces Divergent RL Conclusions
๐ฌ 11
โญ 6.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning reasoning
Recent advances in applying reinforcement learning (RL) to large language models (LLMs) have led to substantial progress. In particular, a series of remarkable yet often counterintuitive phenomena have been reported in LLMs, exhibiting patterns not typically observed in traditional RL settings. For example, notable claims include that a single training example can match the performance achieved with an entire dataset, that the reward signal does not need to be very accurate, and that training solely with negative samples can match or even surpass sophisticated reward-based methods. However, the precise conditions under which these observations holdโand, critically, when they failโremain unclear. In this work, we identify a key factor that differentiates RL observations: whether the pretrained model already exhibits strong *Model-Task Alignment*, as measured by pass@k accuracy on the evaluated task. Through a systematic and comprehensive examination of a series of counterintuitive claims, supported by rigorous experimental validation across different model architectures and task domains, our findings show that while standard RL training remains consistently robust across settings, many of these counterintuitive results arise only when the model and task already exhibit strong model-task alignment. In contrast, these techniques fail to drive substantial learning in more challenging regimes, where standard RL methods remain effective.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1072. ADEPT: Continual Pretraining via Adaptive Expansion and Dynamic Decoupled Tuning
๐ฌ 27
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Continual Pretrain Large Language Models Parameter-Efficient Training
Conventional continual pretraining (CPT) for large language model (LLM) domain adaptation often suffers from catastrophic forgetting and limited domain capacity. Existing strategies adopt layer expansion, introducing additional trainable parameters to accommodate new knowledge. However, the uniform expansion and updates still entangle general and domain learning, undermining its effectiveness. Our pilot studies reveal that LLMs exhibit functional specialization, where layers and units differentially encode general-critical capabilities, suggesting that parameter expansion and optimization should be function-aware. We then propose ADEPT, Adaptive Expansion and Dynamic Decoupled Tuning for continual pretraining, a two-stage framework for domain-adaptive CPT. ADEPT first performs General-Competence Guided Selective Layer Expansion, duplicating layers least critical for the general domain to increase representational capacity while minimizing interference with general knowledge. It then applies Adaptive Unit-Wise Decoupled Tuning, disentangling parameter units within expanded layers according to their general-domain importance and assigning asymmetric learning rates to balance knowledge injection and retention. Experiments on mathematical and medical domains show that ADEPT outperforms full-parameter CPT by up to 5.76% on the general benchmarks and 5.58% on the target domain benchmarks with only 15% of parameters tuned and less than 50% training time. Ablation studies, theoretical analysis, and extended investigations further demonstrate the necessity of targeted expansion and decoupled optimization, providing new principles for efficient and robust domain-adaptive CPT. Our code is open-sourced at https://anonymous.4open.science/r/ADEPT-F2E3
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1073. LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning
๐ฌ 20
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ parameter-efficient fine-tuning low-rank adaptation llms large models
Large pre-trained models are commonly adapted to downstream tasks using parameter-efficient fine-tuning methods such as Low-Rank Adaptation (LoRA), which injects small trainable low-rank matrices instead of updating all weights. While LoRA dramatically reduces trainable parameters with little overhead, it can still underperform full fine-tuning in accuracy and often converges more slowly. We introduce LoFT, a novel low-rank adaptation method that behaves like full fine-tuning by aligning the optimizerโs internal dynamics with those of updating all model weights. LoFT not only learns weight updates in a low-rank subspace (like LoRA) but also properly projects the optimizerโs first and second moments (Adamโs momentum and variance) into the same subspace, mirroring full-model updates. By aligning the low-rank update itself with the full update, LoFT eliminates the need for tuning extra hyperparameter, e.g., LoRA scaling $\alpha$. Empirically, this approach substantially narrows the performance gap between adapter-based tuning and full fine-tuning and consistently outperforms standard LoRA-style methods, all without increasing inference cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1074. TTT3R: 3D Reconstruction as Test-Time Training
๐ฌ 16
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Reconstruction Structure from Motion Recurrent Neural Networks
Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear complexity in the sequence length. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, enabling a balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2\times$ improvement in global pose estimation over baselines while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1075. LiTo: Surface Light Field Tokenization
๐ฌ 12
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ generative model 3D vision
We propose a 3D latent representation that jointly models object geometry and view-dependent appearance. Most prior works focus on either reconstructing 3D geometry or predicting view-independent diffuse appearance, and thus struggle to capture realistic view-dependent effects. Our approach leverages the fact that RGB-depth images provide samples of a surface light field. By encoding random subsamples of this surface light field into a compact set of latent vectors, our model learns to represent both geometry and appearance within a unified 3D latent space. This representation can reproduce view-dependent effects such as lighting reflections and Fresnel reflections under complex lighting. We further train a latent flow matching model on this representation to learn its distribution conditioned on a single input image, enabling the generation of 3D objects with appearances consistent with the lighting and materials in the input. Experiments show that our approach achieves higher reconstruction quality and better separation of geometry and appearance than existing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1076. Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer
๐ฌ 9
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ fMRI-to-Image Reconstruction Brain Decoding fMRI Decoding Multiple Brains
Reconstructing images seen by people from their fMRI brain recordings provides a non-invasive window into the human brain. Despite recent progress enabled by diffusion models, current methods often lack faithfulness to the actual seen images. We present ``Brain-IT'', a brain-inspired approach that addresses this challenge through a Brain Interaction Transformer (BIT), allowing effective interactions between clusters of functionally-similar brain-voxels. These functional-clusters are shared by all subjects, serving as building blocks for integrating information both within and across brains. All model components are shared by all clusters \& subjects, allowing efficient training with limited amount of data. To guide the image reconstruction, BIT predicts two complementary localized patch-level image features: (i) high-level semantic features which steer the diffusion model toward the correct semantic content of the image; and (ii) low-level structural features which help to initialize the diffusion process with the correct coarse layout of the image. BIT's design enables direct flow of information from brain-voxel clusters to localized image features. Through these principles, our method achieves image reconstructions from fMRI that faithfully reconstruct the seen images, and surpass current SotA approaches both visually and by standard objective metrics. Moreover, with only 1-hour of fMRI data from a new subject, we achieve results comparable to current methods trained on full 40-hour recordings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1077. Rethinking Unsupervised Cross-modal Flow Estimation: Learning from Decoupled Optimization and Consistency Constraint
๐ฌ 20
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Cross-modal flow estimation Unsupervised learning Multimodal and multi-spectral images
This work presents DCFlow, a novel self-supervised cross-modal flow estimation framework that integrates a decoupled optimization strategy and a cross-modal consistency constraint. Unlike previous unsupervised approaches that implicitly learn flow estimation solely from appearance similarity, we introduce a decoupled optimization strategy with task-specific supervision to address modality discrepancy and geometric misalignment distinctly. This is achieved by collaboratively training a modality transfer network and a flow estimation network. To enable reliable motion supervision without ground-truth flow, we propose a geometry-aware data synthesis pipeline combined with an outlier-robust loss. Additionally, we introduce a cross-modal consistency constraint to jointly optimize both networks, significantly improving flow prediction accuracy. For evaluation, we construct a comprehensive cross-modal flow benchmark by repurposing public datasets. Experimental results demonstrate that DCFlow can be integrated with various flow estimation networks and achieves state-of-the-art performance among unsupervised approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1078. VOGUE: Unified Understanding, Generation, and Editing for Videos
๐ฌ 19
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ diffusion;multimodal generation
Unified multimodal understandingโgeneration models have shown promising results in image generation and editing, but remain largely constrained to the image domain. In this work, we present VOGUE, a versatile framework that extends unified modeling to the video domain. VOGUE adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, VOGUE unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that VOGUE matches or surpasses state-of-the-art task-specific baselines in visual understanding, text/image-to-video generation, in-context video editing and generation. Beyond these core capabilities, the unified design allows VOGUE to generalize to unseen free-form editing tasks, such as green-screening characters or novel task composition (e.g., editing + style transfer) in a single instruction. Notably, VOGUE is the first system to support visual-prompt-based video generation in a unified model, where the MLLM interprets visual prompts and guides the MMDiT in synthesis. To foster future research, our model and code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1079. Neural Compression of 3D Meshes using Sparse Implicit Representation
๐ฌ 16
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ compression 3D mesh
The growing demand for high-quality 3D mesh models has fueled the need for efficient 3D mesh compression techniques. However, existing methods often exhibit suboptimal compression performance due to the inefficient representation of mesh data. To address this issue, we propose a novel neural mesh compression method based on Sparse Implicit Representation (SIR). Specifically, SIR records signed distance field (SDF) values only on regular grids near the surface, enabling high-resolution structured representation of arbitrary geometric data with a significantly lower memory cost, while still supporting precise surface recovery. Building on this representation, we construct a lightweight Sparse Neural Compression (SNC) network to extract compact embedded features from the SIR and encode them into a bitstream. Extensive experiments and ablation studies demonstrate that our method outperforms state-of-the-art mesh and point cloud compression approaches in both compression performance and computational efficiency across a variety of mesh models. The code is included in the Supplementary Material.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1080. From Pixels to Semantics: Unified Facial Action Representation Learning for Micro-Expression Analysis
๐ฌ 21
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ micro-expression recognition micro-expression generation
Micro-expression recognition (MER) is highly challenging due to the subtle and rapid facial muscle movements and the scarcity of annotated data. Existing methods typically rely on pixel-level motion descriptors such as optical flow and frame difference, which tend to be sensitive to identity and lack generalization. In this work, we propose D-FACE, a Discrete Facial ACtion Encoding framework that leverages large-scale facial video data to pretrain an identity- and domain-invariant facial action tokenizer, for MER. For the first time, MER is shifted from relying on pixel-level motion descriptors to leveraging semantic-level facial action tokens, providing compact and generalizable representations of facial dynamics. Empirical analyses reveal that these tokens exhibit position-dependent semantics, motivating sequential modeling. Building on this insight, we employ a Transformer with sparse attention pooling to selectively capture discriminative action cues. Furthermore, to explicitly bridge action tokens with human-understandable emotions, we introduce an emotion-description-guided CLIP (EDCLIP) alignment. EDCLIP leverages textual prompts as semantic anchors for representation learning, while enforcing that the "others" category, which lacks corresponding prompts due to its ambiguity, remains distant from all anchor prompts.
Extensive experiments on multiple datasets demonstrate that our method achieves not only state-of-the-art recognition accuracy but also high-quality cross-identity and even cross-domain micro-expression generation, suggesting a paradigm shift from pixel-level to generalizable semantic-level facial motion analysis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1081. One-Shot Exemplars for Class Grounding in Self-Supervised Learning
๐ฌ 18
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Self-supervised learning One-shot exemplar Representation learning
Self-Supervised Learning (SSL) has recently achieved remarkable progress by leveraging large-scale unlabeled data. However, SSL pretrains models without relying on human annotation, so it usually does not specify the class space. This inevitably weakens the effectiveness of the learned representation in most downstream tasks that have the intrinsic class structure. In this work, we introduce the new easy setting of One-Shot Exemplar Self-Supervised Learning (OSESSL), requiring only one instance annotation for each class. By introducing this extremely sparse supervision, OSESSL provides the minimum class information to guide the exploration of unlabeled data, achieving significant performance boosts with neglectable annotation cost (i.e., a complexity of $\mathcal{O}(1)$ w.r.t. the sample size). In this OSESSL setting, we propose a simple yet effective framework that leverages the single-labeled exemplar to build the class-specific prototype for learning reliable representations from the huge unlabeled data. To this end, we also build a novel consistency regularization, which extends the sparse exemplar supervision into the decision boundaries, thus improving the robustness of the learned representation. Extensive experiments on real-world datasets clearly validate the reliability of this simple and practical setting. The proposed approach successfully outperforms the state-of-the-art methods, achieving gains of approximately 3\% and 6\% $k$-NN accuracy on CIFAR-100 and ImageNet-100, respectively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1082. STDDN: A Physics-Guided Deep Learning Framework for Crowd Simulation
๐ฌ 15
โญ 6.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ spatiotemporal-learning;physics-informed;neural ODE;crowd simulation;
Accurate crowd simulation is crucial for public safety management, emergency evacuation planning, and intelligent transportation systems. However, existing methods, which typically model crowds as a collection of independent individual trajectories, are limited in their ability to capture macroscopic physical laws. This microscopic approach often leads to error accumulation and compromises simulation stability. Furthermore, deep learning-driven methods tend to suffer from low inference efficiency and high computational overhead, making them impractical for large-scale, efficient simulations. To address these challenges, we propose the Spatio-Temporal Decoupled Differential Equation Network (STDDN), a novel framework that guides microscopic trajectory prediction with macroscopic physics. We innovatively introduce the continuity equation from fluid dynamics as a strong physical constraint. A Neural Ordinary Differential Equation (Neural ODE) is employed to model the macroscopic density evolution driven by individual movements, thereby physically regularizing the microscopic trajectory prediction model. We design a density-velocity coupled dynamic graph learning module to formulate the derivative of the density field within the Neural ODE, effectively mitigating error accumulation. We also propose a differentiable density mapping module to eliminate discontinuous gradients caused by discretization and introduce a cross-grid detection module to accurately model the impact of individual cross-grid movements on local density changes. The proposed STDDN method has demonstrated significantly superior simulation performance compared to state-of-the-art methods on long-term tasks across four real-world datasets, as well as a major reduction in inference latency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1083. Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model
๐ฌ 18
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Vision-language-action Model Representation Learning
Vision-language-action (VLA) models have recently shown strong potential in enabling robots to follow language instructions and execute precise actions. However, most VLAs are built upon vision-language models pretrained solely on 2D data, which lack accurate spatial awareness and hinder their ability to operate in the 3D physical world. Existing solutions attempt to incorporate explicit 3D sensor inputs such as depth maps or point clouds, but these approaches face challenges due to sensor noise, hardware heterogeneity, and incomplete depth coverage in existing datasets. Alternative methods that estimate 3D cues from 2D images also suffer from the limited performance of depth estimators. We propose Spatial Forcing (SF), a simple yet effective alignment strategy that implicitly forces VLA models to develop spatial comprehension capabilities without relying on explicit 3D inputs or depth estimators. SF aligns intermediate visual embeddings of VLAs with geometric representations produced by pretrained 3D foundation models. By enforcing alignment at intermediate layers, SF guides VLAs to encode richer spatial representations that enhance action precision. Extensive experiments in simulation and real-world environments demonstrate that SF achieves state-of-the-art results, surpassing both 2D- and 3D-based VLAs. SF further accelerates training by up to 3.8ร and improves data efficiency across diverse robotic tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1084. Primal-Dual Policy Optimization for Adversarial Linear CMDPs
๐ฌ 16
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Safe Reinforcement Learing Adversarial Linear Constrained MDP Policy Optimization
Existing work on linear constrained Markov decision processes (CMDPs) has primarily focused on stochastic settings, where the losses and costs are either fixed or drawn from fixed distributions. However, such formulations are inherently vulnerable to adversarially changing environments. To overcome this limitation, we propose a primal-dual policy optimization algorithm for online finite-horizon {adversarial} linear CMDPs, where the losses are adversarially chosen under full-information feedback and the costs are stochastic under bandit feedback. Our algorithm is the \emph{first} to achieve sublinear regret and constraint violation bounds in this setting, both bounded by $\widetilde{\mathcal{O}}(K^{3/4})$, where $K$ denotes the number of episodes. The algorithm introduces and runs with a new class of policies, which we call weighted LogSumExp softmax policies, designed to adapt to adversarially chosen loss functions. Our main result stems from the following key contributions: (i) a new covering number argument for the weighted LogSumExp softmax policies, and (ii) two novel algorithmic components---periodic policy mixing and a regularized dual update---which allow us to effectively control both the covering number and the dual variable. We also report numerical results that validate our theoretical findings on the performance of the algorithm.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1085. Online Navigation Refinement: Achieving Lane-Level Guidance by Associating Standard-Definition and Online Perception Maps
๐ฌ 19
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ online navigation refinement geographic information systems navigation standard definition map online perception map
Lane-level navigation is critical for geographic information systems and navigation-based tasks, offering finer-grained guidance than road-level navigation by standard definition (SD) maps. However, it currently relies on expansive global HD maps that cannot adapt to dynamic road conditions. Recently, online perception (OP) maps have become research hotspots, providing real-time geometry as an alternative, but lack the global topology needed for navigation. To address these issues, Online Navigation Refinement (ONR), a new mission is introduced that refines SD-map-based road-level routes into accurate lane-level navigation by associating SD maps with OP maps. The map-to-map association to handle many-to-one lane-to-road mappings under two key challenges: (1) no public dataset provides lane-to-road correspondences; (2) severe misalignment from spatial fluctuations, semantic disparities, and OP map noise invalidates traditional map matching. For these challenges, We contribute: (1) Online map association dataset (OMA), the first ONR benchmark with 30K scenarios and 2.6M annotated lane vectors; (2) MAT, a transformer with path-aware attention to aligns topology despite spatial fluctuations and semantic disparities and spatial attention for integrates noisy OP features via global context; and (3) NR P-R, a metric evaluating geometric and semantic alignment. Experiments show that MAT outperforms existing methods at 34 ms latency, enabling low-cost and up-to-date lane-level navigation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1086. Uncovering Conceptual Blindspots in Generative Image Models Using Sparse Autoencoders
๐ฌ 11
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Generative Image Models Failure Modes Interpretability Sparse Autoencoders
Despite their impressive performance, generative image models trained on large-scale datasets frequently fail to produce images with seemingly simple concepts -- e.g., human hands or objects appearing in groups of four -- that are reasonably expected to appear in the training data. These failure modes have largely been documented anecdotally, leaving open the question of whether they reflect idiosyncratic anomalies or more structural limitations of these models. To address this, we introduce a systematic approach for identifying and characterizing "conceptual blindspots" -- concepts present in the training data but absent or misrepresented in a model's generations. Our method leverages sparse autoencoders (SAEs) to extract interpretable concept embeddings, enabling a quantitative comparison of concept prevalence between real and generated images. We train an archetypal SAE (RA-SAE) on DINOv2 features with 32,000 concepts -- the largest such SAE to date -- enabling fine-grained analysis of conceptual disparities. Applied to four popular generative models (Stable Diffusion 1.5/2.1, PixArt, and Kandinsky), our approach reveals specific suppressed blindspots (e.g., bird feeders, DVD discs, and whitespaces on documents) and exaggerated blindspots (e.g., wood background texture and palm trees). At the individual datapoint level, we further isolate memorization artifacts -- instances where models reproduce highly specific visual templates seen during training. Overall, we propose a theoretically grounded framework for systematically identifying conceptual blindspots in generative models by assessing their conceptual fidelity with respect to the underlying data-generating process.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1087. Exploring Mode Connectivity in Krylov Subspace for Domain Generalization
๐ฌ 10
โญ 6.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Loss landscape mode connectivity Krylov space domain generalization
This paper explores the geometric characteristics of loss landscapes to enhance domain generalization (DG) in deep neural networks.
Existing methods mainly leverage the local flatness around minima for improved generalization. However, recent theoretical studies indicate that flatness does not universally guarantee better generalization. Instead, this paper investigates a global geometrical property for domain generalization, i.e., mode connectivity, the phenomenon where distinct local minima are connected by continuous low-loss pathways. Different from flatness, mode connectivity enables transitions from poor to superior generalization models without leaving low-loss regions.
To navigate these connected pathways effectively, this paper proposes a novel Billiard Optimization Algorithm (BOA), which discovers superior models by mimicking billiard dynamics.
During this process, BOA operates within a low-dimensional Krylov subspace, aiming to alleviate the curse of dimensionality caused by the high-dimensional parameter space of deep models.
Furthermore, this paper reveals that oracle test gradients strongly align with the Krylov subspace constructed from training gradients across diverse datasets and architectures.
This alignment offers a powerful tool to bridge training and test domains, enabling the efficient discovery of superior models with limited training domains.
Experiments on DomainBed demonstrate that BOA consistently outperforms existing sharpness-aware and DG methods across diverse datasets and architectures.
Impressively, BOA even surpasses the sharpness-aware minimization by 3.6\% on VLCS when using a ViT-B/16 backbone.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1088. Optimizing ID Consistency in Multimodal Large Models: Facial Restoration via Alignment, Entanglement, and Disentanglement
๐ฌ 15
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Identity preservation Facial reconstruction Multimodal Large Models Fashion Image Editing
Multimodal editing large models have demonstrated powerful editing capabilities across diverse tasks. However, a persistent and long-standing limitation is the decline in facial identity (ID) consistency during realistic portrait editing. Due to the human eyeโs high sensitivity to facial features, such inconsistency significantly hinders the practical deployment of these models.
Current facial ID preservation methods struggle to achieve consistent restoration of both facial identity and edited element IP due to Cross-source Distribution Bias and Cross-source Feature Contamination.
To address these issues, we propose EditedID, an Alignment-Disentanglement-Entanglement framework for robust identity-specific facial restoration. By systematically analyzing diffusion trajectories, sampler behaviors, and attention properties, we introduce three key components: 1) Adaptive mixing strategy that aligns cross-source latent representations throughout the diffusion process. 2) Hybrid solver that disentangles source-specific identity attributes and details. 3) Attentional gating mechanism that selectively entangles visual elements. Extensive experiments show that EditedID achieves state-of-the-art performance in preserving original facial ID and edited element IP consistency.
As a training-free and plug-and-play solution, it establishes a new benchmark for practical and reliable single/multi-person facial identity restoration in open-world settings, paving the way for the deployment of multimodal editing large models in real-person editing scenarios.
The code is available at https://anonymous.4open.science/r/EditedID.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1089. Variation-aware Flexible 3D Gaussian Editing
๐ฌ 20
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3d editing 3d gaussian splatting knowledge distillation
Indirect editing methods for 3D Gaussian Splatting (3DGS) have recently witnessed significant advancements. These approaches operate by first applying edits in the rendered 2D space and subsequently projecting the modifications back into 3D. *However, this paradigm inevitably introduces cross-view inconsistencies and constrains both the flexibility and efficiency of the editing process*. To address these challenges, we present **VF-Editor**, which enables native editing of Gaussian primitives by predicting attribute variations in a feedforward manner. To accurately and efficiently estimate these variations, we design a novel variation predictor distilled from 2D editing knowledge. The predictor encodes the input to generate a variation field and employs two learnable, parallel decoding functions to iteratively infer attribute changes for each 3D Gaussian. Thanks to its unified design, VF-Editor can seamlessly distill editing knowledge from diverse 2D editors and strategies into a single predictor, allowing for flexible and effective knowledge transfer into the 3D domain. Extensive experiments on both public and private datasets reveal the inherent limitations of indirect editing pipelines and validate the effectiveness and flexibility of our approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1090. AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM
๐ฌ 19
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language model; Knowledge augmentation; Knowledge graph;
Retrieval-augmented generation (RAG) has shown some success in augmenting large language models (LLMs) with external knowledge. However, as a non-parametric knowledge integration paradigm for LLMs, RAG methods heavily rely on external retrieval modules and the retrieved textual context prior. Especially for very large scale knowledge augmentation, they would introduce substantial inference latency due to expensive searches and much longer relevant context. In this paper, we propose a parametric knowledge integration method, called $\textbf{AtlasKV}$, a scalable, effective, and general way to augment LLMs with billion-scale knowledge graphs (KGs) (e.g. 1B triples) using very little GPU memory cost (e.g. less than 20GB VRAM). In AtlasKV, we introduce KG2KV and HiKVP to integrate KG triples into LLMs at scale with sub-linear time and memory complexity. It maintains strong knowledge grounding and generalization performance using the LLMs' inherent attention mechanism, and requires no external retrievers, long context priors, or retraining when adapting to new knowledge.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1091. NePTune: A Neuro-Pythonic Framework for Tunable Compositional Reasoning on Vision-Language
๐ฌ 13
โญ 6.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Neuro-Symbolic Vision and Language Compositional Reasoning
Modern Vision-Language Models (VLMs) have achieved impressive performance in various tasks, yet they often struggle with compositional reasoning, the ability to decompose and recombine concepts to solve novel problems. While neuro-symbolic approaches offer a promising direction, they are typically constrained by crisp logical execution or predefined predicates, which limit flexibility. In this work, we introduce NePTune, a neuro-symbolic framework that overcomes these limitations through a hybrid execution model that integrates the perception capabilities of foundation vision models with the compositional expressiveness of symbolic reasoning. NePTune dynamically translates natural language queries into executable Python programs that blend imperative control flow with soft logic operators capable of reasoning over VLM-generated uncertainty. Operating in a training-free manner, NePTune, with a modular design, decouples perception from reasoning, yet its differentiable operations support fine-tuning. We evaluate NePTune on multiple visual reasoning benchmarks and various domains, utilizing adversarial tests, and demonstrate a significant improvement over strong base models, as well as its effective compositional generalization and adaptation capabilities in novel environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1092. Glance and Focus Reinforcement for Pan-cancer Screening
๐ฌ 19
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Pan-cancer screening AI for healthcare
Pan-cancer screening in large-scale CT scans remains challenging for existing AI methods, primarily due to the difficulty of localizing diverse types of tiny lesions in large CT volumes. The extreme foreground-background imbalance significantly hinders models from focusing on diseased regions, while redundant focus on healthy regions not only decreases the efficiency but also increases false positives. Inspired by radiologists' glance and focus diagnostic strategy, we introduce GF-Screen, a Glance and Focus reinforcement learning framework for pan-cancer screening. GF-Screen employs a Glance model to localize the diseased regions and a Focus model to precisely segment the lesions, where segmentation results of the Focus model are leveraged to reward the Glance model via Reinforcement Learning (RL). Specifically, the Glance model crops a group of sub-volumes from the entire CT volume and learns to select the sub-volumes with lesions for the Focus model to segment. Given that the selecting operation is non-differentiable for segmentation training, we propose to employ the segmentation results to reward the Glance model. To optimize the Glance model, we introduce a novel group relative learning paradigm, which employs group relative comparison to prioritize high-advantage predictions and discard low-advantage predictions within sub-volume groups, not only improving efficiency but also reducing false positives. In this way, for the first time, we effectively extend cutting-edge RL techniques to tackle the specific challenges in pan-cancer screening. We conduct training and validation on a large-scale pan-cancer dataset comprising 5,117 CT scans. Extensive experiments on 16 internal and 7 external datasets across 9 lesion types demonstrated the effectiveness of GF-Screen. Notably, GF-Screen leads the public validation leaderboard of MICCAI FLARE25 pan-cancer challenge, surpassing the FLARE24 champion solution by a large margin (+25.6% DSC and +28.2% NSD). In addition, through discarding redundant regions, GF-Screen reduces the computation costs by 5.7 times, significantly improving inference efficiency. The superior performance of GF-Screen remarks a novel and practical breakthrough in pan-cancer screening. Codes will be available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1093. RayI2P: Learning Rays for Image-to-Point Cloud Registration
๐ฌ 12
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Image-to-Point Cloud Registration
Image-to-point cloud registration aims to estimate the 6-DoF camera pose of a query image relative to a 3D point cloud map. Existing methods fall into two categories: matching-free methods regress pose directly using geometric priors, but lack fine-grained supervision and struggle with precise alignment; matching-based methods construct dense 2D-3D correspondences for PnP-based pose estimation, but are fundamentally limited by projection ambiguity (where multiple geometrically distinct 3D points project to the same image patch, leading to ambiguous feature representations) and scale inconsistency (where fixed-size image patches correspond to 3D regions of varying physical size, causing misaligned receptive fields across modalities). To address these issues, we propose a novel ray-based registration framework that first predicts patch-wise 3D ray bundles connecting image patches to the 3D scene and then estimates camera pose via a differentiable ray-guided regression module, bypassing the need for explicit 2D-3D correspondences. This formulation naturally resolves projection ambiguity, provides scale-consistent geometry encoding, and enables fine-grained supervision for accurate pose estimation. Experiments on KITTI and nuScenes show that our approach achieves state-of-the-art registration accuracy, outperforming existing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1094. Unsupervised Learning of Efficient Exploration: Pre-training Adaptive Policies via Self-Imposed Goals
๐ฌ 17
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Unsupervised Reinforcement Learning Meta-Reinforcement Learning Pre-training Curriculum Learning
Unsupervised pre-training can equip reinforcement learning agents with prior knowledge and accelerate learning in downstream tasks. A promising direction, grounded in human development, investigates agents that learn by setting and pursuing their own goals. The core challenge lies in how to effectively generate, select, and learn from such goals. Our focus is on broad distributions of downstream tasks where solving every task zero-shot is infeasible. Such settings naturally arise when the target tasks lie outside of the pre-training distribution or when their identities are unknown to the agent. In this work, we (i) optimize for efficient multi-episode exploration and adaptation within a meta-learning framework, and (ii) guide the training curriculum with evolving estimates of the agentโs post-adaptation performance. We present ULEE, an unsupervised meta-learning method that combines an in-context learner with an adversarial goal-generation strategy that maintains training at the frontier of the agentโs capabilities. On XLand-MiniGrid benchmarks, ULEE pre-training yields improved exploration and adaptation abilities that generalize to novel objectives, environment dynamics, and map structures. The resulting policy attains improved zero-shot and few-shot performance, and provides a strong initialization for longer fine-tuning processes. It outperforms learning from scratch, DIAYN pre-training, and alternative curricula.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1095. Expressive yet Efficient Feature Expansion with Adaptive Cross-Hadamard Products
๐ฌ 20
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Efficient Vision Models Hadamard Product Neural Architecture Search Differentiable Sampling
Recent theoretical advances reveal that the Hadamard product induces nonlinear representations and implicit high-dimensional mappings for the field of deep learning, yet their practical deployment in efficient vision models remains underdeveloped. To address this gap, we introduce the Adaptive Cross-Hadamard (ACH) module, a novel operator that embeds learnability through differentiable discrete sampling and dynamic softsign normalization. This enables parameter-free feature reuse while stabilizing gradient propagation. Integrated into Hadaptive-Net (Hadamard Adaptive Network) via neural architecture search, our approach achieves unprecedented efficiency. Comprehensive experiments demonstrate state-of-the-art accuracy/speed trade-offs on image classification task, establishing Hadamard operations as fundamental building blocks for efficient vision models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1096. Sim2Real VLA: Zero-Shot Generalization of Synthesized Skills to Realistic Manipulation
๐ฌ 18
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Zero-Shot Sim2Real Vision-Language-Action (VLA) Model Long-horizon Manipulation
Vision-Language-Action (VLA) models represent a critical milestone toward embodied intelligence in robotic manipulation. To support their training, recent research has developed high-performance simulation engines for data synthesis. However, their effectiveness is still significantly limited by the simulation-to-reality (Sim2Real) gap, as policies trained on synthetic data often fail to generalize reliably to the real world. To address this challenge, we present Sim2Real-VLA, a generalist robot control model trained exclusively on synthetic data, yet capable of transferring seamlessly to real-world manipulation tasks. Sim2Real-VLA features a dual-system architecture: a high-level planner that infers object-centered chains-of-affordances, and a low-level actor that executes and validates these plans in real time via a tokenized action space. This design filters out manipulation-irrelevant features and prioritizes motion-critical dynamics, thereby enhancing Sim2Real domain transfer. Besides, a notable advantage of Sim2Real-VLA lies in its tight integration with automated data generation for manipulation skills, eliminating the need for manual fine-tuning and enabling scalable, hands-free training. Empirical evaluations across bimanual, dexterous, and long-horizon tasks show that Sim2Real-VLA consistently outperforms previous VLA baselines under diverse real-world environments and domain shifts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1097. COSA: Context-aware Output-Space Adapter for Test-Time Adaptation in Time Series Forecasting
๐ฌ 22
โญ 6.00
๐ learning on time series and dynamical systems
๐ท๏ธ Test-Time Adaptation Time-Series Forecasting Simple Adapter
Deployed time-series forecasters suffer performance degradation under non-stationarity and distribution shifts.
Test-time adaptation (TTA) for time-series forecasting differs from vision TTA because ground truth becomes observable shortly after prediction.
Existing time-series TTA methods typically employ dual input/output adapters that indirectly modify data distributions, making their effect on the frozen model difficult to analyze.
We introduce the Context-aware Output-Space Adapter (COSA), a minimal, plug-and-play adapter that directly corrects predictions of a frozen base model.
COSA performs residual correction modulated by gating, utilizing the original prediction and a lightweight context vector that summarizes statistics from recently observed ground truth.
At test time, only the adapter parameters (linear layer and gating) are updated under a leakage-free protocol, using observed ground truth with an adaptive learning rate schedule for faster adaptation.
Across diverse scenarios, COSA demonstrates substantial performance gains versus baselines without TTA (13.91$\sim$17.03\%) and SOTA TTA methods (10.48$\sim$13.05\%), with particularly large improvements at long horizons, while adding a reasonable level of parameters and negligible computational overhead.
The simplicity of COSA makes it architecture-agnostic and deployment-friendly.
Source code: https://anonymous.4open.science/r/linear-adapter-A720
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1098. R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
๐ฌ 20
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Reasoning Models Long Horizon Reasoning
Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate modelsโ ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks (+7.5 on AIME2024). These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1099. PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits
๐ฌ 18
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Multimodal Datasets LLM-Inferred Behavior Traits Causality
Understanding human behavior traits is central to applications in human-computer interaction, computational social science, and personalized AI systems. Such understanding often requires integrating multiple modalities to capture nuanced patterns and relationships. However, existing resources rarely provide datasets that combine behavioral descriptors with complementary modalities such as facial attributes and biographical information. To address this gap, we present PersonaX, a curated collection of multimodal datasets designed to enable comprehensive analysis of public traits across modalities. PersonaX consists of (1) CelebPersona, featuring 9444 public figures from diverse occupations, and (2) AthlePersona, covering 4181 professional athletes across 7 major sports leagues. Each dataset includes behavioral trait assessments inferred by three high-performing large language models, alongside facial imagery and structured biographical features.
We analyze PersonaX at two complementary levels. First, we abstract high-level trait scores from text descriptions and apply five statistical independence tests to examine their relationships with other modalities. Second, we introduce a novel causal representation learning (CRL) framework tailored to multimodal and multi-measurement data, providing theoretical identifiability guarantees. Experiments on both synthetic and real-world data demonstrate the effectiveness of our approach. By unifying structured and unstructured analysis, PersonaX establishes a foundation for studying LLM-inferred behavioral traits in conjunction with visual and biographical attributes, advancing multimodal trait analysis and causal reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1100. RF-DETR: Neural Architecture Search for Real-Time Detection Transformers
๐ฌ 14
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Real-Time Object Detection Neural Architecture Search Transfer Learning
Open-vocabulary detectors achieve impressive performance on COCO, but often fail to generalize to real-world datasets with out-of-distribution classes not typically found in their pre-training. Rather than simply fine-tuning a heavy-weight vision-language model (VLM) for new domains, we introduce RF-DETR, a light-weight specialist detection transformer that discovers accuracy-latency Pareto curves for any target dataset with weight-sharing neural architecture search (NAS). Our approach fine-tunes a pre-trained base network on a target dataset and evaluates thousands of network configurations with different accuracy-latency tradeoffs without re-training. Further, we revisit the "tunable knobs" for NAS to improve the transferability of DETRs to diverse target domains. Notably, RF-DETR significantly improves on prior state-of-the-art real-time methods on COCO and Roboflow100-VL. RF-DETR (nano) achieves 48.0 AP on COCO, beating D-FINE (nano) by 5.3 AP at similar latency, and RF-DETR (2x-large) outperforms GroundingDINO (tiny) by 1.2 AP on Roboflow100-VL while running 20 times as fast. To the best of our knowledge, RF-DETR (2x-large) is the first real-time detector to surpass 60 AP on COCO.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1101. Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning
๐ฌ 14
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ reasoning model tool-integrated reasoning self-evolved training information entropy
Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to enhance their internal reasoning ability by integrating external tools. However, models with TIR often exhibit suboptimal behaviors, including insufficient tool calls, excessive tool calls, and overthinking after receiving tool call results. How to empower LLMs to perform TIR efficiently and accurately, while stabilizing the reasoning process, remains an open challenge.
In this paper, we first analyze the impact of tool calls on model reasoning from the perspective of information entropy. We find that when tool call results are provided, the information entropy of subsequent reasoning content will show a clear trend of change, and the overall information entropy of the reasoning chain will vary depending on the number of tool calls. Based on these observations, we propose Tool-Light, a framework designed to encourage LLMs to perform TIR efficiently and accurately. Our framework consists of dataset construction and multi-stage fine-tuning. For dataset construction, we use the trained model for continuous self-evolved sampling, integrating two methods: vanilla sampling and entropy-guided sampling. At the same time, during the sampling process, we design strict criteria for selecting positive-negative pairs. For the training process, we introduce a two-stage method, which includes a Supervised Fine-Tuning (SFT), and Self-Evolved Direct Preference Optimization (DPO).
Test results on 10 datasets reveal the effectiveness of Tool-Light, significantly improving the efficiency and accuracy of the model in completing TIR tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1102. MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning
๐ฌ 20
โญ 6.00
๐ generative models
๐ท๏ธ Image Generation Test-Time Latent Reasoning
Reasoning-augmented machine learning systems have shown improved performance in various domains, including image generation. However, existing reasoning-based methods for image generation either restrict reasoning to a single modality (image or text) or rely on high-quality reasoning data for fine-tuning. To tackle these limitations, we propose MILR, a test-time method that jointly reasons over image and text in a unified latent vector space. Reasoning in MILR is performed by searching through vector representations of discrete image and text tokens. Practically, this is implemented via the policy gradient method, guided by an image quality critic.
We instantiate MILR within the unified multimodal understanding and generation (MUG) framework that natively supports language reasoning before image synthesis and thus facilitates cross-modal reasoning. The intermediate model outputs, which are to be optimized, serve as the unified latent space, enabling MILR to operate entirely at test time. We evaluate MILR on GenEval, T2I-CompBench, and WISE, achieving state-of-the-art results on all benchmarks. Notably, on knowledge-intensive WISE, MILR attains an overall score of 0.63, improving over the baseline by 80%. Our further analysis indicates that joint reasoning in the unified latent space is the key to its strong performance. Moreover, our qualitative studies reveal MILR's non-trivial ability in temporal and cultural reasoning, highlighting the efficacy of our reasoning method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1103. SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence
๐ฌ 27
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Multimodal Large Language Model Evaluation Benchmark Compositional Spatial Intelligence
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in various multimodal tasks. To pursue higher intelligence in space, MLLMs require integrating multiple atomic spatial capabilities to handle complex and dynamic tasks. However, existing benchmarks struggle to comprehensively evaluate the spatial intelligence of common MLLMs from the atomic level to the compositional level. To fill this gap, we present SpaCE-10, a comprehensive benchmark for compositional spatial evaluations. In SpaCE-10, we define 10 atomic spatial capabilities, which are combined to form 8 compositional capabilities. Based on these definitions, we propose a novel hierarchical annotation pipeline to generate high-quality and diverse question-answer (QA) pairs. With over 150+ hours of human expert effort, we obtain over 5k QA pairs for 811 real indoor scenes in SpaCE-10, which covers various evaluation settings like point cloud input and multi-choice QA. We conduct an extensive evaluation of common MLLMs on SpaCE-10 and find that even the most advanced MLLM still lags behind humans by large margins. Through our careful study, we also draw several significant findings that benefit the MLLM community. For example, we reveal that the shortcoming of counting capability greatly limits the compositional spatial capabilities of existing MLLMs. We will release the code and benchmark soon.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1104. IA2: Alignment with ICL Activations improves Supervised Fine-Tuning
๐ฌ 11
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ In Context Learning ICL Supervised Fine Tuning SFT Adaptation
Supervised Fine-Tuning (SFT) is used to specialize model behavior by training weights to produce intended target responses for queries. In contrast, In-Context Learning (ICL) adapts models during inference with instructions or demonstrations in the prompt.
ICL can offer better generalizability and more calibrated responses compared to SFT in data scarce settings, at the cost of more inference compute. In this work, we ask the question: \textit{Can ICL's internal computations be used to improve the qualities of SFT?} We first show that ICL and SFT produce distinct activation patterns, indicating that the two methods achieve adaptation through different functional mechanisms. Motivated by this observation and to use ICL's rich functionality, we introduce \textbf{I}CL \textbf{A}ctivation \textbf{A}lignment (\act), a self-distillation technique which aims to replicate ICL's activation patterns in SFT models and incentivizes ICL-like internal reasoning. Performing \act as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and two model families. This finding is not only practically useful, but also offers a conceptual window into the inner mechanics of model adaptation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1105. Memba: Membrane-driven Parameter-Efficient Fine-Tuning for Mamba
๐ฌ 15
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Brain-inspired computing Mamba Fine-tuning
State Space Models (SSMs) have emerged as powerful alternatives to attention-based Transformers, with Mamba demonstrating impressive efficiency and scalability. As these models grow increasingly larger, the need for Parameter-Efficient Fine-Tuning (PEFT) methods becomes critical to adapt pre-trained Mamba to downstream tasks without prohibitive computational costs. However, previous approaches simply apply traditional Transformer-tailored PEFT methods without addressing the unique temporal processing dynamics of SSMs. To address this limitation, we propose ***Memba***, a membrane-driven PEFT approach specifically designed for Mamba. ***Memba*** introduces Leaky Integrate Membrane (LIM) neurons as bio-inspired gating mechanisms that naturally accumulate membrane potentials over time, enhancing selective information retention. By strategically combining LIM neurons with Low-Rank Adaptations (LoRA) and cross-layer membrane transfer, our approach significantly improves Mamba's temporal modeling capabilities. Extensive experiments across language and vision tasks demonstrate that ***Memba*** achieves substantial improvements over existing PEFT methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1106. NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching
๐ฌ 5
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Omnimodal Multimodal Learning Discrete Flow Matching
Next-generation multimodal foundation models capable of any-to-any cross-modal generation and multi-turn interaction will serve as core components of artificial general intelligence systems, playing a pivotal role in human-machine interaction. However, most existing multimodal models remain constrained by autoregressive architectures, whose inherent limitations prevent a balanced integration of understanding and generation capabilities. Although hybrid and decoupling strategies have been explored to address these tasks within unified frameworks separately, their redundant, non-integrated designs limit their applicability to broader scenarios, such as cross-modal retrieval. In this work, we introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms. By leveraging metric-induced probability paths and kinetic optimal velocities, NExT-OMNI natively supports any-to-any understanding and generation with enhanced response efficiency, while enabling broader application scenarios through concise unified representations rather than task-decoupled designs. Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal understanding and generation benchmarks, while outperforming prior unified models in multi-turn multimodal interaction and cross-modal retrieval, highlighting its architectural advantages as a next-generation multimodal foundation model. To advance further research, we will release training details, data protocols, and open-source both the code and model checkpoints.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1107. The Less You Depend, The More You Learn: Synthesizing Novel Views from Sparse, Unposed Images without Any 3D Knowledge
๐ฌ 19
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ novel view synthesis feed-forward scaling behavior
Recent advances in feed-forward Novel View Synthesis (NVS) have led to a divergence between two design philosophies: bias-driven methods, which rely on explicit 3D knowledge, such as handcrafted 3D representations (e.g., NeRF and 3DGS) and camera poses annotated by Structure-from-Motion algorithms, and data-centric methods, which learn to understand 3D structure implicitly from large-scale imagery data. This raises a fundamental question: which paradigm is more scalable in an era of ever-increasing data availability? In this work, we conduct a comprehensive analysis of existing methods and uncover a critical trend that the performance of methods requiring less 3D knowledge accelerates more as training data increases, eventually outperforming their 3D knowledge-driven counterparts, which we term โthe less you depend, the more you learn.โ Guided by this finding, we design a feed-forward NVS framework that removes both explicit scene structure and pose annotation reliance. By eliminating these dependencies, our method leverages great scalability, learning implicit 3D awareness directly from vast quantities of 2D images, without any pose information for training or inference. Extensive experiments demonstrate that our model achieves state-of-the-art NVS performance, even outperforming methods relying on posed training data. The results validate not only the effectiveness of our data-centric paradigm but also the power of our scalability finding as a guiding principle.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1108. Otters: An Energy-Efficient Spiking Transformer via Optical Time-to-First-Spike Encoding
๐ฌ 16
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Spiking neural network Energy efficient Time-to-First-Spike Encoding Optoelectronic Synapse
Spiking neural networks (SNNs) promise high energy efficiency, particularly with time-to-first-spike (TTFS) encoding, which maximizes sparsity by emitting at most one spike per neuron. However, such energy advantage is often unrealized because inference requires evaluating a temporal decay function and subsequent multiplication with the synaptic weights.
This paper challenges this costly approach by repurposing a physical hardware `bug', namely, the natural signal decay in optoelectronic devices, as the core computation of TTFS. We fabricated a custom indium oxide optoelectronic synapse, showing how its natural physical decay directly implements the required temporal function. By treating the device's analog output as the fused product of the synaptic weight and temporal decay, optoelectronic synaptic TTFS (named Otters) eliminates these expensive digital operations.
To use the Otters paradigm in complex architectures like the transformer, which are challenging to train directly due to the sparsity issue, we introduce a novel quantized neural network-to-SNN conversion algorithm.
This complete hardware-software co-design enables our model to achieve state-of-the-art accuracy across seven GLUE benchmark datasets and demonstrates a 1.77$\times$ improvement in energy efficiency over previous leading SNNs, based on a comprehensive analysis of compute, data movement, and memory access costs using energy measurements from a commercial 22nm process.
Our work thus establishes a new paradigm for energy-efficient SNNs, translating fundamental device physics directly into powerful computational primitives. All codes and data are open source.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1109. Omni-Weather: Unified Multimodal Foundation Model for Weather Generation and Understanding
๐ฌ 24
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ AI for Science Unified foundation model Interpretable reasoning
Weather modeling requires both accurate prediction and mechanistic interpretation, yet existing methods treat these goals in isolation, separating generation from understanding. To address this gap, we present Omni-Weather, the first multimodal foundation model that unifies weather generation and understanding within a single architecture.
Omni-Weather integrates a radar encoder for weather generation tasks, followed by unified processing using a shared self-attention mechanism.
Moreover, we construct a Chain-of-Thought dataset for causal reasoning in weather generation, enabling interpretable outputs and improved perceptual quality.
Extensive experiments show Omni-Weather achieves state-of-the-art performance in both weather generation and understanding.
Our findings further indicate that generative and understanding tasks in the weather domain can mutually enhance each other.
Omni-Weather also demonstrates the feasibility and value of unifying weather generation and understanding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1110. Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models
๐ฌ 15
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ multimodal language models interpretability spatial reasoning
Vision-Language Models (VLMs) have demonstrated remarkable performance across a variety of real-world tasks. However, existing VLMs typically process visual information by serializing images, a method that diverges significantly from the parallel nature of human vision. Moreover, their opaque internal mechanisms hinder both deeper understanding and architectural innovation. Inspired by the dual-stream hypothesis of human vision, which distinguishes the "what" and "where" pathways, we deconstruct the visual processing in VLMs into object recognition and spatial perception for separate study. For object recognition, we convert images into text token maps and find that the model's perception of image content unfolds as a two-stage process from shallow to deep layers, beginning with attribute recognition and culminating in semantic disambiguation. For spatial perception, we theoretically derive and empirically verify the geometric structure underlying the positional representation in VLMs. Based on these findings, we introduce an instruction-agnostic token compression algorithm based on a plug-and-play visual decoder to improve decoding efficiency, and a RoPE scaling technique to enhance spatial reasoning. Through rigorous experiments, our work validates these analyses, offering a deeper understanding of VLM internals and providing clear principles for designing more capable future architectures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1111. On the Generalization Capacities of MLLMs for Spatial Intelligence
๐ฌ 13
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ 3D Computer Vision Multimodal Large Language Model Spatial Intelligence Embodied AI
Multimodal Large Language Models (MLLMs) that directly process RGB inputs for tasks like 3D localization and navigation have shown remarkable potential. However, we argue that these ``RGB-only'' approaches are fundamentally flawed in their ability to generalize across cameras. By ignoring camera parameters, they entangle an object's physical properties with the camera's perspective, creating an irresolvable ambiguity. We show this leads MLLMs to overfit to the training camera distribution, rather than learning true and generalizable 3D geometric principles. To address this, we propose Camera-Aware MLLM framework for spatial MLLMs. It learns generalizable spatial reasoning by: (i) injecting camera intrinsics via a dense embedding that conditions each visual token; (ii) introducing a camera-aware data augmentation strategy that synthetically varies camera parameters, forcing the model to disentangle camera properties from scene content; and (iii) distilling geometric priors from a 3D vision foundation model. Extensive experiments demonstrate that camera-aware MLLMs substantially outperform their naive counterparts, particularly in cross-camera generalization tests on spatially-grounded tasks, indicating that camera-awareness is not only beneficial but also a prerequisite for robust and generalizable spatial intelligence in MLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1112. FideDiff: Efficient Diffusion Model for High-Fidelity Image Motion Deblurring
๐ฌ 23
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Image Motion-Deblurring Diffusion Model
Recent advancements in image motion deblurring, driven by CNNs and transformers, have made significant progress. Large-scale pre-trained diffusion models, which are rich in real-world modeling, have shown great promise for high-quality image restoration tasks such as deblurring, demonstrating stronger generative capabilities than CNN and transformer-based methods. However, challenges such as unbearable inference time and compromised fidelity still limit the full potential of the diffusion models. To address this, we introduce FideDiff, a novel single-step diffusion model designed for high-fidelity deblurring. We reformulate motion deblurring as a diffusion-like process where each timestep represents a progressively blurred image, and we train a consistency model that aligns all timesteps to the same clean image. By reconstructing training data with matched blur trajectories, the model learns temporal consistency, enabling accurate one-step deblurring. We further enhance model performance by integrating Kernel ControlNet for blur kernel estimation and introducing adaptive timestep prediction. Our model achieves superior performance on full-reference metrics, surpassing previous diffusion-based methods and matching the performance of other state-of-the-art models. FideDiff offers a new direction for applying pre-trained diffusion models to high-fidelity image restoration tasks, establishing a robust baseline for further advancing diffusion models in real-world industrial applications. Our dataset and code will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1113. Federated Learning with Profile Mapping under Distribution Shifts and Drifts
๐ฌ 22
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ federated learning privacy distribution drifts distribution shifts data heterogeneity efficiency
Federated Learning (FL) enables decentralized model training across clients without sharing raw data, but its performance degrades under real-world data heterogeneity. Existing methods often fail to address distribution shift across clients and distribution drift over time, or they rely on unrealistic assumptions such as known number of client clusters and data heterogeneity types, which limits their generalizability. We introduce **Feroma**, a novel FL framework that explicitly handles both distribution shift and drift without relying on client or cluster identity. **Feroma** builds on client distribution profilesโcompact, privacy-preserving representations of local dataโthat guide model aggregation and test-time model assignment through adaptive similarity-based weighting. This design allows **Feroma** to dynamically select aggregation strategies during training, ranging from clustered to personalized, and deploy suitable models to unseen, and unlabeled test clients without retraining, online adaptation, or prior knowledge on clients' data. Extensive experiments show that compared to 10 state-of-the-art methods, **Feroma** improves performance and stability under dynamic data heterogeneity conditionsโan average accuracy gain of up to 12 percentage points over the best baselines across 6 benchmarksโwhile maintaining computational and communication overhead comparable to FedAvg. These results highlight that distribution-profile-based aggregation offers a practical path toward robust FL under both data distribution shifts and drifts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1114. Gauge-invariant representation holonomy
๐ฌ 23
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Representation learning Gauge invariance Holonomy Geometric deep learning Robustness
Deep networks learn internal representations whose geometryโhow features bend, rotate, and evolveโaffects both generalization and robustness. Existing similarity measures such as CKA or SVCCA capture pointwise overlap between activation sets, but miss how representations change along input paths. Two models may appear nearly identical under these metrics yet respond very differently to perturbations or adversarial stress. We introduce representation holonomy, a gauge-invariant statistic that measures this path dependence. Conceptually, holonomy quantifies the โtwistโ accumulated when features are parallel-transported around a small loop in input space: flat representations yield zero holonomy, while nonzero values reveal hidden curvature. Our estimator fixes gauge through global whitening, aligns neighborhoods using shared subspaces and rotation-only Procrustes, and embeds the result back to the full feature space. We prove invariance to orthogonal (and affine, post-whitening) transformations, establish a linear null for affine layers, and show that holonomy vanishes at small radii. Empirically, holonomy scales with loop radius and depth, separates models that appear similar under CKA, and correlates with adversarial and corruption robustness across training regimes. It also tracks training dynamics as features form and stabilize. Together, these results position representation holonomy as a practical and scalable diagnostic for probing the geometric structure of learned representations beyond pointwise similarity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1115. Pixel to Gaussian: Ultra-Fast Continuous Super-Resolution with 2D Gaussian Modeling
๐ฌ 21
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Continuous Super-Resolution; 2DGS; Fast Model
Arbitrary-scale super-resolution (ASSR) aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs with arbitrary upsampling factors using a single model, addressing the limitations of traditional SR methods constrained to fixed-scale factors (\textit{e.g.}, $\times$ 2). Recent advances leveraging implicit neural representation (INR) have achieved great progress by modeling coordinate-to-pixel mappings. However, the efficiency of these methods may suffer from repeated upsampling and decoding, while their reconstruction fidelity and quality are constrained by the intrinsic representational limitations of coordinate-based functions. To address these challenges, we propose a novel ContinuousSR framework with a Pixel-to-Gaussian paradigm, which explicitly reconstructs 2D continuous HR signals from LR images using Gaussian Splatting. This approach eliminates the need for time-consuming upsampling and decoding, enabling extremely fast ASSR. Once the Gaussian field is built in a single pass, ContinuousSR can perform arbitrary-scale rendering in just 1ms per scale. Our method introduces several key innovations. Through statistical analysis, we uncover the Deep Gaussian Prior (DGP) and propose DGP-Driven Covariance Weighting, which dynamically optimizes covariance via adaptive weighting. Additionally, we present Adaptive Position Drifting, which refines the positional distribution of the Gaussian space based on image content, further enhancing reconstruction quality. Extensive experiments on seven benchmarks demonstrate that our ContinuousSR delivers significant improvements in SR quality across all scales, with an impressive 19.5ร speedup when continuously upsampling an image across forty scales.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1116. Quant-dLLM: Post-Training Extreme Low-Bit Quantization for Diffusion Large Language Models
๐ฌ 19
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Difussion-LLM Quantization
Diffusion large language models (dLLMs), which offer bidirectional context and flexible masked-denoising generation, are emerging as a compelling alternative to autoregressive (AR) LLMs. However, like AR LLMs, their model sizes continue to grow, motivating weight compression for deployment. Although post-training quantization (PTQ) is effective for AR LLMs, directly transferring it to dLLMs at 2-bit leads to unsatisfactory performance. To tackle these challenges, we propose Quant-dLLM, an ultra-low-bit PTQ framework tailored to dLLMs. Since masked-denoising activations in dLLMs differ from the fully visible signals assumed by standard PTQ methods, we introduce Masked Calibration Simulation (MCS) to align calibration with the timestep-dependent masking, which yields more reliable calibrations. Moreover, we propose a Data-aware Any-order Quantizer (DAQ) that learns ultra-low-bit weight representations via an optimization algorithm. It performs iterative approximation guided by our simulated calibration data. In addition, under a strict 2-bit budget, we introduce Adaptive Blockwise Mixed Precision (ABMP), a sensitivity-based precision allocation scheme that adaptively assigns bit width across channel groups. When restricted to 2-bit precision, Quant-dLLM consistently achieves higher accuracy than state-of-the-art (SOTA) AR-transfer PTQ methods on dLLMs. We will release the code and models soon.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1117. Fast-dLLM v2: Efficient Block-Diffusion LLM
๐ฌ 12
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Diffusion LLM Efficient AI
Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generationโrequiring only โผ1B tokens of fine-tuning. This represents a 500ร reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original modelโs performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5ร speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMsโmarking a significant step toward the practical deployment of fast and accurate LLMs. Code and model will be publicly released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1118. Hallucination Begins Where Saliency Drops
๐ฌ 16
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LVLMs-Saliency; Saliency-Guided Rejection Sampling; Local Coherence Reinforcement; Hallucination
Recent studies have investigated attention dynamics in large vision language models (LVLMs), yet existing methods remain limited in reliably distinguishing hallucinated from correct outputs โ primarily because they rely solely on forward-pass attention, ignoring gradient-based signals that reveal how token influence propagates through the model. To bridge this gap, we introduce \textbf{LVLMs-Saliency}, an \textit{gradient-aware diagnostic tool} that quantifies the grounding strength of each output token by fusing attention weights with their gradients. Through analysis, we identify a decisive pattern: \textit{Hallucinations occur when prior output tokens shows low saliency to the next token prediction}, indicating a failure of contextual memory. Building on this insight, we propose a dual-mechanism inference-time framework: (1) Saliency-Guided Rejection Sampling (SGRS), which dynamically filters candidate tokens during decoding by rejecting those with saliency below a context-adaptive threshold, thereby preventing coherence-breaking tokens from entering the sequence; and (2) Local Coherence Reinforcement (LocoRE), a lightweight plug-and-play module that strengthens attention from the current token to its most recent outputs, actively counteracting the โforgettingโ behavior identified by LVLMs-Saliency. Experimental results demonstrate that our method significantly reduces hallucinations across multiple LVLMs, offering a robust and interpretable solution to improve model reliability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1119. RL's Razor: Why Online Reinforcement Learning Forgets Less
๐ฌ 11
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reinforcement Learning Large Language Models Catastrophic Forgetting
Comparison of fine-tuning models with reinforcement learning (RL) and supervised fine-tuning (SFT) reveals that, despite similar performance at a new task, RL preserves prior knowledge and capabilities significantly better. We find that the degree of forgetting is determined by the distributional shift, measured as the KL-divergence between the fine-tuned and base policy evaluated on the new task. Our analysis reveals that on-policy RL is implicitly biased towards KL-minimal solutions among the many that solve the new task, whereas SFT can converge to distributions arbitrarily far from the base model. We validate these findings through experiments with large language models and robotic foundation models and further provide theoretical justification for why on-policy RL updates lead to a smaller KL change. We term this principle $\textit{RLโs Razor}$: among all ways to solve a new task, RL prefers those closest in KL to the original model.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1120. EgoHandICL: Egocentric 3D Hand Reconstruction with In-Context Learning
๐ฌ 13
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Hand Reconstructin Egocentric Vision In-Context Learning
Robust 3D hand reconstruction is challenging in egocentric vision due to depth ambiguity, self-occlusion, and complex hand-object interactions. Prior works attempt to mitigate the challenges by scaling up training data or incorporating auxiliary cues, often falling short of effectively handling unseen contexts. In this paper, we introduce EgoHandICL, the first in-context learning (ICL) framework for 3D hand reconstruction that achieves strong semantic alignment, visual consistency, and robustness under challenging egocentric conditions. Specifically, we develop (i) complementary exemplar retrieval strategies guided by visionโlanguage models (VLMs), (ii) an ICL-tailored tokenizer that integrates multimodal context, and (iii) a Masked Autoencoders (MAE)-based architecture trained with 3D handโguided geometric and perceptual objectives. By conducting comprehensive experiments on the ARCTIC and EgoExo4D benchmarks, our EgoHandICL consistently demonstrates significant improvements over state-of-the-art 3D hand reconstruction methods. We further show EgoHandICLโs applicability by testing it on real-world egocentric cases and integrating it with EgoVLMs to enhance their handโobject interaction reasoning. Our code and data will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1121. Learning Massively Multitask World Models for Continuous Control
๐ฌ 17
โญ 6.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning world models continuous control
General-purpose control demands agents that act across many tasks and embodiments, yet research on reinforcement learning (RL) for continuous control remains dominated by single-task or offline regimens, reinforcing a view that online RL does not scale. Inspired by the foundation model recipe (large-scale pretraining followed by light RL) we ask whether a single agent can be trained on hundreds of tasks with online interaction. To accelerate research in this direction, we introduce a new benchmark with 200 diverse tasks spanning many domains and embodiments, each with language instructions, demonstrations, and optionally image observations. We then present Newt, a language-conditioned multitask world model that is first pretrained on demonstrations to acquire task-aware representations and action priors, and then jointly optimized with online interaction across all tasks. Experiments show that Newt yields better multitask performance and data-efficiency than a set of strong baselines, exhibits strong open-loop control, and enables rapid adaptation to unseen tasks. We release our environments, demonstrations, code for training and evaluation, as well as 200+ checkpoints. Website: https://newt-world-models.github.io
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1122. Scalable Training for Vector-Quantized Networks with 100% Codebook Utilization
๐ฌ 7
โญ 6.00
๐ generative models
๐ท๏ธ Generative Model Image Quantization Autoregressive Modeling Image Generation Image Synthesis
Vector quantization (VQ) is a key component in discrete tokenizers for image generation, but its training is often unstable due to straight-through estimation bias, one-step-behind updates, and sparse codebook gradients, which lead to suboptimal reconstruction performance and low codebook usage.
In this work, we analyze these fundamental challenges and provide a simple yet effective solution.
To maintain high codebook usage in VQ networks (VQN) during learning annealing and codebook size expansion, we propose VQBridge, a robust, scalable, and efficient projector based on the map function method. VQBridge optimizes code vectors through a compressโprocessโrecover pipeline, enabling stable and effective codebook training.
By combining VQBridge with learning annealing, our VQN achieves full (100\%) codebook usage across diverse codebook configurations, which we refer to as FVQ (FullVQ).
Through extensive experiments, we demonstrate that FVQ is effective, scalable, and generalizable: it attains 100\% codebook usage even with a 262k-codebook, achieves state-of-the-art reconstruction performance, consistently improves with larger codebooks, higher vector channels, or longer training, and remains effective across different VQ variants. Moreover, when integrated with LlamaGen, FVQ significantly enhances image generation performance, surpassing visual autoregressive models (VAR) by 0.5 and diffusion models (DiT) by 0.2 rFID, highlighting the importance of high-quality tokenizers for strong autoregressive image generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1123. Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation
๐ฌ 18
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models; Efficient Training; Low-Rank; LoRA
Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces significant memory overhead, which constrains scalability and computational efficiency.
In this work, we re-frame the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow.
Building on this equivalence, we introduce LoRA-Pre, a novel low-rank optimizer designed for efficient pre-training.
Specifically, LoRA-Pre reduces the optimizer's memory footprint by decomposing the full momentum matrix into a compact low-rank subspace within the online linear learner, thereby maintaining optimization performance while improving memory efficiency.
We empirically validate LoRA-Pre's efficacy by pre-training models from the Llama architecture family, scaling from 60M to 1B parameters.
LoRA-Pre achieves the highest performance across all model sizes.
Notably, LoRA-Pre demonstrates remarkable rank efficiency, achieving comparable or superior results using only 1/8 the rank of baseline methods.
Beyond pre-training, we evaluate LoRA-Pre's effectiveness in fine-tuning scenarios.
With the same rank, LoRA-Pre consistently outperforms all efficient fine-tuning baselines.
Specifically, compared to standard LoRA, LoRA-Pre achieves substantial improvements of 3.14 points on Llama-3.1-8B and 6.17 points on Llama-2-7B, validating our approach's effectiveness across both pre-training and fine-tuning paradigms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1124. An Open-Ended Benchmark and Formal Framework for Adjuvant Research with MLLM
๐ฌ 24
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Adjuvant Scientific Benchmarks Multimodal Large Language Model
Adjuvants play a critical role in modulating immune responses and are central to the development of vaccines and immunotherapies. Yet progress in this field is constrained by data scarcity and incomplete understanding of mechanisms of action, which limit the transition from experience-based design to AI-driven approaches. To address these challenges, we present the first benchmark dedicated to adjuvants, constructed in an open-ended Q\&A format and annotated by domain experts. The benchmark comprises 1,294 Q\&A pairs and 1,364 formal descriptions, providing a resource for evaluating general-purpose multimodal large language models (MLLMs) and for developing domain-specific systems.
We systematically assess 11 closed-source and 18 open-source MLLMs across dimensions including domain-specific Q\&A, hallucination rejection, data generation, and instruction following. Results indicate that OpenAI-o1 (STS = 0.7495, LLM Score = 7.7) and DeepSeek-R1 (STS = 0.7415, LLM Score = 7.7) achieved the strongest performance among closed- and open-source models, respectively. In addition, we introduce a formal description framework for representing adjuvant design principles and immune mechanisms as structured abstractions, which can serve as building blocks for future domain-specialized MLLMs. Overall, this work provides a first step toward systematically integrating MLLMs into adjuvant research by offering a dedicated benchmark, comparative evaluation of existing models, and a formal foundation for future development. Data and code will be released at \href{https://anonymous.4open.science/status/Advancing-Adjuvants-1C2B}{Anonymous}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1125. SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs
๐ฌ 28
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Multimodal Large Language Models Spatial Reasoning
Humans can imagine and manipulate visual images mentally, a capability known as \textit{spatial visualization}.
While many multi-modal benchmarks assess reasoning on visible visual information, the ability to infer unseen relationships through spatial visualization remains insufficiently evaluated as a spatial skill.
This reliance on publicly sourced problems from IQ tests or math competitions risks data contamination and compromises assessment reliability.
To this end, we introduce \textbf{\textit{SpatialViz-Bench}}, a comprehensive multi-modal benchmark for \textit{spatial visualization} with \emph{12} tasks across \emph{4} sub-abilities, comprising \emph{1,180} programmatically generated problems, a scalable framework that allows for expansion to ensure fair and continuously reliable evaluations.
Our evaluation of \emph{27} Multi-modal Large Language Models (MLLMs) reveals wide performance variations, demonstrates the benchmark's strong discriminative power, and uncovers counter-intuitive findings: Chain-of-Thought (CoT) prompting paradoxically degrades accuracy on open-source models.
Through statistical and qualitative analysis of error types, SpatialViz-Bench demonstrates that state-of-the-art MLLMs exhibit deficiencies in \textit{spatial visualization} tasks, thereby addressing a significant lacuna in the field.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1126. PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives
๐ฌ 23
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Political Consensus Finding Collective Decision Making Benchmark on Societal Considerations
Achieving political consensus is crucial yet challenging for the effective functioning of social governance. However, although frontier AI systems represented by large language models (LLMs) have developed rapidly in recent years, their capabilities in this scope are still understudied. In this paper, we introduce PoliCon, a novel benchmark constructed from 2,225 high-quality deliberation records of the European Parliament over 13 years, ranging from 2009 to 2022, to evaluate the ability of LLMs to draft consensus resolutions based on divergent party positions under varying collective decision-making contexts and political requirements. Specifically, PoliCon incorporates four factors to build each task environment for finding different political consensus: specific political issues, political goals, participating parties, and power structures based on seat distribution. We also developed an evaluation framework based on social choice theory for PoliCon, which simulates the real voting outcomes of different political parties to assess whether LLM-generated resolutions meet the requirements of the predetermined political consensus. Our experimental results demonstrate that even state-of-the-art models remain undersatisfied with complex tasks like passing resolutions by a two-thirds majority and addressing security issues, while uncovering their inherent partisan biases and revealing some behaviors LLMs show to achieve the consensus, such as prioritizing the stance of the dominant party instead of uniting smaller parties, which highlights PoliCon's promise as an effective platform for studying LLMs' ability to promote political consensus.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1127. SimULi: Real-Time LiDAR and Camera Simulation with Unscented Transforms
๐ฌ 14
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ neural rendering 3d gaussians 3d vision
Rigorous testing of autonomous robots, such as self-driving vehicles, is essential to ensure their safety in real-world deployments. This requires building high-fidelity simulators to test scenarios beyond those that can be safely or exhaustively collected in the real-world. Existing neural rendering methods based on NeRF and 3DGS hold promise but suffer from low rendering speeds or can only render pinhole camera models, hindering their suitability to applications that commonly require high-distortion lenses and LiDAR data. Multi-sensor simulation poses additional challenges as existing methods handle cross-sensor inconsistencies by favoring the quality of one modality at the expense of others. To overcome these limitations, we propose SimULi, the first method capable of rendering arbitrary camera models and LiDAR data in real-time. Our method extends 3DGUT, which natively supports complex camera models, with LiDAR support, via an automated tiling strategy for arbitrary spinning LiDAR models and ray-based culling. To address cross-sensor inconsistencies, we design a factorized 3D Gaussian representation and anchoring strategy that reduces mean camera and depth error by up to 40% compared to existing methods. SimULi renders 10-20$\times$ faster than ray tracing approaches and 1.5-14$\times$ faster than prior rasterization-based work (and handles a wider range of camera models). When evaluated on two widely benchmarked autonomous driving datasets, SimULi matches or exceeds the fidelity of existing state-of-the-art methods across numerous camera and LiDAR metrics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1128. ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models
๐ฌ 25
โญ 6.00
๐ generative models
๐ท๏ธ Large Reasoning Models KV Cache Compression Quantization Eviction Sparsity Thought-Aware Compression
The long-output context generation of large reasoning models enables extended chain of thought (CoT) but also drives rapid growth of the keyโvalue (KV) cache, quickly overwhelming GPU memory. To address this challenge, we propose ThinKV, a thought-adaptive KV cache compression framework. ThinKV is based on the observation that attention sparsity reveals distinct thought types with varying importance within the CoT. It applies a hybrid quantizationโeviction strategy, assigning token precision by thought importance and progressively evicting tokens from less critical thoughts as reasoning trajectories evolve. Furthermore, to implement ThinKV, we design a kernel that extends PagedAttention to enable efficient reuse of evicted tokens' memory slots, eliminating compaction overheads. Extensive experiments on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason across mathematics and coding benchmarks show that ThinKV achieves near-lossless accuracy with less than 5% of the original KV cache, while improving performance with up to 5.8x higher inference throughput over SoTA baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1129. Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
๐ฌ 12
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM finetuning safety steganography
Understanding and addressing potential safety alignment risks in large language models (LLMs) is critical for ensuring their safe and trustworthy deployment. In this paper, we highlight an insidious safety threat: a compromised LLM can maintain a facade of proper safety alignment while covertly generating harmful content. To achieve this, we finetune the model to understand and apply a steganographic technique. At inference time, we input a prompt that contains a steganographically embedded malicious target question along with a plaintext cover question. The model, in turn, produces a target response similarly embedded within a benign-looking cover response. In this process, human observers only see the model being prompted with a cover question and generating a corresponding cover response, while the malicious content is hidden from view. We demonstrate this invisible safety threat on GPT-4.1 despite the OpenAI fine-tuning APIโs safeguards. The finetuned model produces steganographic malicious outputs in response to hidden malicious prompts, while the user interface displays only a fully benign cover interaction. We also replicate the attack on two open-source models, Phi-4 and Mistral-Small-24B-Base-2501, confirming the generality of our method. We quantitatively evaluate our method on the AdvBench dataset, using Llama-Guard-3-8B for content safety classification. Across all three models, all stegotexts containing malicious content are incorrectly classified as safe.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1130. BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals
๐ฌ 3
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ biosignal ai for healthcare humans and ai unsupervised cross-modal knowledge transfer
Biosignals offer valuable insights into the physiological states of the human body. Although biosignal modalities differ in functionality, signal fidelity, sensor comfort, and cost, they are often intercorrelated, reflecting the holistic and interconnected nature of human physiology. This opens up the possibility of performing the same tasks using alternative biosignal modalities, thereby improving the accessibility, usability, and adaptability of health monitoring systems. However, the limited availability of large labeled datasets presents challenges for training models tailored to specific tasks and modalities of interest. Unsupervised cross-modal knowledge transfer offers a promising solution by leveraging knowledge from an existing modality to support model training for a new modality. Existing methods are typically based on knowledge distillation, which requires running a teacher model alongside student model training, resulting in high computational and memory overhead. This challenge is further exacerbated by the recent development of foundation models that demonstrate superior performance and generalization across tasks at the cost of large model sizes. To this end, we explore a new framework for unsupervised cross-modal knowledge transfer of biosignals by training a lightweight bridge network to align the intermediate representations and enable information flow between foundation models and across modalities. Specifically, we introduce an efficient strategy for selecting alignment positions where the bridge should be constructed, along with a flexible prototype network as the bridge architecture. Extensive experiments across multiple biosignal modalities, tasks, and datasets show that BioX-Bridge reduces the number of trainable parameters by 88-99\% while maintaining or even improving transfer performance compared to state-of-the-art methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1131. FlashWorld: High-quality 3D Scene Generation within Seconds
๐ฌ 21
โญ 6.00
๐ generative models
๐ท๏ธ 3D Scene Generation Multi-view Diffusion Models World Models Distribution Matching Distillation
We propose FlashWorld, a generative model that produces 3D scenes from a single image or text prompt in seconds, $10 \sim 100\times$ faster than previous works while possessing superior rendering quality.
Our approach shifts from the conventional multi-view-oriented (MV-oriented) paradigm, which generates multi-view images for subsequent 3D reconstruction, to a 3D-oriented approach where the model directly produces 3D Gaussian representations during multi-view generation.
While ensuring 3D consistency, 3D-oriented method typically suffers poor visual quality.
FlashWorld includes a dual-mode pre-training phase followed by a cross-mode post-training phase, effectively integrating the strengths of both paradigms.
Specifically, leveraging the prior from a video diffusion model, we first pre-train a dual-mode multi-view diffusion model, which jointly supports MV-oriented and 3D-oriented generation mode.
To bridge the quality gap in 3D-oriented generation, we further propose a cross-mode post-training distillation by matching distribution from consistent 3D-oriented mode to high-quality MV-oriented mode.
This not only enhances visual quality while maintaining 3D consistency, but also reduces the required denoising steps for inference.
Also, we propose a strategy to leverage massive single-view images and text prompts during this process to enhance the model's generalization to out-of-distribution inputs.
Extensive experiments demonstrate the superiority and efficiency of our method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1132. Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward
๐ฌ 15
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multimodal Large Language Models Multimodal Reasoning Reinforcement Learning
Enhancing the multimodal reasoning capabilities of Multimodal Large Language Models (MLLMs) is a challenging task that has attracted increasing attention in the community. Recently, several studies have applied Reinforcement Learning with Verifiable Rewards (RLVR) to the multimodal domain in order to enhance the reasoning abilities of MLLMs. However, these works largely overlook the enhancement of multimodal perception capabilities in MLLMs, which serve as a core prerequisite and foundational component of complex multimodal reasoning. Through McNemar's test, we find that existing RLVR method fails to effectively enhance the multimodal perception capabilities of MLLMs, thereby limiting their further improvement in multimodal reasoning. To address this limitation, we propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the visual content accurately, thereby can effectively incentivizing both their multimodal perception and reasoning capabilities. Specifically, we first collect textual visual annotations from the CoT trajectories of multimodal problems, which will serve as visual references for reward assignment. During RLVR training, we employ a judging LLM to assess the consistency between the visual annotations and the responses generated by MLLM, and assign the visual perception reward based on these consistency judgments. Extensive experiments on several multimodal math and general benchmarks demonstrate the effectiveness and robustness of our Perception-R1, which achieves superior performance on all benchmarks using only 1,442 training data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1133. Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions
๐ฌ 20
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ model steering mechanistic interpretability
Intervention-based model steering offers a lightweight and interpretable alternative to prompting and fine-tuning. However, by adapting strong optimization objectives from fine-tuning, current methods are susceptible to overfitting and often underperform, sometimes generating unnatural outputs.
We hypothesize that this is because effective steering requires the faithful identification of internal model mechanisms, not the enforcement of external preferences.
To this end, we build on the principles of *distributed alignment search (DAS)*, the standard for causal variable localization, to propose a new steering method: **Concept DAS (CDAS)**.
While we adopt the core mechanism of DAS, *distributed interchange intervention (DII)*, we introduce a novel distribution matching objective tailored for the steering task by aligning intervened output distributions with counterfactual distributions.
CDAS differs from prior work in two main ways:
first, it learns interventions via weak-supervised distribution matching rather than probability maximization;
second, it uses DIIs that naturally enable bi-directional steering and allow steering factors to be derived from data, reducing the effort required for hyperparameter tuning and resulting in more faithful and stable control.
On AxBench, a large-scale model steering benchmark, we show that CDAS does not always outperform preference-optimization methods but may benefit more from increased model scale.
In two safety-related case studies, overriding refusal behaviors of safety-aligned models and neutralizing a chain-of-thought backdoor, CDAS achieves systematic steering while maintaining general model utility.
These results indicate that CDAS is complementary to preference-optimization approaches and conditionally constitutes a robust approach to intervention-based model steering.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1134. CoAct-1: Computer-using Multi-agent System with Coding Actions
๐ฌ 14
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Computer-using Agent Multi-gent System LLM Agent
Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks. While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency. In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as an enhanced action. We present CoAct-1, a novel multi-agent system that synergistically combines GUI-based control with direct programmatic execution. CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still utilizing visual interaction when necessary. We evaluate our system on the challenging OSWorld and WindowsAgentArena benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.8% on OSWorld and 52.5% on WindowsAgentArena, significantly outperforming prior methods. Furthermore, our approach dramatically improves efficiency, reducing the average number of steps required to complete a task to just 10.15 on OSWorld, compared to 15 for leading GUI agents. Our results demonstrate that integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1135. A Scalable Inter-edge Correlation Modeling in CopulaGNN for Link Sign Prediction
๐ฌ 14
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Signed Graph Representation Learning Graph Neural Network Gaussian Copula Link Sign Prediction Linear Convergence
Link sign prediction on a signed graph is a task to determine whether the relationship represented by an edge is positive or negative. Since the presence of negative edges violates the graph homophily assumption that adjacent nodes are similar, regular graph methods have not been applicable without auxiliary structures to handle them. We aim to directly model the latent statistical dependency among edges with the Gaussian copula and its corresponding correlation matrix, extending CopulaGNN (Ma et al., 2021). However, a naive modeling of edge-edge relations is computationally intractable even for a graph with moderate scale. To address this, we propose to 1) represent the correlation matrix as a Gramian of edge embeddings, significantly reducing the number of parameters, and 2) reformulate the conditional probability distribution to dramatically reduce the inference cost. We theoretically verify scalability of our method by proving its linear convergence. Also, our extensive experiments demonstrate that it achieves significantly faster convergence than baselines, maintaining competitive prediction performance to the state-of-the-art models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1136. AutoDrive-Rยฒ: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving
๐ฌ 20
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Applications Robots VisionโLanguageโAction Models
VisionโLanguageโAction (VLA) models in autonomous driving systems have recently demonstrated transformative potential by integrating multimodal perception with decision-making capabilities. However, the interpretability and coherence of the decision process and the plausibility of action sequences remain largely underexplored. To address these issues, we propose AutoDrive-Rยฒ, a novel VLA framework that enhances both reasoning and self-reflection capabilities of autonomous driving systems through chain-of-thought (CoT) processing and reinforcement learning (RL). Specifically, we first propose an innovative CoT dataset named nuScenesRยฒ-6K for supervised fine-tuning, which effectively builds cognitive bridges between input information and output trajectories through a four-step logical chain with self-reflection for validation. Moreover, to maximize both reasoning and self-reflection during the RL stage, we further employ the Group Relative Policy Optimization (GRPO) algorithm within a physics-grounded reward framework that incorporates spatial alignment, vehicle dynamic, and temporal smoothness criteria to ensure reliable and realistic trajectory planning. Extensive evaluation results across both nuScenes and Waymo datasets demonstrates the state-of-the-art performance and robust generalization capacity of our proposed method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1137. Empowering Multi-Robot Cooperation via Sequential World Models
๐ฌ 14
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Model-based Reinforcement Learning Multi-Agent Reinforcement Learning Multi-Robot Cooperation
Model-based reinforcement learning (MBRL) has shown significant potential in robotics due to its high sample efficiency and planning capability. However, extending MBRL to multi-robot cooperation remains challenging due to the complexity of joint dynamics and the reliance on synchronous communication. SeqWM employs independent, autoregressive agent-wise world models to represent joint dynamics, where each agent generates its future trajectory and plans its actions based on the predictions of its predecessors. This design lowers modeling complexity, alleviates the reliance on communication synchronization, and enables the emergence of advanced cooperative behaviors through explicit intention sharing. Experiments in challenging simulated environments (Bi-DexHands and Multi-Quad) demonstrate that SeqWM outperforms existing state-of-the-art model-based and model-free baselines in both overall performance and sample efficiency, while exhibiting advanced cooperative behaviors such as predictive adaptation, temporal alignment, and role division. Furthermore, SeqWM has been success fully deployed on physical quadruped robots, demonstrating its effectiveness in real-world multi-robot systems. Demos and code are available at: https://sites.google.com/view/seqwm-marl
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1138. MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
๐ฌ 14
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ benchmark crosslingual multimodal instruction-following speech video
Recent advances in large language models have laid the foundation for multimodal LLMs (MLLMs), which unify text, speech, and vision within a single framework. As these models are rapidly evolving toward general-purpose instruction following across diverse and complex tasks, a key frontier is evaluating their crosslingual and multimodal capabilities over both short- and long-form inputs.
However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on a single modality at a time, rely on short-form inputs, or lack human annotations--hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first crosslingual human-annotated benchmark based on scientific talks on NLP and beyond. MCIF evaluates instruction following in crosslingual, multimodal settings over different input lengths and spans four macro-tasks: recognition, translation, question answering, and summarization. It covers three core modalities (speech, vision, and text) and four diverse languages (English, German, Italian, and Chinese), fully aligned across all dimensions. This parallel design enables a systematic evaluation of MLLMs' abilities to interpret instructions across languages and effectively integrate multimodal contextual information. Our benchmarking and analysis of 23 models highlight universal challenges across modalities and tasks, indicating substantial room for improvement in future MLLMs development. MCIF is released under CC-BY 4.0 license to promote open research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1139. SPICE: Submodular Penalized InformationโConflict Selection for Efficient Large Language Model Training
๐ฌ 22
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Data selection; Submodular; Log-determinant Fisher information; Instruction tuning
Information-based data selection for instruction tuning is compelling: maximizing the log-determinant of the Fisher information yields a monotone submodular objective, enabling greedy algorithms to achieve a $(1-1/e)$ approximation under a cardinality budget. In practice, however, we identify alleviating gradient conflicts, misalignment between per-sample gradients, is a key factor that slows down the decay of marginal log-determinant information gains, thereby preventing significant loss of information. We formalize this via an $\varepsilon$-decomposition that quantifies the deviation from ideal submodularity as a function of conflict statistics, yielding data-dependent approximation factors that tighten as conflicts diminish. Guided by this analysis, we propose SPICE, a conflict-aware selector that maximizes information while penalizing misalignment, and that supports early stopping and proxy models for efficiency. Empirically, SPICE selects subsets with higher log-determinant information than original criteria, and these informational gains translate into performance improvements: across 8 benchmarks with LLaMA2-7B and Qwen2-7B, SPICE uses only 10% of the data, yet matches or exceeds 6 methods including full-data tuning. This achieves performance improvements with substantially lower training cost.
Code is available at https://anonymous.4open.science/r/SPICE-6DF7/README.md.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1140. Relative Entropy Pathwise Policy Optimization
๐ฌ 18
โญ 6.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning parallel simulation value function ppo policy gradients policy optimization
Score-function based methods for policy learning, such as REINFORCE and PPO, have delivered strong results in game-playing and robotics, yet their high variance often undermines training stability. Using pathwise policy gradients, i.e. computing a derivative by differentiating the objective function, alleviates the variance issues. However, they require an accurate action-conditioned value function, which is notoriously hard to learn without relying on replay buffers for reusing past off-policy data. We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories, unlocking the possibility of using pathwise policy updates in the context of on-policy learning. We show how to combine stochastic policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning. The result, Relative Entropy Pathwise Policy Optimization (REPPO), is an efficient on-policy algorithm that combines the stability of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning. Compared to state-of-the-art on two standard GPU-parallelized benchmarks, REPPO provides strong empirical performance at superior sample efficiency, wall-clock time, memory footprint, and hyperparameter robustness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1141. DAMR: Efficient and Adaptive Context-Aware Knowledge Graph Question Answering with LLM-Guided MCTS
๐ฌ 15
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Knowledge Graphs Question Answering LLMs
Knowledge Graph Question Answering (KGQA) aims to interpret natural language queries and perform structured reasoning over knowledge graphs by leveraging their relational and semantic structures to retrieve accurate answers. Existing methods primarily follow either the retrieve-then-reason paradigm, which relies on Graph Neural Networks (GNNs) or heuristic rules to extract static candidate paths, or dynamic path generation strategies that employ large language models (LLMs) with prompting to jointly perform retrieval and reasoning. However, the former lacks adaptability due to static path extraction and the absence of contextual refinement, while the latter suffers from high computational costs and limited evaluation accuracy because of their dependence on fixed scoring functions and repeated LLM calls. To address these issues, this paper proposes Dynamically Adaptive MCTS-based Reasoning (DAMR), a novel framework that integrates LLM-guided Monte Carlo Tree Search (MCTS) with adaptive path evaluation to enable efficient and context-aware KGQA. DAMR leverages MCTS as a backbone, where an LLM-based planner selects the top-$k$ semantically relevant relations at each expansion step to effectively reduce the search space. To enhance evaluation accuracy, we introduce a lightweight Transformer-based scorer that performs context-aware plausibility estimation by jointly encoding the question and relation sequence through cross-attention, thereby capturing fine-grained semantic shifts during multi-hop reasoning. Furthermore, to mitigate the scarcity of high-quality supervision, DAMR incorporates a dynamic pseudo-path refinement mechanism that periodically generates training signals from partial paths explored during search, enabling the scorer to continually adapt to the evolving distribution of reasoning trajectories. Extensive experiments on multiple KGQA benchmarks show that DAMR significantly outperforms state-of-the-art methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1142. What Lies Beyond the View? Actively Constructing Spatial Beliefs in Foundation Models
๐ฌ 20
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Large Language Mode Vision-Language Model Spatial Reasoning Spatial Agent Active Exploration
Current foundation models can answer spatial reasoning questions about a given image or text, yet they lack the fundamental ability to build a genuine spatial understanding of an environment through active exploration. This reflects a critical blind spot in prevailing evaluation protocols, which predominantly test passive reasoning on curated data rather than the active construction of knowledge under uncertainty. To address this, we introduce Theory of Space (ToS), a new framework analogous to the Theory of Mind. While Theory of Mind concerns an agent's ability to model the hidden mental states of others, ToS concerns its ability to construct, update, and utilize an internal belief about the unobserved structure of its spatial environment from local, incomplete observations. We implement ToS with a comprehensive benchmark featuring both text-based and visual environments. Instead of performing specific tasks in such environments, the primary objective is to build a complete and accurate spatial belief through curiosity-driven exploration. A core innovation of our framework is the direct probing of this internal belief: we prompt models to explicitly present their cognitive map at each step, allowing us to measure not only task performance but also the quality, consistency, and evolution of the underlying spatial model itself. By evaluating state-of-the-art models as both active explorers and passive reasoners (using logs from scripted proxy agents), we disentangle exploration strategy from reasoning ability. Our analysis reveals common failure modes in spatial belief management, such as egomotion update errors and the inability to maintain a globally consistent map. The ToS framework provides the concepts and tools necessary to evaluate and build agents with more robust, human-like spatial intelligence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1143. AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization
๐ฌ 16
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ MLLM Emotion Recognition Multimodal Reasoning
Emotion understanding is essential for building socially intelligent agents. Although recent multimodal large language models (MLLMs) have shown strong performance on this task, two key challenges remain: (i) spurious associations between emotions and irrelevant audiovisual cues and (ii) hallucination of audiovisual cues driven by text priors in the language model backbone. To quantify and understand these issues, we introduce **EmoReAlM**, a benchmark designed to evaluate MLLMs for cueโemotion associations, hallucinations and modality agreement. We then propose **AVEm-DPO**, a preference optimization technique that aligns model responses with both audiovisual inputs and emotion-centric queries. Specifically, we construct preferences over (i) responses exhibiting spurious associations or hallucinations and (ii) audiovisual input pairs guided by textual prompts. We also include a regularization term that penalizes reliance on text priors, thereby mitigating modality-specific cue hallucinations. Experimental results on DFEW, RAVDESS and EMER demonstrate that our method significantly improves the performance of the reference baseline models (6-19\% of relative performance) in zero-shot settings. By providing both a rigorous benchmark and a robust optimization framework, this work enables principled evaluation and improvement of MLLMs for emotion understanding and social AI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1144. OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging
๐ฌ 16
โญ 6.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Model Merging Task Vector Data-Free Optimization
Foundation models update slowly due to resource-intensive training, whereas domain-specific models evolve rapidly between releases. Model merging seeks to combine multiple expert models into a single, more capable model, reducing storage and serving costs while supporting decentralized development. Despite its potential, previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks. Recently, Multimodal LLMs (MLLMs) that extend LLMs through large-scale multimodal training have gained traction. However, no benchmark exists for model merging research that clearly divides the tasks of MLLM training and evaluation. In this paper, $(i)$ we introduce a model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, studying both LoRA and full fine-tuning models. Moreover, we explore how model merging can combine different modalities (e.g., vision-language, audio-language, and video-language models), moving toward the Omni-language model. $(ii)$ We implement 10 model merging algorithms on the benchmark. Furthermore, we propose a novel method that removes noise from task vectors and robustly optimizes the merged vector based on a loss defined over task vector interactions, achieving an average performance gain of 2.48\%. $(iii)$ We find that model merging offers a promising way for building improved MLLMs without requiring training data. Our results also demonstrate that the complementarity among multiple modalities outperforms individual modalities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1145. PreferThinker: Reasoning-based Personalized Image Preference Assessment
๐ฌ 20
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Image Preference Assessment;Multimodal Large Language Model๏ผChain-of-Thought
Personalized image preference assessment aims to evaluate an individual user's image preferences by relying only on a small set of reference images as prior information. Existing methods mainly focus on general preference assessment, training models with large-scale data to tackle well-defined tasks such as text-image alignment. However, these approaches struggle to handle personalized preference because user-specific data are scarce and not easily scalable, and individual tastes are often diverse and complex. To overcome these challenges, we introduce a common preference profile that serves as a bridge across users, allowing large-scale user data to be leveraged for training profile prediction and capturing complex personalized preferences. Building on this idea, we propose a reasoning-based personalized image preference assessment framework that follows a \textit{predict-then-assess} paradigm: it first predicts a user's preference profile from reference images, and then provides interpretable, multi-dimensional scores and assessments of candidate images based on the predicted profile. To support this, we first construct a large-scale Chain-of-Thought (CoT)-style personalized assessment dataset annotated with diverse user preference profiles and high-quality CoT-style reasoning, enabling explicit supervision of structured reasoning. Next, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase to empower the model with structured reasoning capabilities, followed by reinforcement learning to incentivize the model to explore more reasonable assessment paths and enhance generalization. Furthermore, we propose a similarity-aware prediction reward to encourage better prediction of the user's preference profile, which facilitates more reasonable assessments exploration. Extensive experiments demonstrate the superiority of the proposed method. Our code and dataset will be publicly released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1146. FACM: Flow-Anchored Consistency Models
๐ฌ 15
โญ 6.00
๐ generative models
๐ท๏ธ Image Generation Consistency Model
Continuous-time Consistency Models (CMs) promise efficient few-step generation but face significant challenges with training instability. We argue this instability stems from a fundamental conflict: Training the network exclusively on a shortcut objective leads to the catastrophic forgetting of the instantaneous velocity field that defines the flow. Our solution is to explicitly anchor the model in the underlying flow, ensuring high trajectory fidelity during training. We introduce the Flow-Anchored Consistency Model (FACM), where a Flow Matching (FM) task serves as a dynamic anchor for the primary CM shortcut objective. Key to this Flow-Anchoring approach is a novel expanded time interval strategy that unifies optimization for a single model while decoupling the two tasks to ensure stable, architecturally-agnostic training. By distilling a pre-trained LightningDiT model, our method achieves a state-of-the-art FID of 1.32 with two steps (NFE=2) and 1.70 with just one step (NFE=1) on ImageNet 256x256. To address the challenge of scalability, we develop a memory-efficient Chain-JVP that resolves key incompatibilities with FSDP. This method allows us to scale FACM training on a 14B parameter model (Wan 2.2), accelerating its Text-to-Image inference from 2x40 to 2-8 steps. Our code and pretrained models will be available to the public.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1147. Aegis: Automated Error Generation and Identification for Multi-Agent Systems
๐ฌ 12
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Multi-Agent Systems; Failure attribution; Automated data generation; Learning
Large language model based multi-agent systems (MAS) have unlocked significant advancements in tackling complex problems, but their increasing capability introduces a structural fragility that makes them difficult to debug. A key obstacle to improving their reliability is the severe scarcity of large-scale, diverse datasets for error attribution, as existing resources rely on costly and unscalable manual annotation. To address this bottleneck, we introduce *Aegis*, a novel framework for **A**utomated **e**rror **g**eneration and attr**i**bution for multi-agent **s**ystems. *Aegis* constructs a large dataset of **9,533** trajectories with annotated faulty agents and error modes, covering diverse MAS architectures and task domains. This is achieved using a LLM-based manipulator that can adaptively inject context-aware errors into successful execution trajectories. Leveraging fine-grained labels and the structured arrangement of positive-negative sample pairs, *Aegis* supports three different learning paradigms: Supervised Fine-Tuning, Reinforcement Learning, and Contrastive Learning. We develop learning methods for each paradigm. Comprehensive experiments show that trained models consistently achieve substantial improvements in error attribution. Notably, several of our fine-tuned LLMs demonstrate performance competitive with or superior to proprietary models an order of magnitude larger, validating our automated data generation framework as a crucial resource for developing more robust and interpretable multi-agent systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1148. AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?
๐ฌ 4
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Agent Agentic System Failure Attribution
Large Language Model (LLM)-based agentic systems, often comprising multiple models, complex tool invocations, and orchestration protocols, substantially outperform monolithic agents. Yet this very sophistication amplifies their fragility, making them more prone to system failure. Pinpointing the specific agent or step responsible for an error within long execution traces defines the task of \textbf{agentic system failure attribution}. Current state-of-the-art reasoning LLMs, however, remain strikingly inadequate for this challenge, with accuracy generally below $10\\%$. To address this gap, we propose AgenTracer, the first automated framework for annotating failed multi-agent trajectories via counterfactual replay and programmed fault injection, producing the curated dataset TracerTraj. Leveraging this resource, we develop AgenTracer-8B, a lightweight failure tracer trained with multi-granular reinforcement learning, capable of efficiently diagnosing errors in verbose multi-agent interactions. On {Who\&When} benchmark, AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up $18.18\\%$, setting a new standard in LLM agentic failure attribution. More importantly, AgenTracer-8B delivers actionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaAS with $4.8\sim14.2\\%$ performance gains, empowering self-correcting and self-evolving agentic AI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1149. Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism
๐ฌ 25
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Long Context Dense Attention Kernel Sparse Attention Kernel Context Parallel Machenism Mask Pattern
Transformer-based large language models (LLMs) have achieved remarkable success, yet their standard attention mechanism incurs quadratic computation and memory costs with respect to sequence length, posing a major bottleneck for long-context training. Prior work tackles this challenge along two directions: (1) kernel-level optimizations, which accelerate dense and sparse attention operators; and (2) module-level strategies, often referred to as distributed attention or context parallel training, which scale attention across multiple devices. However, systematic evaluation still remains limited: operator-level comparisons are often incomplete, while context parallel strategies are typically framework-specific, with unclear performance analysis across contexts. To address these gaps, we propose a unified benchmark that integrates representative attention kernels and context parallel mechanisms with a modular and extensible interface for evaluation. The benchmark evaluates methods along two critical dimensions: (1) attention mask patterns, which strongly affect efficiency, scalability, and usability, and (2) sequence length and distributed scale, which determine performance under extreme long-context training. Through comprehensive experiments on the cluster of up to 96 GPUs, our benchmark enables reproducible comparisons, highlights method-specific trade-offs, and provides practical guidance for designing and deploying attention mechanisms in long-context LLM training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1150. Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
๐ฌ 19
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Unified Model;Multi-Modal;Chain-of-Thought; Image Generation; Image Editing;
Chain-of-Thought (CoT) reasoning has proven effective in enhancing Large Language Models (LLMs) on complex tasks by decomposing problems into step-wise solutions. However, extending CoT to multi-modal settings remains challenging, as it requires modeling transitions of visual states alongside textual reasoning. Existing approaches often underperform due to limited capacity to model visual transitions or fragmented architectures. To overcome this limitation, we introduce Uni-CoT, a Unified Chain-of-Thought framework that captures structured visual transitions and seamlessly aligns them with textual logic, enabling coherent multimodal reasoning.
To mitigate the computational and training challenges inherent to multi-modal reasoning, Uni-CoT introduces a two-level reasoning paradigm: a macro-level CoT for high-level planning and a micro-level CoT for localized subtask execution. This hierarchical design reduces computational overhead while maintaining coherence. Additionally, Uni-CoT incorporates a structured training paradigm with auxiliary tasks to stabilize optimization and improve generalization. Experiments on reasoning-driven image generation and understanding benchmarks demonstrate that Uni-CoT achieves state-of-the-art performance and remarkable generalization, underscoring its potential for complex multi-modal reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1151. From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
๐ฌ 13
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Native Vision-Language Models Vision-Language Primitive Holistic Vision-Language Buffer
The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, greatly narrowing the gap with top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLM development, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Code and weights will be publicly available to promote further research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1152. Nรผwa: Mending the Spatial Integrity Torn by VLM Token Pruning
๐ฌ 20
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Efficient Vision-Language Models; Vision Token Pruning; Inference Acceleration; Visual Grounding
Vision token pruning has proven to be an effective acceleration technique for the Efficient Vision Language Model (VLM). However, existing pruning methods demonstrate excellent performance preservation in visual question answering (VQA) and suffer substantial degradation on visual grounding (VG) tasks. Our analysis of the VLMโs processing pipeline reveals that strategies utilizing global semantic similarity and attention scores lose the global spatial reference frame, which is derived from the interactions of tokens' positional information. Motivated by these findings, we propose N\"uwa, a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity. In the first stage, after the vision encoder, we apply three operations, namely separation, alignment, and aggregation, which are inspired by swarm intelligence algorithms to retain information-rich global spatial anchors. In the second stage, within the LLM, we perform text-guided pruning to retain task-relevant visual tokens. Extensive experiments demonstrate that N\"uwa achieves state-of-the-art performance on multiple VQA benchmarks (from 94\% to 95\%) and yields substantial improvements on visual grounding tasks (from 7\% to 47\%). Code is released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1153. Difficult Examples Hurt Unsupervised Contrastive Learning: A Theoretical Perspective
๐ฌ 11
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Machine Learning. Self-Supervised Learning. Difficult Examples
Unsupervised contrastive learning has shown significant performance improvements in recent years, often approaching or even rivaling supervised learning in various tasks. However, its learning mechanism is fundamentally different from supervised learning. Previous works have shown that difficult examples (well-recognized in supervised learning as examples around the decision boundary), which are essential in supervised learning, contribute minimally in unsupervised settings. In this paper, perhaps surprisingly, we find that the direct removal of difficult examples, although reduces the sample size, can boost the downstream classification performance of contrastive learning. To uncover the reasons behind this, we develop a theoretical framework modeling the similarity between different pairs of samples. Guided by this framework, we conduct a thorough theoretical analysis revealing that the presence of difficult examples negatively affects the generalization of contrastive learning. Furthermore, we demonstrate that the removal of these examples, and techniques such as margin tuning and temperature scaling can enhance its generalization bounds, thereby improving performance.
Empirically, we propose a simple and efficient mechanism for selecting difficult examples and validate the effectiveness of the aforementioned methods, which substantiates the reliability of our proposed theoretical framework.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1154. Pretraining with Re-parametrized Self-Attention: Unlocking Generalizationin SNN-Based Neural Decoding Across Time, Brains, and Tasks
๐ฌ 36
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Brain-Machine Interface Neural Spike Decoding Spiking Neural Network Foundation Model
The emergence of large-scale neural activity datasets provides new opportunities to enhance the generalization of neural decoding models. However, it remains a practical challenge to design neural decoders for fully implantable brain-machine interfaces (iBMIs) that achieve high accuracy, strong generalization, and low computational cost, which are essential for reliable, long-term deployment under strict power and hardware constraints.
To address this, we propose the Re-parametrized self-Attention Spiking Neural Network (RAT SNN) with a cross-condition pretraining framework to integrate neural variability and adapt to stringent computational constraints.
Specifically, our approach introduces multi-timescale dynamic spiking neurons to capture the complex temporal variability of neural activity.
And we refine spike-driven attention within a lightweight, re-parameterized architecture that enables accumulate-only operations between spiking neurons without sacrificing decoding accuracy.
Furthermore, we develop a stepwise training pipeline to systematically integrate neural variability across conditions, including neural temporal drift, subjects and tasks.
Building on these advances, we construct a pretrained model capable of rapid generalization to unseen conditions with high performance.
We demonstrate that RAT SNN consistently outperforms leading SNN baselines and matches the performance of state-of-the-art artificial neural network (ANN) models in terms of decoding accuracy with much lower computational cost under both seen and unseen conditions across various datasets.
Collectively, Pretrained-RAT SNN represents a high-performance, highly generalizable, and energy-efficient prototype of an SNN foundation model for fully iBMI.
Code is available at [RAT SNN GitHub](https://anonymous.4open.science/r/RAT-SNN-FA7E).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1155. Target-Aware Video Diffusion Models
๐ฌ 16
โญ 6.00
๐ generative models
๐ท๏ธ Controllable video diffusion models Human-scene interaction Robotics planning
We present a target-aware video diffusion model that generates videos from an input image, in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask, and the action is described through a text prompt. Our key motivation is to incorporate target awareness into video generation, enabling actors to perform directed actions on designated objects. This enables video diffusion models to act as motion planners, producing plausible predictions of humanโobject interactions by leveraging the priors of large-scale video generative models. We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target's spatial information within the text prompt. We then fine-tune the model with our curated dataset using an additional cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant attention regions and transformer blocks. Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: zero-shot 3D HOI motion synthesis with physical plausibility and long-term video content creation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1156. Beyond Pass@ 1: Self-Play with Variational Problem Synthesis Sustains RLVR
๐ฌ 38
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Reasoning; Reinforcement Learning; Self-envolving
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1157. Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language
๐ฌ 7
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ automated interpretability LLM features structured languages
Automated interpretability aims to translate large language model (LLM) features into human understandable descriptions. However, natural language feature descriptions are often vague, inconsistent, and require manual relabeling. In response, we introduce *semantic regexes*, structured language descriptions of LLM features. By combining primitives that capture linguistic and semantic patterns with modifiers for contextualization, composition, and quantification, semantic regexes produce precise and expressive feature descriptions. Across quantitative benchmarks and qualitative analyses, semantic regexes match the accuracy of natural language while yielding more concise and consistent feature descriptions. Their inherent structure affords new types of analyses, including quantifying feature complexity across layers, scaling automated interpretability from insights into individual features to model-wide patterns. Finally, in user studies, we find that semantic regexes help people build accurate mental models of LLM features.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1158. HUMOF: Human Motion Forecasting in Interactive Social Scenes
๐ฌ 11
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ human motion forecasting scene-aware multi-person
Complex dynamic scenes present significant challenges for predicting human behavior due to the abundance of interaction information, such as human-human and human-environment interactions. These factors complicate the analysis and understanding of human behavior, thereby increasing the uncertainty in forecasting human motions.
Existing motion prediction methods thus struggle in these complex scenarios. In this paper, we propose an effective method for human motion forecasting in dynamic scenes. To achieve a comprehensive representation of interactions, we design a hierarchical interaction feature representation so that high-level features capture the overall context of the interactions, while low-level features focus on fine-grained details. Besides, we propose a coarse-to-fine interaction reasoning module that leverages both spatial and frequency perspectives to efficiently utilize hierarchical features, thereby enhancing the accuracy of motion predictions. Our method achieves state-of-the-art performance across four public datasets. We will release our code upon publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1159. Joint DistributionโInformed Shapley Values for Sparse Counterfactual Explanations
๐ฌ 14
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Counterfactual Explanations Shapley Values Optimization Explainable Machine Learning
Counterfactual explanations (CE) aim to reveal how small input changes flip a modelโs prediction, yet many methods modify more features than necessary, reducing clarity and actionability. We introduce COLA, a model- and generator-agnostic post-hoc framework that refines any given CE by computing a coupling via optimal transport (OT) between factual and counterfactual sets and using it to drive a Shapley-based attribution p-SHAP that selects a minimal set of edits while preserving the target effect. Theoretically, OT minimizes an upper bound on the $W_1$ divergence between factual and counterfactual outcomes and that, under mild conditions, refined counterfactuals are guaranteed not to move farther from the factuals than the originals. Empirically, across four datasets, twelve models, and five CE generators, COLA achieves the same target effects with only 26โ45% of the original feature edits. On a small-scale benchmark, COLA shows near-optimality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1160. MARS: Reinforcing Multi-Agent Reasoning of LLMs through Self-Play in Strategic Games
๐ฌ 17
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Self-play Multi-Agent System Strategic Games
Developing large language models (LLMs) to cooperate and compete effectively within multi-agent systems (MASs) is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce **MARS**, an end-to-end RL framework that incentivizes **M**ulti-**A**gent **R**easoning of LLMs through **S**elf-play in both cooperative and competitive games. MARS features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, MARS agent trained from Qwen3-4B develops strong strategic abilities that generalize to held-out games with up to 28.7% performance improvements. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of MASs in reasoning benchmarks. When integrated into leading MASs, our MARS agent achieves significant performance gains of up to 10.0% on AIME, 6.6% on GPQA-Diamond, and 3.5% on average across all benchmarks. These results establish end-to-end RL training with self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs.
Our code and models are publicly available at [https://anonymous.4open.science/r/MARS-LLM](https://anonymous.4open.science/r/MARS-LLM).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1161. Asynchronous Policy Gradient Aggregation for Efficient Distributed Reinforcement Learning
๐ฌ 11
โญ 6.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning federated learning distributed learning asynchronous methods
We study distributed reinforcement learning (RL) with policy gradient methods under asynchronous and parallel computations and communications. While non-distributed methods are well understood theoretically and have achieved remarkable empirical success, their distributed counterparts remain less explored, particularly in the presence of heterogeneous asynchronous computations and communication bottlenecks. We introduce two new algorithms, Rennala NIGT and Malenia NIGT, which implement asynchronous policy gradient aggregation and achieve state-of-the-art efficiency. In the homogeneous setting, Rennala NIGT provably improves the total computational and communication complexity while supporting the AllReduce operation. In the heterogeneous setting, Malenia NIGT simultaneously handles asynchronous computations and heterogeneous environments with strictly better theoretical guarantees. Our results are further corroborated by experiments, showing that our methods significantly outperform prior approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1162. RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling
๐ฌ 25
โญ 6.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ logical reasoning rule-based reasoning reinforcement learning language models
Rule-based reasoning is acknowledged as one of the fundamental problems of reasoning. While recent studies show that large reasoning models (LRMs) have remarkable reasoning capabilities enhanced by reinforcement learning (RL), real applications still face severe challenges due to variations in rule formats, types, and complexity. To mitigate this issue, we introduce RuleReasoner, an effective method for rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach in RL. Specifically, RuleReasoner resamples each training batch by updating the domain weights based on historical rewards. This facilitates domain balance and active learning schedules for RL, obviating static mix-training engineered by human. Evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin ($\Delta$4.1% on eight ID tasks and $\Delta$10.4% on three OOD tasks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1163. Topology and geometry of the learning space of ReLU networks: connectivity and singularities
๐ฌ 16
โญ 6.00
๐ learning theory
๐ท๏ธ learning dynamics topology neural networks ReLU networks geometry symmetry loss landscape gradient singularity connectedness
Understanding the properties of the parameter space in feed-forward ReLU networks is critical for effectively analyzing and guiding training dynamics. After initialization, training under gradient flow decisively restricts the parameter space to an algebraic variety that emerges from the homogeneous nature of the ReLU activation function. In this study, we examine two key challenges associated with feed-forward ReLU networks built on general directed acyclic graph (DAG) architectures: the (dis)connectedness of the parameter space and the existence of singularities within it. We extend previous results by providing a thorough characterization of connectedness, highlighting the roles of bottleneck nodes and balance conditions associated with specific subsets of the network. Our findings clearly demonstrate that singularities are intricately connected to the topology of the underlying DAG and its induced sub-networks. We discuss the reachability of these singularities and establish a principled connection with differentiable pruning. We validate our theory with simple numerical experiments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1164. Hyperbolic Aware Minimization: Implicit Bias for Sparsity
๐ฌ 21
โญ 6.00
๐ optimization
๐ท๏ธ Sparsity Implicit bias Sign flip Exponential update Training dynamics Bregman function
Understanding the implicit bias of optimization algorithms is key to explaining and improving the generalization of deep models. The hyperbolic implicit bias induced by pointwise overparameterization promotes sparsity, but also yields a small inverse Riemannian metric near zero, slowing down parameter movement and impeding meaningful parameter sign flips. To overcome this obstacle, we propose Hyperbolic Aware Minimization (HAM), which alternates a standard optimizer step with a lightweight hyperbolic mirror step. The mirror step incurs less compute and memory than pointwise overparameterization, reproduces its beneficial hyperbolic geometry for feature learning, and mitigates the smallโinverse-metric bottleneck. Our characterization of the implicit bias in the context of underdetermined linear regression provides insights into the mechanism how HAM consistently increases performance --even in the case of dense training, as we demonstrate in experiments with standard vision benchmarks. HAM is especially effective in combination with different sparsification methods, advancing the state of the art.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1165. DiffWind: Physics-Informed Differentiable Modeling of Wind-Driven Object Dynamics
๐ฌ 22
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Physics-based Modeling 3D Dynamics System Identification Differentiable Physics
Modeling wind-driven object dynamics from video observations is highly challenging due to the invisibility and spatioโtemporal variability of wind, as well as the complex deformations of objects. We present DiffWind, a physics-informed differentiable framework that unifies windโobject interaction modeling, video-based reconstruction, and forward simulation. Specifically, we represent wind as a grid-based physical field and objects as particle systems derived from 3D Gaussian Splatting, with their interaction modeled by the Material Point Method (MPM). To recover wind-driven object dynamics, we introduce a reconstruction framework that jointly optimizes the spatioโtemporal wind force field and object motion through differentiable rendering and simulation. To ensure physical validity, we incorporate the Lattice Boltzmann Method (LBM) as a physics-informed constraint, enforcing compliance with fluid dynamics laws. Beyond reconstruction, our method naturally supports forward simulation under novel wind conditions and enable new applications such as wind retargeting. We further introduce WD-Objects, a dataset of synthetic and real-world wind-driven scenes. Extensive experiments demonstrate that our method significantly outperforms prior dynamic scene modeling approaches in both reconstruction accuracy and simulation fidelity, opening a new avenue for video-based windโobject interaction modeling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1166. Systematic Biosafety Evaluation of DNA Language Models under Jailbreak Attacks
๐ฌ 13
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Jailbreak Attacks; DNA language models
DNA, encoding genetic instructions for almost all living organisms, fuels groundbreaking advances in genomics and synthetic biology. Recently, DNA Language Models have achieved success in designing synthetic functional DNA sequences, even whole genomes of novel bacteriophage, verified with wet lab experiments. Such remarkable generative power also brings severe biosafety concerns about whether DNA language models can design human viruses. With the goal of exposing vulnerabilities and informing the development of robust safeguarding techniques, we perform a systematic biosafety evaluation of DNA language models through the lens of jailbreak attacks. Specifically, we introduce JailbreakDNABench, a benchmark centered on high-priority human viruses, together with an end-to-end jailbreak framework, GeneBreaker. GeneBreaker integrates three key components: (1) an LLM agent equipped with customized bioinformatics tools to design high-homology yet non-pathogenic jailbreak prompts, (2) beam search guided by PathoLM and log-probability heuristics to steer sequence generation toward pathogen-like outputs, and (3) a BLAST- and function-annotationโbased evaluation pipeline to identify successful jailbreaks. On JailbreakDNABench, GeneBreaker successfully jailbreaks the latest Evo series models across 6 viral categories consistently (up to 60\% Attack Success Rate for Evo2-40B). Further case studies on SARS-CoV-2 spike protein and HIV-1 envelope protein demonstrate the sequence and structural fidelity of jailbreak output, while evolutionary modeling of SARS-CoV-2 underscores biosecurity risks. Our findings also reveal that scaling DNA language models amplifies dual-use risks, motivating enhanced safety alignment and tracing mechanisms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1167. ELViS: Efficient Visual Similarity from Local Descriptors that Generalizes Across Domains
๐ฌ 13
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ instance-level image retrieval image re-ranking local similarity generalization interpretability
Large-scale instance-level training data is scarce, so models are typically trained on domain-specific datasets. Yet in real-world retrieval, they must handle diverse domains, making generalization to unseen data critical. We introduce ELViS, an image-to-image similarity model that generalizes effectively to unseen domains. Unlike conventional approaches, our model operates in similarity space rather than representation space, promoting cross-domain transfer. It leverages local descriptor correspondences, refines their similarities through an optimal transport step with data-dependent gains that suppress uninformative descriptors, and aggregates strong correspondences via a voting process into an image-level similarity. This design injects strong inductive biases, yielding a simple, efficient, and interpretable model. To assess generalization, we compile a benchmark of eight datasets spanning landmarks, artworks, products, and multi-domain collections, and evaluate ELViS as a re-ranking method. Our experiments show that ELViS outperforms competing methods by a large margin in out-of-domain scenarios and on average, while requiring only a fraction of their computational cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1168. PROTDYN: A FOUNDATION PROTEIN LANGUAGE MODEL FOR THERMODYNAMICS AND DYNAMICS GENERATION
๐ฌ 10
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Transformer Protein Language Model Protein ensemble generation Protein dynamics generative model
Molecular dynamics (MD) simulation has long been the principal computational tool for exploring protein conformational landscapes, but its application is limited by high computational cost. We present ProTDyn, a foundation protein language model that unifies conformational ensemble generation and multi-timescale dynamics modeling within a single framework. Unlike prior approaches that treat these tasks separately, ProTDyn allows flexible i.i.d ensemble sampling and dynamic trajectory simulation. Across diverse protein systems, ProTDyn yields thermodynamically consistent ensembles, faithfully reproduces dynamical properties over multiple timescales, and generalizes to proteins beyond its training dataโoffering a scalable and efficient alternative to conventional MD simulations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1169. Thyme: Think Beyond Images
๐ฌ 14
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ MLLM Agentic Think with images Coding
Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (OpenAI O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code.
In this paper, we make a preliminary attempt in this direction by introducing \textbf{Thyme} (\textbf{Th}ink Be\textbf{y}ond I\textbf{m}ag\textbf{e}s), a novel paradigm for enabling multimodal large language models to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code (Figure 2). This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement), but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial Supervised Fine-Tuning (SFT) on a curated dataset of 500K samples to teach code generation, followed by a Reinforcement Learning (RL) phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose \textbf{GRPO-ATS} (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. As shown in Figure 1, comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1170. Refining Hybrid Genetic Search for CVRP via Reinforcement Learning-Finetuned LLM
๐ฌ 4
โญ 6.00
๐ optimization
๐ท๏ธ Capacitated Vehicle Routing Large Language Model Reinforcement Finetuning
While large language models (LLMs) are increasingly used as automated heuristic designers for vehicle routing problems (VRPs), current state-of-the-art methods predominantly rely on prompting massive, general-purpose models like GPT-4. This work challenges that paradigm by demonstrating that a smaller, specialized LLM, when meticulously fine-tuned, can generate components that surpass expert-crafted heuristics within advanced solvers. We propose RFTHGS, a novel Reinforcement learning (RL) framework for Fine-Tuning a compact LLM to generate high-performance crossover operators for the Hybrid Genetic Search (HGS) solver, applied to the Capacitated VRP (CVRP). Our method employs a multi-tiered, curriculum-based reward function that progressively guides the LLM to master generating first compilable, then executable, and finally, superior-performing operators that exceed human expert designs. This is coupled with an operator caching mechanism that discourages plagiarism and promotes diversity during training. Comprehensive experiments show that our fine-tuned LLM produces crossover operators which significantly outperform the expert-designed ones in HGS. The performance advantage remains consistent, generalizing from small-scale instances to large-scale problems with up to 1000 nodes. Furthermore, RFTHGS exceeds the performance of leading neuro-combinatorial baselines, prompt-based methods, and commercial LLMs such as GPT-4o and GPT-4o-mini.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1171. Eliciting Numerical Predictive Distributions of LLMs Without Auto-Regression
๐ฌ 14
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ mechanistic interpretability uncertainty estimation LLMs time series probing
Large Language Models (LLMs) have recently been successfully applied to regression tasks---such as time series forecasting and tabular prediction---by leveraging their in-context learning abilities. However, their autoregressive decoding process may be ill-suited to continuous-valued outputs, where obtaining predictive distributions over numerical targets requires repeated sampling, leading to high computational cost and inference time. In this work, we investigate whether distributional properties of LLM predictions can be recovered _without_ explicit autoregressive generation. To this end, we study a set of regression probes trained to predict statistical functionals (e.g., mean, median, quantiles) of the LLMโs numerical output distribution directly from its internal representations. Our results suggest that LLM embeddings carry informative signals about summary statistics of their predictive distributions, including the numerical uncertainty. This investigation opens up new questions about how LLMs internally encode uncertainty in numerical tasks, and about the feasibility of lightweight alternatives to sampling-based approaches for uncertainty-aware numerical predictions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1172. Sparkle: A Robust and Versatile Representation for Point Cloud-based Human Motion Capture
๐ฌ 13
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Motion Capture
Point cloud-based motion capture leverages rich spatial geometry and privacy-preserving sensing, but learning robust representations from noisy, unstructured point clouds remains challenging. Existing approaches face a struggle trade-off between point-based methods (geometrically detailed but noisy) and skeleton-based ones (robust but oversimplified). We address the fundamental challenge: how to construct an effective representation for human motion capture that can balance expressiveness and robustness.
In this paper, we propose Sparkle, a structured representation unifying skeletal joints and surface anchors with explicit kinematic-geometric factorization. Our framework, SparkleMotion, learns this representation through hierarchical modules embedding geometric continuity and kinematic constraints. By explicitly disentangling internal kinematic structure from external surface geometry, SparkleMotion achieves state-of-the-art performance not only in accuracy but crucially in robustness and generalization under severe domain shifts, noise, and occlusion. Extensive experiments demonstrate our superiority across diverse sensor types and challenging real-world scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1173. Riemannian High-Order Pooling for Brain Foundation Models
๐ฌ 15
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ EEG brain-computer interface representation learning manifold learning
Electroencephalography (EEG) is a noninvasive technique for measuring brain electrical activity that supports a wide range of brain-computer interaction applications. Motivated by the breakthroughs of Large Language Models (LLMs), recent efforts have begun to explore Large EEG foundation Models trained on broad unlabeled corpora. However, most advances focus on improving the backbone while neglecting the classification head. Existing models often rely on a single class token, underutilizing the spatiotemporal structure and second-order statistics that are crucial for EEG decoding. We propose Riemannian High Order Pooling (RHOP), a plug-and-play module that injects principled Riemannian statistics into the classifier. RHOP maps each token to a quotient Gaussian jointly encoding mean and second-order information, yielding scale-invariant descriptors. Tokens are then aggregated by estimating a Riemannian Gaussian on the SPD manifold, where the Frรฉchet mean and covariance are embedded into an SPD descriptor. The resulting normalized vector is fused with the class token for prediction. RHOP is backbone-agnostic and integrates with modern EEG foundation models, e.g., BIOT and LaBraM. Across diverse EEG benchmarks, it improves accuracy, robustness, and efficiency under full fine-tuning, linear probing, and from-scratch training settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1174. Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning
๐ฌ 24
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning CUDA Code Generation High-Performance Computing
Code generation is a crucial research area in the field of artificial intelligence, holding the potential to revolutionize software development and streamline programming processes. However, generating the high-performance code, which need to be executed in a shorter time for the low-latency scenario, remains a formidable challenge. Existing methods often struggle to account for the irregularity of input sparse data in sparse programs and the need for domain-specific architectural knowledge, leading to sub-optimal performance. To tackle these issues, we propose the SparseRL framework. SparseRL leverages deep reinforcement learning, treating a pre-trained language model as a stochastic policy. It takes the row and column indices of non-zero elements in the sparse matrix as input and generates CUDA code as output for sparse matrix operations. We also introduce a domain-specific code generation mechanism for the dynamic input, a sinusoidal embedding technique tailored for sparse matrices, and a hierarchical reward function that considers both code correctness and execution efficiency. Experimental results demonstrate SparseRL achieves state-of-the-art performance. In sparse matrix-vector multiplication (SpMV) tasks, it improves the compilation rate by 20% compared to existing methods, and the generated code runs 30% faster on average. For sparse matrix-dense matrix multiplication (SpMM) tasks, SparseRL also shows significant performance gains. These results highlight the effectiveness of SparseRL in generating high-performance CUDA code for sparse matrix operations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1175. Achieving low-bit Muon through subspace preservation and grid quantization
๐ฌ 23
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM memory-efficient quantization low-bit Muon optimizer
Training Large Language Models (LLMs) faces severe memory constraints due to the increasing size of model parameters and optimizer states. The Muon optimizer, which is based on matrix orthogonalization, has recently demonstrated significant potential and offers considerable memory advantages over AdamW by utilizing only the first moment. However, how to apply memory-reduction techniques to further compress the optimizer states of Muon remains underexplored. Directly applying existing methods may encounter significant difficulties due to the orthogonalization process. In this work, we investigate the low-bit compression of Muon and systematically analyze the quantization error exacerbated by orthogonalization. We identify that the error primarily originates from the top singular subspace and the outlier patterns of moment matrix appearing across both dimensions. To address this, we propose 4-bit-Muon-GRASP (GRid And Subspace Preserving), which compresses the Muon optimizer states to 4 bits using grid quantization, while preserving the top singular subspace with minimal overhead. We evaluate 4-bit-Muon-GRASP through pre-training on LLaMA-130M, 350M, and 1.1B architectures and fine-tuning on 7B models for various reasoning tasks. Extensive experiment results show that our 4-bit-Muon-GRASP achieves accuracy comparable to full-precision counterparts while reducing training memory consumption by up to 28\%. Code will be made public upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1176. Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers
๐ฌ 7
โญ 6.00
๐ generative models
๐ท๏ธ Massive Activations Diffusion Transformers Visual Detail Synthesis
Massive Activations (MAs) are a well-documented phenomenon across Transformer architectures, and prior studies in both LLMs and ViTs have shown that they play a substantial role in shaping model behavior. However, the nature and function of MAs within Diffusion Transformers (DiTs) remain largely unexplored. In this work, we systematically investigate these activations to elucidate their role in visual generation. We found that these massive activations occur across all spatial tokens, and their distribution is modulated by the input timestep embeddings. Importantly, our investigations further demonstrate that these massive activations play a key role in local detail synthesis, while having minimal impact on the overall semantic content of output.
Building on these insights, we propose Detail Guidance (DG), a MAs-driven, training-free self-guidance strategy to explicitly enhance local detail fidelity for DiTs. Specifically, DG constructs a degraded ``detail-deficient'' model by disrupting MAs and leverages it to guide the original network toward higher-quality detail synthesis. Our DG can seamlessly integrate with Classifier-Free Guidance (CFG), enabling joint enhancement of detail fidelity and prompt alignment. Extensive experiments demonstrate that our DG consistently improves local detail quality across various pre-trained DiTs (\eg, SD3, SD3.5, and Flux).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1177. Tree-sliced Sobolev IPM
๐ฌ 25
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ sliced optimal transport sobolev ipm tree-sliced wasserstein distance tree wasserstein distance
Recent work shows Tree-Sliced Optimal Transport to be an efficient and more expressive alternative to Sliced Wasserstein (SW), improving downstream performance. Tree-sliced metrics compare probability distributions by projecting measures onto tree metric spaces; a central example is the Tree-Sliced Wasserstein (TSW) distance, which applies the $1$-Wasserstein metric after projection. However, computing tree-based $p$-Wasserstein for general $p$ is costly, largely confining practical use to $p=1$. This restriction is a significant bottleneck, as higher-order metrics ($p > 1$) are preferred in gradient-based learning for their more favorable optimization landscapes. In this work, we revisit Sobolev integral probability metrics (IPM) on trees to obtain a practical generalization of TSW. Building on the insight that a suitably regularized Sobolev IPM admits a closed-form expression, we introduce TS-Sobolev, a tree-sliced metric that aggregates regularized Sobolev IPMs over random tree systems and remains tractable for all $p\ge1$; for $p>1$, TS-Sobolev has the same computational complexity as TSW at $p=1$. Notably, at $p=1$ it recovers TSW exactly. Consequently, TS-Sobolev serves as a drop-in replacement for TSW in practical applications, with an additional flexibility in changing $p$. Furthermore, we extend this framework to define a corresponding metric for probability measures on hyperspheres. Experiments on Euclidean and spherical datasets show that TS-Sobolev and its spherical variant improve downstream performance in gradient flows, self-supervised learning, generative modeling, and text topic modeling over recent SW and TSW variants.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1178. Mixed-Curvature Tree-Sliced Wasserstein Distance
๐ฌ 12
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ mixed curvature space sliced optimal transport
Mixed-curvature spaces have emerged as a powerful alternative to their Euclidean counterpart, enabling data representations better aligned with the intrinsic structure of complex datasets. However, comparing probability distributions in such spaces remains underdeveloped: existing measures such as KL divergence and Wasserstein either rely on strong assumptions on distributions or incur high computational costs. The Sliced-Wasserstein (SW) framework provides an alternative approach for constructing distributional distances; however, its reliance on one-dimensional projections limits its ability to capture the geometry of the ambient space. To address this limitation, the Tree-Sliced Wasserstein (TSW) framework employs tree structures as a richer projected space. Motivated by the intuition that such a space is particularly suitable for representing the geometric properties of mixed-curvature manifolds, we introduce the Mixed-Curvature Tree-Sliced Wasserstein (MC-TSW), a novel discrepancy measure that is computationally efficient while faithfully capturing both the topological and geometric structures of mixed-curvature spaces. Specifically, we introduce an adaptation of tree systems and Radon transform to mixed-curvature spaces, which yields a closed-form solution for the optimal transport problem on the tree system. We further provide theoretical analysis on the properties of the Radon transform and the MC-TSW distance. Experimental results demonstrate that MC-TSW improves distributional comparisons over product-space-based distance and line-based baselines, and that mixed-curvature representations consistently outperform constant-curvature alternatives, highlighting their importance for modeling complex datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1179. Towards Efficient Constraint Handling in Neural Solvers for Routing Problems
๐ฌ 21
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Routing Problems; Deep Reinforcement Learning; Constraint Handling; Combinatorial Optimization
Neural solvers have achieved impressive progress in addressing simple routing problems, particularly excelling in computational efficiency. However, their advantages under complex constraints remain nascent, for which current constraint-handling schemes via feasibility masking or implicit feasibility awareness can be inefficient or inapplicable for hard constraints. In this paper, we present Construct-and-Refine (CaR), the first general and efficient constraint-handling framework for neural routing solvers based on explicit learning-based feasibility refinement. Unlike prior construction-search hybrids that target reducing optimality gaps through heavy improvements yet still struggle with hard constraints, CaR achieves efficient constraint handling by designing a joint training framework that guides the construction module to generate diverse and high-quality solutions well-suited for a lightweight improvement process, e.g., 10 steps versus 5k steps in prior work. Moreover, CaR presents the first use of construction-improvement-shared representation, enabling potential knowledge sharing across paradigms by unifying the encoder, especially in more complex constrained scenarios. We evaluate CaR on typical hard routing constraints to showcase its broader applicability. Results demonstrate that CaR achieves superior feasibility, solution quality, and efficiency compared to both classical and neural state-of-the-art solvers.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1180. From Gradient Volume to Shapley Fairness: Towards Fair Multi-Task Learning
๐ฌ 26
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Multi-task Learning Shapley Value Fair Optimization
Multi-task learning often suffers from gradient conflicts, leading to unfair optimization and degraded overall performance. To address this, we present SVFair, a Shapley value-based framework for fair gradient aggregation. We propose two scalable geometric conflict metrics: VolDet, a gram determinant volume metric, and VolDetPro, its sign-aware extension distinguishing antagonistic gradients. By integrating these metrics into Shapley value computation, SVFair quantifies each taskโs deviation from the overall gradient and rebalances updates toward fairness. In parallel, our Shapley value computation admits controllable complexity. Extensive experiments show that SVFair achieves state-of-the-art results across diverse supervised and reinforcement learning benchmarks, and further improves existing methods when integrated as a fairness-enhancing module.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1181. Sheaves Reloaded: A Direction Awakening
๐ฌ 30
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ directed sheaf neural network directed graphs directed cellular sheaves
Sheaf Neural Networks (SNNs) are a powerful algebraic-topology generalization of Graph Neural Networks (GNNs), and have been shown to significantly improve our ability to model complex relational data. While the GNN literature proved that incorporating directionality can substantially boost performance in many real-world applications, no SNNs approaches are known with such a capability. To address this limitation, we introduce the Directed Cellular Sheaf, a generalized cellular sheaf designed to explicitly account for edge orientations. Building on it, we define a corresponding sheaf Laplacian, the Directed Sheaf Laplacian $L^{\widetilde{\mathcal{F}}}$, which exploits the sheaf's structure to capture both the graphโs topology and its directions. $L^{\widetilde{\mathcal{F}}}$ serves as the backbone of the Directed Sheaf Neural Network (DSNN), the first SNN model to embed a directional bias into its architecture. Extensive experiments on twelve real-world benchmarks show that DSNN consistently outperforms many baseline methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1182. LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models
๐ฌ 31
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Multimodal Models Model Compression Fourier Domain Matrix Approximation
Large multimodal models (LMMs) have achieved impressive performance on various vision-language tasks, but their substantial computational and memory costs hinder their practical deployment. Existing compression methods often decouple low-rank decomposition and quantization, leading to compounded reconstruction errors, especially in multimodal architectures with cross-modal redundancy. To address this issue, we propose LLaVA-FA, a novel efficient LMM that performs joint low-rank plus quantization approximation in the frequency domain. By leveraging the de-correlation and conjugate symmetry properties of Fourier transform, LLaVA-FA achieves more compact and accurate weight representations. Furthermore, we introduce PolarQuant, a polar-coordinate quantization method tailored for complex matrices, and an optional diagonal calibration (ODC) scheme that eliminates the need for large-scale calibration data. Extensive experimental results demonstrate that our proposed LLaVA-FA outperforms existing efficient multimodal models across multiple benchmarks while maintaining minimal activated parameters and low computational costs, validating its effectiveness as a powerful solution for compressing LMMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1183. MATRIX: Mask Track Alignment for Interaction-aware Video Generation
๐ฌ 20
โญ 6.00
๐ generative models
๐ท๏ธ video generative model
Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K,
a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks from the MATRIX-11K dataset, enhancing both grounding and propagation. We further propose InterGenEval, an evaluation protocol for interaction-aware video generation. In experiments, MATRIX improves both interaction fidelity and semantic alignment while reducing drift and hallucination. Extensive ablations validate our design choices. Codes and weights will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1184. Entropy-Monitored Kernelized Token Distillation for Audio-Visual Compression
๐ฌ 14
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Audio-Visual Learning Multimodal Learning Efficient Machine Learning Knowledge Distillation Audio-Visual Classification Audio-Visual Segmentation
We propose a method for audio-visual knowledge distillation. Existing methods typically distill from the latent embeddings or outputs. The former requires matching feature dimensions, if not the same architecture, between teacher and student models while the latter supports any teacher-student pairing, but tends to be less performant. Unlike them, we do not explicitly distill from the latent embeddings or outputs, but the pairwise relationships between embeddings across samples for each modality; this is realized as a kernel, which is the crux of our method, ``Kernelized Token Distillation (KTD)''. Specifically, we tokenize and embed the input for a given modality, and compute the Gram matrix across tokens, from which we distill. As audio and visual modalities afford different information for a task, we adaptively modulate distillation by measuring the entropy of each modality, leading to an Entropty-Monitored Kernelized Token Distillation (EM-KTD) scheme. Our method allows flexibility in complexity of kernel function to model relationships across tokens, which are selectively distilled to ensure high-fidelity supervision for the student. We evaluate EM-KTD on VGGSound and AVS-Bench, where we use 94\% fewer parameters than the teacher while preserving 96.9\% in performance for audio-visual event recognition and 96.5\% on audio-visual segmentation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1185. ORCaS: Unsupervised Depth Completion via Occluded Region Completion as Supervision
๐ฌ 17
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Depth completion Unsupervised Learning 3D Reconstruction Multi-modal Learning
We propose a method for inferring an egocentric dense depth map from an RGB image and a sparse point cloud.
The crux of our method lies in modeling the 3D scene implicitly within the latent space and learning an inductive bias in an unsupervised manner through principles of Structure-from-Motion. To force the learning of this inductive bias, we propose to optimize for an ill-posed objective: predicting latent features that are not observed in the input view, but exists in the 3D scene. This is facilitated by means of rigid warping of latent features from the input view to a nearby or adjacent (co-visible) view of the same 3D scene. "Empty" regions in the latent space that correspond to regions occluded from the input view are completed by a Contextual eXtrapolation mechanism based on features visible in input view. Once learned, the inductive bias can be transferred to modulate the features of the input view to improve fidelity.
We term our method "Occluded Region Completion as Supervision" or ORCaS. We evaluate ORCaS on VOID1500 and NYUv2 benchmark datasets, where we improve over the best existing method by 8.91% across all metrics. ORCaS also improves generalization from VOID1500 to ScanNet and NYUv2 by 15.7% and robustness to low density inputs by 31.2%. Code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1186. Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow
๐ฌ 12
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ multi-agent system visual hallucination snowballing
Multi-Agent System (MAS) powered by Visual Language Models (VLMs) enables challenging tasks but suffers from a novel failure term, multi-agent visual hallucination snowballing, where hallucinations are seeded in a single agent and amplified by following ones due to the over-reliance on textual flow to relay visual information. Through turn-, layer-, and token-wise attention analyses, we provide detailed insights into the essence of hallucination snowballing regarding the reduction of visual attention allocation. It leads us to identify a subset of vision tokens with a unimodal attention peak in middle layers that best preserve visual evidence but gradually diminish in deeper agent turns, resulting in the visual hallucination snowballing in MAS. Thus, we propose ViF, a lightweight, plug-and-play mitigation paradigm that relays inter-agent messages with Visual Flow powered by the selected visual relay tokens and applies attention reallocation to amplify this pattern. The experiment results demonstrate that our method markedly reduces hallucination snowballing, consistently improving the performance across eight benchmarks based on four common MAS structures and ten base models. The implementation source code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1187. QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation
๐ฌ 17
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model; Reinforcement Learning
Reinforcement learning (RL) has emerged as a central paradigm for training large language models (LLMs) in reasoning tasks. Yet recent studies question RLโs ability to incentivize reasoning capacity beyond the base model. This raises a key challenge: how can RL be adapted to solve harder reasoning problems more effectively?
To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals.
Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@kโparticularly on problems where standard RL struggles to make progress.
This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 72.50\% (+10.73\%) on AIME24, 62.29\% (+12.79\%) on AIME25, and 41.67\% (+10.11\%) on HMMT25. Code, data and model are available at https://anonymous.4open.science/r/questa932.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1188. UniOD: A Universal Model for Outlier Detection across Diverse Domains
๐ฌ 14
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ outlier detection
Outlier detection (OD), distinguishing inliers and outliers in completely unlabeled datasets, plays a vital role in science and engineering. Although there have been many insightful OD methods, most of them require troublesome hyperparameter tuning (a challenge in unsupervised learning) and costly model training for every task or dataset. In this work, we propose UniOD, a universal OD framework that leverages labeled datasets to train a single model capable of detecting outliers of datasets with different feature dimensions and heterogeneous feature spaces from diverse domains. Specifically, UniOD extracts uniform and comparable features across different datasets by constructing and factorizing multi-scale point-wise similarity matrices. It then employs graph neural networks to capture comprehensive within-dataset and between-dataset information simultaneously, and formulates outlier detection tasks as node classification tasks.
As a result, once the training is complete, UniOD can identify outliers in datasets from diverse domains without any further model/hyperparameter selection and parameter optimization, which greatly improves convenience and accuracy in real applications. More importantly, we provide theoretical guarantees for the effectiveness of UniOD, consistent with our numerical results. We evaluate UniOD on 30 benchmark OD datasets against 17 baselines, demonstrating its effectiveness and superiority.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1189. GoldenStart: Q-Guided Priors and Entropy Control for Distilling Flow Policies
๐ฌ 19
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning; Flow-matching policies;
Flow-matching policies hold great promise for reinforcement learning (RL) by capturing complex, multi-modal action distributions. However, their practical application is often hindered by prohibitive inference latency and ineffective online exploration. Although recent works have employed one-step distillation for fast inference, the structure of the initial noise distribution remains an overlooked factor that presents significant untapped potential. This overlooked factor, along with the challenge of controlling policy stochasticity, constitutes two critical areas for advancing distilled flow-matching policies. To overcome these limitations, we propose GoldenStart (GS-flow), a policy distillation method with Q-guided priors and explicit entropy control. Instead of initializing generation from uninformed noise, we introduce a Q-guided prior modeled by a conditional VAE. This state-conditioned prior repositions the starting points of the one-step generation process into high-Q regions, effectively providing a ``golden start'' that shortcuts the policy to promising actions. Furthermore, for effective online exploration, we enable our distilled actor to output a stochastic distribution instead of a deterministic point. This is governed by entropy regularization, allowing the policy to shift from pure exploitation to principled exploration. Our integrated framework demonstrates that by designing the generative startpoint and explicitly controlling policy entropy, it is possible to achieve efficient and exploratory policies, bridging the generative models and the practical actor-critic methods. We conduct extensive experiments on offline and online continuous control benchmarks, where our method significantly outperforms prior state-of-the-art approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1190. Doxing via the Lens: Revealing Location-related Privacy Leakage on Multi-modal Large Reasoning Models
๐ฌ 29
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Privacy Leakage
Recent advances in multi-modal large reasoning models (MLRMs) have shown significant ability to interpret complex visual content. While these models possess impressive reasoning capabilities, they also introduce novel and underexplored privacy risks. In this paper, we identify a novel category of privacy leakage in MLRMs: Adversaries can infer sensitive geolocation information, such as users' home addresses or neighborhoods, from user-generated images, including selfies captured in private settings. To formalize and evaluate these risks, we propose a three-level privacy risk framework that categorizes image based on contextual sensitivity and potential for geolocation inference. We further introduce DoxBench, a curated dataset of 500 real-world images reflecting diverse privacy scenarios divided into 6 categories. Our evaluation across 13 advanced MLRMs and MLLMs demonstrates that most of these models outperform non-expert humans in geolocation inference and can effectively leak location-related private information. This significantly lowers the barrier for adversaries to obtain users' sensitive geolocation information. We further analyze and identify two primary factors contributing to this vulnerability: (1) MLRMs exhibit strong geolocation reasoning capabilities by leveraging visual clues in combination with their internal world knowledge; and (2) MLRMs frequently rely on privacy-related visual clues for inference without any built-in mechanisms to suppress or avoid such usage. To better understand and demonstrate real-world attack feasibility, we propose GeoMiner, a collaborative attack framework that decomposes the prediction process into two stages consisting of clue extraction and reasoning to improve geolocation performance. Our findings highlight the urgent need to reassess inference-time privacy risks in MLRMs to better protect users' sensitive information.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1191. ViPO: Visual Preference Optimization at Scale
๐ฌ 3
โญ 6.00
๐ generative models
๐ท๏ธ Diffusion Model Image Generation Video Generation Visual Generation DPO
While preference optimization is crucial for improving visual generative models, how to effectively scale this paradigm for visual generation remains largely unexplored.
Current open-source preference datasets typically contain substantial conflicting preference patterns, where winners excel in some dimensions but underperform in others. Naively optimizing on such noisy datasets fails to learn meaningful preferences, fundamentally hindering effective scaling. To enhance the robustness of preference algorithms against noise, we propose Poly-DPO, which extends the DPO objective with an additional polynomial term that dynamically adjusts model confidence during training based on dataset characteristics, enabling effective learning across diverse data distributions from noisy to trivially simple patterns.
Beyond biased patterns, existing datasets suffer from low resolution, limited prompt diversity, and imbalanced distributions. To facilitate large-scale visual preference optimization by tackling key data bottlenecks, we construct ViPO, a massive-scale preference dataset with 1M image pairs (1024px) across five categories and 300K video pairs (720p+) across three categories. Leveraging state-of-the-art generative models and diverse prompts ensures consistent, reliable preference signals with balanced distributions.
Remarkably, when applying Poly-DPO to our high-quality dataset, the optimal configuration converges to standard DPO. This convergence validates both our dataset quality and Poly-DPO's adaptive nature: sophisticated optimization becomes unnecessary with sufficient data quality, yet remains valuable for imperfect datasets.
We comprehensively validate our approach across various visual generation models. On noisy datasets like Pick-a-Pic V2, Poly-DPO achieves 6.87\% and 2.32\% gains over Diffusion-DPO on GenEval for SD1.5 and SDXL, respectively. For our high-quality VIPO dataset, models achieve performance far exceeding those trained on existing open-source preference datasets. These results confirm that addressing both algorithmic adaptability and data quality is essential for scaling visual preference optimization. All models and datasets will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1192. Bandit Learning in Matching Markets Robust to Adversarial Corruptions
๐ฌ 12
โญ 6.00
๐ learning theory
๐ท๏ธ Bandits Matching markets Robust algorithms Adversarial corruptions
This paper investigates the problem of bandit learning in two-sided decentralized matching markets with adversarial corruptions. In matching markets, players on one side aim to learn their unknown preferences over arms on the other side through iterative online learning, with the goal of identifying the optimal stable match. However, in real-world applications, stochastic rewards observed by players may be corrupted by malicious adversaries, potentially misleading the learning process and causing convergence to a sub-optimal match. We study this problem under two settings: one where the corruption level $C$ (defined as the sum of the largest adversarial alterations to the feedback across rounds) is known, and another where it is unknown. For the known corruption setting, we develop a robust variant of the classical Explore-Then-Gale-Shapley (ETGS) algorithm by incorporating widened confidence intervals. For the unknown corruption case, we propose a Multi-layer ETGS race method that adaptively mitigates adversarial effects without prior corruption knowledge. We provide theoretical guarantees for both algorithms by establishing upper bounds on their optimal stable regret, and further derive the lower bound to demonstrate their optimality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1193. Property-Driven Protein Inverse Folding with Multi-Objective Preference Alignment
๐ฌ 13
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ protein design preference alignment
Protein sequence design must balance designability, defined as the ability to recover a target backbone, with multiple, often competing, developability properties such as solubility, thermostability, and expression.
Existing approaches address these properties through post hoc mutation, inference-time biasing, or retraining on property-specific subsets, yet they are target dependent and demand substantial domain expertise or careful hyperparameter tuning.
In this paper, we introduce ProtAlign, a multi-objective preference alignment framework that fine-tunes pretrained inverse folding models to satisfy diverse developability objectives while preserving structural fidelity.
ProtAlign employs a semi-online Direct Preference Optimization strategy with a flexible preference margin to mitigate conflicts among competing objectives and constructs preference pairs using in silico property predictors.
Applied to the widely used ProteinMPNN backbone, the resulting model MoMPNN enhances developability without compromising designability across tasks including sequence design for CATH 4.3 crystal structures, de novo generated backbones, and real-world binder design scenarios, making it an appealing framework for practical protein sequence design.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1194. Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control
๐ฌ 17
โญ 6.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning world model humanoid
Reinforcement learning (RL) is widely used for humanoid control, with on-policy methods such as Proximal Policy Optimization (PPO) enabling robust training via large-scale parallel simulation and, in some cases, zero-shot deployment to real robots. However, the low sample efficiency of on-policy algorithms limits safe adaptation to new environments. Although off-policy RL and model-based RL have shown improved sample efficiency, the gap between large-scale pretraining and efficient finetuning on humanoids still exists. In this paper, we find that off-policy Soft Actor-Critic (SAC), with large-batch update and a high Update-To-Data (UTD) ratio, reliably supports large-scale pretraining of humanoid locomotion policies, achieving zero-shot deployment on real robots. For adaptation, we demonstrate in simulation that these SAC-pretrained policies can be finetuned in new environments and out-of-distribution tasks using model-based methods. Data collection in the new environment executes a deterministic policy while stochastic exploration is instead confined to a physics-informed world model. This separation mitigates the risks of random exploration during adaptation while preserving exploratory coverage for improvement. Overall, the approach couples the wall-clock efficiency of large-scale simulation during pretraining with the sample efficiency of model-based learning during fine-tuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1195. Queue Length Regret Bounds for Contextual Queueing Bandits
๐ฌ 17
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Queueing bandits contextual bandits logistic bandits
We introduce contextual queueing bandits, a new context-aware framework for scheduling while simultaneously learning unknown service rates. Individual jobs carry heterogeneous contextual features, based on which the agent chooses a job and matches it with a server to maximize the departure rate. The service/departure rate is governed by a logistic model of the contextual feature with an unknown server-specific parameter. To evaluate the performance of a policy, we consider queue length regret, defined as the difference in queue length between the policy and the optimal policy. The main challenge in the analysis is that the lists of remaining job features in the queue may differ under our policy versus the optimal policy for a given time step, since they may process jobs in different orders. To address this, we propose the idea of policy-switching queues equipped with a sophisticated coupling argument. This leads to a novel queue length regret decomposition framework, allowing us to understand the short-term effect of choosing a suboptimal job-server pair and its long-term effect on queue state differences. We show that our algorithm, CQB-$\varepsilon$, achieves a regret upper bound of $\widetilde{\mathcal{O}}(T^{-1/4})$. We also consider the setting of adversarially chosen contexts, for which our second algorithm, CQB-Opt, achieves a regret upper bound of $\mathcal{O}(\log^2 T)$. Lastly, we provide experimental results that validate our theoretical findings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1196. Activation Steering with a Feedback Controller
๐ฌ 52
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ activation steering behaviour control alignment PID control mechanistic interpretability language models
Controlling the behaviors of large language models (LLMs) is fundamental to their safety alignment and reliable deployment. However, existing steering methods are primarily driven by empirical insights and lack theoretical performance guarantees. In this work, we develop a control-theoretic foundation for activation steering by showing that popular steering methods correspond to the proportional (P) controllers, with the steering vector serving as the feedback signal. Building on this finding, we propose Proportional-Integral-Derivative (PID) Steering, a principled framework that leverages the full PID controller for activation steering in LLMs. The proportional (P) term aligns activations with target semantic directions, the integral (I) term accumulates errors to enforce persistent corrections across layers, and the derivative (D) term mitigates overshoot by counteracting rapid activation changes. This closed-loop design yields interpretable error dynamics and connects activation steering to classical stability guarantees in control theory. Moreover, PID Steering is lightweight, modular, and readily integrates with state-of-the-art steering methods. Extensive experiments across multiple LLM families and benchmarks demonstrate that PID Steering consistently outperforms existing approaches, achieving more robust and reliable behavioral control.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1197. Reinforcing Diffusion Models by Direct Group Preference Optimization
๐ฌ 16
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Diffusion Models; Reinforcement Learning;
While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic policy, yet the most costโeffective diffusion samplers are based on deterministic ODEs. Recent work addresses this issue by using inefficient SDE-based samplers to induce stochasticity, but this reliance on model-agnostic Gaussian noise leads to slow convergence. To resolve this conflict, we propose Direct Group Preference Optimization (DGPO), a new online RL algorithm that dispenses with the policy-gradient framework entirely. DGPO learns directly from group-level preferences, which utilize relative information of samples within groups. This design eliminates the need for inefficient stochastic policies, unlocking the use of efficient deterministic ODE samplers and faster training. Extensive results show that DGPO trains around 20 times faster than existing state-of-the-art methods and achieves superior performance on both in-domain and out-of-domain reward metrics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1198. MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
๐ฌ 20
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ SpeechLLMs Multimodal Speech Processing Linguistics LLM
Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken communication, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in speech. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. Notably, linguistic theory forms the foundation of speech language understanding (SLU), yet existing benchmarks have paid insufficient attention to this fundamental aspect and fail to capture the broader linguistic picture. To ground our benchmark in linguistic principles, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 22 advanced SpeechLLMs, we identify substantial room for improvement in existing models. MMSU establishes a new standard for comprehensive assessment of SLLU, providing valuable insights for developing more sophisticated human-AI speech interaction systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1199. Lossless Vocabulary Reduction for Auto-Regressive Language Models
๐ฌ 9
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Language Models Next-Token Distribution Tokenization Vocabulary
Tokenization---the process of decomposing a given text into a sequence of subwords called tokens---is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1200. FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows
๐ฌ 19
โญ 6.00
๐ generative models
๐ท๏ธ Generative models Flow matching any-to-any generation
Any-to-any generation seeks to translate between arbitrary subsets of modalities, enabling flexible cross-modal synthesis.
Despite recent success, existing flow-based approaches are challenged by its inefficiency, as they require large-scale datasets often with restrictive pairing constraints, incur high computation cost from modeling joint distribution, and multi-stage training pipeline.
We propose \textbf{FlowBind}, an efficient framework for any-to-any generation.
Our approach is distinguished by its simplicity: it learns a shared latent space capturing cross-modal information, with modality-specific invertible flows bridging this latent to each modality.
Both components are optimized jointly under a single flow-matching objective, and at inference the invertible flows act as encoders and decoders for direct translation across modalities.
By factorizing interactions through the shared latent, FlowBind naturally leverages arbitrary subsets of modalities for training, and achieves competitive generation quality while substantially reducing data requirements and computational cost.
Experiments on text, image, and audio demonstrate that FlowBind attains comparable quality while requiring up to 6ร fewer parameters and training 10ร faster than prior methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1201. Selective Data Removal for Distributional Machine Unlearning
๐ฌ 18
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ unlearning theory privacy sample complexity machine learning statistical learning
Machine learning systems increasingly face requirements to remove entire domains of informationโsuch as toxic language or biasesโrather than individual user data. This task presents a dilemma: full removal of the unwanted domain data is computationally expensive, while random partial removal is statistically inefficient. We find that a domain's statistical influence is often concentrated in a small subset of its data samples, suggesting a path between ineffective partial removal and unnecessary complete removal. We formalize this as distributional unlearning: a framework to select a small subset that balances forgetting an unwanted distribution while preserving a desired one. Using Kullback-Leibler divergence constraints, we derive the exact removal-preservation Pareto frontier for Gaussian distributions and prove that models trained on the edited data achieve corresponding log-loss bounds. We propose a distance-based selection algorithm and show it is quadratically more sample-efficient than random removal in the challenging low-divergence regime. Experiments across synthetic, text, and image datasets (Jigsaw, CIFAR-10, SMS spam) show our method requires 15โ82\% less deletion than full removal for strong unlearning effects, e.g., halving initial forget set accuracy. Ultimately, by showing a small forget set often suffices, our framework lays the foundations for more scalable and rigorous subpopulation unlearning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1202. Assembling the Mind's Mosaic: Towards EEG Semantic Intent Decoding
๐ฌ 13
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Electroencephalography (EEG) Brain-computer interface (BCI) Semantic Intent Neural decoding
Enabling natural communication through brainโcomputer interfaces (BCIs) remains one of the most profound challenges in neuroscience and neurotechnology. While existing frameworks offer partial solutions, they are constrained by oversimplified semantic representations and a lack of interpretability. To overcome these limitations, we introduce **Semantic Intent Decoding(SID)**, a novel framework that translates neural activity into natural language by modeling meaning as a flexible set of compositional semantic units.
SID is built on three core principles: semantic compositionality, continuity and expandability of semantic space, and fidelity in reconstruction.
We present **BrainMosaic**, a deep learning architecture implementing SID. BrainMosaic decodes multiple semantic units from EEG/SEEG signals using set matching and then reconstructs coherent sentences through semantic-guided reconstruction.
This approach moves beyond traditional pipelines that rely on fixed-class classification or unconstrained generation, enabling a more interpretable and expressive communication paradigm. Extensive experiments on multilingual EEG and clinical SEEG datasets demonstrate that SID and BrainMosaic offer substantial advantages over existing frameworks, paving the way for natural and effective BCI-mediated communication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1203. SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation
๐ฌ 14
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Scene Generation Part-aware 3D Generation
We introduce SceneTransporter, an end-to-end framework for structured 3D scene generation from a single image. While existing methods generate part-level 3D objects, they often fail to organize these parts into distinct instances in open-world scenes. Through a debiased clustering probe, we reveal a critical insight: this failure stems from the lack of structural constraints within the model's internal assignment mechanism. Based on this finding, we reframe the task of structured 3D scene generation as a global correlation assignment problem. To solve this, SceneTransporter formulates and solves an entropic Optimal Transport (OT) objective within the denoising loop of the compositional DiT model. This formulation imposes two powerful structural constraints. First, the resulting transport plan gates cross-attention to enforce an exclusive, one-to-one routing of image patches to part-level 3D latents, preventing entanglement. Second, the competitive nature of the transport encourages the grouping of similar patches, a process that is further regularized by an edge-based cost, to form coherent objects and prevent fragmentation. Extensive experiments show that SceneTransporter outperforms existing methods on open-world scene generation, significantly improving instance-level coherence and geometric fidelity. Code and models will be publicly available at \url{https://scenetransporter.github.io/}
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1204. IF-VidCap: Can Video Caption Models Follow Instructions?
๐ฌ 29
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Caption
Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions.
Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlook instruction-following capabilities.
To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples.
Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness.
Our comprehensive evaluation of over 19 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity.
Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1205. PatchRefiner V2: Fast and Lightweight Real-Domain High-Resolution Metric Depth Estimation
๐ฌ 4
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Depth Estimation High Resolution
While current high-resolution depth estimation methods achieve strong results, they often suffer from computational inefficiencies due to reliance on heavyweight models and multiple inference steps, increasing inference time. To address this, we introduce PatchRefiner V2 (PRV2), which replaces heavy refiner models with lightweight encoders. This reduces model size and inference time but introduces noisy features. To overcome this, we propose a Coarse-to-Fine (C2F) module with a Guided Denoising Unit for refining and denoising the refiner features and a Noisy Pretraining strategy to pretrain the refiner branch to fully exploit the potential of the lightweight refiner branch. Additionally, we propose to adopt the Scale-and-Shift Invariant Gradient Matching (SSIGM) loss within local windows to enhance synthetic-to-real domain transfer. PRV2 outperforms state-of-the-art depth estimation methods on UnrealStereo4K in both accuracy and speed, using fewer parameters and faster inference. It also shows improved depth boundary delineation on real-world datasets like CityScapes, demonstrating its effectiveness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1206. Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation
๐ฌ 16
โญ 6.00
๐ learning theory
๐ท๏ธ Transformer Signal Propagation Theory of Neural Networks Physics for Machine Learning
Finding the right initialisation for neural networks is crucial to ensure smooth
training and good performance. In transformers, the wrong initialisation can
lead to one of two failure modes of self-attention layers: rank collapse, where
all tokens collapse into similar representations, and entropy collapse, where
highly concentrated attention scores lead to training instability. While previous work has studied different scaling regimes for transformers, an asymptotically exact, down-to-the constant prescription for how to initialise transformers has so
far been lacking. Here, we provide an analytical theory of signal propagation
through deep transformers with self-attention, layer
normalisation, skip connections and MLP. Our theory yields a simple algorithm to compute trainability diagrams
that identify the correct choice of initialisation hyper-parameters for a given
architecture. We overcome the key challenge, an exact treatment of the self-attention layer, by establishing a formal parallel with the Random Energy Model from statistical
physics.
We also analyse gradients in the backward path and determine the regime where gradients vanish at initialisation. We demonstrate the versatility of our framework through three case studies. Our theoretical framework gives a unified perspective on the
two failure modes of self-attention and gives quantitative predictions on the
scale of both weights and residual connections that guarantee smooth training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1207. SR-Scientist: Scientific Equation Discovery With Agentic AI
๐ฌ 23
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Symbolic regression Equation Discovery Large Language Models Agentic AI
Recently, Large Language Models (LLMs) have been applied to scientific equation discovery, leveraging their embedded scientific knowledge for hypothesis generation. However, current methods typically confine LLMs to the role of an equation proposer within search algorithms like genetic programming. In this paper, we present SR-Scientist, a framework that elevates the LLM from a simple equation proposer to an autonomous AI scientist that writes code to analyze data, implements the equation as code, submits it for evaluation, and optimizes the equation based on experimental feedback. Specifically, we wrap the code interpreter into a set of tools for data analysis and equation evaluation. The agent is instructed to optimize the equation by utilizing these tools over a long horizon with minimal human-defined pipelines. Empirical results show that SR-Scientist outperforms baseline methods by an absolute margin of 6\% to 35\% on datasets covering four science disciplines. Additionally, we demonstrate our method's robustness to noise, the generalization of the discovered equations to out-of-domain data, and their symbolic accuracy. Furthermore, we develop an end-to-end reinforcement learning framework to enhance the agent's capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1208. ATLAS: Constraints-Aware Multi-Agent Collaboration for Real-World Travel Planning
๐ฌ 5
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multi-agent orchestration Real-world travel planning Constraints-aware planning
While Large Language Models (LLMs) have shown remarkable advancements in reasoning and tool use, they often fail to generate optimal, grounded solutions under complex constraints. Real-world travel planning exemplifies these challenges, evaluating agentsโ abilities to handle constraints that are explicit, implicit, and even evolving based on interactions with dynamic environments and user needs. In this paper, we present ATLAS, a general multi-agent framework designed to effectively handle such complex nature of constraints awareness in real-world travel planning tasks. ATLAS introduces a principled approach to address the fundamental challenges of constraint-aware planning through dedicated mechanisms for dynamic constraint management, iterative plan critique, and adaptive interleaved search. ATLAS demonstrates state-of-the-art performance on the TravelPlanner benchmark, improving the final pass rate from 23.3% to 44.4% over its best alternative. More importantly, our work is the first to demonstrate quantitative effectiveness on real-world travel planning tasks with live information search and multi-turn feedback. In this realistic setting, ATLAS showcases its superior overall planning performance, achieving an 84% final pass rate which significantly outperforms baselines including ReAct (59%) and a monolithic agent (27%).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1209. From Prediction to Perfection: Introducing Refinement to Autoregressive Image Generation
๐ฌ 6
โญ 6.00
๐ generative models
๐ท๏ธ Auto-Regressive Image Generation Discrete Diffusion
Autoregressive (AR) models have emerged as a powerful framework for image generation, yet they remain bound by a fundamental limitation: once a prediction is made, it cannot be revised. Each step marches forward in a strict left-to-right sequence, causing small errors to accumulate and compromise the final image. In this work, we reimagine this process with TensorAR, a decoder-only AR model that shifts from predicting discrete tokens to predicting overlapping tensor windows. This simple change transforms image synthesis into a process of next-tensor prediction, enabling the model to refine earlier outputs while preserving the causal structure that defines autoregression. To guard against information leakage during training, we introduce a discrete tensor noising mechanism inspired by discrete diffusion theory, which injects categorical noise into input tensors. TensorAR is designed to be plug-and-play: unlike masked AR methods, it requires no architectural modifications, and unlike autoregressive diffusion, it preserves the familiar AR training paradigm. We evaluate TensorAR across both class-to-image and text-to-image tasks, showing consistent gains in generation quality and instruction-following ability, while achieving a superior balance between quality and latency. In doing so, TensorAR offers a new path forward for autoregressive generation---one where predictions are not just produced, but continually refined.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1210. CPQS-Tuning: A Model Self-Perception-Based Data Filtering Algorithm for Efficient Instruction Fine-Tuning
๐ฌ 18
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Instruction Fine-tuning LLMs Data Filtering CPQS Hidden States
Instruction fine-tuning is a key technique for enhancing the performance of large language models (LLMs), but low-quality and redundant data often hinder its effectiveness. Recent studies suggest that filtering a small amount of high-quality data for instruction fine-tuning can achieve faster and more efficient training performance. However, existing data filtering approaches predominantly depend on predefined evaluation models or manually designed metrics, without leveraging information from the target LLM itself. This limitation may result in a mismatch between the filtering criteria and the actual requirements of the LLM being fine-tuned, thereby reducing the effectiveness of the fine-tuning process. To address these issues, we propose a novel perspective: the hidden states of LLMs implicitly reflect the quality of the training data. Based on this insight, we propose a novel data filtering method that extracts the hidden states that reflect the target LLMโs perception of the data as representative features, and builds a data classification model upon them, which outputs the Contrastive Perception Quality Score (CPQS) for dataset filtering. Our experiments are conducted in both general and downstream domains.
(1) In the general domain, our experiments show that training on under 10\% of the data from both the Alpaca\_GPT4 and DeepSeek-R1 synthesized reasoning datasets enables our method to outperform models trained on the complete datasets. Moreover, it surpasses the performance of current state-of-the-art data-selection techniques.
(2) In downstream tasks, our approach delivers an average performance gain exceeding 3.6\% over leading data-selection algorithms across multiple benchmarks, including GSM8K, HumanEval, and HumanEval-Plus.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1211. PolyGraphScore: a classifier-based metric for evaluating graph generative models
๐ฌ 21
โญ 6.00
๐ generative models
๐ท๏ธ graph generative models model evaluation maximum mean discrepancy generative models
Existing methods for evaluating graph generative models primarily rely on Maximum Mean Discrepancy (MMD) metrics based on graph descriptors. While these metrics can rank generative models, they do not provide an absolute measure of performance. Their values are also highly sensitive to extrinsic parameters, namely kernel and descriptor parametrization, making them incomparable across different graph descriptors.
We introduce PolyGraphScore (PGS), a new evaluation framework that addresses these limitations. It approximates the Jensen-Shannon (JS) distance of graph distributions by fitting binary classifiers to distinguish between real and generated graphs, featurized by these descriptors. The data log-likelihood of these classifiers approximates a variational lower bound on the JS distance between the two distributions. Resulting scores are constrained to the unit interval $[0,1]$ and are comparable across different graph descriptors. We further derive a theoretically grounded summary score that combines these individual metrics to provide a maximally tight lower bound on the distance for the given descriptors. Thorough experiments demonstrate that PGS provides a more robust and insightful evaluation compared to MMD metrics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1212. Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs
๐ฌ 27
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Privacy leakage;Inference-preventing optimization;Text anonymization;Attribute inference attack;Large language models
Recent studies have shown that large language models (LLMs) can infer private user attributes (e.g., age, location, gender) from user-generated text shared online, enabling rapid and large-scale privacy breaches. Existing anonymization-based defenses are coarse-grained, lacking word-level precision in anonymizing privacy-leaking elements. Moreover, they are inherently limited as altering user text to hide sensitive cues still allows attribute inference to occur through models' reasoning capabilities.
To address these limitations, we propose a unified defense framework that combines fine-grained anonymization (TRACE) with inference-preventing optimization (RPS). TRACE leverages attention mechanisms and inference chain generation to identify and anonymize privacy-leaking textual elements, while RPS employs a lightweight two-stage optimization strategy to induce model rejection behaviors, thereby preventing attribute inference.
Evaluations across diverse LLMs show that TRACE-RPS reduces attribute inference accuracy from around 50\% to below 5\% on open-source models. In addition, our approach offers strong cross-model generalization, prompt-variation robustness, and utility-privacy tradeoffs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1213. Large Depth Completion Model from Sparse Observations
๐ฌ 19
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Depth Completion
This work presents the Large Depth Completion Model (LDCM), a simple, effective, and robust framework for single-view metric depth estimation with sparse observations. Without relying on complex architectural designs, LDCM generates metric-accurate dense depth maps in one large transformer. It outperforms existing approaches across diverse datasets and sparse observations. We achieve this from two key perspectives: (1) maximizing the potential of existing monocular foundation models to improve sparse observations preprocessing, and (2) reformulating training objectives to better capture geometric structure and metric consistency. Specifically, a Poisson-based depth initialization module is firstly introduced to generate a uniform coarse dense depth map from diverse sparse observations, which serves as a strong structural prior for the network. Regarding the training objective, we replace the conventional depth head with a point map head that regresses per-pixel 3D coordinates in camera space, enabling the model to directly learn the underlying 3D scene structure instead of performing pixel-wise depth map restoration. Moreover, this design eliminates the need for camera intrinsic parameters, allowing LDCM to naturally produce metric-scaled 3D point maps. Extensive experiments demonstrate that LDCM consistently outperforms state-of-the-art methods across multiple benchmarks and varying sparsity priors in both depth completion and point map estimation, showcasing its effectiveness and strong generalization to unseen data distributions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1214. Cyber-Zero: Training Cybersecurity Agents without Runtime
๐ฌ 12
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ capture the flag language model agents security vulnerability
Large Language Models (LLMs) have achieved remarkable success in software engineering tasks when trained with executable runtime environments, particularly in resolving GitHub issues. However, such runtime environments are often unavailable in other domains, especially cybersecurity, where challenge configurations and execution contexts are ephemeral or restricted. We present Cyber-Zero, the first runtime-free framework for synthesizing high-quality agent trajectories to train cybersecurity LLMs. Cyber-Zero leverages publicly available CTF writeups and employs persona-driven LLM simulation to reverse-engineer runtime behaviors and generate realistic, long-horizon interaction sequences without actual environments. Using trajectories synthesized by Cyber-Zero, we train LLM-based agents that achieve up to 13.1% absolute performance gains over baseline models on three prominent CTF benchmarks: InterCode-CTF, NYU CTF Bench, and Cybench. Our best model, Cyber-Zero-32B, establishes new state-of-the-art performance among open-weight models, matching the capabilities of proprietary systems like DeepSeek-V3-0324 and Claude-3.5-Sonnet while offering superior cost-effectiveness, and demonstrating that runtime-free trajectory synthesis can effectively democratize the development of state-of-the-art cybersecurity agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1215. Translation Heads: Unveiling Attention's Role in LLM Multilingual Translation
๐ฌ 15
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Multilinguistic Interpretability
Recently, large language models (LLMs) have made remarkable progress, with multilingual capability emerging as a core foundational strengths. However, the internal mechanisms by which these models perform translation remain incompletely understood. In this paper, we elucidate the relationship between the attention mechanism in LLMs and their translation abilities. We find that certain attention heads, which we term token alignment heads, are specifically responsible for mapping tokens from the source language to the target language during inference.
Through a systematic investigation across various models, we confirm that these token alignment heads exhibit several key characteristics: (1) Universality: They are present in all LLMs we studied. (2) Sparsity: They constitute only a small fraction of all attention heads. (3) Consistency: The set of token alignment heads activated by the model shows strong consistency across different language pairs. (4) Causality: Interventionally removing these heads leads to a sharp decline in the model's translation performance, while randomly removing non-token alignment heads has little impact on translation ability. (5) Functional Specificity: Ablating token alignment heads disproportionately harms translation but has a varied impact on other multilingual tasks. We also traced the formation of token alignment heads during pre-training, revealing an evolutionary path of rapid proliferation, stabilization, and eventual pruning. Furthermore we leverage these token alignment heads to filter multilingual training data, and our experiments show that these data could enhance translation capabilities of the models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1216. Self-Refining Vision Language Model for Robotic Failure Detection and Reasoning
๐ฌ 21
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Failure Reasoning Robotics Foundation Models
Reasoning about failures is crucial for building reliable and trustworthy robotic systems. Prior approaches either treat failure reasoning as a closed-set classification problem or assume access to ample human annotations. Failures in the real world are typically subtle, combinatorial, and difficult to enumerate, whereas rich reasoning labels are expensive to acquire. We address this problem by introducing ARMOR: Adaptive Round-based Multi-task mOdel for Robotic failure detection and reasoning. We formulate detection and reasoning as a multi-task self-refinement process, where the model iteratively predicts detection outcomes and natural language reasoning conditioned on past outputs. During training, ARMOR learns from heterogeneous supervision - large-scale sparse binary labels and small-scale rich reasoning annotations - optimized via a combination of offline and online imitation learning. At inference time, ARMOR generates multiple refinement trajectories and selects the most confident prediction via a self-certainty metric.
Experiments across diverse environments show that ARMOR achieves state-of-the-art performance by improving over the previous approaches by up to 30\% on failure detection rate and up to 100\% in reasoning measured through LLM fuzzy match score, demonstrating robustness to heterogeneous supervision and open-ended reasoning beyond predefined failure modes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1217. On the Bayes Inconsistency of Disagreement Discrepancy Surrogates
๐ฌ 10
โญ 6.00
๐ learning theory
๐ท๏ธ Bayes consistency distribution shift disagreeent discrepancy surrogate losses adversarial robustness
Deep neural networks often fail when deployed in real-world contexts due to distribution shift, a critical barrier to building safe and reliable systems. An emerging approach to address this problem relies on _disagreement discrepancy_โa measure of how the disagreement between two models changes under a shifting distribution. The process of maximizing this measure has seen applications in bounding error under shifts, testing for harmful shifts, and training more robust models. However, this optimization involves the non-differentiable zero-one loss, necessitating the use of practical surrogate losses. We prove that existing surrogates for disagreement discrepancy are not Bayes consistent, revealing a fundamental flaw: maximizing these surrogates can fail to maximize the true disagreement discrepancy. To address this, we introduce new theoretical results providing both upper and lower bounds on the optimality gap for such surrogates. Guided by this theory, we propose a novel disagreement loss that, when paired with cross-entropy, yields a provably consistent surrogate for disagreement discrepancy. Empirical evaluations across diverse benchmarks demonstrate that our method provides more accurate and robust estimates of disagreement discrepancy than existing approaches, particularly under challenging adversarial conditions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1218. Discovering Novel LLM Experts via Task-Capability Coevolution
๐ฌ 30
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model LLMs Minimal Criterion Coevolution Evolutionary Model Merging Synthetic Data Quality-Diversity Open-endedness
Frontier model developers aim to train models continually to possess emergent, diverse capabilities.
To extend capabilities, the current pre-training and post-training paradigm requires manually starting training runs with static datasets or reward functions every time.
Addressing this limitation, our work pursues the insight that open-endedness (via the coevolution of models and tasks) can discover models with increasingly novel skills in a single run.
We introduce a new model development framework that extends coevolution to large language model (LLM) discovery, open-ended \textit{Assessment Coevolving with Diverse Capabilities} (AC/DC).
AC/DC evolves both LLMs via model merging and natural language tasks via synthetic data generation.
AC/DC discovers growing archives of LLMs that surpass the capabilities of larger LLMs while taking up less GPU memory.
In particular, our LLM populations achieve a broader Coverage of expertise than other curated models or baselines on downstream benchmarks, without \textit{any} explicit benchmark optimization.
Furthermore, AC/DC improves Coverage over time, continually innovates on tasks and models, and improves performance in multi-agent best-of-N selection.
Our findings highlight the potential of coevolution as a means of discovering broader sets of capabilities from base LLMs.
Overall, AC/DC brings us one step closer to a profoundly new paradigm of LLM development, where continual improvements to the diversity of model capabilities can be accelerated by leveraging existing models as stepping stones to increasingly powerful models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1219. Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models
๐ฌ 20
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Layer-wise Pruning Cooperative Game Theory Shapley Value Approximation
While large language models (LLMs) demonstrate impressive performance across various tasks, their deployment in real-world scenarios is still constrained by high computational demands. Layer-wise pruning, a commonly employed strategy to mitigate inference costs, can partially address this challenge. However, existing approaches generally depend on static heuristic rules and fail to account for the interdependencies among layers, thereby limiting the effectiveness of the pruning process. To this end, this paper proposes a game-theoretic framework that formulates layer pruning as a cooperative game in which each layer acts as a player and model performance serves as the utility. As computing exact Shapley values is computationally infeasible for large language models (LLMs), we propose using a lightweight surrogate network to estimate layer-wise marginal contributions. This network can predict LLM performance for arbitrary layer combinations at a low computational cost. Additionally, we employ stratified Monte Carlo mask sampling to further reduce the cost of Sharpley value estimation. This approach captures inter-layer dependencies and dynamically identifies critical layers for pruning. Extensive experiments demonstrate the consistent superiority of our method in terms of perplexity and zero-shot accuracy, achieving more efficient and effective layer-wise pruning for large language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1220. Diffusion Alignment as Variataional Expectation-Maximization
๐ฌ 13
โญ 6.00
๐ generative models
๐ท๏ธ Diffusion Model Alignment RLHF Test time search
Diffusion alignment aims to optimize diffusion models for the downstream objective. While existing methods based on reinforcement learning or direct backpropagation achieve considerable success in maximizing rewards, they often suffer from reward over-optimization and mode collapse. We introduce Diffusion Alignment as Variational Expectation-Maximization (DAV), a framework that formulates diffusion alignment as an iterative process alternating between two complementary phases: the E-step and the M-step. In the E-step, we employ test-time search
to generate diverse and reward-aligned samples. In the M-step, we refine the diffusion model using samples discovered by the E-step. We demonstrate that DAV can optimize reward while preserving diversity for both continuous and discrete tasks: text-to-image synthesis and DNA sequence design.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1221. Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function
๐ฌ 13
โญ 6.00
๐ generative models
๐ท๏ธ Diffusion Models RL Finetuning
Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose Soft Q-based Diffusion Finetuning (SQDF), a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves superior target rewards while preserving diversity in text-to-image alignment. Furthermore, in online black-box optimization, SQDF attains high sample efficiency while maintaining naturalness and diversity. Our code is available at https://anonymous.4open.science/r/SQDF-B66C
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1222. Multi-Head Low-Rank Attention
๐ฌ 27
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ ML System Efficient Decoding
Long-context inference in large language models is bottlenecked by Key-Value (KV) cache loading during the decoding stage, where the sequential nature of generation requires repeatedly transferring the KV cache from off-chip to on-chip memory at each step. Recent architectures like Multi-Head Latent Attention (MLA) significantly reduce the KV cache size to $4.5d_h$ per token per layer while maintaining high model quality. However, when using tensor parallelism (TP) with sufficient devices for inference, MLA still decodes slower than Grouped-Query Attention (GQA) because its single latent vector cannot be sharded, forcing each device to load $4.5 d_h$ versus $2 d_h$ for GQA. In this work, we propose Multi-Head Low-Rank Attention (MLRA), a TP-friendly attention mechanism that slashes the per-device KV cache under TP to just $1.5 d_h$. Extensive experiments show that MLRA achieves state-of-the-art perplexity and downstream task performance, while also delivering a 2.8$\times$ decoding speedup over MLA.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1223. GT-Space: Enhancing Heterogeneous Collaborative Perception with Ground Truth Feature Space
๐ฌ 12
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Collaborative perception multi-modality multi-agent sensor fusion
In autonomous driving, multi-agent collaborative perception enhances sensing capabilities by enabling agents to share perceptual data. A key challenge lies in handling heterogeneous features from agents equipped with different sensing modalities or model architectures, which complicates data fusion. Existing approaches often require retraining encoders or designing interpreter modules for pairwise feature alignment, but these solutions are not scalable in practice. To address this, we propose GT-Space, a flexible and scalable collaborative perception framework for heterogeneous agents. GT-Space constructs a common feature space from ground-truth labels, providing a unified reference for feature alignment. With this shared space, agents only need a single adapter module to project their features, eliminating the need for pairwise interactions with other agents. Furthermore, we design a fusion network trained with contrastive losses across diverse modality combinations. Extensive experiments on simulation datasets (OPV2V and V2XSet) and a real-world dataset (RCooper) demonstrate that GT-Space consistently outperforms baselines in detection accuracy while delivering robust performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1224. Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement
๐ฌ 21
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model LLM Safety Over-refusal SafetyAlignment
Large language models (LLMs) aligned for safety often suffer from over-refusalโthe tendency to reject seemingly toxic or benign prompts by misclassifying them as toxic. This behavior undermines models' helpfulness and restricts usability in sensitive or nuanced contexts. While prior work has proposed mitigation strategies such as data augmentation and activation steering, these approaches often face a trade-off: reducing over-refusal typically degrades the modelโs ability to reject genuinely harmful content.
We argue that this issue arises from the ambiguous influence of toxic and seemingly toxic prompts on the modelโs learning dynamics. To address it, we introduce a preceding alignment stage, DCR: $\textbf{D}$iscernment via $\textbf{C}$ontrastive $\textbf{R}$efinement. Both theoretically and empirically, we demonstrate that contrastive refinement improves an LLMโs capacity to distinguish truly toxic prompts from superficially toxic ones. Evaluation across diverse benchmarks shows that our method effectively reduces over-refusal while preserving the safety benefits of alignment. Importantly, it achieves this with minimal degradation of general capabilities, offering a more principled and robust direction for safety alignment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1225. Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools
๐ฌ 17
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ multimodal reasoning vision-language model action recognition
Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invokes domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, while maintaining computational efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1226. Evoking User Memory: Personalizing LLM via Recollection-Familiarity Adaptive Retrieval
๐ฌ 32
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Language Model Memory Retrieval Recollection-Familiarity Dual Process Personalization
Personalized large language models (LLMs) rely on memory retrieval to incorporate user-specific histories, preferences, and contexts. Existing approaches either overload the LLM by feeding all the user's past memory into the prompt, which is costly and unscalable, or simplify retrieval into a one-shot similarity search, which captures only surface matches. Cognitive science, however, shows that human memory operates through a dual process: Familiarity, offering fast but coarse recognition, and Recollection, enabling deliberate, chain-like reconstruction for deeply recovering episodic content.
Current systems lack both the ability to perform recollection retrieval and mechanisms to adaptively switch between the dual retrieval paths, leading to either insufficient recall or the inclusion of noise.
To address this, we propose RF-Mem (RecollectionโFamiliarity Memory Retrieval), a familiarity uncertainty-guided dual-path memory retriever.
RF-Mem measures the familiarity signal through the mean score and entropy. High familiarity leads to the direct top-$K$ Familiarity retrieval path, while low familiarity activates the Recollection path. In the Recollection path, the system clusters candidate memories and applies $\alpha$-mix with the query to iteratively expand evidence in embedding space, simulating deliberate contextual reconstruction.
This design embeds human-like dual-process recognition into the retriever, avoiding full-context overhead and enabling scalable, adaptive personalization. Experiments across three benchmarks and corpus scales demonstrate that RF-Mem consistently outperforms both one-shot retrieval and full-context reasoning under fixed budget and latency constraints. Our code can be found in the Supplementary Materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1227. MetaVLA: Unified Meta Co-Training for Efficient Embodied Adaptation
๐ฌ 15
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ VisionโLanguageโAction models Efficient Robot Reasoning Generalization
VisionโLanguageโAction (VLA) models show promise in embodied reasoning, yet remain far from true generalistsโthey often require task-specific fine-tuning, incur high compute costs, and generalize poorly to unseen tasks. We propose MetaVLA, a unified, backbone-agnostic post-training framework for efficient and scalable alignment. MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse target tasks into a single fine-tuning stage while leveraging structurally diverse auxiliary tasks to improve in-domain generalization. Unlike naive multi-task SFT, MetaVLA integrates a lightweight meta-learning mechanismโderived from Attentive Neural Processesโto enable rapid adaptation from diverse contexts with minimal architectural change or inference overhead. On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA by up to 8.0\% on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by 76%. These results show that scalable, low-resource post-training is achievableโpaving the way toward general-purpose embodied agents. Code will be available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1228. PRO-MOF: Policy Optimization with Universal Atomistic Models for Controllable MOF Generation
๐ฌ 4
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ metal-organic framework material generation AI for science physical modeling
Generating physically stable and novel metal-organic frameworks (MOFs) for inverse design that meet specific performance targets is a significant challenge. Existing generative models often struggle to explore the vast chemical and structural space effectively, leading to suboptimal solutions or mode collapse. To address this, we propose PRO-MOF, a hierarchical reinforcement learning (HRL) framework for controllable MOF generation. Our approach decouples the MOF design process into two policies: a high-level policy for proposing chemical building blocks and a low-level policy for assembling their 3D structures. By converting the deterministic Flow Matching model into a Stochastic Differential Equation (SDE), we enable the low-level policy to perform compelling exploration. The framework is optimized in a closed loop with high-fidelity physical reward signals provided by a pre-trained universal atomistic model (UMA). Furthermore, we introduce a Pass@K Group Relative Policy Optimization (GRPO) scheme that effectively balances exploration and exploitation by rewarding in-group diversity. Experiments on multiple inverse design tasks, such as maximizing CO2 working capacity and targeting specific pore diameters, show that PRO-MOF significantly outperforms existing baselines, including diffusion-based methods and genetic algorithms, in both success rate and the discovery of top-performing materials. Our work demonstrates that hierarchical reinforcement learning combined with a high-fidelity physical environment is a powerful paradigm for solving complex material discovery problems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1229. Unified Biomolecular Trajectory Generation via Pretrained Variational Bridge
๐ฌ 21
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ deep generative model molecular dynamics trajectory generation augmented bridge matching adjoint matching
Molecular Dynamics (MD) simulations provide a fundamental tool for characterizing molecular behavior at full atomic resolution, but their applicability is severely constrained by the computational cost. To address this, a surge of deep generative models has recently emerged to learn dynamics at coarsened timesteps for efficient trajectory generation, yet they either generalize poorly across systems or, due to limited molecular diversity of trajectory data, fail to fully exploit structural information to improve generative fidelity. Here, we present the Pretrained Variational Bridge (PVB) in an encoder-decoder fashion, which maps the initial structure into a noised latent space and transports it toward stage-specific targets through augmented bridge matching. This unifies training on both single-structure and paired trajectory data, enabling consistent use of cross-domain structural knowledge across training stages. Moreover, for protein-ligand complexes, we further introduce a reinforcement learning-based optimization via adjoint matching that speeds progression toward the holo state, which supports efficient post-optimization of docking poses. Experiments on proteins and protein-ligand complexes demonstrate that PVB faithfully reproduces thermodynamic and kinetic observables from MD while delivering stable and efficient generative dynamics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1230. GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods
๐ฌ 21
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models Jailbreak Attacks Evaluation System Benchmark
Despite the growing interest in jailbreaks as an effective red-teaming tool for building safe and responsible large language models (LLMs), flawed evaluation system designs have led to significant discrepancies in their effectiveness assessments. With a systematic measurement study based on 37 jailbreak studies since 2022, we find that existing evaluation systems lack case-specific criteria, resulting in misleading conclusions about their effectiveness and safety implications. In this paper, we introduce GuidedBench, a novel benchmark comprising a curated harmful question dataset and GuidedEval, an evaluation system integrated with detailed case-by-case evaluation guidelines. Experiments demonstrate that GuidedBench offers more accurate evaluations of jailbreak performance, enabling meaningful comparisons across methods. GuidedEval reduces inter-evaluator variance by at least 76.03%, ensuring reliable and reproducible evaluations. We reveal why existing jailbreak benchmarks fail to evaluate accurately and suggest better evaluation practices.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1231. THE PATH OF LEAST RESISTANCE: GUIDING LLM REASONING TRAJECTORIES WITH PREFIX CONSENSUS
๐ฌ 13
โญ 6.00
๐ generative models
๐ท๏ธ Speculative reasoning LLM inference optimization
Large language models achieve strong reasoning performance, but inference strategies such as Self-Consistency (SC) are computationally expensive, as they fully expand all reasoning traces. We introduce PoLR (Path of Least Resistance), the first inference-time method to leverage prefix self-consistency for compute-efficient reasoning. PoLR clusters short prefixes of reasoning traces, identifies the dominant cluster, and expands only a subset of promising paths, preserving the accuracy benefits of SC while substantially reducing token usage and latency. Our theoretical analysis, framed via mutual information and entropy, explains why early reasoning steps encode strong signals predictive of final correctness. Empirically, PoLR consistently matches or exceeds SC across GSM8K, Math500, AIME 2024/2025, and GPQA-Diamond, reducing token usage by up to 60% and wall-clock latency by up to 50%. Moreover, PoLR is fully complementary to adaptive inference methods (e.g., Adaptive Consistency, Early-Stopping SC) and can serve as a drop-in pre-filter, making SC substantially more efficient and scalable without requiring model fine-tuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1232. $AutoDrive\text{-}P^3$: Unified Chain of PerceptionโPredictionโPlanning Thought via Reinforcement Fine-Tuning
๐ฌ 22
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Autonomous Driving; Vision-Language Models; Reinforcement Learning
Vision-language models (VLMs) are increasingly being adopted for end-to-end autonomous driving systems due to their exceptional performance in handling long-tail scenarios. However, current VLM-based approaches suffer from two major limitations: 1) Some VLMs directly output planning results without chain-of-thought (CoT) reasoning, bypassing crucial perception and prediction stages which creates a significant domain gap and compromises decision-making capability; 2) Other VLMs can generate outputs for perception, prediction, and planning tasks but employ a fragmented decision-making approach where these modules operate seperately, leading to a significant lack of synergy that undermines true planning performance. To address these limitations, we propose ${AutoDrive\text{-}P^3}$, a novel framework that seamlessly integrates $\underline{\textbf{P}}$erception, $\underline{\textbf{P}}$rediction, and $\underline{\textbf{P}}$lanning through structured reasoning. We introduce the ${P^3\text{-}CoT}$ dataset to facilitate coherent reasoning and propose ${P^3\text{-}GRPO}$, a hierarchical reinforcement learning algorithm that provides progressive supervision across all three tasks. Specifically, ${AutoDrive\text{-}P^3}$ progressively generates CoT reasoning and answers for perception, prediction, and planning, where perception provides essential information for subsequent prediction and planning, while both perception and prediction collectively contribute to the final planning decisions, enabling safer and more interpretable autonomous driving. Additionally, to balance inference efficiency with performance, we introduce dual thinking modes: detailed thinking and fast thinking. Extensive experiments on both open-loop (nuScenes) and closed-loop (NAVSIMv1/v2) benchmarks demonstrate that our approach achieves state-of-the-art performance in planning tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1233. Arbitrary-Shaped Image Generation via Spherical Neural Field Diffusion
๐ฌ 12
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Diffusion Models Image Generation Spherical Neural Field
Existing diffusion models excel at generating diverse content, but remain confined to fixed image shapes and lack the ability to flexibly control spatial attributes such as viewpoint, field-of-view (FOV), and resolution.
To fill this gap, we propose Arbitrary-Shaped Image Generation (ASIG), the first generative framework that enables precise spatial attribute control while supporting high-quality synthesis across diverse image shapes (e.g., perspective, panoramic, and fisheye).
ASIG introduces two key innovations: (1) a mesh-based spherical latent diffusion to generate a complete scene representation, with seam enforcement denoising strategy to maintain semantic and spatial consistency across viewpoints; and (2) a spherical neural field to sample arbitrary regions from the scene representation with coordinate conditions, enabling distortion-free generation at flexible resolutions.
To this end, ASIG enables precise control over spatial attributes within a unified framework, enabling high-quality generation across diverse image shapes. Experiments demonstrate clear improvements over prior methods specifically designed for individual shapes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1234. Rethinking Consistent Multi-Label Classification under Inexact Supervision
๐ฌ 6
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Multi-label classification partial multi-label learning complementary multi-label learning.
Partial multi-label learning and complementary multi-label learning are two popular weakly supervised multi-label classification paradigms that aim to alleviate the high annotation costs of collecting precisely annotated multi-label data. In partial multi-label learning, each instance is annotated with a candidate label set, among which only some labels are relevant; in complementary multi-label learning, each instance is annotated with complementary labels indicating the classes to which the instance does not belong. Existing consistent approaches for the two paradigms either require accurate estimation of the generation process of candidate or complementary labels or assume a uniform distribution to eliminate the estimation problem. However, both conditions are usually difficult to satisfy in real-world scenarios. In this paper, we propose consistent approaches that do not rely on the aforementioned conditions to handle both problems in a unified way. Specifically, we propose two risk estimators based on first- and second-order strategies. Theoretically, we prove consistency w.r.t. two widely used multi-label classification evaluation metrics and derive convergence rates for the estimation errors of the proposed risk estimators. Empirically, extensive experimental results validate the effectiveness of our proposed approaches against state-of-the-art methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1235. InfoScan: Information-Efficient Visual Scanning via Resource-Adaptive Walks
๐ฌ 25
โญ 6.00
๐ learning theory
๐ท๏ธ Vision Model; Scan Strategy; Markov Decision Processes; Information Scoring
High-resolution visual representation learning remains challenging due to the quadratic complexity of Vision Transformers and the limitations of existing efficient approaches, where fixed scanning patterns in recent Mamba-based models hinder content-adaptive perception.
To address these limitations, a novel Information-aware Scanning mechanism (InfoScan) tailored for state-space visual backbones is proposed, which dynamically allocates computational resources to the most salient regions of an image.
Specifically, InfoScan rigorously assesses the informativeness of image patches by integrating entropy with local structural analyses, formulates a joint optimization objective balancing fine-grained detail preservation and broader contextual coherence, and learns an adaptive scanning policy via reinforcement learning.
Built upon the innovative Visual Information State Space (VISS) block, InfoScan establishes a new family of models that achieve superior efficiency-accuracy trade-offs across diverse tasks.
Extensive empirical evaluation in different downstream vision tasks demonstrates that our information-driven dynamic scanning paradigm offers a robust and principled alternative to fixed or global-first traversal methods.
Collectively, our work positions adaptive, content-aware processing as a promising and effective new paradigm for efficient high-resolution visual representation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1236. CoRA: Boosting Time Series Foundation Models for Multivariate Forecasting through Correlation-aware Adapter
๐ฌ 20
โญ 6.00
๐ learning on time series and dynamical systems
๐ท๏ธ Time Series Forecasting
Most existing Time Series Foundation Models (TSFMs) use channel independent modeling and focus on capturing and generalizing temporal dependencies, while neglecting the correlations among channels or overlook the different aspects of correlations. However, these correlations play a vital role in Multivariate time series forecasting. To address this, we propose a Correlation-aware Adapter (**CoRA**), a lightweight plug-and-play method that requires only fine-tuning with TSFMs and is able to capture different types of correlations, so as to improve forecast performance. Specifically, to reduce complexity, we innovatively decompose the correlation matrix into low-rank Time-Varying and Time-Invariant components. For the Time-Varying component, we further design learnable polynomials to learn dynamic correlations by capturing trends or periodic patterns. To learn positive and negative correlations that appear only among some variables, we introduce a novel dual contrastive learning method that identifies correlations through projection layers, regulated by a Heterogeneous-Partial contrastive loss during training, without introducing additional complexity in the inference stage. Extensive experiments on 10 real-world datasets demonstrate that CoRA improves the state-of-the-art TSFMs in average forecast performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1237. MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task
๐ฌ 16
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Reasoning Mathematical Reasoning Data Augmentation
Mathematical reasoning represents a critical frontier in advancing large language models (LLMs). While step-by-step approaches have emerged as the dominant paradigm for mathematical problem-solving in LLMs, the quality of reasoning steps in training data fundamentally constrains model performance. Recent studies has demonstrated that more detailed intermediate steps can enhance model performance, yet existing methods for step expansion either require more powerful external models or incur substantial computational costs. In this paper, we introduce MathFimer, a novel framework for mathematical reasoning step expansion inspired by the "Fill-in-the-middle" from code completion. By decomposing solution chains into prefix-suffix pairs and training models to reconstruct missing intermediate steps, we develop a specialized model, MathFimer-7B, on our carefully curated NuminaMath-Fim dataset. We then apply these models to enhance existing mathematical reasoning datasets by inserting detailed intermediate steps into their solution chains, creating MathFimer-expanded versions. Through comprehensive experiments on multiple mathematical reasoning datasets, including MathInstruct and MetaMathQA, we demonstrate that models trained on MathFimer-expanded data consistently outperform their counterparts trained on original data across various benchmarks such as GSM8K and MATH. Our approach offers a practical, scalable solution for enhancing mathematical reasoning capabilities in LLMs without relying on more powerful external models or expensive inference procedures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1238. ADM-v2: Pursuing Full-Horizon Roll-out in Dynamics Models for Offline Policy Learning and Evaluation
๐ฌ 17
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Model-based Reinforcement Learning Offline Reinforcement Learning
Model-based methods for offline Reinforcement Learning transfer extensive policy exploration and evaluation to data-driven dynamics models, effectively saving real-world samples in the offline setting. We expect the dynamics model to allow the policy to roll out full-horizon episodes, which is crucial for ensuring sufficient exploration and reliable evaluation. However, many previous dynamics models exhibit limited capability in long-horizon prediction. This work follows the paradigm of the Any-step Dynamics Model (ADM) that improves future predictions by reducing bootstrapping prediction to direct prediction. We structurally decouple each recurrent forward of the RNN cell from the backtracked state and propose the second version of ADM (ADM-v2), making the direct prediction more flexible. ADM-v2 not only enhances the accuracy of direct predictions for making full-horizon roll-outs but also supports parallel estimation of the any-step prediction uncertainty to improve efficiency. The results on DOPE validate the reliability of ADM-v2 for policy evaluation. Moreover, via full-horizon roll-out, ADM-v2 for policy optimization enables substantial advancements, whereas other dynamics models degrade due to long-horizon error accumulation. We are the first to achieve SOTA under the full-horizon roll-out setting on both D4RL and NeoRL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1239. Emergent Discrete Controller Modules for Symbolic Planning in Transformers
๐ฌ 9
โญ 6.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Transformers symbolic planning discrete controller modules length generalization algorithmic reasoning
Transformers struggle with tasks that require symbolic planning loops, variable updates, and conditional branching, especially under length extrapolation. We introduce discrete controller modules that insert a small set of program primitives (ASSIGN, ADD, COMPARE, BRANCH) into Transformer blocks via a GumbelโSoftmax selector over operations and a compact program state of registers, flags, and optional memory. We prove that the augmented model can simulate any bounded-step program by mapping each primitive step to one controller step, and we bound the deviation of relaxed execution from its discrete trace by $O(\tau+\kappa^{-1})$ (selection temperature $\tau$, comparison sharpness $\kappa$). Empirically, the controller-augmented Transformer achieves strong length generalization on algorithmic benchmarks (Sorting, Sum-of-List, BFS), improving longest-length accuracy by up to $20$โ$40$ points over strong baselines, and yields consistent gains on symbolic QA (DROP) and program-synthesis-style tasks (RobustFill) with reduced compositionality drop-off. The learned execution is interpretable: operation traces align with ground truth, register roles are linearly decodable, and targeted knockouts cause localized accuracy losses. The approach adds only $\sim$5โ7% FLOPs and can be applied sparsely (every $p$-th layer).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1240. Fantastic Tractor-Dogs and How Not to Find Them With Open-Vocabulary Detectors
๐ฌ 12
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ open-vocabulary object detection vision-language false positives
Open-Vocabulary Detectors (OVDs) excel in zero-shot benchmarks, but we observe a critical flaw in real-world deployment: a high rate of confident false positive predictions on images that do not contain any target objects (e.g., detecting a tractor in an image of a dog). This issue is masked by standard benchmarks like COCO and LVIS, as they rarely contain images without any of the target classes present. We identify vision-language fusion layers in early-fusion OVD architectures (e.g., Grounding DINO or LLMDet) as the root cause, and show how they distribute irrelevant class information across image features when no prompted object is present. To mitigate background false positives without costly retraining, we propose a simple, training-free method: appending attention sink tokens to the input prompt. We show that such sinks can redirect spurious attention and dramatically reduce background false positives. Our approach significantly improves the performance of all six early-fusion models tested (e.g., boosting AP on LVIS by more than 5x at a false positive rate of 0.01 for some models), making them practical for real-world applications where images without the object of interest are much more prevalent.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1241. Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs
๐ฌ 20
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Large Language Models Dataset Fully Open Reasoning
Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT).
Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities.
Addressing these challenges,
our work makes three primary contributions.
First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy.
Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases.
Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent
semi-open models such as InternVL3.5-8B. A comprehensive ablation study further dissects the impact of our data curation process, revealing that each stage provides significant performance gains across a wide range of benchmarks.
Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1242. Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data
๐ฌ 10
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ foundation models relational deep learning relational data transformer
Pretrained transformers readily adapt to new sequence modeling tasks via zero-shot prompting, but relational domains still lack architectures that transfer across datasets and tasks.
The core challenge is the diversity of relational data, with varying heterogeneous schemas, graph structures, and functional dependencies.
We propose the _Relational Transformer (RT)_, a cell-level architecture pretrained on diverse relational databases and directly applicable to unseen datasets and tasks, without any need for task- or dataset-specific fine-tuning or retrieval of in-context examples. RT (i) tokenizes cells with table/column metadata, (ii) is pretrained via masked token prediction, and (iii) utilizes a novel _Relational Attention_ mechanism over columns, rows, and primaryโforeign key links.
Pretrained on RelBench datasets spanning tasks such as churn and sales forecasting, RT attains strong zero-shot performance; on binary classification it averages 94\% of fully supervised AUROC in a single forward pass, and fine-tuning yields state-of-the-art results with high sample efficiency. Our experiments show that RTโs zero-shot transfer harnesses task-table context,
column and feature attention, and schema semantics. Overall, RT provides a practical path toward foundation models for relational data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1243. MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement
๐ฌ 19
โญ 6.00
๐ generative models
๐ท๏ธ Video generation; Diffusion models
We tackle the task of any-reference video generation, which aims to synthesize videos conditioned on arbitrary types and combinations of reference subjects, together with textual prompts. This task faces persistent challenges, including identity inconsistency, entanglement among multiple reference subjects, and copy-paste artifacts. To address these issues, we introduce MAGREF, a unified and effective framework for any-reference video generation. Our approach incorporates masked guidance and a subject disentanglement mechanism, enabling flexible synthesis conditioned on diverse reference images and textual prompts. Specifically, masked guidance employs a region-aware masking mechanism combined with pixel-wise channel concatenation to preserve appearance features of multiple subjects along the channel dimension. This design preserves identity consistency and maintains the capabilities of the pre-trained backbone, without requiring any architectural changes. To mitigate subject confusion, we introduce a subject disentanglement mechanism which injects the semantic values of each subject derived from the text condition into its corresponding visual region. Additionally, we establish a four-stage data pipeline to construct diverse training pairs, effectively alleviating copy-paste artifacts. Extensive experiments on a comprehensive benchmark demonstrate that MAGREF consistently outperforms existing state-of-the-art approaches, paving the way for scalable, controllable, and high-fidelity any-reference video synthesis. The code and video demos are available in the supplementary materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1244. How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation.
๐ฌ 17
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Benchmark Ananlysis Transferability
Transferability estimation metrics are used to find a high-performing pre-trained model for a given target task without fine-tuning models and without access to the source dataset. Despite the growing interest in developing such metrics, the benchmarks used to measure their progress have gone largely unexamined. In this work, we empirically show the shortcomings of widely used benchmark setups to evaluate transferability estimation metrics. We argue that the benchmarks on which these metrics are evaluated are fundamentally flawed. We empirically demonstrate that their unrealistic model spaces and static performance hierarchies artificially inflate the perceived performance of existing metrics, to the point where simple, dataset-agnostic heuristics can outperform sophisticated methods. Our analysis reveals a critical disconnect between current evaluation protocols and the complexities of real-world model selection. To address this, we provide concrete recommendations for constructing more robust and realistic benchmarks to guide future research in a more meaningful direction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1245. From Neural Networks to Logical Theories: The Correspondence between Fibring Modal Logics and Fibring Neural Networks
๐ฌ 11
โญ 6.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ fibring modal logics logical expressiveness graph neural networks transformer encoders
Fibring of modal logics is a well-established formalism for combining countable families of modal logics into a single fibred language with common semantics, characterized by fibred models. Inspired by this formalism, fibring of neural networks was introduced as a neurosymbolic framework for combining learning and reasoning in neural networks. Fibring of neural networks uses the (pre-)activations of a trained network to evaluate a fibring function computing the weights of another network whose outputs are injected back into the original network. However, the exact correspondence between fibring of neural networks and fibring of modal logics was never formally established. In this paper, we close this gap by formalizing the idea of fibred models compatible with fibred neural networks. Using this correspondence, we then derive non-uniform logical expressiveness results for Graph Neural Networks (GNNs), Graph Attention Networks (GATs) and Transformer encoders. Longer-term, the goal of this paper is to open the way for the use of fibring as a formalism for interpreting the logical theories learnt by neural networks with the tools of computational logic.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1246. Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility
๐ฌ 10
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ LLMs cognitive science interpretability common sense reasoning
Language models (LMs) are used for a diverse range of tasks, from question answering to writing fantastical stories. In order to reliably accomplish these tasks, LMs must be able to discern the modal category of a sentence (i.e., whether it describes something that is possible, impossible, completely nonsensical, etc.). However, recent studies have called into question the ability of LMs to categorize sentences according to modality. In this work, we identify linear representations that discriminate between modal categories within a variety of LMs, or modal difference vectors. Analysis of modal difference vectors reveals that LMs have access to more reliable modal categorization judgments than previously reported. Furthermore, we find that modal difference vectors emerge in a consistent order as models become more competent (i.e., through training steps, layers, and parameter count). Notably, we find that modal difference vectors identified within LM activations can be used to model fine-grained human categorization behavior. This potentially provides a novel view into how human participants distinguish between modal categories, which we explore by correlating projections along modal difference vectors with human participants' ratings of interpretable features. In summary, we derive new insights into LM modal categorization using techniques from mechanistic interpretability, with the potential to inform our understanding of modal categorization in humans.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1247. Learning a Game by Paying the Agents
๐ฌ 14
โญ 6.00
๐ learning theory
๐ท๏ธ No-Regret Learning Inverse Game Theory Revealed Preference Steering
We study the problem of learning the utility functions of no-regret learning agents in a repeated normal-form game.
Differing from most prior literature, we introduce a principal with the power to observe the agents playing the game, send agents signals, and give agents *payments* as a function of their actions.
We show that the principal can, using a number of rounds polynomial in the size of the game, learn the utility functions of all agents to any desired precision $\varepsilon > 0$, for any no-regret learning algorithms of the agents.
Our main technique is to formulate a zero-sum game between the principal and the agents, where the principal's strategy space is the set of all payment functions.
Finally, we discuss implications for the problem of *steering* agents to a desired equilibrium: in particular, we introduce, using our utility-learning algorithm as a subroutine, the first algorithm for steering arbitrary no-regret learning agents without prior knowledge of their utilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1248. ELEPHANT: Measuring and understanding social sycophancy in LLMs
๐ฌ 12
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ large language models sycophancy affirmation benchmark social sycophancy
LLMs are known to exhibit _sycophancy_: agreeing with and flattering users, even at the cost of correctness. Prior work measures sycophancy only as direct agreement with users' explicitly stated beliefs that can be compared to a ground truth. This fails to capture broader forms of sycophancy such as affirming a user's self-image or other implicit beliefs. To address this gap, we introduce **social sycophancy**, characterizing sycophancy as excessive preservation of a userโs _face_ (their desired self-image), and present **ELEPHANT**, a benchmark for measuring social sycophancy in an LLM. Applying our benchmark to 11 models, we show that LLMs consistently exhibit high rates of social sycophancy: on average, they preserve user's face 45 percentage points more than humans in general advice queries and in queries describing clear user wrongdoing (from Reddit's r/AmITheAsshole). Furthermore, when prompted with perspectives from either side of a moral conflict, LLMs affirm *whichever side the user adopts* in 48% of cases--telling both the at-fault party and the wronged party that they are not wrong--rather than adhering to a consistent moral or value judgment. We further show that social sycophancy is rewarded in preference datasets, and that while existing mitigation strategies for sycophancy are limited in effectiveness, model-based steering shows promise for mitigating these behaviors. Our work provides theoretical grounding and an empirical benchmark for understanding and addressing sycophancy in the open-ended contexts that characterize the vast majority of LLM use cases.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1249. WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables
๐ฌ 10
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Speech Large Language Models SLLM Voice Assistant Benchmark
Wearable devices such as AI glasses are transforming voice assistants into always-available, hands-free collaborators that integrate seamlessly with daily life, but they also introduce challenges like egocentric audio affected by motion and noise, rapid micro-interactions, and the need to distinguish device-directed speech from background conversations. Existing benchmarks largely overlook these complexities, focusing instead on clean or generic conversational audio. To bridge this gap, we present WearVox, the first benchmark designed to rigorously evaluate voice assistants in realistic wearable scenarios. WearVox comprises 3,842 multi-channel, egocentric audio recordings collected via AI glasses across five diverse tasks including Search-Grounded QA, Closed-Book QA, Side-Talk Rejection, Tool Calling, and Speech Translation, spanning a wide range of indoor and outdoor environments and acoustic conditions. Each recording is accompanied by rich metadata, enabling nuanced analysis of model performance under real-world constraints. We benchmark leading proprietary and open-source speech Large Language Models (SLLMs) and find that most real-time SLLMs achieve accuracies on WearVox ranging from 29\% to 59\%, with substantial performance degradation on noisy outdoor audio, underscoring the difficulty and realism of the benchmark. Additionally, we conduct a case study with two new SLLMs that perform inference with single-channel and multi-channel audio, demonstrating that multi-channel audio inputs significantly enhance model robustness to environmental noise and improve discrimination between device-directed and background speech. Our results highlight the critical importance of spatial audio cues for context-aware voice assistants and establish WearVox as a comprehensive testbed for advancing wearable voice AI research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1250. Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models
๐ฌ 18
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Vision-language model Jailbreak Transferability
Visionโlanguage models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models. In this work, we propose **U**niversa**l** and **tra**nsferable jail**break** (**UltraBreak**), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision-level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1251. Boomerang Distillation Enables Zero-Shot Model Size Interpolation
๐ฌ 17
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ knowledge distillation pretraining adaptive compute model interpolation
Large language models (LLMs) are typically deployed under diverse memory and compute constraints. Existing approaches build model families by training each size independently, which is prohibitively expensive and provides only coarse-grained size options. In this work, we identify a novel phenomenon that we call boomerang distillation: starting from a large base model (the teacher), one first distills down to a small student and then progressively reconstructs intermediate-sized models by re-incorporating blocks of teacher layers into the student without any additional training. This process produces zero-shot interpolated models of many intermediate sizes whose performance scales smoothly between the student and teacher, often matching or surpassing pretrained or distilled models of the same size. We further analyze when this type of interpolation succeeds, showing that alignment between teacher and student through pruning and distillation is essential. Boomerang distillation thus provides a simple and efficient way to generate fine-grained model families, dramatically reducing training cost while enabling flexible adaptation across deployment environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1252. Ctrl-World: A Controllable Generative World Model for Robot Manipulation
๐ฌ 9
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ World Model Vision-Language-Action Model (VLA)
Generalist robot policies can now perform a wide range of manipulation skills, but evaluating and improving their ability with unfamiliar objects and instructions remains a significant challenge. Rigorous evaluation requires a large number of real-world rollouts, while systematic improvement demands additional corrective data with expert labels. Both of these processes are slow, costly, and difficult to scale.
World models offer a promising, scalable alternative by enabling policies to rollout within imagination space.
However, a key challenge is building a controllable world model that can handle multi-step interactions with generalist robot policies.
This requires a world model compatible with modern generalist policies by supporting multi-view prediction, fine-grained action control, and consistent long-horizon interactions, which is not achieved by previous works.
In this paper, we make a step forward by introducing a controllable multi-view world model that can be used to evaluate and improve the instruction-following ability of generalist robot policies. Our model maintains long-horizon consistency with a pose-conditioned memory retrieval mechanism and achieves precise action control through frame-level action conditioning. Trained on the DROID dataset (95k trajectories, 564 scenes), our model generates spatially and temporally consistent trajectories under novel scenarios and new camera placements for over 20 seconds. We show that our method can accurately rank policy performance without real-world robot rollouts. Moreover, by synthesizing successful trajectories in imagination and using them for supervised fine-tuning, our approach can improve policy success by 44.7\%. Videos can be found at https://sites.google.com/view/ctrl-world.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1253. Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models
๐ฌ 12
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ RLVR Large Language Model Risk-Sensitive Reinforcement Learning
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs) on complex reasoning tasks. Yet current methods face an exploration dilemma: standard RL struggles to escape the local optima of pre-trained LLMsโ sharply peaked initial policies, boosting single-solution accuracy (pass@1) but suppressing solution diversity and multi-solution performance (pass@k). As a result, RLVR often distills existing capabilities rather than discovering new reasoning strategies. We address this with a Risk-Sensitive Reinforcement Learning framework. By adopting a risk-seeking objective that interpolates between mean and maximum rewards, we derive a novel Risk-Sensitive GRPO (RS-GRPO) algorithm that emphasizes hard prompts to drive exploration. Across six mathematical reasoning benchmarks and five LLMs, RS-GRPO consistently improves pass@k performance while enhancing or maintaing pass@1.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1254. On the $O(1/T)$ Convergence of Alternating Gradient DescentโAscent in Bilinear Games
๐ฌ 19
โญ 6.00
๐ optimization
๐ท๏ธ Two-player zero-sum games Alternating gradient descent-ascent Performance estimation programming
We study the alternating gradient descent-ascent (AltGDA) algorithm in two-player zero-sum games.
Alternating methods, where players take turns to update their strategies, have long been recognized as simple and practical approaches for learning in games, exhibiting much better numerical performance than their simultaneous counterparts.
However, our theoretical understanding of alternating algorithms remains limited, and results are mostly restricted to the unconstrained setting.
We show that for two-player zero-sum games that admit an interior Nash equilibrium, AltGDA converges at an $O(1/T)$ ergodic convergence rate when employing a small constant stepsize. This is the first result showing that alternation improves over the simultaneous counterpart of GDA in the constrained setting.
For games without an interior equilibrium, we show an $O(1/T)$ local convergence rate with a constant stepsize that is independent of any game-specific constants.
In a more general setting, we develop a performance estimation programming (PEP) framework to jointly optimize the AltGDA stepsize along with its worst-case convergence rate.
The PEP results indicate that AltGDA may achieve an $O(1/T)$ convergence rate for a finite horizon $T$, whereas its simultaneous counterpart appears limited to an $O(1/\sqrt{T})$ rate.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1255. Learning to Interpret Weight Differences in Language Models
๐ฌ 10
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ interpretability weight diffs lora
Finetuning (pretrained) language models is a standard approach for updating their internal parametric knowledge and specializing them to new tasks and domains. However, the corresponding model weight changes ("weight diffs") are not generally interpretable. While inspecting the finetuning dataset can give a sense of how the model might have changed, these datasets are often not publicly available or are too large to work with directly. Towards the goal of broadly understanding model weight changes in natural language, we introduce Diff Interpretation Tuning (DIT), a method that trains models to describe their own finetuning-induced modifications. Our approach uses synthetic, labeled weight diffs to train an introspection adapter, which can be applied to a compatible finetuned model to make it self-describe the weight changes. We demonstrate in two proof-of-concept settings (reporting hidden behaviors and summarizing finetuned knowledge) that our method enables models to describe their finetuning-induced modifications using concise and accurate natural language descriptions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1256. The Mind's Transformer: Computational Neuroanatomy of LLM-Brain Alignment
๐ฌ 21
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ language model neuroscience brain alignment fMRI
The alignment of Large Language Models (LLMs) and brain activity provides a powerful framework to advance our understanding of cognitive neuroscience and artificial intelligence. In this work, we zoom into one of the fundamental units of LLMsโthe transformer blockโto provide the first systematic computational neuroanatomy of its internal operations and human brain acitivity during language processing. Analyzing 21 state-of-the-art LLMs across five model families, we extract and evaluate 13 distinct intermediate states per transformer blockโfrom initial layer normalization through attention mechanisms to feed-forward networks (FFNs). Our analysis reveals three key findings: (1) The commonly used hidden states in LLMs are surprisingly suboptimal, with over 90\% of brain voxels in sensory and language regions better explained by previously unexplored intermediate computations; (2) Different computational stages within a single transformer block map to anatomically distinct brain systems, revealing an intra-block hierarchy where early attention states align with sensory cortices while later FFN states correspond to association areasโmirroring the cortical processing hierarchy; (3) Rotary Positional Embeddings (RoPE) specifically enhance alignment along the brain's auditory processing streams. Per-head queries with RoPE best explain 74\% of auditory cortex activity compared to 8\% without RoPE, providing the first neurobiological validation of this architectural component in LLMs. Building on these insights, we propose MindTransformer, a feature selection framework that learns brain-aligned representations from all intermediate states. MindTransformer achieves significant brain alignment performance, with correlation improvements in primary auditory cortex exceeding gains from 456ร model scaling. Our computational neuroanatomy approach opens new directions for understanding both biological intelligence through the lens of transformer computations and artificial intelligence through principles of brain organization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1257. Dens3R: A Foundation Model for 3D Geometry Prediction
๐ฌ 12
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Visual Foundation Model 3D Geometry Prediction
Recent advances in dense 3D reconstruction have led to significant progress, yet achieving accurate unified geometric prediction remains a major challenge. Most existing methods are limited to predicting a single geometry quantity from input images. However, geometric quantities such as depth, surface normals, and point maps are inherently correlated, and estimating them in isolation often fails to ensure consistency, thereby limiting both accuracy and practical applicability. This motivates us to explore a unified framework that explicitly models the structural coupling among different geometric properties to enable joint regression. In this paper, we present Dens3R, a 3D foundation model designed for joint geometric dense prediction and adaptable to a wide range of downstream tasks. Dens3R adopts a two-stage training framework to progressively build a pointmap representation that is both generalizable and intrinsically invariant. Specifically, we design a lightweight shared encoder-decoder backbone and introduce position-interpolated rotary positional encoding to maintain expressive power while enhancing robustness to high-resolution inputs. By integrating image-pair matching features with intrinsic invariance modeling, Dens3R accurately regresses multiple geometric quantities such as surface normals and depth, achieving consistent geometry perception from single-view to multi-view inputs. Additionally, we propose a post-processing pipeline that supports geometrically consistent multi-view inference. Extensive experiments demonstrate the superior performance of Dens3R across various tasks and highlight its potential for broader applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1258. IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
๐ฌ 5
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Image-Grounded Video Perception and Reasoning Multimodal llms Benchmark
Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose \textbf{IV-Bench}, the first comprehensive benchmark for evaluating \emph{Image-Grounded Video Perception and Reasoning}. IV-Bench consists of 966 videos paired with 2,560 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) and 5 representative categories. Extensive evaluations of state-of-the-art open-source (e.g., InternVL2.5, Qwen2.5-VL) and closed-source (e.g., GPT-4o, Gemini2-Flash and Gemini2-Pro) MLLMs demonstrate that current models substantially underperform in image-grounded video Perception and Reasoning, merely achieving at most 28.9% accuracy. Further analysis reveals key factors influencing model performance on IV-Bench, including inference pattern, frame number, and resolution. These findings collectively provide valuable insights for future research. Our codes and data are released in \url{https://anonymous.4open.science/r/IV-Bench-A3F7}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1259. ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
๐ฌ 5
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Multi-Timescale Benchmark Long Video Understanding
Understanding long videos requires Multimodal Large Language Models (MLLMs) to grasp multi-timescale information, often organized in hierarchies. However, current long-video understanding benchmarks either overlook multi-timescale design or distribute questions targeting different timescales across different videos. This approach entangles timescales with video content, thereby hindering a clear assessment of MLLM multi-timescale performance. To address this, we introduce ScaleLong, the first benchmark to disentangle these factors by embedding questions targeting four hierarchical timescales\textemdash clip (seconds), shot (tens of seconds), event (minutes), and story (hours)\textemdash all within the same video content. This ``within-content'' multi-timescale questioning design enables direct comparison of model performance across timescales on identical videos. ScaleLong features 269 videos (avg. 86 min) from 5 main categories and 36 sub-categories, with 4โ8 carefully designed questions, with at least one question targeting each timescale. Evaluating 22 MLLMs reveals a distinct U-shaped performance trend: higher accuracy at the shortest (clip) and longest (story) timescales, with a dip at intermediate levels. Furthermore, ablation studies demonstrate that increased visual token capacity consistently enhances reasoning across all timescales. ScaleLong offers a crucial fine-grained, multi-timescale benchmark for advancing MLLM capabilities in long-video understanding. The code and dataset are available at \url{https://anonymous.4open.science/r/ScaleLong-7717}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1260. Reevaluating Policy Gradient Methods for Imperfect-Information Games
๐ฌ 16
โญ 6.00
๐ reinforcement learning
๐ท๏ธ imperfect-information games two-player zero-sum games reinforcement learning multi agent game theory
In the past decade, motivated by the putative failure of naive self-play deep reinforcement learning (DRL) in adversarial imperfect-information games, researchers have developed numerous DRL algorithms based on fictitious play (FP), double oracle (DO), and counterfactual regret minimization (CFR). In light of recent results of the magnetic mirror descent algorithm, we hypothesize that simpler generic policy gradient methods like PPO are competitive with or superior to these FP-, DO-, and CFR-based DRL approaches. To facilitate the resolution of this hypothesis, we implement and release the first broadly accessible exact exploitability computations for five large games. Using these games, we conduct the largest-ever exploitability comparison of DRL algorithms for imperfect-information games. Over 7000 training runs, we find that FP-, DO-, and CFR-based approaches fail to outperform generic policy gradient methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1261. ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer
๐ฌ 17
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ byte-level language modeling tokenization
Modern language models (LMs) still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in otherwise strong reasoning models. We introduce \textbf{ByteFlow Net}, a new architecture that removes tokenizers entirely and instead enables models to learn their own segmentation of raw byte streams into semantically meaningful units. Our approach is grounded in information theory: ByteFlow Net performs compression-driven segmentation based on coding rate of latent representation, allowing the model to dynamically evaluate the information cost of grouping bytes and decide chunk boundaries during processing. Unlike prior self-tokenizing methods that depend on brittle heuristics with human-designed inductive biases, ByteFlow Net adapts its internal representation granularity to the input itself. Experiments demonstrate that this compression-based chunking strategy yields substantial performance gains, with ByteFlow Net outperforming both BPE-based Transformers and previous byte-level architectures. These results suggest that end-to-end, tokenizer-free modeling is not only feasible but also more effective, opening a path toward more adaptive, robust, and information-grounded language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1262. Frayed RoPE and Long Inputs: A Geometric Perspective
๐ฌ 12
โญ 6.00
๐ generative models
๐ท๏ธ RoPE context length extension sink tokens clustering attention long context transformer language model
Rotary Positional Embedding (RoPE) is a widely adopted technique for encoding position in language models, which, while effective, causes performance breakdown when input length exceeds training length. Prior analyses assert (rightly) that long inputs cause channels to rotate "out of distribution," but it is not clear how extra rotation relates to or causes pathological behavior. Through empirical analysis we advance a unified geometric understanding of attention behavior with RoPE. We find that attention induces tight clustering of separated key and query latent point clouds, allowing for creation of sink tokens: placeholders that allow attention heads to avoid token mixing when not required. RoPE applied to longer inputs damages this key/query cluster separation, producing pathological behavior by inhibiting sink token functionality. From this geometric perspective, we propose RoPE-ID (In Distribution), a straightforward modification that allows attention layers to generalize to longer inputs out of the box: apply RoPE with high frequency to a subset of channels. We demonstrate the effectiveness of RoPE-ID for extended inputs using 1B and 3B parameter Transformers on the LongBench and RULER information retrieval benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1263. ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations
๐ฌ 22
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Relational Hallucination; Interleaved Chain of Image and Text; Large Vision-Language Models
While Large Vision-Language Models (LVLMs) achieve strong performance in multimodal tasks, hallucinations continue to affect their reliability. Among the three categories of hallucinations, which include object, attribute, and relation, relation hallucinations account for the largest proportion but have received the least attention. To address this challenge, we propose \textbf{ChainMPQ} (\textbf{M}ulti-\textbf{P}erspective \textbf{Q}uestions guided Interleaved Text-image Reasoning \textbf{Chain}), a training-free method that improves relational inference in LVLMs by utilizing accumulated textual and visual memories. ChainMPQ first extracts subject and object keywords from the question to enhance the corresponding image regions. It then constructs multi-perspective questions that focus on the three core components of a relationship: the subject, the object, and the relation that links them. These questions are sequentially input to the model, with textual and visual memories from earlier steps providing supporting context for subsequent ones, thereby forming an interleaved chain of image and text that guides progressive relational reasoning. Experiments on multiple LVLMs and benchmarks show that ChainMPQ substantially reduces relation hallucinations, while ablation studies further validate the effectiveness of its three core modules.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1264. GOOD: Geometry-guided Out-of-Distribution Modeling for Open-set Test-time Adaptation in Point Cloud Semantic Segmentation
๐ฌ 14
โญ 6.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Open-set Semantic Segmentation Online Domain Adaptation Point Cloud Segmentation
Open-set Test-time Adaptation (OSTTA) has been introduced to address the challenges of both online model optimization and open-set recognition. Despite the demonstrated success of OSTTA methodologies in 2D image recognition, their application to 3D point cloud semantic segmentation is still hindered by the complexities of point cloud data, particularly the imbalance between known (in-distribution, ID) and unknown (out-of-distribution, OOD) data, where known samples dominate and unknown instances are often sparse or even absent. In this paper, we propose a simple yet effective strategy, termed Geometry-guided Out-of-Distribution Modeling (GOOD), specifically designed to address OSTTA for 3D point cloud semantic segmentation. Technically, we first leverage geometric priors to cluster the point cloud into superpoints, thereby mitigating the numerical disparity between individual points and providing a more structured data representation. Then, we introduce a novel confidence metric to effectively distinguish between known and unknown superpoints. Additionally, prototype-based representations are integrated to enhance the discrimination between ID and OOD regions, facilitating robust segmentation. We validate the efficacy of GOOD across four benchmark datasets. Remarkably, on the Synth4D to SemanticKITTI task, GOOD outperforms HGL by 1.93%, 8.99%, and 7.91% in mIoU, AUROC, and FPR95, respectively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1265. ExGRPO: Learning to Reason from Prior Successes
๐ฌ 19
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reinforcement Learning Large Reasoning Model Reinforcement Learning with Verifiable Rewards
Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1266. ATTS: Asynchronous Test-Time Scaling via Conformal Prediction
๐ฌ 19
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Conformal Prediction Test-Time Scaling Speculative Decoding
Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency. Speculative decoding is a natural way to accelerate the scaling process; however, scaling along both the parallel and sequential dimensions poses significant challenges, including substantial memory-bound execution and synchronization overhead. We introduce *ATTS* (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive scaling framework that follows the hypothesis testing process to address these challenges. By revisiting arithmetic intensity, *ATTS* identifies synchronization as the primary bottleneck. It enables asynchronous inference through online calibration and proposes an ordinal classification algorithm that supports a three-stage rejection sampling pipeline, scaling along both the sequential and parallel axes. Across experiments on the MATH, AMC23, AIME24, and AIME25 datasets and across multiple draftโtarget model families, we show that *ATTS* delivers up to *56.7x* speedup in test-time scaling and a *4.14x* throughput improvement, while maintaining accurate control of the rejection rate, reducing latency and memory overhead, and incurring no accuracy loss. By scaling both in parallel and sequential dimensions, we enable the 1.5B/70B draft/target model combination to achieve the performance of the state-of-the-art reasoning model o3-mini (high) on the AIME dataset. We submit the anonymous repository: anonymous.4open.science/r/Asynchronous-Test-Time-Scaling-5940.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1267. Fast Convergence of Natural Gradient Descent for Over-parameterized Physics-Informed Neural Networks
๐ฌ 22
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ natural gradient descent over-parameterization physics-informed neural networks neural tangent kernel
In the context of over-parameterization, there is a line of work demonstrating that randomly initialized (stochastic) gradient descent (GD) converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. However, the convergence rate of GD for training two-layer neural networks exhibits poor dependence on the sample size and the Gram matrix, leading to a slow training process. In this paper, we show that for training two-layer $\text{ReLU}^3$ Physics-Informed Neural Networks (PINNs), the learning rate can be improved from the smallest eigenvalue of the limiting Gram matrix to the reciprocal of the largest eigenvalue, implying that GD actually enjoys a faster convergence rate. Despite such improvements, the convergence rate is still tied to the least eigenvalue of the Gram matrix, leading to slow convergence. We then develop the positive definiteness of Gram matrices with general smooth activation functions and provide the convergence analysis of natural gradient descent (NGD) in training two-layer PINNs, demonstrating that the maximal learning rate can be $\mathcal{O}(1)$ and at this rate, the convergence rate is independent of the Gram matrix. In particular, for smooth activation functions, the convergence rate of NGD is quadratic. Numerical experiments are conducted to verify our theoretical results.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1268. Towards Safe and Optimal Online Bidding: A Modular Look-ahead Lyapunov Framework
๐ฌ 13
โญ 6.00
๐ optimization
๐ท๏ธ online bidding budget constraints ROI constraints Lyapunov optimization
This paper studies online bidding subject to simultaneous budget and return-on-investment (ROI) constraints, which encodes the goal of balancing high volume and profitability. We formulate the problem as a general constrained online learning problem that can be applied to diverse bidding settings (e.g., first-price or second-price auctions) and feedback regimes (e.g., full or partial information), among others. We introduce L2FOB, a Look-ahead Lyapunov Framework for Online Bidding with strong empirical and theoretical performance. By combining optimistic reward and pessimistic cost estimation with the look-ahead virtual queue mechanism, L2FOB delivers safe and optimal bidding decisions. We provide adaptive guarantees: L2FOB achieves $O (\mathcal{E}\_r(T,p)+(\nu^* / \rho) \mathcal{E}\_c(T,p))$ regret and $O (\mathcal{E}\_r(T,p)+\mathcal{E}\_c(T,p))$ anytime ROI constraint violation, where $\mathcal{E}_r(T,p)$ and $\mathcal{E}_c(T,p)$ are cumulative estimation errors over $T$ rounds, $\rho$ is the average per-round budget, and $\nu^*$ is the offline optimal average reward. We instantiate L2FOB in several online bidding settings, demonstrating guarantees that match or improve upon the best-known results. These results are derived from the novel look-ahead design and Lyapunov stability analysis. Numerical experiments further validate our theoretical guarantees.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1269. Physics-informed learning under mixing: How physical knowledge speeds up learning
๐ฌ 12
โญ 6.00
๐ learning theory
๐ท๏ธ learning with dependent data physics-informed machine learning convergence rates complexity-dependent bounds
A major challenge in physics-informed machine learning is to understand how the incorporation of prior domain knowledge affects learning rates when data are dependent. Focusing on empirical risk minimization with physics-informed regularization, we derive complexity-dependent bounds on the excess risk in probability and in expectation. We prove that, when the physical prior information is aligned, the learning rate improves from the (slow) Sobolev minimax rate to the (fast) optimal i.i.d. one without any sample-size deflation due to data dependence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1270. Bi-LoRA: Efficient Sharpness-Aware Minimization for Fine-Tuning Large-Scale Models
๐ฌ 17
โญ 6.00
๐ optimization
๐ท๏ธ low-rank adaptation efficient training generalization
Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning of large pre-trained models. Yet LoRA can face generalization challenges.
One promising way to improve the generalization is Sharpness-Aware Minimization (SAM), which has proven effective for small-scale training scenarios. In this paper, we propose **Bi**-directional **Lo**w-**R**ank **A**daptation (Bi-LoRA), which introduces an auxiliary adversarial LoRA module. This design explicitly decouples sharpness optimization, handled by the auxiliary module, from task adaptation, performed by the primary module. Such a separation yields two key benefits. First, it transforms the sequential computation of primary LoRA update and adversarial perturbation into a parallel form, which roughly halves the time and conquers the main obstacle of applying SAM in LoRA. Second, it provides perturbations from the auxiliary module that do not collapse into the restricted optimization subspace of the primary module, enabling broader sharpness exploration and flatter minima. Bi-LoRA simultaneously achieves both efficiency and effectiveness within a single framework, as verified by extensive experiments across diverse architectures and tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1271. Obfuscated Activations Bypass LLM Latent-Space Defenses
๐ฌ 21
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Interpretability Adversarial Attack Jailbreaking Safety
_Latent-space_ monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners to detect harmful activations before they lead to undesirable actions. This prompts the question: can models execute harmful behavior _via inconspicuous latent states_? Here, we study such _obfuscated activations_. Our results are nuanced. We show that state-of-the-art latent-space defenses---such as activation probes and latent OOD detection---are vulnerable to obfuscated activations. For example, against probes trained to classify harmfulness, our obfuscation attacks can reduce monitor recall from 100% down to 0% while still achieving a 90% jailbreaking success rate. However, we also find that certain probe architectures are more robust than others, and we discover the existence of an _obfuscation tax_: on a complex task (writing SQL code), evading monitors reduces model performance.
Together, our results demonstrate white-box monitors are not robust to adversarial attack, while also providing concrete suggestions to alleviate, but not completely fix, this weakness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1272. PAGE-4D: Disentangled Pose and Geometry Estimation for 4D Perception
๐ฌ 11
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 4D Perception Camera Pose Estimation Depth Estimation Point Cloud Reconstructionn
Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction, point cloud reconstruction, and point trackingโall without post-processing. Training a geometry transformer for dynamic scenes from scratch, however, demands large-scale dynamic datasets and substantial computational resources, which are often impractical. To overcome this, we propose an efficient fine-tuning strategy that allows PAGE-4D to generalize to dynamic scenarios using only limited dynamic data and compute. In particular, we design a dynamics-aware aggregator that disentangles dynamic from static content for downstream scene understanding tasks: it first predicts a dynamics-aware mask, which then guides a dynamics-aware global attention mechanism. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction. The source code and pretrained model weights are provided in the supplementary material.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1273. Revisiting Long-context Modeling from Context Denoising Perspective
๐ฌ 17
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Language Modeling Long-context Understanding
Long-context models (LCMs) have demonstrated great potential in processing long sequences, facilitating many real-world applications. The success of LCMs can be attributed to their ability to locate implicit critical information within the context for further prediction. However, recent research reveals that LCMs are often susceptible to contextual noise, i.e., irrelevant tokens, that can mislead model attention. In this paper, we conduct a fine-grained analysis of the context noise and propose an effective metric, the Integrated Gradient (IG) score, to detect and quantify the noise information within the context. Our findings reveal that even simple mitigation of detected context noise can substantially boost the model's attention on critical tokens and benefit subsequent predictions. Building on this insight, we propose Context Denoising Training (CDT), a straightforward yet effective training strategy that improves attention on critical tokens while reinforcing their influence on model predictions. Extensive experiments across four tasks, under both context window scaling and long-context alignment settings, demonstrate the superiority of CDT. Notably, when trained with CDT, an open-source 8B model can achieve performance (50.92) comparable to GPT-4o (51.00).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1274. Robust Selective Activation with Randomized Temporal K-Winner-Take-All in Spiking Neural Networks for Continual Learning
๐ฌ 17
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Spiking neural networks
The human brain exhibits remarkable efficiency in processing sequential information, a capability deeply rooted in the temporal selectivity and stochastic competition of neuronal activation. Current continual learning in spiking neural networks (SNNs) faces a critical challenge: balancing task-specific selectivity with adaptive resource allocation and enhancing the robustness with perturbations to mitigate catastrophic forgetting. Considering the intrinsic temporal dynamics of spiking neurons instead of traditional K-winner-take-all (K-WTA) based on firing rate, we explore how to leave networks robust to temporal perturbations in SNNs on lifelong learning tasks.
In this paper, we propose Randomized Temporal K-winner-take-all (RTK-WTA) SNNs for lifelong learning, a biologically grounded approach that integrates trace-dependent neuronal activation with probabilistic top-k selection. By dynamically prioritizing neurons based on their spatiotemporal relevance, RTK-WTA SNNs emulate the brainโs ability to modulate neural resources in spatial and temporal dimensions while introducing controlled randomness to prevent overlapping task representations. The proposed RTK-WTA SNNs enhance inter-class margins and robustness through expanded feature space utilization theoretically. The experimental results show that RTK-WTA surpasses deterministic K-WTA by 3.07โ5.0\% accuracy on splitMNIST and splitCIFAR100 with elastic weight consolidation. Controlled stochasticity balances temporal coherence and adaptability, offering a scalable framework for lifelong learning in neuromorphic systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1275. Adversarially Pretrained Transformers may be Universally Robust In-Context Learners
๐ฌ 14
โญ 6.00
๐ learning theory
๐ท๏ธ Adversarial Robustness Transformer In-Context Learning
Adversarial training is one of the most effective adversarial defenses, but it incurs a high computational cost. In this study, we present the first theoretical analysis suggesting that adversarially pretrained transformers can serve as universally robust foundation models, models that can robustly adapt to diverse downstream tasks with only lightweight tuning. Specifically, we demonstrate that single-layer linear transformers, after adversarial pretraining across a variety of classification tasks, can robustly generalize to unseen classification tasks through in-context learning from clean demonstrations (i.e., without requiring additional adversarial training or examples). This universal robustness stems from the model's ability to adaptively focus on robust features within given tasks. We also show the two open challenges for attaining robustness: accuracy-robustness trade-off and sample-hungry training. This study initiates the discussion on the utility of universally robust foundation models. While their training is expensive, the investment would prove worthwhile as downstream tasks can enjoy free adversarial robustness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1276. PCB-Bench: Benchmarking LLMs for Printed Circuit Board Placement and Routing
๐ฌ 20
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ LLMs Printed Circuit Board Placement and Routing Multimodal Benchmark
Recent advances in Large Language Models (LLMs) have enabled impressive capabilities across diverse reasoning and generation tasks. However, their ability to understand and operate on real-world engineering problemsโsuch as Printed Circuit Board (PCB) placement and routingโremains underexplored due to the lack of standardized benchmarks and high-fidelity datasets. To address this gap, we introduce PCB-Bench, the first comprehensive benchmark designed to systematically evaluate LLMs in the context of PCB design. PCB-Bench spans three complementary task settings: (1) text-based reasoning with approximately 3,700 expert-annotated instances, consisting of over 1,800 question-answer pairs and their corresponding choice question versions, covering component placement, routing strategies, and design rule compliance; (2) multimodal image-text reasoning with approximately 500 problems requiring joint interpretation of PCB visuals and technical specifications, including component identification, function recognition, and visual trace reasoning; (3) real-world design comprehension using over 170 complete PCB projects with schematics, placement files, and design documentation. We design structured evaluation protocols to assess both generative and discriminative capabilities, and conduct extensive comparisons across state-of-the-art LLMs. Our results reveal substantial gaps in current modelsโ ability to reason over spatial placements, follow domain-specific constraints, and interpret professional engineering artifacts. PCB-Bench establishes a foundational resource for advancing research toward more capable engineering AI, with implications extending beyond PCB design to broader structured reasoning domains.
Data and code are available at https://anonymous.4open.science/r/ICLR_submission_PCB-Bench-CDC5.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1277. Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM
๐ฌ 20
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ dLLMs Inference Acceleration
Diffusion Large Language Models (dLLMs) offer a promising alternative to autoregressive models, excelling in text generation tasks due to their bidirectional attention mechanisms. However, their computational complexity, scaling as $\mathcal{O}(L^3)$ with sequence length $L$, poses significant challenges for long-sequence and real-time applications, primarily due to the lack of compatibility with key-value caching and the non-autoregressive nature of denoising steps. Existing acceleration methods rely on static caching or parallel decoding strategies, which fail to account for the dynamic behavior of token properties across layers and decoding steps. We propose \textbf{Dynamic-dLLM}, a training-free framework that enhances dLLM inference efficiency through two components: Dynamic Cache Updating (DCU), which adaptively allocates cache-update budgets based on layer-wise token dynamics, and Adaptive Parallel Decoding (APD), which dynamically calibrates decoding thresholds to balance generation quality and efficiency. Extensive experiments on models like LLaDA-8B-Instruct, LLaDA-1.5, and Dream-v0-7B-Instruct across benchmarks such as MMLU, GSM8K, and HumanEval demonstrate that Dynamic-dLLM significantly improves inference speed, attaining an average speedup of exceeding 3$\times$ while maintaining performance. Dynamic-dLLM outperforms state-of-the-art acceleration methods and provides a plug-and-play solution for efficient dLLM deployment without compromising performance. Code and models will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1278. Video Scene Segmentation with Genre and Duration Signals
๐ฌ 10
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Scene Segmentation Movie Scene Boundary Detection Video Temporal Segmentation
Video scene segmentation aims to detect semantically coherent boundaries in long-form videos, bridging the gap between low-level visual signals and high-level narrative understanding.
However, existing methods primarily rely on visual similarity between adjacent shots, which makes it difficult to accurately identify scene boundaries, especially when semantic transitions do not align with visual changes.
In this paper, we propose a novel approach that incorporates production-level metadata, specifically genre conventions and shot duration patterns, into video scene segmentation.
Our main contributions are three-fold:
(1) we leverage textual genre definitions as semantic priors to guide shot-level representation learning during self-supervised pretraining, enabling better capture of narrative coherence;
(2) we introduce a duration-aware anchor selection strategy that prioritizes shorter shots based on empirical duration statistics, improving pseudo-boundary generation quality;
(3) we propose a test-time shot splitting strategy that subdivides long shots into segments for improved temporal modeling.
Experimental results demonstrate state-of-the-art performance on MovieNet-SSeg and BBC datasets.
We introduce MovieChat-SSeg, extending MovieChat-1K with manually annotated scene boundaries across 1,000 videos spanning movies, TV series, and documentaries.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1279. Learning Part-Aware Dense 3D Feature Field For Generalizable Articulated Object Manipulation
๐ฌ 15
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Robotic Manipulation Imitation Learning 3D Representation Generalizable Policy
Articulated object manipulation is essential for various real-world robotic tasks, yet generalizing across diverse objects remains a major challenge. A key to generalization lies in understanding functional parts (e.g., door handles and knobs), which indicate where and how to manipulate across diverse object categories and shapes. Previous works attempted to achieve generalization by introducing foundation features, while these features are mostly 2D-based and do not specifically consider functional parts. When lifting these 2D features to geometry-profound 3D space, challenges arise, such as long runtimes, multi-view inconsistencies, and low spatial resolution with insufficient geometric information. To address these issues, we propose \textbf{Part-Aware 3D Feature Field (PA3FF)}, a novel dense 3D feature with part awareness for generalizable articulated object manipulation. PA3FF is trained by 3D part proposals from a large-scale labeled datasets, via a contrastive learning formulation. Given point clouds as input, PA3FF predicts a continuous 3D feature field in a feedforward manner, where the distance between point feature reflects the proximity of functional parts: points with similar features are more likely to belong to the same part. Building on this feature, we introduce the \textbf{Part-Aware Diffusion Policy (PADP)}, an imitation learning framework aimed at enhancing sample efficiency and generalization for robotic manipulation. We evaluate PADP on several simulated and real-world tasks, demonstrating that PA3FF consistently outperforms a range of 2D and 3D representations in manipulation scenarios, including CLIP, DINOv2, and Grounded-SAM, achieving state-of-the-art performance. Beyond imitation learning, PA3FF enables diverse downstream methods, including correspondence learning and segmentation task, making it a versatile foundation for robotic manipulation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1280. On the Thinking-Language Modeling Gap in Large Language Models
๐ฌ 23
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Reasoning Structural Causal Models
Large Language Models (LLMs) demonstrate remarkable capabilities in solving complicated reasoning tasks by imitating the human thinking process from human languages. However, even the most capable LLMs can still fail in tasks that are simple for humans. To understand the gap, we construct structural causal models of next-token predictors in human languages. As language is primarily a tool for humans to share knowledge instead of thinking, modeling human thinking from languages can integrate language expression biases into LLMs. More specifically, we show that LLMs can fail to understand implicit expressions -- expression patterns occur less frequently during training. Consequently, LLMs can easily overlook critical information when biased by implicit expressions. We verify our theoretical claims with carefully constructed realistic datasets containing implicit expressions. Furthermore, we also propose a prompt-level intervention to instruct LLMs to carefully expand and focus on all the expressions available. The empirical success of the prompt-level intervention across 11 tasks and 4 representative LLMs, along with the improvements over general reasoning tasks, reaffirms our findings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1281. Exploratory Diffusion Model for Unsupervised Reinforcement Learning
๐ฌ 13
โญ 6.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning diffusion policy unsupervised reinforcement learning exploration
Unsupervised reinforcement learning (URL) pre-trains agents by exploring diverse states in reward-free environments, aiming to enable efficient adaptation to various downstream tasks. Without extrinsic rewards, prior methods rely on intrinsic objectives, but heterogeneous exploration data demand strong modeling capacity for both intrinsic reward design and policy learning. We introduce the **Ex**ploratory **D**iffusion **M**odel (**ExDM**), which leverages the expressive power of diffusion models to fit diverse replay-buffer distributions, thus providing accurate density estimates and a score-based intrinsic reward that drives exploration into under-visited regions. This mechanism substantially broadens state coverage and yields robust pre-trained policies. Beyond exploration, ExDM offers theoretical guarantees and practical algorithms for fine-tuning diffusion policies under limited interactions, overcoming instability and computational overhead from multi-step sampling. Extensive experiments on Maze2d and URLB show that ExDM achieves superior exploration and faster downstream adaptation, establishing new state-of-the-art results, particularly in environments with complex structure or cross-embodiment settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1282. Learning to summarize user information for personalized reinforcement learning from human feedback
๐ฌ 10
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ pluralistic preference alignment RL finetuning of LLMs pluralistic reward modeling
As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users' preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model, meaning it assumes that everyone's preferences are the same.
We present a novel framework, **P**reference **L**earning **U**sing **S**ummarization (**PLUS**), that uses reinforcement learning (RL) to learn to produce text-based summaries of each user's preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. Both the user-summarization model and reward model are trained simultaneously, creating an online co-adaptation loop. We show that in contrast to the standard BradleyโTerry model, summaries produced by PLUS capture diverse aspects of user preferences, achieving a 11โ77\% improvement in reward model accuracy. Key strengths of PLUS are: (1) robust performance with new users and conversation topics, achieving a 25\% improvement over the best personalized reward model technique used for RLHF; (2) zero-shot personalization with state-of-the-art proprietary models like GPT-4 (e.g., PLUS-summary-conditioned responses achieved a 72\% win rate compared to 28\% for default GPT-4o); (3) learning from flexible user contexts beyond preference labels, and (4) interpretable representation of users, enabling greater transparency and user control in pluralistic LLM alignment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1283. Diffusion and Flow-based Copulas: Forgetting and Remembering Dependencies
๐ฌ 14
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Copula estimation dependence modelling diffusion non-parametric copula
Copulas are a fundamental tool for modelling multivariate dependencies in data, forming the method of choice in diverse fields and applications. However, the adoption of existing models for multimodal and high-dimensional dependencies is hindered by restrictive assumptions and poor scaling. In this work, we present methods for modelling copulas based on the principles of diffusions and flows. We design two processes that progressively forget inter-variable dependencies while leaving dimension-wise distributions unaffected, provably defining valid copulas at all times. We show how to obtain copula models by learning to remember the forgotten dependencies from each process, theoretically recovering the true copula at optimality. The first instantiation of our framework focuses on direct density estimation, while the second specialises in expedient sampling. Empirically, we demonstrate the superior performance of our proposed methods over state-of-the-art copula approaches in modelling complex and high-dimensional dependencies from scientific datasets and images. Our work enhances the representational power of copula models, empowering applications and paving the way for their adoption on larger scales and more challenging domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1284. Unified In-Context Video Editing
๐ฌ 13
โญ 6.00
๐ generative models
๐ท๏ธ video editing; video generation; diffusion models
Recent advances in text-to-video generation have sparked interest in generative video editing tasks. Previous methods often rely on task-specific architectures (e.g., additional adapter modules) or dedicated customizations (e.g., DDIM inversion), which limit the integration of versatile editing conditions and the unification of various editing tasks. In this paper, we introduce UNified In-Context Video Editing (UNIC), a simple yet effective framework that unifies diverse video editing tasks within a single model in an in-context manner. To achieve this unification, we represent the inputs of various video editing tasks as three types of tokens: the source video tokens, the noisy video latent, and the multi-modal conditioning tokens that vary according to the specific editing task. Based on this formulation, our key insight is to integrate these three types into a single consecutive token sequence and jointly model them using the native attention operations of DiT, thereby eliminating the need for task-specific adapter designs. Nevertheless, direct task unification under this framework is challenging, leading to severe token collisions and task confusion due to the varying video lengths and diverse condition modalities across tasks. To address these, we introduce task-aware RoPE to facilitate consistent temporal positional encoding, and condition bias that enables the model to clearly differentiate different editing tasks. This allows our approach to adaptively perform different video editing tasks by referring the source video and varying condition tokens "in context", and support flexible task composition. To validate our method, we construct a unified video editing benchmark containing six representative video editing tasks. Results demonstrate that our unified approach achieves comparable performance with task specialists and exhibits emergent task composition abilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1285. TIPO: Text to Image with Text Pre-sampling for Prompt Optimization
๐ฌ 18
โญ 6.00
๐ generative models
๐ท๏ธ prompt optimization prompt engineering text-to-image
TIPO (Text-to-Image Prompt Optimization) introduces an efficient approach for automatic prompt refinement in text-to-image (T2I) generation. Starting from simple user prompts, TIPO leverages a lightweight pre-trained model to expand these prompts into richer, detailed versions. Conceptually, TIPO samples refined prompts from a targeted sub-distribution within the broader semantic space, preserving the original intent while significantly improving visual quality, coherence, and detail. Unlike resource-intensive methods based on large language models (LLMs) or reinforcement learning (RL), TIPO provides computational efficiency and scalability, opening new possibilities for effective, automated prompt engineering in T2I tasks.
We provide visual results, human preference report to investigate TIPO's effectiveness. Experimental evaluations on benchmark datasets demonstrate substantial improvements in aesthetic quality, significant reduction of visual artifacts, and enhanced alignment with target distributions along with significant human preference proficiency. These results highlight the importance of targeted prompt engineering in text-to-image tasks and indicate broader opportunities for automated prompt refinement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1286. Visual Planning: Let's Think Only with Images
๐ฌ 14
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ visual planning
Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations for these "vision-first'' tasks, as a supplementary channel to language-based reasoning. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising supplement to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1287. Learning Structure-Semantic Evolution Trajectories for Graph Domain Adaptation
๐ฌ 13
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph Domain Adaptation Stochastic Differential Equations
Graph Domain Adaptation (GDA) aims to bridge distribution shifts between domains by transferring knowledge from well-labeled source graphs to given unlabeled target graphs.
One promising recent approach addresses graph transfer by discretizing the adaptation process, typically through the construction of intermediate graphs or stepwise alignment procedures.
However, such discrete strategies often fail in real-world scenarios, where graph structures evolve continuously and nonlinearly, making it difficult for fixed-step alignment to approximate the actual transformation process.
To address these limitations, we propose \textbf{DiffGDA}, a \textbf{Diff}usion-based \textbf{GDA} method that models the domain adaptation process as a continuous-time generative process. We formulate the evolution from source to target graphs using stochastic differential equations (SDEs), enabling the joint modeling of structural and semantic transitions.
To guide this evolution, a domain-aware network is introduced to steer the generative process toward the target domain, encouraging the diffusion trajectory to follow an optimal adaptation path.
We theoretically show that the diffusion process converges to the optimal solution bridging the source and target domains in the latent space.
Extensive experiments on 14 graph transfer tasks across 8 real-world datasets demonstrate DiffGDA consistently outperforms state-of-the-art baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1288. Spiking Discrepancy Transformer for Point Cloud Analysis
๐ฌ 9
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Spiking Neural Networks Point Cloud Processing Efficient Computing Brain-inspired Computing
Spiking Transformer has sparked growing interest, with the Spiking Self-Attention merging spikes with self-attention to deliver both energy efficiency and competitive performance. However, existing work primarily focuses on 2D visual tasks, and in the domain of 3D point clouds, the disorder and complexity of spatial information, along with the scale of the point clouds, present significant challenges. For point clouds, we introduce spiking discrepancy, measuring differences in spike features to highlight key information, and then construct the Spiking Discrepancy Attention Mechanism (SDAM). SDAM contains two variants: the Spiking Element Discrepancy Attention captures local geometric correlations between central points and neighboring points, while the Spiking Intensity Discrepancy Attention characterizes structural patterns of point clouds based on macroscopic spike statistics. Moreover, we propose a Spatially-Aware Spiking Neuron. Based on these, we construct a hierarchical Spiking Discrepancy Transformer. Experimental results demonstrate that our method achieves state-of-the-art performance within the Spiking Neural Networks and exhibits impressive performance compared to Artificial Neural Networks along with a few parameters and significantly lower theoretical energy consumption.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1289. Robustness in the Face of Partial Identifiability in Reward Learning
๐ฌ 22
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Inverse Reinforcement Learning Reward Learning Preference Based Reinforcement Learning Theory
In Reward Learning (ReL), we are given feedback on an unknown target reward, and the goal is to use this information to recover it in order to carry out some downstream application, e.g., planning. When the feedback is not informative enough, the target reward is only partially identifiable, i.e., there exists a set of rewards, called the feasible set, that are equally plausible candidates for the target reward. In these cases, the ReL algorithm might recover a reward function different from the target reward, possibly leading to a failure in the application. In this paper, we introduce a general ReL framework that permits to quantify the drop in "performance" suffered in the considered application because of identifiability issues. Building on this, we propose a robust approach to address the identifiability problem in a principled way, by maximizing the "performance" with respect to the worst-case reward in the feasible set. We then develop Rob-ReL, a ReL algorithm that applies this robust approach to the subset of ReL problems aimed at assessing a preference between two policies, and we provide theoretical guarantees on sample and iteration complexity for Rob-ReL. We conclude with some numerical simulations to illustrate the setting and empirically characterize Rob-ReL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1290. Online Pseudo-Zeroth-Order Training of Neuromorphic Spiking Neural Networks
๐ฌ 12
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ neuromorphic computing spiking neural networks non-backpropagation training biological plausibility pseudo-zeroth-order
Brain-inspired neuromorphic computing with spiking neural networks (SNNs) is a promising energy-efficient computational approach. However, successfully training deep SNNs in a more biologically plausible and neuromorphic-hardware-friendly way is still challenging. Most recent methods leverage spatial and temporal backpropagation (BP), not adhering to neuromorphic properties. Despite the efforts of some online training methods, tackling spatial credit assignments by alternatives with competitive performance as spatial BP remains a significant problem. In this work, we propose a novel method, online pseudo-zeroth-order (OPZO) training. Our method only requires a single forward propagation with noise injection and direct top-down signals for spatial credit assignment, avoiding spatial BP's problem of symmetric weights and separate phases for layer-by-layer forward-backward propagation. OPZO solves the large variance problem of zeroth-order methods by the pseudo-zeroth-order formulation and momentum feedback connections, while having more guarantees than random feedback. Combining online training, OPZO can pave paths to on-chip SNN training. Experiments on neuromorphic and static datasets with both fully connected and convolutional networks demonstrate the effectiveness of OPZO with competitive performance compared with spatial BP, as well as estimated low training costs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1291. The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives
๐ฌ 17
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ alignment bayesian inverse reinforcement learning uncertainty diagnostics
The objectives that Large Language Models (LLMs) implicitly optimize remain dangerously opaque, making trustworthy alignment and auditing a grand challenge. While Inverse Reinforcement Learning (IRL) can infer reward functions from behaviour, existing approaches either produce a single, overconfident reward estimate or fail to address the fundamental ambiguity of the task, known as non-identifiability. This paper introduces a principled auditing framework that reframes reward inference from a simple estimation task to a comprehensive process for verification. Our framework leverages Bayesian IRL to not only recover a distribution over objectives but to enable three critical audit capabilities: (i) Quantifying and systematically reducing non-identifiability by demonstrating posterior contraction over sequential rounds of evidence; (ii) Providing actionable, uncertainty-aware diagnostics that expose spurious shortcuts and identify out-of-distribution prompts where the inferred objective cannot be trusted; and (iii) Validating policy-level utility by showing that the refined, low-uncertainty reward can be used directly in RLHF to achieve training dynamics and toxicity reductions comparable to the ground-truth alignment process. Empirically, our framework successfully audits a detoxified LLM, yielding a well-calibrated and interpretable objective that strengthens alignment guarantees. Overall, this work provides a practical toolkit for auditors, safety teams, and regulators to verify what LLMs are truly trying to achieve, moving us toward more trustworthy and accountable AI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1292. Teaching LLMs to Admit Uncertainty in OCR
๐ฌ 23
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Optical Character Recognition Visually Degraded Document Uncertainty LLM
Vision language models (VLMs) are increasingly replacing traditional OCR pipelines, but on visually degraded documents they often hallucinate, producing fluent yet incorrect text without signaling uncertainty. This occurs because current post-training emphasizes accuracy, which encourages models to guess even when uncertain. The problem persists in state-of-the-art systems and severely impacts OCR reliability. To improve the trustworthiness of OCR on degraded documents, we propose uncertainty-aware OCR. Rather than suppressing guesses, our model transcribes while explicitly bracketing spans it deems unreliable with uncertainty tags. To train our model, we use Group Relative Policy Optimization (GRPO). We define the usage rules for uncertainty tags and an evaluation protocol. We introduce a pseudo-labeled cold start and a multi-objective reward that balances transcription accuracy and uncertainty coverage while preventing reward hacking. We explore different combinations of cold start and reward granularity and verify the effect of reward parameters in preventing reward hacking and improving the corresponding metrics. We also introduce Blur-OCR, a challenging OCR benchmark for uncertainty-aware OCR on degraded documents. In detailed experiments, our model maintains transcription accuracy while achieving an uncertainty tag F1 score of 0.685, and substantially outperforms both open- and closed-source baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1293. FreqKV: Key-Value Compression in Frequency Domain for Context Window Extension
๐ฌ 16
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models KV Compression Context Extension
Existing key-value (KV) cache compression methods for large language models (LLMs) often rely on token eviction, which risks losing critical local information in both long prefilling and decoding scenarios. When extrapolating beyond the pretrained context length, their performance degrades sharply on long-context benchmarks. Motivated by the observation in the frequency domain that the context information is concentrated in the low-frequency components, we propose FreqKV, a parameter-free and architecture-agnostic approach. It iteratively compresses the increasing KV cache in the frequency domain, allowing models to process lengthy contexts efficiently. With minimal training at 8K length, FreqKV extends the context window of LLaMA-2-7B up to 256K tokens while maintaining stable perplexity. Extensive experiments on both prefilling and decoding stages demonstrate that FreqKV enables robust context window extension and consistently outperforms existing KV cache compression methods, highlighting its effectiveness for both understanding and generation in long contexts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1294. From Assistant to Independent Developer โ Are GPTs Ready for Software Development?
๐ฌ 13
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ software dvelopment app development coding agent LLM code model
Large language models (LLMs) have demonstrated remarkable capability in function-level code generation tasks.
Unlike isolated functions, real-world applications demand reasoning over the entire software system: developers must orchestrate how different components interact, maintain consistency across states over time, and ensure the application behaves correctly within the lifecycle and framework constraints. Yet, no existing benchmark adequately evaluates whether LLMs can bridge this gap and construct entire software systems from scratch.
To address this gap, we propose \tool, a benchmark consisting of 101 software development problems drawn from real-world Android apps. Given a natural language specification detailing the app functionality, a language model is tasked with \textbf{implementing the functionality into an Android app from scratch}. Developing an Android app from scratch requires understanding and coordinating app states, lifecycle management, and asynchronous operations, calling for LLMs to generate context-aware, robust, and maintainable code. To construct \tool, we design a multi-agent system to automatically summarize the main functionalities from app documents and navigate the app to synthesize test cases validating the functional correctness of app implementation. Following rigorous manual verification by Android development experts, \tool incorporates the test cases within an automated evaluation framework that enables reproducible assessment without human intervention, making it easily adoptable for future research. Our evaluation on 12 flagship LLMs show that all evaluated models achieve low effectiveness, with the best-performing model (GPT-5) developing only 18.8\% functionally correct applications, highlighting fundamental limitations in current models' ability to handle complex, multi-component software engineering challenges.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1295. Inference-time scaling of diffusion models through classical search
๐ฌ 18
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ diffusion models inference-time scaling compositional generation search algorithms
Classical search algorithms have long underpinned modern artificial intelligence. In this work, we tackle the challenge of inference-time control in diffusion modelsโadapting generated outputs to meet diverse test-time objectivesโusing principles from classical search. We propose a general framework that orchestrates local and global search to efficiently navigate the generative space. It performs compute-efficient global exploration using breadth-first and depth-first tree search and employs a theoretically grounded, scalable local search via annealed Langevin MCMC. We evaluate our approach on a range of challenging domains, including planning, offline reinforcement learning, and image generation, and observe significant gains in both performance and efficiency over baseline methods. These results demonstrate that classical search offers a principled and practical foundation for inference-time scaling in diffusion models. By jointly scaling local and global search for the first time, our framework establishes a new Pareto frontier across challenging decision-making domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1296. PepBenchmark: A Standardized Benchmark for Peptide Machine Learning
๐ฌ 20
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ peptide machine learning benchmark protein language models
Peptide therapeutics are widely regarded as the โthird generationโ of drugs, yet progress in peptide Machine Learning (ML) are hindered by the absence of standardized benchmarks. Here we present \textbf{PepBenchmark}, which standardizes datasets, preprocessing, and evaluation protocols for peptide drug discovery. PepBenchmark comprises three components: (1) \textbf{PepBenchData}, a well-curated collection comprising 29 canonical-peptide and 6 non-canonical-peptide datasets across 7 groups, systematically covering key aspects of peptide drug developmentโrepresenting, to the best of our knowledge, the most comprehensive AI-ready dataset resource to date; (2) \textbf{PepBenchPipeline}, a standardized preprocessing pipeline that ensures consistent cleaning, representation conversion, and dataset splitting, addressing the quality issues that often arise from ad-hoc pipelines; and (3) \textbf{PepBenchLeaderboard}, a unified evaluation protocol and leaderboard with strong baselines across 4 major methodological families: fingerprint-based, GNN-based, PLM-based, and SMILES-based models. Together, PepBenchmark provides the first standardized and comparable foundation for peptide drug discovery, facilitating methodological advances and translation into real-world applications. Code is included in the supplementary material and will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1297. Spatially Informed Autoencoders for Interpretable Visual Representation Learning
๐ฌ 8
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ autoencoder visual representation point process conditional simulation interpretable machine learning self supervision spatial statistics
We introduce spatially informed variational autoencoders (SI-VAE) as self-supervised deep-learning models that use stochastic point processes to predict spatial organization patterns from images. Existing approaches to learning visual representations based on variational autoencoders (VAE) struggle to capture spatial correlations between objects or events, focusing instead on pixel intensities. We address this limitation by incorporating a point-process likelihood, derived from the Papangelou conditional intensity, as a self-supervision target. This results in a hybrid model that learns statistically interpretable representations of spatial localization patterns and enables zero-shot conditional simulation directly from images. Experiments with synthetic images show that SI-VAE improve the classification accuracy of attractive, repulsive, and uncorrelated point patterns from 48% (VAE) to over 80% in the worst case and 90% in the best case, while generalizing to unseen data. We apply SI-VAE to a real-world microscopy data set, demonstrating its use for studying the spatial organization of proteins in human cells and for using the representations in downstream statistical analysis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1298. PU-BENCH: A UNIFIED BENCHMARK FOR RIGOROUS AND REPRODUCIBLE PU LEARNING
๐ฌ 7
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ PU learning semi-supervised leaning benchmark
Positive-Unlabeled (PU) learning, a challenging paradigm for training binary classifiers from only positive and unlabeled samples, is fundamental to many applications. While numerous PU learning methods have been proposed, the research is systematically hindered by the lack of a standardized and comprehensive benchmark for rigorous evaluation. Inconsistent data generation, disparate experimental settings, and divergent metrics have led to irreproducible findings and unsubstantiated performance claims. To address this foundational challenge, we introduce \textbf{PU-Bench}, the first unified open-source benchmark for PU learning. PU-Bench provides: 1) a unified data generation pipeline to ensure consistent input across configurable sampling schemes, label ratios and labeling mechanisms ; 2) an integrated framework of 16 state-of-the-art PU methods; and 3) standardized protocols for reproducible assessment. Through a large-scale empirical study on 8 diverse datasets (\textbf{2,560 }evaluations in total), PU-Bench reveals a complex while intuitional performance landscape, uncovering critical trade-offs between effectiveness and efficiency, and those of robustness and label frequency and selection bias. It is anticipated to serve as a foundational resource to catalyze reproducible, rigorous, and impactful research in the PU learning community.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1299. Self-Consistency Improves the Trustworthiness of Self-Interpretable GNNs
๐ฌ 18
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Self-interpretble GNNs; Trustworthy; Consistency; Faithfulness
Graph Neural Networks (GNNs) achieve strong predictive performance but offer limited transparency in their decision-making. Self-Interpretable GNNs (SI-GNNs) address this by generating built-in explanations, yet their training objectives are misaligned with evaluation criteria such as faithfulness. This raises two key questions: (i) can faithfulness be explicitly optimized during training, and (ii) does such optimization genuinely improve explanation quality? We show that faithfulness is intrinsically tied to explanation self-consistency and can therefore be optimized directly. Empirical analysis further reveals that self-inconsistency predominantly occurs on unimportant features, linking it to redundancy-driven explanation inconsistency observed in recent work and suggesting untapped potential for improving explanation quality. Building on these insights, we introduce a simple, model-agnostic self-consistency (SC) training strategy. Without changing architectures or pipelines, SC consistently improves explanation quality across multiple dimensions and benchmarks, offering an effective and scalable pathway to more trustworthy GNN explanations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1300. Learning linear state-space models with sparse system matrices
๐ฌ 19
โญ 6.00
๐ learning on time series and dynamical systems
๐ท๏ธ linear state-space models expectation-maximization algorithm system identification state estimation
Due to tractable analysis and control, linear state-space models (LSSMs) provide a fundamental mathematical tool for time-series data modeling in various disciplines. In particular, many LSSMs have sparse system matrices because interactions among variables are limited or only a few significant relationships exist. However, current learning algorithms for LSSMs lack the ability to learn system matrices with the sparsity constraint due to the similarity transformation. To address this issue, we impose sparsity-promoting priors on system matrices to balance modeling error and model complexity. By taking hidden states of LSSMs as latent variables, we then explore the expectation-maximization (EM) algorithm to derive a maximum a posteriori (MAP) estimate of both hidden states and system matrices from noisy observations. Based on the Global Convergence Theorem, we further demonstrate that the proposed learning algorithm yields a sequence converging to a local maximum or saddle point of the joint posterior distribution. Finally, experimental results on simulation and real-world problems illustrate that the proposed algorithm can preserve the inherent topological structure among variables and significantly improve prediction accuracy over classical learning algorithms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1301. DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving
๐ฌ 14
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Autonomous Driving Task-Centric Paradigm Scalable State Space Model
Recent advances towards End-to-End Autonomous Driving (E2E-AD) focus on integrating modular designs into a unified framework for joint optimization. Most of these advances follow a sequential paradigm (i.e., perception-prediction-planning) based on separable Transformer decoders and rely on dense BEV features to encode scene representations. However, such manual ordering design can inevitably cause information loss and cumulative errors, lacking flexible and diverse relation modeling among different modules and sensors. Meanwhile, insufficient training of image backbone and quadratic-complexity of attention mechanism also hinder the scalability and efficiency of E2E-AD system to handle spatiotemporal input. To this end, we propose DriveMamba, a Task-Centric Scalable paradigm for efficient E2E-AD, which integrates dynamic task relation modeling, implicit view correspondence learning and long-term temporal fusion into a single-stage Unified Mamba decoder. Specifically, both extracted image features and expected task outputs are converted into token-level sparse representations in advance, which are then sorted by their instantiated positions in 3D space. The linear-complexity operator enables efficient long-context sequential token modeling to capture task-related inter-dependencies simultaneously. Additionally, a bidirectional trajectory-guided "local-to-global" scan method is designed to preserve spatial locality from ego-perspective, thus facilitating the ego-planning. Extensive experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superiority, generalizability and great efficiency of DriveMamba.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1302. Harmonized Cone for Feasible and Non-conflict Directions in Training Physics-Informed Neural Networks
๐ฌ 24
โญ 6.00
๐ optimization
๐ท๏ธ Physics-Informed Neural Networks Multi-loss Optimization Gradient Conflict Resolution Feasible Directions Nonconvex Convergence
Physics-Informed Neural Networks (PINNs) have emerged as a powerful tool for solving PDEs, yet training is difficult due to a multi-objective loss that couples PDE residuals, initial/boundary conditions, and auxiliary physics terms. Existing remedies often yield infeasible scaling factors or conflicting update directions, resulting in degraded performance. In this paper, we show that training PINNs requires jointly considering feasible scaling factors and a non-conflict direction. Through a geometric analysis of per-loss gradients, we define the $\textit{harmonized cone}$ as the intersection of their primal and dual cones, which characterizes directions that are simultaneously feasible and non-conflicting. Building on this, we propose $HARMONIC$ (HARMONIzed Cone gradient descent), a training procedure that computes updates within the harmonized cone by leveraging the Double Description method to aggregate extreme rays. Theoretically, we establish convergence guarantees in nonconvex settings and prove the existence of a nontrivial harmonized cone. Across standard PDE benchmarks, $HARMONIC$ generally outperforms state-of-the-art methods while ensuring feasible and non-conflict updates.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1303. villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models
๐ฌ 16
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ embodied AI
Vision-Language-Action (VLA) models have emerged as a popular paradigm for learning robot manipulation policies that can follow language instructions and generalize to novel scenarios. Recent works have begun to explore the incorporation of latent actions, abstract representations of motion between two frames, into VLA pre-training. In this paper, we introduce villa-X a novel Vision-Language-Latent-Action (ViLLA) framework that advances latent action modeling for learning generalizable robot manipulation policies.
Our approach improves both how latent actions are learned and how they are incorporated into VLA pre-training. We demonstrate that villa-X can generate latent action plans in a zero-shot fashion, even for unseen embodiments and open-vocabulary symbolic understanding. This capability enables villa-X to achieve superior performance across diverse simulation tasks in SIMPLER and on two real-world robotic setups involving both gripper and dexterous hand manipulation. These results establish villa-X as a principled and scalable paradigm for learning generalizable robot manipulation policies. We believe it provides a strong foundation for future research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1304. Captain Cinema: Towards Short Movie Generation
๐ฌ 4
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Generation Diffusion Transformer
We present **Captain Cinema**, a generation framework for short movie generation.
Given a detailed textual description of a movie storyline, our approach firstly generates a sequence of keyframes that outline the entire narrative, which ensures long-range coherence in both the storyline and visual appearance (e.g., scenes and characters). We refer to this step as top-down keyframe planning. These keyframes then serve as conditioning signals for a video synthesis model, which supports long context learning, to produce the spatio-temporal dynamics between them. This step is referred to as bottom-up video synthesis. To support stable and efficient generation of multi-scene long narrative cinematic works, we introduce an interleaved training strategy for Multimodal Diffusion Transformers (MM-DiT), specifically adapted for long-context video data. Our model is trained on a curated cinematic dataset consisting of interleaved samples for video generation. Our experiments demonstrate that Captain Cinema performs favorably in the automated creation of visually coherent and narratively consistent short films.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1305. CatalystBench: A Comprehensive Multi-Task Benchmark for Advancing Language Models in Catalysis Science
๐ฌ 22
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Scientific Benchmark AI for Science Catalyst Design Large Language Models Multi-task Learning Domain Adaptation
The discovery of novel catalytic materials is a cornerstone of chemical engineering and sustainable energy, yet it remains a complex, knowledge-intensive process. While Large Language Models (LLMs) have demonstrated remarkable potential in various scientific domains, their application to catalysis is hindered by the lack of specialized, multi-dimensional benchmarks to guide their development and evaluation. To bridge the critical gap, we introduce CatalystBench, a comprehensive and challenging benchmark meticulously constructed from scientific literature and public datasets, specifically designed to assess the capabilities of LLMs in the nuanced domain of catalyst design. The tasks covered by this benchmark dataset encompass the entire closed-loop process of catalyst development, including reading comprehension, experimental analysis, and scheme reasoning. Based on this benchmark, we propose a Multi-head Full-task (MFT) domain-specific fine-tuning method that employs coupling task-specific output heads. We systematically compare with other three distinct fine-tuning strategies: Single-Task (ST), Full-Task (FT) and Multi-head Single-Task (MST). The extensive experiments demonstrate that the MFT strategy consistently achieves the most substantial performance improvements across all tasks, underscoring the effectiveness of explicit multi-task architectures in complex scientific reasoning. The resulting CatalystLLM significantly outperforms a wide array of state-of-the-art open-source and closed-source models on CatalystBench. We will publicly release both the CatalystBench benchmark and the CatalystLLM model, providing the community with a robust evaluation framework and a powerful new tool to accelerate AI-driven research in catalytic materials science.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1306. Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment
๐ฌ 27
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Object-centric learning diffusion models contrastive learning slot attention compositionality
Slot Attention (SA) with pretrained diffusion models has recently shown promise for object-centric learning (OCL), but suffers from slot entanglement and weak alignment between object slots and image content. We propose Contrastive Object-centric Diffusion Alignment (CODA), a simple extension that (i) employs register slots to absorb residual attention and reduce interference between object slots, and (ii) applies a contrastive alignment loss to explicitly encourage slotโimage correspondence. The resulting training objective serves as a tractable surrogate for maximizing mutual information (MI) between slots and inputs, strengthening slot representation quality. On both synthetic (MOVi-C/E) and real-world datasets (VOC, COCO), CODA improves object discovery (e.g., +6.1\% FG-ARI on COCO), property prediction, and compositional image generation over strong baselines. Register slots add negligible overhead, keeping CODA efficient and scalable. These results indicate potential applications of CODA as an effective framework for robust OCL in complex, real-world scenes. Code is available as supplementary material.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1307. Group-Normalized Implicit Value Optimization for Language Models
๐ฌ 18
โญ 6.00
๐ optimization
๐ท๏ธ LLM post-training
Fine-tuning Large Language Models (LLMs) with reinforcement learning (RL) has become a key technique for enhancing performance on a wide range of tasks, from user alignment to complex reasoning. However, this approach is often hindered by the difficulty of fine-grained credit assignment, as it typically relies on sparse rewards given only at the end of a completely generated sequence. Conventional solutions often require training an auxiliary value network known as critic, which introduces significant computational overhead and training instability. We present Group-Normalized Implicit Value Optimization (GN-IVO), a novel, critic-free algorithm that directly addresses this challenge. GN-IVO learns step-level values implicitly from the policy through a group-normalized distributional matching objective. This approach elegantly circumvents the need for an explicit critic and avoids the computation of the intractable partition function by normalizing values across a group of sampled model responses. Theoretically, we prove that our objective recovers the true value function up to a constant, guaranteeing that the optimal policy is preserved. We demonstrate the practical effectiveness of GN-IVO on a diverse set of text generation and reasoning tasks, showing that it consistently outperforms strong RL baselines for LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1308. SSDi8: Accurate and Efficient 8-bit Quantization for State Space Duality
๐ฌ 22
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Mamba-2 State Space Duality (SSD) Quantization
Recent advances in sequence modeling have highlighted Mamba as a state space architecture offering efficient long-range dependency modeling and providing a viable alternative to Transformers. Building upon this, Mamba-2 introduces the Structured State Space Duality (SSD), which integrates recurrent and attention modes to achieve efficiency and scalability. However, this architectural expansion substantially increases memory and latency overhead, underscoring the need for efficient compression strategies tailored to SSD. In this work, we present SSDi8, the first post-training quantization framework specifically designed for SSD to maintain a persistent INT8 path. SSDi8 introduces a reformulation that decouples element-wise multiplications from matrix multiplications, enabling reuse of quantized activations across modules. Moreover, SSDi8 adaptively quantizes channel-varying activations at cost-effective points, further reducing latency. On the accuracy side, SSDi8 explicitly leverages the intrinsic dimensional decomposition of SSD, exploiting distinct outlier distributions across axes, and incorporates an error correction term based on per-channel error statistics. Comprehensive experiments demonstrate that SSDi8 achieves accuracy comparable to FP16 while delivering up to 1.4X speedup in W4A8 and W8A8 settings. We further validate its robustness in resource-constrained environments by deploying it on the Orin Nano device.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1309. IDER: IDEMPOTENT EXPERIENCE REPLAY FOR RELIABLE CONTINUAL LEARNING
๐ฌ 25
โญ 6.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ continual learning reliable idempotence
Catastrophic forgetting, the tendency of neural networks to forget previously learned knowledge when learning new tasks, has been a major challenge in continual learning (CL). To tackle this challenge, CL methods have been proposed and shown to reduce forgetting. Furthermore, CL models deployed in mission-critical settings can benefit from uncertainty awareness by calibrating their predictions to reliably assess their confidences. However, existing uncertainty-aware continual learning methods suffer from high computational overhead and incompatibility
with mainstream replay methods. To address this, we propose idempotent experience replay (IDER), a novel approach based on the idempotent property where repeated function applications yield the same output. Specifically, we first adapt the training loss to make model idempotent on current data streams. In addition, we introduce an idempotence distillation loss. We feed the output of the current model back into the old checkpoint and then minimize the distance between this reprocessed output and the original output of the current model. This yields a simple and effective new baseline for building reliable continual learners, which can be seamlessly integrated with other CL approaches. Extensive experiments on different CL benchmarks demonstrate that IDER consistently improves prediction reliability while simultaneously boosting accuracy and reducing forgetting. Our results suggest the potential of idempotence as a promising principle for deploying efficient and trustworthy continual learning systems in real-world applications. Our code will be released upon publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1310. ProofFlow: A Dependency Graph Approach to Faithful Proof Autoformalization
๐ฌ 9
โญ 6.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Autoformalization Large Language Models Dependency Graph Lean (Formal Language) Structural Fidelity Semantic Faithfulness
Proof autoformalization, the task of translating natural language theorems and proofs into machine-verifiable code, is a critical step for integrating large language models into rigorous mathematical workflows. Current approaches focus on producing executable code, but they frequently fail to preserve the semantic meaning and logical structure of the original human-written argument. To address this, we introduce ProofFlow, a novel pipeline that treats structural fidelity as a primary objective. ProofFlow first constructs a directed acyclic graph (DAG) to map the logical dependencies between proof steps. Then, it employs a novel lemma-based approach to systematically formalize each step as an intermediate lemma, preserving the logical structure of the original argument. To facilitate evaluation, we present a new benchmark of 184 undergraduate-level problems, manually annotated with step-by-step solutions and logical dependency graphs, and introduce ProofScore, a new composite metric to evaluate syntactic correctness, semantic faithfulness, and structural fidelity. Experimental results show our pipeline sets a new state-of-the-art for autoformalization, achieving a ProofScore of 0.545, substantially exceeding baselines like full-proof formalization (0.279), which processes the entire proof at once, and step-proof formalization (0.046), which handles each step independently. Our pipeline, benchmark, and score metric are open-sourced to encourage further progress at https://anonymous.4open.science/r/ProofFlow-351E.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1311. Analysis of approximate linear programming solution to Markov decision problem with log barrier function
๐ฌ 14
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Markov decision programming reinforcement learning linear programming dynamic programming
There are two primary approaches to solving Markov decision problems (MDPs): dynamic programming based on the Bellman equation and linear programming (LP). Dynamic programming methods are the most widely used and form the foundation of both classical and modern reinforcement learning (RL). By contrast, LP-based methods have been less commonly employed, although they have recently gained attention in contexts such as offline RL. The relative underuse of the LP-based methods stems from the fact that it leads to an inequality-constrained optimization problem, which is generally more challenging to solve effectively compared with Bellman-equation-based methods. The purpose of this paper is to establish a theoretical foundation for solving LP-based MDPs in a more effective and practical manner. Our key idea is to leverage the log-barrier function, widely used in inequality-constrained optimization, to transform the LP formulation of the MDP into an unconstrained optimization problem. This reformulation enables approximate solutions to be obtained easily via gradient descent. While the method may appear naive, to the best of our knowledge, a thorough theoretical interpretation of this approach has not yet been developed. This paper aims to bridge this gap.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1312. Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction
๐ฌ 12
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Gaussian Spaltting 3D Occupancy Prediction Open-vocabulary
The 3D occupancy prediction task has witnessed remarkable progress in recent years, playing a crucial role in vision-based autonomous driving systems. While traditional methods are limited to fixed semantic categories, recent approaches have moved towards predicting text-aligned features to enable open-vocabulary text queries in real-world scenes. However, there exists a trade-off in text-aligned scene modeling: sparse Gaussian representation struggles to capture small objects in the scene, while dense representation incurs significant computational overhead. To address these limitations, we present **PG-Occ**, an innovative **P**rogressive **G**aussian Transformer Framework that enables open-vocabulary 3D occupancy prediction. Our framework employs progressive online densification, a feed-forward strategy that gradually enhances the 3D Gaussian representation to capture fine-grained scene details. By iteratively enhancing the representation, the framework achieves increasingly precise and detailed scene understanding. Another key contribution is the introduction of an anisotropy-aware sampling strategy with spatio-temporal fusion, which adaptively assigns receptive fields to Gaussians at different scales and stages, enabling more effective feature aggregation and richer scene information capture. Through extensive evaluations, we demonstrate that **PG-Occ** achieves *state-of-the-art* performance with a relative **14.3\% mIoU improvement** over the previous best performing method. The source code and models will be made publicly available upon publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1313. MambaSL: Exploring Single-Layer Mamba for Time Series Classification
๐ฌ 25
โญ 6.00
๐ learning on time series and dynamical systems
๐ท๏ธ modular selective SSM multi-head adaptive pooling skip connection single-layer Mamba time series classification
Despite recent advances in state space models (SSMs) such as Mamba across various sequence domains, research on their standalone capacity for time series classification (TSC) has remained limited.
We propose MambaSL, a framework that minimally redesigns the selective SSM and projection layers of a single-layer Mamba, guided by four TSC-specific hypotheses.
To address benchmarking limitationsโrestricted configurations, partial University of East Anglia (UEA) dataset coverage, and insufficiently reproducible setupsโwe re-evaluate 20 strong baselines across all 30 UEA datasets under a unified protocol.
Our results show that MambaSL achieves state-of-the-art performance on the UEA benchmark among 21 models, with statistically significant average improvements over baselines while ensuring reproducibility via public checkpoints.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1314. TRACE: Your Diffusion Model is Secretly an Instance Edge Detector
๐ฌ 15
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ diffusion unsupervised instance segmentation weakly-supervised panoptic segmentation inference dynamics attention
High-quality instance and panoptic segmentation has traditionally relied on dense instance-level annotations such as masks, boxes, or points, which are costly, inconsistent, and difficult to scale. Unsupervised and weakly-supervised approaches reduce this burden but remain constrained by semantic backbone constraints and human bias, often producing merged or fragmented outputs. We present TRACE (TRAnsforming diffusion Cues to instance Edges), showing that text-to-image diffusion models secretly function as instance edge annotators. TRACE identifies the Instance Emergence Point (IEP) where object boundaries first appear in self-attention maps, extracts boundaries through Attention Boundary Divergence (ABDiv), and distills them into a lightweight one-step edge decoder. This design removes the need for per-image diffusion inversion, achieving 81ร faster inference while producing sharper and more connected boundaries. On the COCO benchmark, TRACE improves unsupervised instance segmentation by +5.1 AP, and in tag-supervised panoptic segmentation it outperforms point-supervised baselines by +1.7 PQ without using any instance-level labels. These results reveal that diffusion models encode hidden instance boundary priors, and that decoding these signals offers a practical and scalable alternative to costly manual annotation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1315. FASA: FREQUENCY-AWARE SPARSE ATTENTION
๐ฌ 22
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Functional sparsity of FC; KV cache
The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance.
We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance.
FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full attention head. This provides a robust and computationally free proxy for identifying salient tokens.
Building on this insight, FASA first identifies a critical set of tokens using dominant FCs, and then performs focused attention computation solely on this pruned subset.
Across a spectrum of long-context tasks, from sequence modeling to complex CoT reasoning, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy, demonstrating remarkable robustness even under constraint budgets. Notably, on LongBench-V1, FASA reaches nearly 100\% of full-KV performance when only keeping 256 tokens, and achieves 2.56$\times$ speedup using just 18.9\% of the cache on AIME24.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1316. Segment Any Events with Language
๐ฌ 28
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ event sensor event-based scene understanding open-vocabulary
Scene understanding with free-form language has been widely explored within diverse modalities such as images, point clouds, and LiDAR. However, related studies on event sensors are scarce or narrowly centered on semantic-level understanding. We introduce **SEAL**, the first Semantic-aware Segment Any Events framework that addresses Open-Vocabulary Event Instance Segmentation (OV-EIS). Given the visual prompt, our model presents a unified framework to support both event segmentation and open-vocabulary mask classification at multiple levels of granularity, including instance-level and part-level. To enable thorough evaluation on OV-EIS, we curate four benchmarks that cover *label granularity* from coarse to fine class configurations and *semantic granularity* from instance-level to part-level understanding. Extensive experiments show that our SEAL largely outperforms proposed baselines in terms of performance and inference speed with a parameter-efficient architecture. In the Appendix, we further present a simple variant of our SEAL achieving generic spatiotemporal OV-EIS that does not require any visual prompts from users in the inference. The code will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1317. Modality-free Graph In-context Alignment
๐ฌ 18
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph neural networks In-context learning Pretraining
In-context learning (ICL) converts static encoders into task-conditioned reasoners, enabling adaptation to new data from just a few examples without updating pretrained parameters. This capability is essential for graph foundation models (GFMs) to approach LLM-level generality. Yet current GFMs struggle with cross-domain alignment, typically relying on modality-specific encoders that fail when graphs are pre-vectorized or raw data is inaccessible. In this paper, we introduce **M**odality-**F**ree **G**raph **I**n-context **A**lignment (MF-GIA), a framework that makes a pretrained graph encoder promptable for few-shot prediction across heterogeneous domains without modality assumptions. MF-GIA captures domain characteristics through gradient fingerprints, which parameterize lightweight transformations that align pre-encoded features and indexed labels into unified semantic spaces. During pretraining, a dual prompt-aware attention mechanism with episodic objective learns to match queries against aligned support examples to establish prompt-based reasoning capabilities. At inference, MF-GIA performs parameter-update-free adaptation using only a few-shot support set to trigger cross-domain alignment and enable immediate prediction on unseen domains. Experiments demonstrate that MF-GIA achieves superior few-shot performance across diverse graph domains and strong generalization to unseen domains. The code is anonymously available [here](https://anonymous.4open.science/r/MF-GIA-submit-3212).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1318. Bootstrapping MLLM for WeaklyโSupervised ClassโAgnostic Object Counting
๐ฌ 23
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Object counting MLLMs weakly-supervised class-agnostic counting
Object counting is a fundamental task in computer vision, with broad applicability in many real-world scenarios. Fully-supervised counting methods require costly point-level annotations per object. Few weakly-supervised methods leverage only image-level object counts as supervision and achieve fairly promising results. They are, however, often limited to counting a single category, \eg person. In this paper, we propose WS-COC, the first MLLM-driven weakly-supervised framework for class-agnostic object counting.
Instead of directly fine-tuning MLLMs to predict object counts, which can be challenging due to the modality gap, we incorporate three simple yet effective strategies to bootstrap the counting paradigm in both training and testing: First, a divide-and-discern dialogue tuning strategy is proposed to guide the MLLM to determine whether the object count falls within a specific range and progressively break down the range through multi-round dialogue. Second, a compare-and-rank count optimization strategy is introduced to train the
MLLM to optimize the relative ranking of multiple images according to their object counts. Third, a global-and-local counting enhancement strategy aggregates and fuses local and global count predictions to improve counting performance in dense scenes. Extensive experiments on FSC-147, CARPK, PUCPR+, and ShanghaiTech show that WS-COC matches or even surpasses many state-of-art fully-supervised methods while significantly reducing annotation costs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1319. DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation
๐ฌ 17
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ text diffusion model; diffusion large language model; code generation
Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. The global planning and iterative refinement features of dLLMs are particularly useful for code generation. However, current training and inference mechanisms for dLLMs in coding are still under-explored. To demystify the decoding behavior of dLLMs and unlock their potential for coding, we systematically investigate their denoising processes and reinforcement learning (RL) methods. We train a 7B dLLM, DiffuCoder, on 130B tokens of code. Using this model as a testbed, we analyze its decoding behavior, revealing how it differs from that of AR models: (1) dLLMs can decide how causal their generation should be without relying on semi-AR decoding, and (2) increasing the sampling temperature diversifies not only token choices but also their generation order. This diversity creates a rich search space for RL rollouts. For RL training, to reduce the variance of token log-likelihood estimates and maintain training efficiency, we propose coupled-GRPO, a novel sampling scheme that constructs complementary mask noise for completions used in training. In our experiments, coupled-GRPO significantly improves DiffuCoder's performance on code generation benchmarks (+4.4\% on EvalPlus) and reduces reliance on AR bias during decoding. Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1320. A Study on PAVE Specification for Learnware
๐ฌ 8
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Learnware Model Specification Parameter Vector Learnware Identification Model Capability
The *Learnware* paradigm aims to help users solve machine learning tasks by leveraging existing well-trained models rather than starting from scratch. A learnware comprises a submitted model paired with a *specification* sketching its capabilities. For an open platform with continuously uploaded models, these specifications are essential to enabling users to identify helpful models, eliminating the requirement for prohibitively costly per-model evaluations. In previous research, specifications based on privacy-preserving reduced sets succeed in enabling learnware identification through distribution matching, but suffer from high sample complexity for learnwares from high-dimensional, unstructured data like images or text. In this paper, we formalize **Pa**rameter **Ve**ctor (PAVE) specification for learnware identification, which utilizes the changes in pre-trained model parameters to inherently encode the model capability and task requirements, offering an effective solution for these learnwares. Theoretically, from the neural tangent kernel perspective, we establish a tight connection between PAVE and prior specifications, providing a theoretical explanation for their shared underlying principles. We further approximate the parameter vector in a low-rank space and analyze the approximation error bound, highly reducing the computational and storage overhead. Extensive empirical studies demonstrate that PAVE specification excels at identifying CV and NLP learnwares for reuse on given user tasks, and succeeds in identifying helpful learnwares from open learnware repository with corrupted model quality for the first time. Reusing identified learnware to solve user tasks can even outperform user-fine-tuned pre-trained models in data-limited scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1321. Provably Accelerated Imaging with Restarted Inertia and Score-based Image Priors
๐ฌ 4
โญ 6.00
๐ optimization
๐ท๏ธ Image reconstruction accelerated iterative algorithms regularization by denoising score-based image prior restarted inertia
Fast convergence and high-quality image recovery are two essential features of algorithms for solving ill-posed imaging inverse problems.
Existing methods, such as regularization by denoising (RED), often focus on designing sophisticated image priors to improve reconstruction quality, while leaving convergence acceleration to heuristics.
To bridge the gap, we propose Restarted Inertia with Score-based Priors (RISP) as a principled extension of RED.
RISP incorporates a restarting inertia for fast convergence, while still allowing score-based image priors for high-quality reconstruction.
We prove that RISP attains a faster stationary-point convergence rate than RED, without requiring the convexity of the image prior.
We further derive and analyze the associated continuous-time dynamical system, offering insight into the connection between RISP and the heavy-ball ordinary differential equation (ODE).
Experiments across a range of imaging inverse problems demonstrate that RISP enables fast convergence while achieving high-quality reconstructions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1322. Generative View Stitching
๐ฌ 15
โญ 6.00
๐ generative models
๐ท๏ธ Video Generation Camera-guided Video Generation Video Diffusion Models
Autoregressive video diffusion models are capable of extremely long rollouts that are stable and consistent with history, but they are unable to guide the current generation with conditioning from the future. In camera-guided video generation with a predefined camera trajectory, this limitation leads to collisions with the generated scene, after which autoregression quickly collapses. To address this, we propose Generative View Stitching (GVS), which samples the entire sequence in parallel such that the generated scene is faithful to every part of the predefined camera trajectory. Our main contribution is a sampling algorithm that extends prior work on diffusion stitching for robot planning to video generation. While such stitching methods usually require a specially trained model, GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing, a prevalent sequence diffusion framework that we show already provides the affordances necessary for stitching. We then introduce OmniGuidance, a technique that enhances the temporal consistency in stitching by conditioning on both the past and future, and that enables our proposed loop-closing mechanism for delivering long-range coherence. Overall, GVS achieves camera-guided video generation that is stable, collision-free, frame-to-frame consistent, and closes loops for a variety of predefined camera paths, including Oscar Reutersvรคrdโs Impossible Staircase.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1323. SWINGARENA: Adversarial Programming Arena for Long-context GitHub Issue Solving
๐ฌ 23
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Arena Real-World GitHub Issues Adversarial Programming Retrieval-Augmented Generation Continuous Integration Code Benchmark
We present \textsc{SwingArena}, a adversarial evaluation framework for Large Language Models (LLMs) that closely mirrors real-world software development workflows. Unlike traditional static benchmarks, \textsc{SwingArena} models the collaborative process of software iteration by pairing LLMs as \textit{submitters}, who generate patches, and \textit{reviewers}, who create test cases and verify the patches through continuous integration (CI) pipelines. To support these interactive evaluations, we introduce a retrieval-augmented code generation (RACG) module that efficiently handles long-context challenges by providing syntactically and semantically relevant code snippets from large codebases, supporting multiple programming languages (C++, Python, Rust, and Go). This enables the framework to scale across diverse tasks and contexts while respecting token limitations. Our experiments, using over 400 high-quality real-world GitHub issues selected from a pool of 2,300 issues, show that models like GPT-4o excel at aggressive patch generation, whereas DeepSeek and Gemini prioritize correctness in CI validation. \textsc{SwingArena} presents a scalable and extensible methodology for evaluating LLMs in realistic, CI-driven software development settings. The complete codebase and benchmark are submitted in https://anonymous.4open.science/r/Swing-Bench and will be open-sourced after the anonymity period.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1324. Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-Language Navigation
๐ฌ 17
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ navigation foundation models Vision-and-Language Navigation
While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance.
We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, "grounds slowly" by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, "moves fast" by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories.
The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1325. RECON: Robust symmetry discovery via Explicit Canonical Orientation Normalization
๐ฌ 19
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ symmetry discovery canonicalization equivariance
Real world data often exhibits unknown, instance-specific symmetries that rarely exactly match a transformation group $G$ fixed a priori. Class-pose decompositions aim to create disentangled representations by factoring inputs into invariant features and a pose $g\in G$ defined relative to a training-dependent, \emph{arbitrary} canonical representation. We introduce \textsc{recon}, a class-pose agnostic \emph{canonical orientation normalization} that corrects arbitrary canonicals via a simple right-multiplication, yielding \emph{natural}, data-aligned canonicalizations. This enables (i) unsupervised discovery of instance-specific pose distributions, (ii) detection of out-of-distribution poses and (iii) a plug-and-play \emph{test-time canonicalization layer}. This layer can be attached on top of any pre-trained model to infuse group invariance, improving its performance without retraining. We demonstrate results on 2D image benchmarks and extend unsupervised instance-level pose discovery to 3D groups.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1326. Omni-IML: Towards Unified Interpretable Image Manipulation Localization
๐ฌ 13
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ document analysis tampered text detection vision foundation model
Existing Image Manipulation Localization (IML) methods rely heavily on task-specific designs, making them perform well only on the target IML task, while joint training on multiple IML tasks causes significant performance degradation, hindering real applications. To this end, we propose Omni-IML, the first generalist model designed to unify IML across diverse tasks. Specifically, Omni-IML achieves generalization through three key components: (1) a Modal Gate Encoder, which adaptively selects the optimal encoding modality per sample, (2) a Dynamic Weight Decoder, which dynamically adjusts decoder filters to the task at hand, and (3) an Anomaly Enhancement module that leverages box supervision to highlight the tampered regions and facilitate the learning of task-agnostic features. Beyond localization, to support interpretation of the tampered images, we construct Omni-273k, a large high-quality dataset that includes natural language descriptions of tampered artifacts. It is annotated through our automatic, chain-of-thoughts annotation technique. We also design a simple-yet-effective interpretation module to better utilize these descriptive annotations. Our extensive experiments show that our single Omni-IML model achieves state-of-the-art performance across all four major IML tasks, providing a valuable solution for practical deployment and a promising direction of generalist models in image forensics. We will release our code and dataset.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1327. Reforming the Mechanism: Editing Reasoning Patterns in LLMs with Circuit Reshaping
๐ฌ 20
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Mechanistic Interpretability Model Editing Circuit Reshaping
Large language models (LLMs) often exhibit flawed reasoning ability that undermines reliability. Existing approaches to improving reasoning typically treat it as a general and monolithic skill, applying broad training that is inefficient and unable to target specific reasoning errors. We introduce Reasoning Editing, a paradigm for selectively modifying specific reasoning patterns in LLMs while preserving other reasoning pathways. This task presents a fundamental trade-off between Generality, the ability of an edit to generalize across different tasks sharing the same reasoning pattern, and Locality, the ability to preserve other reasoning capabilities.
Through systematic investigation, we uncover the Circuit-Interference Law: edit interference between reasoning patterns is proportional to the overlap of their neural circuits. Guided by this principle, we propose REdit, the first framework to actively reshape neural circuits before editing, thereby modulating interference between reasoning patterns and mitigating the trade-off. REdit integrates three components: (i) Contrastive Circuit Reshaping, which directly addresses the generality-locality trade-off by disentangling overlapping circuits; (ii) Meta-Contrastive Learning, which extends transferability to novel reasoning patterns; and (iii) Dual-Level Protection, which preserves preexisting abilities by constraining reshaping update directions and regularizing task-level predictions.
Extensive experiments with Qwen-2.5-3B on propositional logic reasoning tasks across three difficulty levels demonstrate that REdit consistently achieves superior generality and locality compared to baselines, with additional validation in mathematics showing broader potential. Our code is available at https://anonymous.4open.science/r/REdit-DBD8.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1328. AlphaFlow: Understanding and Improving MeanFlow Models
๐ฌ 19
โญ 6.00
๐ generative models
๐ท๏ธ diffusion models mean flows mean flow models few-step diffusion one-step diffusion generative models imagenet
MeanFlow has recently emerged as a powerful framework for few-step generative modeling trained from scratch, but its success is not yet fully understood. In this work, we show that the MeanFlow objective naturally decomposes into two parts: trajectory flow matching and trajectory consistency. Through gradient analysis, we find that these terms are strongly negatively correlated, causing optimization conflict and slow convergence. Motivated by these insights, we introduce $\alpha$-Flow, a broad family of objectives that unifies trajectory flow matching, Shortcut Model, and MeanFlow under one formulation. By adopting a curriculum strategy that smoothly anneals from trajectory flow matching to MeanFlow, $\alpha$-Flow disentangles the conflicting objectives, and achieves better convergence. When trained from scratch on class-conditional ImageNet-1K 256ร256 with vanilla DiT backbones, $\alpha$-Flow consistently outperforms MeanFlow across scales and settings. Our largest $\alpha$-Flow-XL/2+ model achieves new state-of-the-art results using vanilla DiT backbones, with FID scores of 2.58 (1-NFE) and 2.15 (2-NFE). The source code and pre-trained checkpoints will be publicly released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1329. Consistent Text-to-Image Generation via Scene De-Contextualization
๐ฌ 18
โญ 6.00
๐ generative models
๐ท๏ธ Text-to-Image generation Identity-preserving Prompt embedding editing Scene contextualization
Consistent text-to-image (T2I) generation seeks to produce identity-preserving images of the same subject across diverse scenes, yet it often fails due to a phenomenon called identity (ID) shift. Previous methods have tackled this issue, but typically rely on the unrealistic assumption of knowing all target scenes in advance. This paper reveals that a key source of ID shift is the native correlation between subject and scene context, called scene contextualization, which arises naturally as T2I models fit the training distribution of vast natural images. We formally prove the near-universality of this scene-subject correlation and derive theoretical bounds on its strength. On this basis, we propose a novel, efficient, training-free prompt embedding editing approach, called Scene De-Contextualization (SDeC), that imposes an inversion process of T2Iโs built-in scene contextualization. Specifically, it identifies and suppresses the latent scene-subject correlation within the ID promptโs embedding by quantifying SVD directional stability to re-weight the corresponding eigenvalues adaptively. Critically, SDeC allows for per-scene use (one prompt per scene) without requiring prior access to all target scenes. This makes it a highly flexible and general solution well-suited to real-world applications where such prior knowledge is often unavailable or varies over time. Experiments demonstrate that SDeC significantly enhances identity preservation while maintaining scene diversity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1330. Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark
๐ฌ 21
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Privacy
The deployment of Large Language Models (LLMs) in embodied agents creates an urgent need to measure their privacy awareness in the physical world. Existing evaluation methods, however, are confined to natural lanague based scenarios. To bridge this gap, we introduce EAPrivacy, a comprehensive evaluation benchmark designed to quantify the physical-world privacy awareness of LLM-powered agents. EAPrivacy utilizes procedurally generated scenarios across four tiers to test an agent's ability to handle sensitive objects, adapt to changing environments, balance task execution with privacy constraints, and resolve conflicts with social norms. Our measurements reveal a critical deficit in current models. The top-performing model, Gemini 2.5 Pro, achieved only 59\% accuracy in scenarios involving changing physical environments. Furthermore, when a task was accompanied by a privacy request, models prioritized completion over the constraint in up to 86\% of cases. In high-stakes situations pitting privacy against critical social norms, leading models like GPT-4o and Claude-3.5-haiku disregarded the social norm over 15\% of the time. These findings, demonstrated by our benchmark, underscore a fundamental misalignment in LLMs regarding physically grounded privacy and establish the need for more robust, physically-aware alignment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1331. Flow-Disentangled Feature Importance
๐ฌ 13
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Interpretability Feature Importance Statistical Inference Correlation Distortion Uncertainty Quantification
Quantifying feature importance with valid statistical uncertainty is central to interpretable machine learning, yet classical model-agnostic methods often fail under feature correlation, producing unreliable attributions and compromising inference. Statistical approaches that address correlation through feature decorrelation have shown promise but remain restricted to $\ell_2$ loss, limiting their applicability across diverse machine learning tasks. We introduce Flow-Disentangled Feature Importance (FDFI), a model-agnostic framework that resolves these limitations by combining principled statistical inference with computational flexibility. FDFI leverages flow matching to learn flexible disentanglement maps that not only handle arbitrary feature distributions but also provide an interpretable pathway for understanding how importance is attributed through the data's correlation structure. The framework generalizes the decorrelation-based attribution to general differentiable loss functions, enabling statistically valid importance assessment for black-box predictors across regression and classification. We establish statistical inference theory, deriving semiparametric efficiency of FDFI estimators, which enables valid confidence intervals and hypothesis testing with Type I error control. Experiments demonstrate that FDFI achieves substantially higher statistical power than removal-based and conditional permutation approaches, while maintaining robust and interpretable attributions even under severe interdependence. These findings hold across synthetic benchmarks and a broad collection of real datasets spanning diverse domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1332. Unlocking Full Efficiency of Token Filtering in Large Language Model Training
๐ฌ 19
โญ 6.00
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ Efficient LLM Training; Token Filtering;
Token filtering has been proposed to enhance the utility of large language models (LLMs) by eliminating inconsequential tokens during training. While using fewer tokens is expected to reduce computational workloads, existing methods have not yet achieved a real-world efficiency boost. This is primarily due to two factors: (1) existing work has inadequate sparsity for speedup, and (2) token filtering operates within a sparsity range that is non-standard in existing machine learning (ML) libraries and thus cannot be efficiently supported. This paper presents Centrifuge, a system that leverages algorithm and system co-design to unleash the full efficiency of token filtering in LLM training. At the algorithm level, Centrifuge filters activations of inconsequential tokens in the attention backward kernel to amplify the sparsity in backward computation. At the system level, Centrifuge proposes an automatic workflow that transforms sparse GEMM into dimension-reduced dense GEMM for optimized efficiency using standard ML libraries. Evaluations on models with various scalesโfrom 1.1B to 40Bโdemonstrate that Centrifuge reduces backpropagation time by up to 49.9\% and end-to-end training time by up to 34.7\% when filtering 50\% of tokens. Utility assessments indicate that Centrifuge preserves the utility benefits of token filtering and significantly enhances model performance by up to 26.6\% compared to standard training. Centrifuge is designed for seamless integration into existing LLM training frameworks, enabling systems already utilizing token filtering to accelerate training with just one line of code.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1333. FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models
๐ฌ 16
โญ 6.00
๐ generative models
๐ท๏ธ Diffusion language model few step generation flow matching
Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. Diffusion Language Models (DLMs) parallelize across positions and thus appear promising for language generation, yet standard discrete diffusion typically needs hundreds to thousands of model evaluations to reach high quality, trading serial depth for iterative breadth. We introduce **FS-DFM**, Few-Step Discrete Flow-Matching. A discrete flow-matching model designed for speed without sacrificing quality. The core idea is simple: make the number of sampling steps an explicit parameter and train the model to be consistent across step budgets, so one big move lands where many small moves would. We pair this with a reliable update rule that moves probability in the right direction without overshooting, and with strong teacher guidance distilled from long-run trajectories. Together, these choices make few-step sampling stable, accurate, and easy to control. On language modeling benchmarks, FS-DFM with 8 sampling steps achieves perplexity parity with a 1\,024-step discrete-flow baseline for generating 1\,024 tokens using a similar-size model, delivering up to 128ร faster sampling and corresponding latency/throughput gains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1334. HiCache: A Plug-in Scaled-Hermite Upgrade for Taylor-Style Cache-then-Forecast Diffusion Acceleration
๐ฌ 4
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Diffusion Acceleration Efficiency ML
Diffusion models have achieved remarkable success in content generation but suffer from prohibitive computational costs due to iterative sampling. While recent feature caching methods tend to accelerate inference through temporal extrapolation, these methods still suffer from severe quality loss due to the failure in modeling the complex dynamics of feature evolution. To solve this problem, this paper presents HiCache (Hermite Polynomial-based Feature Cache), a training-free acceleration framework that fundamentally improves feature prediction by aligning mathematical tools with empirical properties. Our key insight is that feature derivative approximations in Diffusion Transformers exhibit multivariate Gaussian characteristics, motivating the use of Hermite polynomials, the potentially theoretically optimal basis for Gaussian-correlated processes. Besides, we introduce a dual-scaling mechanism that ensures numerical stability while preserving predictive accuracy, which is also effective when applied standalone to TaylorSeer. Extensive experiments demonstrate HiCache's superiority: achieving \$5.55\times\$ speedup on FLUX.1-dev while exceeding baseline quality, maintaining strong performance across text-to-image, video generation, and super-resolution tasks. Moreover, HiCache can be naturally added to the previous caching methods to enhance their performance, e.g., improving ClusCa from \$0.9480\$ to \$0.9840\$ in terms of image rewards. Our code is included in the supplementary material, and will be released on GitHub.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1335. TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
๐ฌ 14
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Process Reward Model Tabular Reasoning Tool Integration Test-time Scaling
Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored.
Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification.
We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM.
Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9\% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1336. Variational Reasoning for Language Models
๐ฌ 11
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Language Models Variational Reasoning Reinforcement Learning
We introduce a **variational reasoning** framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where *an implicit weighting by model accuracy* naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1337. SNaX: sparse narrow accelerated mixture of experts
๐ฌ 19
โญ 6.00
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ Mixture of Experts GPU kernel
Mixture of Experts (MoE) models have emerged as the de-facto architecture for scaling up language models without significantly increasing the computational cost. Existing MoE methods optimize system efficiency or model architecture independently. We show that as MoE models get more granular and sparser, they become more memory-bound, and jointly optimizing the algorithms and the kernel design leads to a major improvement in MoE training throughput. We first propose a memory-efficient algorithm to compute the forward and backward of MoE with minimal activation saved. We then design GPU kernels that overlap memory IO latency with compute, benefiting all MoE architectures. Finally, we propose a novel "token rounding" method that minimizes the wasted compute brought by tile quantization. As a result, our method SNaX reduces activation memory by 45% and has 1.80x throughput improvement on NVidia H100 GPUs compared to ScatterMoE for a fine-grained 7B MoE. Moreover, SNaX on 64 H100s achieves a training throughput of 213 billion tokens a day comparable to ScatterMoE's 225 billion tokens a day on 96 H100s for a 7B MoE model training with token-choice routing while training with FSDP-2. Under high MoE sparsity settings, our tile-aware token rounding algorithm yields an additional 1.18x speedup on kernel execution time compared to vanilla top-$K$ routing while maintaining similar downstream performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1338. RMFlow: Refined Mean Flow by a Noise-Injection Step for Multimodal Generation
๐ฌ 14
โญ 6.00
๐ generative models
๐ท๏ธ Mean Flow Flow Matching Noise-injection Likelihood Maximization Multimodal Generation
Mean flow (MeanFlow) enables efficient, high-fidelity image generation, yet its single-function evaluation (1-NFE) generation often cannot yield compelling results. We address this issue by introducing RMFlow, an efficient multimodal generative model that integrates a coarse 1-NFE MeanFlow transport with a subsequent tailored noise-injection refinement step. RMFlow approximates the average velocity of the flow path using a neural network trained with a new loss function that balances minimizing the Wasserstein distance between probability paths and maximizing sample likelihood. RMFlow achieves competitive, often (near) state-of-the-art results on text-to-image, context-to-molecule, and time-series generation using 1-NFE, at a comparable computational cost to the baseline MeanFlows.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1339. Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions
๐ฌ 21
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Online video understanding; Video Question Answering; Vision-Language Models; Decision
Visual agents operating in the wild must respond to queries precisely when sufficient evidence first appears in a video stream, a critical capability that is overlooked by conventional video LLMs evaluated in offline settings. The shift to an online, streaming paradigm introduces significant challenges: a lack of decision transparency, the difficulty of aligning response timing with visual evidence, and the need to maintain a global, causally consistent understanding under tight computational budgets. To address these issues, we propose a novel framework that decouples reasoning control from memory integration. We introduce Thinking-QwenVL, an instantiation of this framework with two core components. First, the Active Thinking Decision Maker (ATDM) is a transparent reasoning controller that externalizes its decision process using observable progress ($\boldsymbol{\rho}$) and confidence ($\boldsymbol{c}$) metrics. This allows it to precisely time its response $t_r$ to match the first-sufficient-evidence timestamp $t^\star$ while streaming its reasoning to the user. Second, the Hierarchical Progressive Semantic Integration (HPSI) module acts as an efficient memory system. It employs a set of learnable, multi-level aggregation tokens that are propagated across clips to build a rich, global cognitive state without exceeding token budgets. Extensive experiments demonstrate the effectiveness of ATDM and HPSI, e.g., Thinking-QwenVL improves the accuracy of the previous state-of-the-art from 67.63\% to 71.60\% on the StreamingBench benchmark.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1340. Relational Feature Caching for Accelerating Diffusion Transformers
๐ฌ 13
โญ 6.00
๐ generative models
๐ท๏ธ Diffusion transformer Feature Caching
Feature caching approaches accelerate diffusion transformers (DiTs) by storing the output features of computationally expensive modules at certain timesteps, and exploiting them for subsequent steps to reduce redundant computations. Recent forecasting-based caching approaches employ temporal extrapolation techniques to approximate the output features with cached ones. Although effective, relying exclusively on temporal extrapolation still suffers from significant prediction errors, leading to performance degradation. Through a detailed analysis, we find that 1) these errors stem from the irregular magnitude of changes in the output features, and 2) an input feature of a module is strongly correlated with the corresponding output. Based on this, we propose relational feature caching (RFC), a novel framework that leverages the input-output relationship to enhance the accuracy of the feature prediction. Specifically, we introduce relational feature estimation (RFE) to estimate the magnitude of changes in the output features from the inputs, enabling more accurate feature predictions. We also present relational cache scheduling (RCS), which estimates the prediction errors using the input features and performs full computations only when the errors are expected to be substantial. Extensive experiments across various DiT models demonstrate that RFC consistently outperforms prior approaches significantly. We will release our code publicly upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1341. GhostEI-Bench: Do Mobile Agent Resilience to Environmental Injection in Dynamic On-Device Environments?
๐ฌ 16
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Mobile Agents Environmental Injection Benchmark GUI Agent Safety
Vision-Language Models (VLMs) are increasingly deployed as autonomous agents to navigate mobile Graphical User Interfaces (GUIs).
However, their operation within dynamic on-device ecosystems, which include notifications, pop-ups, and inter-app interactions, exposes them to a unique and underexplored threat vector: environmental injection.
Unlike traditional prompt-based attacks that manipulate textual instructions, environmental injection contaminates the agent's visual perception by inserting adversarial UI elements, such as deceptive overlays or spoofed notifications, directly into the GUI.
This bypasses textual safeguards and can derail agent execution, leading to privacy leakage, financial loss, or irreversible device compromise.
To systematically evaluate this threat, we introduce GhostEI-Bench, the first benchmark dedicated to assessing mobile agents under environmental injection attacks within dynamic, executable environments.
Moving beyond static image-based assessments, our benchmark injects adversarial events into realistic application workflows inside fully operational Android emulators, assessing agent performance across a range of critical risk scenarios.
We also introduce a novel evaluation protocol where a judge LLM performs fine-grained failure analysis by reviewing the agent's action trajectory alongside the corresponding sequence of screenshots.
This protocol identifies the precise point of failure, whether in perception, recognition, or reasoning.
Our comprehensive evaluation of state-of-the-art agents reveals their profound vulnerability to deceptive environmental cues. The results demonstrate that current models systematically fail to perceive and reason about manipulated UIs.
GhostEI-Bench provides an essential framework for quantifying and mitigating this emerging threat, paving the way for the development of more robust and secure embodied agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1342. DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation
๐ฌ 19
โญ 6.00
๐ generative models
๐ท๏ธ block-wise training backpropagation-free training memory-efficient training
End-to-end backpropagation requires storing activations throughout all layers, creating memory bottlenecks that limit model scalability. Existing block-wise training methods offer means to alleviate this problem, but they rely on ad-hoc local objectives and remain largely unexplored beyond classification tasks. We propose $\textit{DiffusionBlocks}$, a principled framework for transforming transformer-based networks into genuinely independent trainable blocks that maintain competitive performance with end-to-end training. Our key insight leverages the fact that residual connections naturally correspond to updates in a dynamical system. With minimal modifications to this system, we can convert the updates to those of a denoising process, where each block can be learned independently by leveraging the score matching objective. This independence enables training with gradients for only one block at a time, thereby reducing memory requirements in proportion to the number of blocks. Our experiments on a range of transformer architectures (vision, diffusion, autoregressive, recurrent-depth, and masked diffusion) demonstrate that DiffusionBlocks training matches the performance of end-to-end training while enabling scalable block-wise training on practical tasks beyond small-scale classification. DiffusionBlocks provides a theoretically grounded approach that successfully scales to modern generative tasks across diverse architectures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1343. Hyperspherical Latents Improve Continuous-Token Autoregressive Generation
๐ฌ 12
โญ 6.00
๐ generative models
๐ท๏ธ autoregressive generation image generation diffusion
Autoregressive (AR) models are promising for image generation, yet continuous-token AR variants often trail latent diffusion and masked-generation models. The core issue is heterogeneous variance in VAE latents, which is amplified during AR decoding, especially under classifier-free guidance (CFG), and can cause variance collapse. We propose SphereAR to address this issue. Its core design is to constrain all AR inputs and outputs---including after CFG---to lie on a fixed-radius hypersphere (constant $\ell_2$ norm), leveraging hyperspherical VAEs. Our theoretical analysis shows that hyperspherical constraint removes the scale component (the primary cause of variance collapse), thereby stabilizing AR decoding. Empirically, on ImageNet generation, SphereAR-H (943M) sets a new state of the art for AR models, achieving FID 1.34. Even at smaller scales, SphereAR-L (479M) reaches FID 1.54 and SphereAR-B (208M) reaches 1.92, matching or surpassing much larger baselines such as MAR-H (943M, 1.55) and VAR-d30 (2B, 1.92).
To our knowledge, this is the first time a pure next-token AR image generator with raster order surpasses diffusion and masked-generation models at comparable parameter scales.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1344. UniRestorer: Universal Image Restoration via Adaptively Estimating Image Degradation at Proper Granularity
๐ฌ 17
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ low-level vision image restoration all-in-one image restoration
Recently, considerable progress has been made in all-in-one image restoration. Generally, existing methods can be degradation-agnostic or degradation-aware. However, the former are limited in leveraging degradation estimation-based priors, and the latter suffer from the inevitable error in degradation estimation. Consequently, the performance of existing methods has a large gap compared to specific single-task models. In this work, we make a step forward in this topic, and present our UniRestorer with improved restoration performance. Specifically, we perform hierarchical clustering on degradation space, and train a multi-granularity mixture-of-experts (MoE) restoration model. Then, UniRestorer adopts both degradation and granularity estimation to adaptively select an appropriate expert for image restoration. In contrast to existing degradation-agnostic and -aware methods, UniRestorer can leverage degradation estimation to benefit degradation-specific restoration, and use granularity estimation to make the model robust to degradation estimation error. Experimental results show that our UniRestorer outperforms state-of-the-art all-in-one methods by a large margin, and is promising in closing the performance gap to specific single-task models. The code and pre-trained models will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1345. A Step to Decouple Optimization in 3DGS
๐ฌ 11
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3DGS optimizer regularization
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time novel view synthesis. As an explicit representation optimized through gradient propagation among primitives, optimization widely accepted in deep neural networks (DNNs) is actually adopted in 3DGS, such as synchronous weight updating and Adam with the adaptive gradient. However, considering the physical significance and specific design in 3DGS, there are two overlooked details in the optimization of 3DGS: (i) update step coupling, which induces optimizer state rescaling and costly attribute updates outside the viewpoints, and (ii) gradient coupling in the moment, which may lead to under- or over-effective regularization. Nevertheless, such a complex coupling is under-explored. After revisiting the optimization of 3DGS, we take a step to decouple it and recompose the process into: Sparse Adam, Re-State Regularization and Decoupled Attribute Regularization. Taking a large number of experiments under the 3DGS and 3DGS-MCMC frameworks, our work provides a deeper understanding of these components. Finally, based on the empirical analysis, we re-design the optimization and propose AdamW-GS by re-coupling the beneficial components, under which better optimization efficiency and representation effectiveness are achieved simultaneously.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1346. Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing
๐ฌ 14
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Diffusion Large Language Models Discrete Diffusion Models Inference Acceleration KV Cache AR-Diffusion Hybrid
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs to achieve rapid convergence.We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $\mathbf{2.5\times}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to the vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $\mathbf{50\times}$ while maintaining comparable output quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1347. One-Step Flow Q-Learning: Addressing the Diffusion Policy Bottleneck in Offline Reinforcement Learning
๐ฌ 15
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Reinforcement learning Diffusion Model Flow Matching Offline Reinforcement Learning
Diffusion Q-Learning (DQL) has established diffusion policies as a high-performing paradigm for offline reinforcement learning, but its reliance on multi-step denoising for action generation renders both training and inference slow and fragile. Existing efforts to accelerate DQL toward one-step denoising typically rely on auxiliary modules or policy distillation, sacrificing either simplicity or performance. It remains unclear whether a one-step policy can be trained directly without such trade-offs. To this end, we introduce One-Step Flow Q-Learning (OFQL), a novel framework that enables effective one-step action generation during both training and inference, without auxiliary modules or distillation. OFQL reformulates the DQL policy within the Flow Matching (FM) paradigm but departs from conventional FM by learning an average velocity field that directly supports accurate one-step action generation. This design removes the need for multi-step denoising and backpropagation-through-time updates, resulting in substantially faster and more robust learning. Extensive experiments on the D4RL benchmark show that OFQL, despite generating actions in a single step, not only significantly reduces computation during both training and inference but also outperforms multi-step DQL by a large margin. Furthermore, OFQL surpasses all other baselines, achieving state-of-the-art performance in D4RL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1348. Sample Lottery: Unsupervised Discovery of Critical Instances for LLM Reasoning
๐ฌ 14
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Reinforcement Learning with Verifiable Reward
Reinforcement Learning with Verifiable Reward (RLVR) has equipped large language models (LLMs) with the capability of reasoning over complicated logical problems through policy optimization. However, conventional methods require complete annotation of the entire dataset and allocate computation uniformly over all samples. We articulate the lottery sample hypothesis in policy optimization of LLMs: a large training set contains a small subset that, when trained alone, yields performance comparable to that of the full dataset. This paper therefore explores the following question: How can we identify these lottery-winning samples from the original dataset without access to answers? Unlike prior efforts that analyze the effect of different samples in the training set with complete annotation, this paper focuses on the unsupervised discovery of critical instances for LLM reasoning and proposes a novel framework termed Complementary Conformal Selection (CONST). Specifically, CONST evaluates the importance of samples by considering two complementary components: procedural volatility and outcome volatility. Procedural volatility measures the potential variations during the LLMโs reasoning process, while outcome volatility captures inconsistencies in the final answer. Subsequently, conformal prediction is used to obtain a prediction set whose cardinality serves as the criterion for selecting the lottery-winning samples for annotation. We also provide a theoretical analysis, showing that CONST can effectively approximate the optimal policy. Extensive experiments on various LLMs across different datasets demonstrate the effectiveness of CONST.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1349. ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation
๐ฌ 23
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Unified Multimodal Model Generation Benchmark Cross-modal Reasoning
Unified multimodal models (UMMs) have shown remarkable advances in jointly understanding and generating text and images. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning: textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. As such, existing benchmarks rarely require the use of one modality to guide, verify, or refine outputs in the other. They therefore fail to capture a central aspiration of unified multimodal models, namely to support seamless reasoning across modalities. We address this gap with **ROVER**, a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1285 tasks grounded in 2,048 images, spanning two complementary settings. **Verbally-augmented reasoning for visual generation** evaluates whether models can use structured verbal prompts and reasoning chains to guide faithful image synthesis. **Visually-augmented reasoning for verbal generation** evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes. Experiments on 17 state-of-the-art UMMs reveal two key findings: (i) cross-modal reasoning capabilities strongly correlate with visual generation performance, particularly for interleaved imageโtext generation; and (ii) current models remain severely limited in visual-augmented reasoning, showing relative strength in perception and physical modeling but weakness in logical tasks. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation. More information on **Anonymous Page**: https://anony0923.github.io
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1350. ReFORM: Reflected Flows for On-support Offline RL via Noise Manipulation
๐ฌ 16
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Offline reinforcement learning support constraint flow model
Offline reinforcement learning (RL) aims to learn the optimal policy from a fixed
behavior policy dataset without additional environment interaction. One common challenge that arises in this setting is the out-of-distribution (OOD) error,
which occurs when the policy leaves the training distribution. Prior methods penalize a statistical distance term to keep the policy close to the behavior policy, but
this constrains policy improvement and may not completely prevent OOD actions.
Another challenge is that the optimal policy distribution can be multimodal and
difficult to represent. Recent works apply diffusion or flow policies to address this
problem, but it is unclear how to avoid OOD errors while retaining policy expressiveness. We propose ReFORM, an offline RL method based on flow policies that
enforces the less restrictive support constraint by construction. ReFORM learns a
BC flow policy with a bounded source distribution to capture the support of the
action distribution, then optimizes a reflected flow that generates bounded noise
for the BC flow while keeping the support, to maximize the performance. Across
40 challenging tasks from the OGBench benchmark with datasets of varying quality and using a constant set of hyperparameters for all tasks, ReFORM dominates all baselines with hand-tuned hyperparameters on the performance profile curves.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1351. Bringing Stability to Diffusion: Decomposing and Reducing Variance of Training Masked Diffusion Models
๐ฌ 15
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Masked Diffusion Models Training Variance Training Stability Mask Schedule Mask Sampling
Masked diffusion models (MDMs) are a promising alternative to autoregressive models (ARMs), but they suffer from **inherently** much higher training variance. High variance leads to noisier gradient estimates and unstable optimization, so even equally strong pretrained MDMs and ARMs that are competitive at initialization often diverge after task-specific training, with MDMs falling far behind. Currently, there has been no theoretical explanation or systematic solution. In this paper, we derive **the first decomposition** of MDM training variance into three sources: {A} masking pattern noise, {B} masking rate noise, and {C} data noise -- while ARMs are only affected by {C}. This cleanly explains the fundamental training gap. Building on this foundation, we design six variance-reduction methods, including two core methods: (1) P-POTS, a **Pareto-optimal** $t$-sampler that minimizes training variance by sampling harder $t$ values more often with appropriately smaller update steps, and (2) MIRROR, which uses negatively correlated samples to reduce {A}. Experiments show that, compared to standard MDM training, our methods improve accuracy by **7โ8\%** on complex reasoning tasks, while simultaneously reducing run-to-run variability to **near ARM levels**, substantially narrowing the gap with strong ARM baselines; in most settings, even the best baseline method runs remain below the worst run of our method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1352. No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping
๐ฌ 20
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language models reinforcement learning with verifiable rewards llm reasoning
Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same rewardโso-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1353. Know When to Abstain: Optimal Selective Classification with Likelihood Ratios
๐ฌ 12
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ selective classification
Selective classification enhances the reliability of predictive models by allowing them to abstain from making uncertain predictions. In this work, we revisit the design of optimal selection functions through the lens of the NeymanโPearson lemma, a classical result in statistics that characterizes the optimal rejection rule as a likelihood ratio test. We show that this perspective not only unifies the behavior of several post-hoc selection baselines, but also motivates new approaches to selective classification which we propose here. A central focus of our work is the setting of covariate shift, where the input distribution at test time differs from that at training. This realistic and challenging scenario remains relatively underexplored in the context of selective classification. We evaluate our proposed methods across a range of vision and language tasks, including both supervised learning and vision-language models. Our experiments demonstrate that our Neyman-Pearson-informed methods consistently outperform existing baselines, indicating that likelihood ratio-based selection offers a robust mechanism for improving selective classification under covariate shifts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1354. EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems
๐ฌ 20
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Agents Test Time Learning
A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like โclever but clueless internsโ in novel environments. This severely limits their practical utility. To systematically measure and drive progress on this challenge, we first introduce the Jericho Test-Time Learning (J-TTL) benchmark. J-TTL is a new evaluation setup where an agent must play the same game for several consecutive episodes, attempting to improve its performance from one episode to the next. On J-TTL, we find that existing adaptation methods like reflection, memory, or reinforcement learning struggle. To address the challenges posed by our benchmark, we present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradientsโby evolving the entire agentic system after every episode. EvoTest has two roles: the Actor Agent, which plays the game, and the Evolver Agent, which analyzes the episode transcript to propose a revised configuration for the next run. This configuration rewrites the prompt, updates memory by logging effective stateโaction choices, tunes hyperparameters, and learns the tool-use routines. On our J-TTL benchmark, EvoTest consistently increases performance, outperforming not only reflection and memory-only baselines but also more complex online fine-tuning methods. Notably, our method is the only one capable of winning two games (Detective and Library), while all baselines fail to win any.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1355. How to Lose Inherent Counterfactuality in Reinforcement Learning
๐ฌ 9
โญ 6.00
๐ reinforcement learning
๐ท๏ธ counterfactuality inherent skills reinforcement learning
Learning in high-dimensional MDPs with complex state dynamics became possible with the progress achieved in reinforcement learning research.
At the same time, deep neural policies have been observed to be highly unstable with respect to the minor variations in their state space, causing volatile and unpredictable behaviour.
To alleviate these volatilities, a line of work suggested techniques to cope with this problem via explicitly regularizing the temporal difference loss to ensure local $\epsilon$-invariance in the state space.
In this paper, we provide theoretical foundations on the impact of $\epsilon$-local invariance training on the deep neural policy manifolds.
Our comprehensive theoretical and experimental analysis reveals that standard reinforcement learning inherently learns counterfactual values while recent training techniques that focus on explicitly enforcing $\epsilon$-local invariance cause policies to lose counterfactuality, and further result in learning misaligned and inconsistent values.
In connection to this analysis, we further highlight that this line of training methods break the core intuition and the true biological inspiration of reinforcement learning, and introduce an intrinsic gap between how natural intelligence understands and interacts with an environment in contrast to AI agents trained via $\epsilon$-local invariance methods. The misalignment, inaccuracy and the loss of counterfactuality revealed in our paper further demonstrate the need to rethink the approach in establishing truly reliable and generalizable reinforcement learning policies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1356. AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer
๐ฌ 12
โญ 6.00
๐ generative models
๐ท๏ธ Video-to-Audio Generation; Audio Generation;
Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data (e.g., conflating acoustically distinct sounds like different dog barks under coarse labels), and textual ambiguity in describing microacoustic features (e.g., "metallic clang" failing to distinguish impact transients and resonance decay). These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose **AC-Foley**, an audio-conditioned V2A model that directly leverages reference audio to achieve precise and fine-grained control over generated sounds. This approach enables: fine-grained sound synthesis (e.g., footsteps with distinct timbres on wood, marble, or gravel), timbre transfer (e.g., transforming a violinโs melody into the bright, piercing tone of a suona), zero-shot generation of sounds (e.g., creating unique weapon sound effects without training on firearm datasets) and better audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling precise manipulation of acoustic attributes. Empirically, AC-Foley achieves state-of-the-art performance for Foley generation when conditioned on reference audio, while remaining competitive with SOTA video-to-audio methods even without audio conditioning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1357. Out of the Memory Barrier: A Highly Memory-Efficient Training System for LLMs with Million-Token Contexts
๐ฌ 19
โญ 6.00
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ LLM NLP Long-Context LLM Memory Efficient Training
Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time. The primary culprits are the activations, whose memory footprints scale linearly with sequence length. We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier. Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint ($\mathcal{O}(1)$) and shifts the primary bottleneck to the growing KV cache. To manage the KV cache, OOMB integrates a suite of synergistic optimizations: a paged memory manager for both the KV cache and its gradients to eliminate fragmentation, asynchronous CPU offloading to hide data transfer latency, and page-level sparse attention to reduce both computational complexity and communication overhead. The synergy of these techniques yields exceptional efficiency. Our empirical results show that for every additional 10K tokens of context, the end-to-end training memory overhead increases by a mere 10MB for Qwen2.5-7B. This allows training Qwen2.5-7B with a 4M-token context on a single H200 GPU, a feat that would otherwise require a large cluster using context parallelism. This work represents a substantial advance in resource efficiency for long-context LLM training. The source code is available for review at https://anonymous.4open.science/r/oomb/README.md.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1358. On the Universality and Complexity of GNN for Solving Second-order Cone Programs
๐ฌ 13
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ graph neural network second order cone programming learning to optimize Weisfeiler-Lehman test sample complexity
Graph Neural Networks (GNNs) have demonstrated both empirical efficiency and universal expressivity for solving constrained optimization problems such as linear and quadratic programming. However, extending this paradigm to more general convex problems with universality guarantees, particularly Second-Order Cone Programs (SOCPs), remains largely unexplored.
We address this challenge by proposing a novel graph representation that captures the inherent structure of conic constraints. We then establish a key universality theorem: *there exist GNNs that can provably approximate essential SOCP properties, including instance feasibility and optimal solutions*. We further derive the sample complexity for GNN generalization based on Rademacher complexity, filling an important gap for Weisfeiler-Lehman-based GNNs in learning-to-optimize paradigms.
Our results provide a rigorous foundation linking GNN expressivity and generalization power to conic optimization structure, opening new avenues for scalable, data-driven SOCP solvers. The approach extends naturally to $p$-order cone programming for any $p \geq 1$ while preserving universal expressivity and requiring no structural modifications to the GNN architecture. Numerical experiments on randomly generated SOCPs and real-world power grid problems demonstrate the effectiveness of our approach, achieving superior prediction accuracy with significantly fewer parameters than fully connected neural networks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1359. LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation
๐ฌ 15
โญ 6.00
๐ generative models
๐ท๏ธ Video Generation Video Customization Diffusion Models Multi-Subject Generation Face-Attribute Alignment
Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise faceโattribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose $\textbf{\textit{Lumos{X}}}$, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subjectโattribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that $\textit{LumosX}$ achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1360. Efficient Approximate Posterior Sampling with Annealed Langevin Monte Carlo
๐ฌ 14
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Posterior Sampling Sampling
We study the problem of posterior sampling in the context of score based generative models. We have a trained score network for a prior $p(x)$, a measurement model $p(y|x)$, and are tasked with sampling from the posterior $p(x|y)$. Prior work has shown this to be intractable in KL (in the worst case) under well-accepted computational hardness assumptions. Despite this, popular algorithms for tasks such as image super-resolution, stylization, and reconstruction enjoy empirical success. Rather than establishing distributional assumptions or restricted settings under which exact posterior sampling is tractable, we view this as a more general "tilting" problem of biasing a distribution towards a measurement. Under minimal assumptions, we show that one can tractably sample from a distribution that is simultaneously close to the posterior of a noised prior in KL divergence and the true posterior in Fisher divergence. Intuitively, this combination ensures that the resulting sample is consistent with both the measurement and the prior. To the best of our knowledge these are the first formal results for (approximate) posterior sampling in polynomial time.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1361. Non-Asymptotic Analysis of Efficiency in Conformalized Regression
๐ฌ 20
โญ 6.00
๐ optimization
๐ท๏ธ conformal prediction efficiency conformalized regression quantile regression uncertainty quantification
Conformal prediction provides prediction sets with coverage guarantees. The informativeness of conformal prediction depends on its efficiency, typically quantified by the expected size of the prediction set. Prior work on the efficiency of conformalized regression commonly treats the miscoverage level $\alpha$ as a fixed constant. In this work, we establish non-asymptotic bounds on the deviation of the prediction set length from the oracle interval length for conformalized quantile and median regression trained via SGD, under mild assumptions on the data distribution. Our bounds of order $\mathcal{O}(1/\sqrt{n} + 1/(\alpha^2 n) + 1/\sqrt{m} + \exp(-\alpha^2 m))$ capture the joint dependence of efficiency on the proper training set size $n$, the calibration set size $m$, and the miscoverage level $\alpha$. The results identify phase transitions in convergence rates across different regimes of $\alpha$, offering guidance for allocating data to control excess prediction set length. Empirical results are consistent with our theoretical findings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1362. Reverse-Engineered Reasoning for Open-Ended Generation
๐ฌ 11
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ reasoning open-ended generation synthetic data
While the "deep reasoning" paradigm has spurred significant advances in verifiable domains like mathematics, its application to open-ended, creative generation remains a critical challenge. The two dominant methods for instilling reasoningโreinforcement learning (RL) and instruction distillation -- falter in this area; RL struggles with the absence of clear reward signals and high-quality reward models, while distillation is prohibitively expensive and capped by the teacher model's capabilities. To overcome these limitations, we introduce REverse-Engineered Reasoning (REER), a new paradigm that fundamentally shifts the approach. Instead of building a reasoning process "forwards" through trial-and-error or imitation, REER works "backwards" from known good solutions to computationally discover the latent, step-by-step deep reasoning process that could have produced them. Using this scalable, gradient-free approach, we curate and open-source DeepWriting-20K, a large-scale dataset of 20,000 deep reasoning trajectories for open-ended tasks. Our model, DeepWriter-8B, trained on this data, not only surpasses strong open-source baselines but also achieves performance competitive with, and at times superior to, leading proprietary models like GPT-4o and Claude 3.5.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1363. MaskInversion: Localized Embeddings via Optimization of Explainability Maps
๐ฌ 9
โญ 6.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ vision encoder localized embedding CLIP
Vision-language foundation models such as CLIP have achieved tremendous results in global vision-language alignment, but still show some limitations in creating representations for specific image regions.
To address this problem, we propose MaskInversion, a method that leverages the feature representations of pre-trained foundation models, such as CLIP, to generate a context-aware embedding for a query image region specified by a mask at test time.
MaskInversion starts with initializing an embedding token and compares its explainability map, derived from the pretrained model, to the query mask.
The embedding token is then subsequently refined to approximate the query region by minimizing the discrepancy between its explainability map and the query mask. During this process, only the embedding vector is updated, while the underlying foundation model is kept frozen
allowing to use MaskInversion with any pre-trained model.
As deriving the explainability map involves computing its gradient, which can be expensive, we propose a gradient decomposition strategy that simplifies this computation.
The learned region representation can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation. We evaluate the proposed method on all those tasks on several datasets such as PascalVOC, MSCOCO, RefCOCO, and OpenImagesV7 and show its capabilities compared to other SOTA approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1364. Perturbation-Induced Linearization: Constructing Unlearnable Data with Solely Linear Classifiers
๐ฌ 12
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ unlearnable examples data protection linear model shortcut linearity
Collecting web data to train deep models has become increasingly common, raising concerns about unauthorized data usage. To mitigate this issue, unlearnable examples introduce imperceptible perturbations into data, preventing models from learning effectively. However, existing methods typically rely on deep neural networks as surrogate models for perturbation generation, resulting in significant computational costs. In this work, we propose Perturbation-Induced Linearization (PIL), a computationally efficient yet effective method that generates perturbations using only linear surrogate models. PIL achieves comparable or better performance than existing surrogate-based methods while reducing computational time dramatically. We further reveal a key mechanism underlying unlearnable examples: inducing linearization to deep models, which explains why PIL can achieve competitive results in a very short training time. Beyond this, we provide an analysis about the property of unlearnable examples under percentage-based partial perturbation. Our work not only provides a practical approach for data protection but also offers insights into what makes unlearnable examples effective.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1365. Train on Validation (ToV): Fast data selection with applications to fine-tuning
๐ฌ 13
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Data selection Influence function Instruction tuning LLM
State-of-the-art machine learning often follows a two-stage process: $(i)$ pre-training on large, general-purpose datasets; $(ii)$ fine-tuning on task-specific data. In fine-tuning, selecting training examples that closely reflect the target distribution is crucial. However, it is often the case that only a few samples are available from the target distribution. Existing data selection methods treat these target samples as a validation set and estimate the effect of adding or removing a single sample from the training pool by performing inference on the validation set.
We propose a simpler and faster alternative that inverts the usual role of train and validation: we perform inference on the training pool before and after fine-tuning on the validation set. We then select samples whose predictions change the most. Our key insight is that the training samples most affected by fine-tuning on a small validation set tend to be the most beneficial for reducing test loss on the target distribution. Experiments on instruction tuning and named entity recognition tasks show that, in most cases, our method achieves lower test log-loss than state-of-the-art approaches. We support our findings with theoretical analysis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1366. ProstaTD: Bridging Surgical Triplet from Classification to Fully Supervised Detection
๐ฌ 17
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Surgical Triplet Endoscopy Benchmark Dataset Evaluation
Surgical triplet detection is a critical task in surgical video analysis, with significant implications for performance assessment and training novice surgeons. However, existing datasets like CholecT50 lack precise spatial bounding box annotations, rendering triplet classification at the image level insufficient for practical applications. The inclusion of bounding box annotations is essential to make this task meaningful, as they provide the spatial context necessary for accurate analysis and improved model generalizability. To address these shortcomings, we introduce ProstaTD, a large-scale, multi-institutional dataset for surgical triplet detection, developed from the technically demanding domain of robot-assisted prostatectomy. ProstaTD offers clinically defined temporal boundaries and high-precision bounding box annotations for each structured triplet activity. The dataset comprises 71,775 video frames and 196,490 annotated triplet instances, collected from 21 surgeries performed across multiple institutions, reflecting a broad range of surgical practices and intraoperative conditions. The annotation process was conducted under rigorous medical supervision and involved more than 60 contributors, including practicing surgeons and medically trained annotators, through multiple iterative phases of labeling and verification. To further facilitate future general-purpose surgical annotation, we developed two tailored labeling tools to improve efficiency and scalability in our annotation workflows. In addition, we created a surgical triplet detection evaluation toolkit that enables standardized and reproducible performance assessment across studies. ProstaTD is the largest and most diverse surgical triplet dataset to date, moving the field from simple classification to full detection with precise spatial and temporal boundaries and thereby providing a robust foundation for fair benchmarking.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1367. Neural Force Field: Few-shot Learning of Generalized Physical Reasoning
๐ฌ 14
โญ 6.00
๐ causal reasoning
๐ท๏ธ Physical reasoning few-shot learning
Physical reasoning is a remarkable human ability that enables rapid learning and generalization from limited experience. Current AI models, despite extensive training, still struggle to achieve similar generalization, especially in Out-of-distribution (OOD) settings. This limitation stems from their inability to abstract core physical principles from observations. A key challenge is developing representations that can efficiently learn and generalize physical dynamics from minimal data. Here we present Neural Force Field (NFF), a framework extending Neural Ordinary Differential Equation (NODE) to learn complex object interactions through force field representations, which can be efficiently integrated through an Ordinary Differential Equation ( ODE) solver to predict object trajectories. Unlike existing approaches that rely on discrete latent spaces, NFF captures fundamental physical concepts such as gravity, support, and collision in continuous explicit force fields. Experiments on three challenging physical reasoning tasks demonstrate that NFF, trained with only a few examples, achieves strong generalization to unseen scenarios. This physics-grounded representation enables efficient forward-backward planning and rapid adaptation through interactive refinement. Our work suggests that incorporating physics-inspired representations into learning systems can help bridge the gap between artificial and human physical reasoning capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1368. Parallel Sampling from Masked Diffusion Models via Conditional Independenceย Testing
๐ฌ 9
โญ 6.00
๐ generative models
๐ท๏ธ masked diffusion models language models inference
Masked diffusion models (MDMs) offer a compelling alternative to autoregres-
sive models (ARMs) for discrete text generation because they enable parallel
token sampling, rather than sequential, left-to-right generation. This means po-
tentially much faster inference. However, effective parallel sampling faces two
competing requirements: (i) simultaneously updated tokens must be conditionally
independent, and (ii) updates should prioritise high-confidence predictions. These
goals conflict because high-confidence predictions often cluster and depend on
each other, opportunities for parallel updates.
We present PUNT, a model-agnostic sampler that reconciles this trade-off. Our
method identifies token dependencies and removes lower-confidence tokens from
conflicting groups. This produces sets of indices for unmasking that satisfy both
independence and confidence criteria. Our approach ensures improved parallel
unmasking through approximate conditional independence testing.
Our experiments show that PUNT delivers a superior trade-off between accuracy
and compute when compared to other strong training-free baselines, especially for
generation of longer sequences. On the IFEval benchmark, it achieves up to 16%
higher accuracy over baseline methods, including sequential generation (one-by-
one). These gains hold across different values of hyperparameters, mitigating the
need for brittle hyperparameter tuning. Moreover, we observe that PUNT induces
an emergent hierarchical generation strategy, where the model first establishes
high-level paragraph structure before local refinement, suggesting a planning-like
generation process that contributes to strong alignment performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1369. Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards
๐ฌ 15
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Compound AI System Heterogenous Configuration Optimization Local Rewards
Compound AI systems integrating multiple components, such as Large Language Models, specialized tools, and traditional machine learning models, are increasingly deployed to solve complex real-world tasks. However, optimizing compound systems remains challenging due to their non-differentiable structures and diverse configuration types across components, including prompts, hyperparameters, and model parameters. To address this challenge, we propose Optimas, a unified framework for effective optimization of compound systems. The core idea of Optimas is to maintain one Local Reward Function (LRF) per component, each satisfying a localโglobal alignment property, i.e., each componentโs local reward correlates with the global system performance. In each iteration, Optimas efficiently adapts the LRFs to maintain this property while simultaneously maximizing each componentโs local reward. This approach enables independent updates of heterogeneous configurations using the designated optimization method, while ensuring that local improvements consistently lead to performance gains. We present extensive evaluations across five real-world compound systems to demonstrate that Optimas outperforms strong baselines by an average improvement of 11.92%, offering a general and effective approach for improving compound systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1370. ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs
๐ฌ 15
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Safety Alignment Benchmark Agent LLM Agent evaluation Decision-making
As large language models (LLMs) evolve from conversational assistants into autonomous agents, evaluating the safety of their actions becomes critical. Prior safety benchmarks have primarily focused on preventing generation of harmful content, such as toxic text. However, they overlook the challenge of agents taking harmful actions when the most effective path to an operational goal conflicts with human safety. To address this gap, we introduce ManagerBench, a benchmark that evaluates LLM decision-making in realistic, human-validated managerial scenarios. Each scenario forces a choice between a pragmatic but harmful action that achieves an operational goal, and a safe action that leads to worse operational performance. A parallel control set, where potential harm is directed only at inanimate objects, measures a model's pragmatism and identifies its tendency to be overly safe. Our findings indicate that the frontier LLMs perform poorly when navigating this safety-pragmatism trade-off. Many consistently choose harmful options to advance their operational goals, while others avoid harm only to become overly safe and ineffective. Critically, we find this misalignment does not stem from an inability to perceive harm, as models' harm assessments align with human judgments, but from flawed prioritization. ManagerBench is a challenging benchmark for a core component of agentic behavior: making safe choices when operational goals and alignment values incentivize conflicting actions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1371. Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
๐ฌ 20
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Safety monitoring Polynomial classifiers Interpretability
Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible--costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can "buy" stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On two large-scale safety datasets (WildGuardMix and BeaverTails), for 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size, all the while being more interpretable than their black-box counterparts. Our anonymous code is available at https://anonymous.4open.science/r/tpc-anon-0708.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1372. VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
๐ฌ 11
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ JEPA VLM video-language efficiency
We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model can focus on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that can reduce the number of decoding operations by approximately 2.85ร while maintaining similar performance compared to dense non-adaptive uniform decoding. Beyond generation, the embedding-space formulation naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance of VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance to classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE, and POPEv2, despite only having 1.6B parameters.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1373. What Generative Search Engines Like and How to Optimize Web Content Cooperatively
๐ฌ 16
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ generative engine optimization generative engines preference rule discovery reinforcement learning
By employing large language models (LLMs) to retrieve documents and generate natural language responses, Generative Engines, such as Google AI overview and ChatGPT, provide significantly enhanced user experiences and have rapidly become the new form of search. Their rapid adoption also drives the needs of Generative Engine Optimization (GEO), as content providers are eager to gain more traction from them. In this paper, we introduce AutoGEO, a framework to automatically learn generative engine preferences when using retrieved contents for response generation, and rewrite web contents for more such traction. AutoGEO first prompts frontier LLMs to explain generative engine preferences and extract meaningful preference rules from these explanations. Then it uses preference rules as context engineering for AutoGEO$\_\text{API}$, a prompt-based GEO system, and as rule-based rewards to train AutoGEO$\_\text{Mini}$, a cost-effective GEO model. Experiments on the standard GEO-Bench and two newly constructed benchmarks using real user queries demonstrate the effectiveness of AutoGEO in enhancing content traction while preserving search utility. Analyses confirmed the learned rules' robustness and abilities to capture unique preferences in variant domains, and AutoGEO systems' ability to embed them in content optimization. The learned preference rules, our models, and codes will be open-sourced.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1374. The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
๐ฌ 18
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Long Horizon Agents
Does continued scaling of large language models (LLMs) yield diminishing returns? In this work, we show that short-task benchmarks may give an illusion of slowing progress, as even marginal gains in single-step accuracy can compound into exponential improvements in the length of tasks a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. So, we propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. First, we find that larger models can correctly execute significantly more turns even when small models have near-perfect single-turn accuracy. We then observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations---curiously, we observe a self-conditioning effect---models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. But, we find that thinking mitigates self-conditioning, and also enables execution of much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of tasks they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1375. Semantic Uncertainty Quantification of Hallucinations in LLMs: A Quantum Tensor Network Based Method
๐ฌ 8
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Semantic uncertainty Large language models quantum physics
Large language models (LLMs) exhibit strong generative capabilities but remain vulnerable to confabulations, fluent yet unreliable outputs that vary arbitrarily even under identical prompts. Leveraging a quantum tensor networkโbased pipeline, we propose a quantum physics-inspired uncertainty quantification framework that accounts for the aleatoric uncertainty in token sequence probability for semantic equivalence-based clustering of LLM generations. In turn, this offers a principled and interpretable scheme for hallucination detection. We further introduce an entropy-maximization strategy that prioritizes high-certainty, semantically coherent outputs and highlights entropy regions where LLM decisions are likely to be unreliable, offering practical guidelines for when human oversight is warranted. We evaluate the robustness of our scheme under different generation lengths and quantization levels, dimensions overlooked in prior studies, demonstrating that our approach remains reliable even in resource-constrained deployments. A total of 116 experiments on TriviaQA, NQ, SVAMP, and SQuAD across multiple architectures (Mistral-7B, Mistral-7B-instruct, Falcon-rw-1b, LLaMA-3.2-1b, LLaMA-2-13b-chat, LLaMA-2-7b-chat, LLaMA-2-13b and LLaMA-2-7b) show consistent improvements in AUROC and AURAC over state-of-the-art baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1376. GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra
๐ฌ 13
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Geometric Reasoning Benchmarking Foundation Models
Monocular 3D reconstruction methods and vision-language models (VLMs) demonstrate impressive results on standard benchmarks, yet their true understanding of geometric properties remains unclear. We introduce GIQ, a comprehensive benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models. GIQ comprises synthetic and real-world images and corresponding 3D meshes of diverse polyhedraโincluding Platonic, Archimedean, Johnson, and Catalan solids, as well as stellations and compound shapesโcovering varying levels of complexity and symmetry. Through systematic experiments involving monocular 3D reconstruction, 3D symmetry detection, mental rotation tests, and zero-shot shape classification tasks, we reveal significant shortcomings in current models. State-of-the-art reconstruction algorithms trained on extensive 3D datasets struggle to reconstruct even basic geometric forms accurately. While foundation models effectively detect specific 3D symmetry elements via non-linear probing, they falter significantly in tasks requiring detailed geometric differentiation, such as mental rotation.
Moreover, advanced vision-language assistants exhibit remarkably low accuracy on complex polyhedra, systematically misinterpreting basic properties like face geometry, convexity, and compound structures. GIQ will be publicly available
, providing a structured platform to highlight and address critical gaps in geometric intelligence, facilitating future progress in robust, geometry-aware representation learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1377. Debiased and Denoised Projection Learning for Incomplete Multi-view Clustering
๐ฌ 6
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Incomplete multi-view clustering projection debiasing and denoising robust contrastive learning.
Multi-view clustering achieves outstanding performance but relies on the assumption of complete multi-view samples. However, certain views may be partially unavailable due to failures during acquisition or storage, resulting in distribution shifts across views. Although some incomplete multi-view clustering (IMVC) methods have been proposed, they still confront the following limitations: 1) Missing-view data imputation methods increase the unnecessary computational complexity; 2) Consensus representation imputation methods always ignore the inter-view distribution bias due to missing views. To tackle these issues, we propose a novel IMVC based on projection debiasing and denoising (PDD). Specifically, it utilizes the unbiased projection learned from complete views to refine the biased projection learned from data with missing views. Additionally, we introduce a robust contrastive learning for consensus projection to mitigate cluster collapse risk induced by misalignment noise. Comprehensive experiments demonstrate that PDD achieves superior performance compared with state-of-the-art methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1378. Invert4TVG: A Temporal Video Grounding Framework with Inversion Tasks Preserving Action Understanding Ability
๐ฌ 15
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Video Understanding & Activity Analysis
Temporal Video Grounding (TVG) aims to localize video segments corresponding to a given textual query, which often describes human actions. However, we observe that current methods, usually optimizing for high temporal Intersection-over-Union (IoU), frequently struggle to accurately recognize or understand the underlying actions in both the video and query, thus reducing the effectiveness of these methods. To address this, we propose a novel TVG framework that integrates inversion-based TVG as auxiliary objectives to maintain the model's action understanding ability. We introduce three kinds of inversion TVG tasks derived from the original TVG annotations: (1) Verb Completion, predicting masked verbs (actions) in queries given video segments; (2) Action Recognition, identifying query-described actions; and (3) Video Description, generating descriptions containing query-relevant actions given video segments. These inversion tasks are entirely derived from the original TVG tasks and are probabilistically integrated with them within a reinforcement learning framework. By leveraging carefully designed reward functions, the model preserves its ability to understand actions, thereby improving the accuracy of temporal grounding. Experiments show our method outperforms state-of-the-art approaches, achieving a 7.1\% improvement in R1@0.7 on Charades-STA for a 3B model.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1379. Data Selection for LLM Alignment Using Fine-Grained Preferences
๐ฌ 13
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Data Selection Preference Alignment
Large language models (LLMs) alignment aims to ensure that the behavior of LLMs meets human preferences. While collecting data from multiple fine-grained, aspect-specific preferences becomes more and more feasible, existing alignment methods typically work on a single preference and thus struggle with conflicts inherent in such aggregated datasets. As one early attempt, in this paper, we propose a data-centric approach to align LLMs through the effective use of fine-grained preferences. Specifically, we formulate the problem as a direct fine-grained preference optimization and introduce preference divergence (PD) that quantifies inter-aspect preference conflicts. Instead of directly tackling the consequent complicated optimization, we recast it as a data selection problem and propose a simple yet effective strategy, which identifies a subset of data corresponding to the most negative PD values, for efficient training. We theoretically analyze the loss-bound optimality of our selection strategy and conduct extensive empirical studies on varied settings and datasets to demonstrate that our practical selection method could achieve consistent improvement against standard full-data alignment, using even just 30% of the data. Our work shares a line that LLM alignment using fine-grained preferences is highly feasible.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1380. PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm
๐ฌ 11
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ AI safety annotator disagreement personalized alignment value pluralism benchmark
Current AI safety frameworks, which often treat harmfulness as binary, lack the flexibility to handle borderline cases where humans meaningfully disagree. To build more pluralistic systems, it is essential to move beyond consensus and instead understand where and why disagreements arise. We introduce PluriHarms, a benchmark designed to systematically study human harm judgments across two key dimensionsโthe harm axis (benign to harmful) and the agreement axis (agreement to disagreement). Our scalable framework generates prompts that capture diverse AI harms and human values while targeting cases with high disagreement rates, validated by human data. The benchmark includes 150 prompts with 15,000 ratings from 100 human annotators, enriched with demographic and psychological traits and prompt-level features of harmful actions, effects, and values. Our analyses show that prompts that relate to imminent risks and tangible harms amplify perceived harmfulness, while annotator traits (e.g., toxicity experience, education) and their interactions with prompt content explain systematic disagreement. We benchmark AI safety models and alignment methods on PluriHarms, finding that while personalization significantly improves prediction of human harm judgments, considerable room remains for future progress. By explicitly targeting value diversity and disagreement, our work provides a principled benchmark for moving beyond "one-size-fits-all" safety toward pluralistically safe AI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1381. Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR
๐ฌ 10
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reinforcement learning; Large Language Model; Active Learning; Reasoning
Large Language Models (LLMs) have recently improved mathematical reasoning through Reinforcement Learning with Verifiable Reward (RLVR). However, existing RLVR algorithms require large query budgets, making annotation costly. We investigate whether fewer but more informative queries can yield similar or superior performance, introducing active learning (AL) into RLVR. We identify that classic AL sampling strategies fail to outperform random selection in this setting, due to ignoring \textbf{objective uncertainty} when only selecting by subjective uncertainty. This work proposes an \textbf{uncertainty consistency} metric to evaluate how well subjective uncertainty aligns with objective uncertainty. In the offline setting, this alignment is measured using the Point-Biserial Correlation Coefficient (PBC). For online training, because of limited sampling and dynamically shifting output distributions, PBC estimation is difficult. Therefore, we introduce a new online variant, computed from normalized advantage and subjective uncertainty. Theoretically, we prove that the online variant is strictly negatively correlated with offline PBC and supports better sample selection. Experiments show our method consistently outperforms random and classic AL baselines, achieving full-dataset performance while training on only 30\% of the data, effectively reducing the cost of RLVR for reasoning tasks.\footnote{The code is available at \hyperref[https://anonymous.4open.science/r/uncertainty-consistency-235C]{https://anonymous.4open.science/r/uncertainty-consistency-235C}.
}
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1382. Neodragon: Mobile Video Generation Using Diffusion Transformer
๐ฌ 11
โญ 6.00
๐ generative models
๐ท๏ธ Text to Video Generation Flow Matching Diffusion Transformer Diffusion Models Mobile Video Generation Step Distilllation Block Pruning Text-Encoder Distillation Asymmetric Decoder Distillation
We propose Neogradon, a video DiT (Diffusion Transformer) designed to run on a low-power NPU present in devices such as phones and laptop computers. We demonstrate that, despite video transformers' huge memory and compute cost, mobile devices can run these models when carefully optimised for efficiency. To achieve this level of efficiency, i) we replace the original large Text-Encoder with a much smaller one with minimal quality loss through our novel distillation framework which doesnโt require any image or video data. ii) We propose an Asymmetric Decoder distillation approach which allows us to replace the native codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the video generation pipeline. iii) With our Block Pruning strategy, we remove entire blocks from the MMDiT denoiser based on their relative importance and recover original performance through a two-stage distillation process. iv) We reduce the diffusion sampling cost using our novel extended version of DMD (Distribution Matching Distillation) for the Pyramidal Flow-Matching objective. Neodragon generates 49 frames of [640$\times$1024] resolution within 7.6 seconds on the Qualcomm Hexagon NPU with a VBench total score of 81.61, setting a new state-of-the-art for mobile video generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1383. LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora
๐ฌ 13
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ large languagde models retrieval augmented generation graph retrieval augmented generation efficiency
Retrieval-Augmented Generation (RAG) is widely used to mitigate hallucinations of Large Language Models (LLMs) by leveraging external knowledge. While effective for simple queries, traditional RAG systems struggle with large-scale, unstructured corpora where information is fragmented. Recent advances incorporate knowledge graphs to capture relational structures, enabling more comprehensive retrieval for complex, multi-hop reasoning tasks. However, existing graph-based RAG (GraphRAG) methods rely on unstable and costly relation extraction for graph construction, often producing noisy graphs with incorrect or inconsistent relations that degrade retrieval quality. In this paper, we revisit the pipeline of existing GraphRAG systems and propose Linear Graph-based Retrieval-Augmented Generation (LinearRAG), an efficient framework that enables reliable graph construction and precise passage retrieval. Specifically, LinearRAG constructs a relation-free hierarchical graph, termed Tri-Graph, using only lightweight entity extraction and semantic linking, avoiding unstable relation modeling. This new paradigm of graph construction scales linearly with corpus size and incurs no extra token consumption, providing an economical and reliable indexing of the original passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant entity activation via local semantic bridging, followed by (ii) passage retrieval through global importance aggregation. Extensive experiments on four benchmark datasets demonstrate that LinearRAG significantly outperforms baseline models. Our code and datasets are available at https://anonymous.4open.science/r/LinearRAG-C205/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1384. Adaptive Regularization for Large-Scale Sparse Feature Embedding Models
๐ฌ 13
โญ 6.00
๐ optimization
๐ท๏ธ adaptive regularization CTR estimation large-scale sparse feature optimization one-epoch overfitting
The one-epoch overfitting problem has drawn widespread attention, especially in CTR and CVR estimation models in search, advertising, and recommendation domains. These models which rely heavily on large-scale sparse categorical features, often suffer a significant decline in performance when trained for multiple epochs. Although recent studies have proposed heuristic solutions, the fundamental cause of this phenomenon remains unclear. In this work, we present a theoretical explanation grounded in Rademacher complexity, supported by empirical experiments, to explain why overfitting occurs in models with large-scale sparse categorical features. Based on this analysis, we propose a regularization method that constrains the norm budget of embedding layers adaptively. Our approach not only prevents the severe performance degradation observed during multi-epoch training, but also improves model performance within a single epoch. This method has already been deployed in online production systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1385. DAG-Math: Graph-Guided Mathematical Reasoning in LLMs
๐ฌ 23
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLMs mathematical reasoning directed acyclic graphs
Large Language Models (LLMs) demonstrate strong performance on mathematical problems when prompted with Chain-of-Thought (CoT), yet it remains unclear whether this success stems from search, rote procedures, or rule-consistent reasoning. To address this, we propose modeling CoT as a certain rule-based stochastic process over directed acyclic graphs (DAGs), where nodes represent intermediate derivation states and edges encode rule applications. Within this framework, we introduce **logical closeness**, a metric that quantifies how well a modelโs CoT trajectory (i.e., the LLM's output) adheres to the DAG structure, providing evaluation beyond classical PASS@$k$ metrics. Building on this, we introduce the *DAG-MATH* CoT format and construct a benchmark that guides LLMs to generate CoT trajectories in this format, thereby enabling the evaluation of their reasoning ability under our framework. Across standard mathematical reasoning datasets, our analysis uncovers statistically significant differences in reasoning fidelity among representative LLM familiesโeven when PASS@$k$ is comparableโhighlighting gaps between final-answer accuracy and rule-consistent derivation. Our framework provides a balance between free-form CoT and formal proofs systems, offering actionable diagnostics for LLMs reasoning evaluation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1386. Inductive Reasoning for Temporal Knowledge Graphs with Emerging Entities
๐ฌ 12
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Temporal Knowledge Graph Inductive Learning
Reasoning on Temporal Knowledge Graphs (TKGs) is essential for predicting future events and time-aware facts. While existing methods are effective at capturing relational dynamics, their performance is limited by a closed-world assumption, which fails to account for emerging entities not present in the training. Notably, these entities continuously join the network without historical interactions. Empirical study reveals that emerging entities are widespread in TKGs, comprising roughly 25\% of all entities. The absence of historical interactions of these entities leads to significant performance degradation in reasoning tasks. Whereas, we observe that entities with semantic similarities often exhibit comparable interaction histories, suggesting the presence of transferable temporal patterns. Inspired by this insight, we propose TransFIR (Transferable Inductive Reasoning), a novel framework that leverages historical interaction sequences from semantically similar known entities to support inductive reasoning. Specifically, we propose a codebook-based classifier that categorizes emerging entities into latent semantic clusters, allowing them to adopt reasoning patterns from similar entities. Experimental results demonstrate that TransFIR outperforms all baselines in reasoning on emerging entities, achieving an average improvement of 28.6\% in Mean Reciprocal Rank (MRR) across multiple datasets. The implementations are available at https://anonymous.4open.science/r/TransFIR-C72F.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1387. Reversible PrimitiveโComposition Alignment for Continual VisionโLanguage Learning
๐ฌ 18
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ continual learning vision-language models catastrophic forgetting
Vision-language (VL) models are increasingly deployed in non-stationary settings, yet under sequential adaptation they often preserve primitive recognition while losing compositional structure, especially with tight rehearsal budgets and no task IDs. We address this gap by asking how a continual VL system can maintain structurally dependable behaviour while safeguarding zero-shot performance. We introduce Compo-ReAlign, a structure-first recipe built around three components: a reversible composer that maps primitive embeddings to compositions by design, a multi-positive InfoNCE that jointly aligns textual and composed views of the same target, and a spectral trust region that clips updates when alignment sensitivity inflates. Across compositional DIL and multi-domain MTIL retrieval, Compo-ReAlign sets a new state of the art, improves over the strongest prior by +2.4 R@1, and reduces forgetting by 40%. We provide a compact, reversible alignment head with geometry-aware training for compositionally robust VL continual learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1388. Efficient Zero-shot Inpainting with Decoupled Diffusion Guidance
๐ฌ 13
โญ 6.00
๐ generative models
๐ท๏ธ Diffusion models zero-shot guidance
Diffusion models have emerged as powerful priors for image editing tasks such as inpainting and local modification, where the objective is to generate realistic content that remains consistent with observed regions. In particular, zero-shot approaches that leverage a pretrained diffusion model, without any retraining, have been shown to achieve highly effective reconstructions. However, state-of-the-art zero-shot methods typically rely on a sequence of surrogate likelihood functions, whose scores are used as proxies for the ideal score. This procedure however requires vector-Jacobian products through the denoiser at every reverse step, introducing significant memory and runtime overhead. To address this issue, we propose a new likelihood surrogate that yields simple and efficient to sample Gaussian posterior transitions, sidestepping the backpropagation through the denoiser network. Our extensive experiments show that our method achieves strong observation consistency compared with fine-tuned baselines and produces coherent, high-quality reconstructions, all while significantly reducing inference cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1389. CTBench: Cryptocurrency Time Series Generation Benchmark
๐ฌ 9
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Time Series Generation Crypto-centric Benchmark Cryptocurrency Markets Financial Evaluation Measure Suite
Synthetic time series are vital for data augmentation, stress testing, and prototyping in quantitative finance. Yet in cryptocurrency markets, characterized by 24/7 trading, extreme volatility, and rapid regime shifts, existing Time Series Generation (TSG) methods and benchmarks often fall short, jeopardizing practical utility. Most prior work targets non-financial or traditional financial domains, focuses narrowly on classification and forecasting while neglecting crypto-specific complexities, and lacks critical financial evaluations, particularly for trading applications. To bridge these gaps, we introduce \textbf{CTBench}, the first \textbf{C}ryptocurrency \textbf{T}ime series generation \textbf{Bench}mark. It curates an open-source dataset of 452 tokens and evaluates models across 13 metrics spanning forecasting accuracy, rank fidelity, trading performance, risk assessment, and computational efficiency. A key innovation is a dual-task evaluation framework: the Predictive Utility measures how well synthetic data preserves temporal and cross-sectional patterns for forecasting, while the Statistical Arbitrage assesses whether reconstructed series support mean-reverting signals for trading. We systematically benchmark eight state-of-the-art models from five TSG families across four market regimes, revealing trade-offs between statistical quality and real-world profitability. Notably, CTBench provides ranking analysis and practical guidance for deploying TSG models in crypto analytics and trading applications. The source code is available at \url{https://anonymous.4open.science/r/CTBench-F5A3/}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1390. CHROMA: Consistent Harmonization of Multi-View Appearance via Bilateral Grid Prediction
๐ฌ 14
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Bilateral Grid Appearance Harmonization 3D Reconstruction
Modern camera pipelines apply extensive on-device processing, such as exposure adjustment, white balance, and color correction, which, while beneficial individually, often introduce photometric inconsistencies across views. These appearance variations violate multi-view consistency and degrade novel view synthesis.
Joint optimization of scene-specific representations and per-image appearance embeddings has been proposed to address this issue, but with increased computational complexity and slower training.
In this work, we propose a generalizable, feed-forward approach that predicts spatially adaptive bilateral grids to correct photometric variations in a multi-view consistent manner. Our model processes hundreds of frames in a single step, enabling efficient large-scale harmonization, and seamlessly integrates into downstream 3D reconstruction models, providing cross-scene generalization without requiring scene-specific retraining. To overcome the lack of paired data, we employ a hybrid self-supervised rendering loss leveraging 3D foundation models, improving generalization to real-world variations.
Extensive experiments show that our approach outperforms or matches the reconstruction quality of existing scene-specific optimization methods with appearance modeling, without significantly affecting the training time of baseline 3D models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1391. FieryGS: In-the-Wild Fire Synthesis with Physics-Integrated Gaussian Splatting
๐ฌ 13
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Gaussian Splatting Physics Simulation Combustion Simulation Novel View Synthesis
We consider the problem of synthesizing photorealistic, physically plausible combustion effects in in-the-wild 3D scenes. Traditional CFD and graphics pipelines can produce realistic fire effects but rely on handcrafted geometry, expert-tuned parameters, and labor-intensive workflows, limiting their scalability to the real world. Recent scene modeling advances like 3D Gaussian Splatting (3DGS) enable high-fidelity real-world scene reconstruction, yet lack physical grounding for combustion. To bridge this gap, we propose FieryGS, a physically-based framework that integrates physically-accurate and user-controllable combustion simulation and rendering within the 3DGS pipeline, enabling realistic fire synthesis for real scenes. Our approach tightly couples three key modules: (1) multimodal large-language-model-based physical material reasoning, (2) efficient volumetric combustion simulation, and (3) a unified renderer for fire and 3DGS. By unifying reconstruction, physical reasoning, simulation, and rendering, FieryGS removes manual tuning and automatically generates realistic, controllable fire dynamics consistent with scene geometry and materials. Our framework supports complex combustion phenomenaโincluding flame propagation, smoke dispersion, and surface carbonizationโwith precise user control over fire intensity, airflow, ignition location and other combustion parameters. Evaluated on diverse indoor and outdoor scenes, FieryGS outperforms all comparative baselines in visual realism, physical fidelity, and controllability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1392. Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Architectures
๐ฌ 21
โญ 6.00
๐ learning theory
๐ท๏ธ learning dynamics gradient flow simplicity bias
Neural networks trained with gradient descent often learn solutions of increasing complexity over time, a phenomenon known as simplicity bias. Despite being widely observed across architectures, existing theoretical treatments lack a unifying framework. We present a theoretical framework that explains a simplicity bias arising from saddle-to-saddle learning dynamics for a general class of neural networks, incorporating fully-connected, convolutional, and attention-based architectures. Here, simple means expressible with few hidden units, i.e., hidden neurons, convolutional kernels, or attention heads. Specifically, we show that linear networks learn solutions of increasing rank, ReLU networks learn solutions with an increasing number of kinks, convolutional networks learn solutions with an increasing number of convolutional kernels, and self-attention models learn solutions with an increasing number of attention heads. By analyzing fixed points, invariant manifolds, and dynamics of gradient descent learning, we show that saddle-to-saddle dynamics operates by iteratively evolving near an invariant manifold, approaching a saddle, and switching to another invariant manifold. Our analysis also illuminates the effects of data distribution and initialization on the duration and number of plateaus in learning, dissociating previously confounding factors. Overall, our theory offers a framework for understanding when and why gradient descent progressively learns increasingly complex solutions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1393. Bayesian Ensemble for Sequential Decision-Making
๐ฌ 11
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Ensemble Methods Reinforcement Learning
Ensemble learning is a practical family of methods for uncertainty modeling, particularly useful for sequential decision-making problems like recommendation systems and reinforcement learning tasks. The posterior on likelihood parameters is approximated by sampling an ensemble member from a predetermined index distribution, with the ensembleโs diversity reflecting the degree of uncertainty. In this paper, we propose Bayesian Ensemble (BE), a lightweight yet principled Bayesian layer atop existing ensembles. BE treats the selection of an ensemble member as a bandit problem in itself, dynamically updating a sampling distribution over members via Bayesian inference on observed rewards. This contrasts with prior works that rely on fixed, uniform sampling. We extend this framework to both bandit learning and reinforcement learning, introducing Bayesian Ensemble Bandit and Bayesian Ensemble Deep Q-Network for diverse decision-making problems. Extensive experiments on both synthetic and real-world environments demonstrate the effectiveness and efficiency of BE.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1394. Convex Dominance in Deep Learning: A Scaling Law of Loss and Learning Rate
๐ฌ 15
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Convex optimization Scaling law Learning rate transfer
Deep learning has non-convex loss landscape and its optimization dynamics is hard to analyze or control. Nevertheless, the dynamics can be empirically convex-like across various tasks, models, optimizers, hyperparameters, etc. In this work, we examine the applicability of convexity and Lipschitz continuity in deep learning, in order to precisely control the loss dynamics via the learning rate schedules. We illustrate that deep learning quickly becomes weakly convex after a short period of training, and the loss is predicable by an upper bound on the last iterate, which further informs the scaling of optimal learning rate. Through the lens of convexity, we build scaling laws of learning rates and losses that extrapolate as much as $80\times$ across training horizons and $70\times$ across model sizes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1395. Reasoning in Space via Grounding in the World
๐ฌ 17
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ 3d spatial reasoning 3d visual grounding
In this paper, we claim that 3D visual grounding is the cornerstone of spatial reasoning and introduce the $\textit{Grounded-Spatial Reasoner (GS-Reasoner)}$ to explore the effective spatial representations that bridge the gap between them. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information. This deficiency is manifested either in poor performance on grounding or in an excessive reliance on external modules, ultimately hindering the seamless integration of grounding and spatial reasoning. To address this, we propose a simple yet effective \emph{dual-path pooling} mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation that encapsulates all essential information without extra tokens. Leveraging this holistic representation, GS-Reasoner is the first 3D LLMs that achieves autoregressive grounding entirely without external modules while delivering performance comparable to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning. To further bridge grounding and spatial reasoning, we introduce the $\textit{Grounded Chain-of-Thought (GCoT)}$ dataset. This dataset is meticulously curated to include both 3D bounding box annotations for objects referenced in reasoning questions and step-by-step reasoning paths that integrate grounding as a core component of the problem-solving process. Extensive experiments demonstrate that GS-Reasoner achieves impressive results on 3D visual grounding, which in turn significantly enhances its spatial reasoning capabilities, leading to state-of-the-art performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1396. Divide, Harmonize, Then Conquer It: Shooting Multi-Commodity Flow Problems with Multimodal Language Models
๐ฌ 21
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Multi-Commodity Flow Multimodal Language Models Resource Allocation
The multi-commodity flow (MCF) problem is a fundamental topic in network flow and combinatorial optimization, with broad applications in transportation, communication, and logistics, etc. Nowadays, the rapid expansion of allocation systems has posed challenges for existing optimization engines in balancing optimality and tractability. In this paper, we present Pram, the first ML-based method that leverages the reasoning power of multimodal language models (MLMs) for addressing the trade-off dilemmaโa great need of service providers. As part of our proposal, Pram (i) quickly computes high-quality allocations by dividing the original problem into local subproblems, which are then resolved by an MLM-powered "agent", and (ii) ensures global consistency by harmonizing these subproblems via a multi-agent reinforcement learning algorithm. Theoretically, we show that Pram, which learns to perform gradient descent in context, provably converges to the optimum within the family of MCF problems.
Empirically, on real-world datasets and public topologies, Pram achieves performance comparable to, and in some cases even surpassing, linear programming solvers (very close to the optimal solution), and substantially lower runtimes (one to two orders of magnitude faster).
Moreover, Pram exhibits strong robustness (<10\% performance degradation under failures or bursts), demonstrating MLM's generalization ability to unforeseen events.
Our anonymous codebase is available at https://anonymous.4open.science/r/Pram, with experimental datasets attached in the supplementary material.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1397. Self-Speculative Masked Diffusions
๐ฌ 18
โญ 6.00
๐ generative models
๐ท๏ธ mask diffusion generative models speculative decoding speculative sampling LLM
We present self-speculative masked diffusions, a new class of masked diffusion generative models for discrete data that require significantly fewer function evaluations to generate samples. Standard masked diffusion models predict factorized logits over currently masked positions. A number of masked positions are then sampled, however, the factorization approximation means that sampling too many positions in one go leads to poor sample quality. As a result, many simulation steps and therefore neural network function evaluations are required to generate high-quality data. We reduce the computational burden by generating \emph{non-factorized} predictions over masked positions. This is achieved by modifying the final transformer attention mask from non-causal to causal, enabling draft token generation and parallel validation via a novel, model-integrated speculative sampling mechanism. This results in a non-factorized predictive distribution over masked positions in a single forward pass. We apply our method to GPT2 scale text modelling and protein sequences generation, finding that we can achieve a ~2x reduction in the required number of network forward passes relative to standard masked diffusion models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1398. Demystifying Emergent Exploration in Goal-Conditioned RL
๐ฌ 11
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Goal-Conditioned RL Contrastive RL Emergent exploration Cognitive interpretability
In this work, we take a first step toward elucidating the mechanisms behind emergent exploration in unsupervised reinforcement learning. We study Single-Goal Contrastive Reinforcement Learning (SGCRL) (Liu et al., 2025), a self-supervised algorithm capable of solving challenging long-horizon goal-reaching tasks without external rewards or curricula. We combine theoretical analysis of the algorithmโs
objective function with controlled experiments to understand what drives its exploration. We show that SGCRL maximizes implicit rewards shaped by its learned representations. These representations automatically modify the reward landscape to promote exploration before reaching the goal and exploitation thereafter. Our experiments also demonstrate that these exploration dynamics arise from learning low-rank representations of the state space rather than from neural network function approximation. Our improved understanding enables us to adapt SGCRL to perform safety-aware exploration.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1399. Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction
๐ฌ 9
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Single Image Face Reconstruction Face Tracking Foundation Model Finetuning
We address the 3D reconstruction of human faces from a single RGB image. To this end, we propose Pixel3DMM, a set of highly-generalized vision transformers which predict per-pixel geometric cues in order to constrain the optimization of a 3D morphable face model (3DMM). We exploit the latent features of the DINO foundation model, and introduce a tailored surface normal and uv-coordinate prediction head. We train our model by registering three high-quality 3D face datasets against the FLAME mesh topology, which results in a total of over 1,000 identities and 976K images. For 3D face reconstruction, we propose a FLAME fitting opitmization that solves for the 3DMM parameters from the uv-coordinate and normal estimates. To evaluate our method, we introduce a new benchmark for single-image face reconstruction, which features high diversity facial expressions, viewing angles, and ethnicities. Crucially, our benchmark is the first to evaluate both posed and neutral facial geometry. Ultimately, our method outperforms the state-of-the-art (SoTA) by over 15\% in terms of geometric accuracy for posed facial expressions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1400. Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models
๐ฌ 3
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Diffusion Language Model Training-Free
Diffusion Large Language Models (DLLMs) are emerging as a powerful alternative to the dominant Autoregressive Large Language Models, offering efficient parallel generation and capable global context modeling. However, the practical application of DLLMs is hindered by a critical architectural constraint: the need for a statically predefined generation length. This static length allocation leads to a problematic trade-off: insufficient lengths cripple performance on complex tasks, while excessive lengths incur significant computational overhead and sometimes result in performance degradation. While the inference framework is rigid, we observe that the model itself possesses internal signals that correlate with the optimal response length for a given task. To bridge this gap, we leverage these latent signals and introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion for Diffusion Large Language Models. DAEDAL operates in two phases: 1) Before the denoising process, DAEDAL starts from a short initial length and iteratively expands it to a coarse task-appropriate length, guided by a sequence completion metric. 2) During the denoising process, DAEDAL dynamically intervenes by pinpointing and expanding insufficient generation regions through mask token insertion, ensuring the final output is fully developed. Extensive experiments on DLLMs demonstrate that DAEDAL achieves performance comparable, and in some cases superior, to meticulously tuned fixed-length baselines, while simultaneously enhancing computational efficiency by achieving a higher effective token ratio. By resolving the static length constraint, DAEDAL unlocks new potential for DLLMs, bridging a critical gap with their Autoregressive counterparts and paving the way for more efficient and capable generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1401. CHAMMI-75: pre-training multi-channel models with heterogeneous microscopy images
๐ฌ 21
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ microscopy representation learning multi-channel imaging self-supervised learning biology
Quantifying cell morphology using images and machine learning has proven to be a powerful tool to study the response of cells to treatments. However, the models used to quantify cellular morphology are typically trained with a single microscopy imaging type and under controlled experimental conditions. This results in specialized models that cannot be reused across biological studies because the technical specifications do not match (e.g., different number of channels), or because the target experimental conditions are out of distribution. Here, we present CHAMMI-75, a dataset of heterogeneous, multi-channel microscopy images with 2.8M multi-channel images from 75 diverse biological studies. We curated this resource from publicly available sources to investigate cellular morphology models that are channel-adaptive and can process any microscopy image type. Our experiments show that training with CHAMMI-75 can improve performance in multi-channel bioimaging tasks, opening the way to create the next generation of cellular morphology models for biological studies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1402. Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering
๐ฌ 12
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Uncertainty Quantification LLMs RAG Contextual QA Hallucinations
Uncertainty Quantification (UQ) research has primarily focused on closed-book factual question answering (QA), while contextual QA remains unexplored, despite its importance in real-world applications. In this work, we focus on UQ for the contextual QA task and propose a theoretically grounded approach to quantify \emph{epistemic uncertainty}. We begin by introducing a task-agnostic, token-level uncertainty measure defined as the cross-entropy between the predictive distribution of the given model and the unknown true distribution. By decomposing this measure, we isolate the epistemic component and approximate the true distribution by a perfectly prompted, idealized model. We then derive an upper bound for epistemic uncertainty and show that it can be interpreted as semantic feature gaps in the given modelโs hidden representations relative to the ideal model. We further apply this generic framework to the contextual QA task and hypothesize that three features approximate this gap: \emph{context-reliance} (using the provided context rather than parametric knowledge), \emph{context comprehension} (extracting relevant information from context), and \emph{honesty} (avoiding intentional lies). Using a top-down interpretability approach, we extract these features by using only a small number of labeled samples and ensemble them to form a robust uncertainty score. Experiments on multiple QA benchmarks in both in-distribution and out-of-distribution settings show that our method substantially outperforms state-of-the-art unsupervised (sampling-free and sampling-based) and supervised UQ methods, achieving up to a 13-point PRR improvement while incurring a negligible inference overhead.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1403. Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
๐ฌ 36
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reasoning Paradigms Parallel Thinking RL LLM
Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging. Existing methods mainly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced learning rather than exploration and generalization. To address this issue, we propose **Parallel-R1**, the first reinforcement learning (RL) framework that instills parallel thinking for complex real-world reasoning tasks. Our framework employs a progressive curriculum that addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking behavior, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully elicits parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on difficult tasks with RL. Further analysis reveals a distinct shift in the model's thinking patterns: in the early stage, it utilizes parallel thinking as an exploration strategy, while in the later stage, it employs this ability for multi-perspective verification.
Most significantly, we validate parallel thinking as a **mid-training exploration scaffold**, where this intermediate phase unlocks a higher performance ceiling after RL, yielding a **42.9%** improvement over the sequential RL baseline.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1404. Interaction Field Matching: Overcoming Limitations of Electrostatic Models
๐ฌ 17
โญ 6.00
๐ generative models
๐ท๏ธ generative models distribution transfer electrostatics
Electrostatic field matching (EFM) has recently appeared as a novel physics-inspired paradigm for data generation and transfer using the idea of an electric capacitor. However, it requires modeling electrostatic fields using neural networks, which is non-trivial because of the necessity to take into account the complex field outside the capacitor plates. In this paper, we propose Interaction Field Matching (IFM), a generalization of EFM which allows using general interaction fields beyond the electrostatic one. Furthermore, inspired by strong interactions between quarks and antiquarks in physics, we design a particular interaction field realization which solves the problems which arise when modeling electrostatic fields in EFM. We show the performance on a series of toy and image data transfer problems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1405. ComGS: Efficient 3D Object-Scene Composition via Surface Octahedral Probes
๐ฌ 12
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Object-Scene Composition Gaussian Splatting Surface Octahedral Probes
Gaussian Splatting (GS) enables immersive rendering, but realistic 3D objectโscene composition remains challenging. Baked appearance and shadow information in GS radiance fields cause inconsistencies when combining objects and scenes. Addressing this requires relightable object reconstruction and scene lighting estimation. For relightable object reconstruction, existing Gaussian-based inverse rendering methods often rely on ray tracing, leading to low efficiency. We introduce Surface Octahedral Probes (SOPs), which store lighting and occlusion information and allow efficient 3D querying via interpolation, avoiding expensive ray tracing. SOPs provide at least a 2x speedup in reconstruction and enable real-time shadow computation in Gaussian scenes. For lighting estimation, existing Gaussian-based inverse rendering methods struggle to model intricate light transport and often fail in complex scenes, while learning-based methods predict lighting from a single image and are viewpoint-sensitive. We observe that 3D objectโscene composition primarily concerns the objectโs appearance and nearby shadows. Thus, we simplify the challenging task of full scene lighting estimation by focusing on the environment lighting at the objectโs placement. Specifically, we capture a 360ยฐ reconstructed radiance field of the scene at the location and fine-tune a diffusion model to complete the lighting. Building on these advances, we propose ComGS, a novel 3D objectโscene composition framework. Our method achieves high-quality, real-time rendering at around 26 FPS, produces visually harmonious results with vivid shadows, and requires only 36 seconds for editing. The code and dataset will be publicly released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1406. Entropy Regularizing Activation: Boosting Continuous Control, Large Language Models, and Image Classification with Activation as Entropy Constraints
๐ฌ 25
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Entropy Continuous Control Large Language Models Image Classification
We propose ERA, a new paradigm for entropy-constrained policy via output activation. It guarantees minimum sampling entropy by transforming the outputs of the last layer. Our approach demonstrates broad effectiveness across different domains: 1) for large language models~(LLMs), boosting the AIME 2025 score for Qwen2.5-Math-7B by 37.4%; 2) for continuous control reinforcement learning agents, improving performance by more than 30% over strong baselines such as SAC on the challenging HumanoidBench; 3) for image classification, enhancing ImageNet top-1 accuracy by 0.69% for ResNet-50. These gains are achieved with a computational overhead of less than 7%. Our work validates output activation as a powerful tool for entropy control, opening a new direction for designing simpler and more robust algorithms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1407. Single-stream Policy Optimization
๐ฌ 4
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Single-stream Policy Optimization Large Language Models Reinforcement Learning
We revisit policy-gradient optimization for Large Language Models (LLMs) from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO's gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3-8B, SPO improves the average maj@32 by $+3.4\ \text{percentage points} (\mathrm{pp})$ over GRPO, driven by substantial absolute point gains on challenging datasets, including $+7.3\ \mathrm{pp}$ on BRUMO 25, $+4.4\ \mathrm{pp}$ on AIME 25, $+3.3\ \mathrm{pp}$ on HMMT 25, and achieves consistent relative gain in pass@$k$ across the evaluated $k$ values. SPO's success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1408. Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs
๐ฌ 22
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Long-Context LLM Position Embedding
Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models (LLMs) by applying rotations to query and key vectors in the complex plane. Standard implementations, however, utilize only the real component of the complex-valued dot product for attention score calculation. This simplification discards the imaginary component, which contains valuable phase information, leading to a potential loss of relational details crucial for modeling long-range dependencies. In this paper, we propose an extension that re-incorporates this discarded imaginary component. Our method leverages the full complex-valued representation to create a dual-component attention score. We theoretically and empirically demonstrate that this approach enhances the modeling of long-context dependencies by preserving more positional information. Furthermore, evaluations on a suite of long-context language modeling benchmarks show that our method consistently improves performance over the standard RoPE, with the benefits becoming more significant as context length increases.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1409. DistDF: Time-series Forecasting Needs Joint-distribution Wasserstein Alignment
๐ฌ 16
โญ 6.00
๐ learning on time series and dynamical systems
๐ท๏ธ time-series forecasting
Training time-series forecast models requires aligning the conditional distribution of model forecasts with that of the label sequence. The standard direct forecast (DF) approach seeks to minimize the conditional negative log-likelihood of the label sequence, typically estimated using the mean squared error. However, this estimation proves to be biased in the presence of label autocorrelation. In this paper, we propose DistDF, which achieves alignment by alternatively minimizing a discrepancy between the conditional forecast and label distributions. Because conditional discrepancies are difficult to estimate from finite time-series observations, we introduce a newly proposed joint-distribution Wasserstein discrepancy for time-series forecasting, which provably upper bounds the conditional discrepancy of interest. This discrepancy admits tractable, differentiable estimation from empirical samples and integrates seamlessly with gradient-based training. Extensive experiments show that DistDF improves the performance diverse forecast models and achieves the state-of-the-art forecasting performance. Code is available at https://anonymous.4open.science/r/DistDF-F66B.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1410. A High Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
๐ฌ 12
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ high-quality dataset multimodal dataset interleaved image-text synergy interleaved evaluation
Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation.
However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality and instructional richness of current training datasets.
To address this, we introduce \textbf{InterSyn}, a dataset that features:
(1) large scale, comprising 1.8M multimodal samples;
(2) high quality, supported by our proposed \textbf{Self-Evaluation with Iterative Refinement (SEIR)} method for rigorous automated quality refinement;
(3) rich instructional diversity, ensured through diverse well-designed question templates, based on human preferences and covering a 3500-topic hierarchy.
These characteristics make InterSyn particularly well-suited for training LMMs in interactive imageโtext generation capabilities.
To evaluate the capabilities, we propose \textbf{SynJudge}, a reliable automatic evaluator that aligns closely with human judge and outputs four interpretable scores: Text Content Completeness (TCC), Image Content Completeness (ICC), Image Quality (IQ), and ImageโText Synergy (ITS).
These scores are complementary, covering both content and quality as well as cross-modal interaction, thereby forming a comprehensive evaluation framework.
Experimental results on InterSyn subsets of up to 200K samples show that 25Kโ50K already yield substantial improvements, while scaling to 100K/200K brings further gains in TCC, ICC, and especially ITS, highlighting InterSynโs:
(1) scalability, as performance consistently improves with more data;
(2) efficiency, as significant gains are achievable even with smaller subsets, making it accessible to researchers with varying computational resources.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1411. ARROW: An Adaptive Rollout and Routing Method for Global Weather Forecasting
๐ฌ 11
โญ 6.00
๐ learning on time series and dynamical systems
๐ท๏ธ Deep Learning; Spatiotemporal Analysis; Weather Forecasting
Weather forecasting is a fundamental task in spatiotemporal data analysis, with broad applications across a wide range of domains. Existing data-driven forecasting methods typically model atmospheric dynamics over a fixed short time interval, e.g., 6 hours, and rely on naive autoregression-based rollout for long-term forecastsing, e.g., 5 days. However, this paradigm suffers from two key limitations: (1) it often inadequately models the spatial and multi-scale temporal dependencies inherent in global weather systems, and (2) the rollout strategy struggles to balance error accumulation with the capture of fine-grained atmospheric variations. In this study, we propose ARROW, an Adaptive-Rollout Multi-scale temporal Routing method for Global Weather Forecasting. To contend with the first limitation, we construct a multi-interval forecasting model that forecasts weather across different time intervals. Within the model, the Shared-Private Mixture-of-Experts captures both shared patterns and specific characteristics of atmospheric dynamics across different time scales, while Ring Positional Encoding accurately encodes the circular latitude structure of the Earth when representing spatial information. For the second limitation, we develop an adaptive rollout scheduler based on reinforcement learning, which selects the most suitable time interval to forecast according to the current weather state. Experimental results demonstrate that ARROW achieves state-of-the-art performance in global weather forecasting, establishing a promising paradigm in this field.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1412. ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding
๐ฌ 14
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Omni-modal large language models training-free guidance decoding language model reasoning
Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual hyperparameter tuning. Experiments on six multi-modal reasoning benchmarks demonstrate that ThinkOmni consistently delivers performance improvements, with main results achieving 70.2 on MathVista and 75.5 on MMAU. Overall, ThinkOmni offers a flexible and generalizable solution for omni-modal reasoning and provides new insights into the generalization and application of reasoning capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1413. Riemannian Variational Flow Matching for Material and Protein Design
๐ฌ 30
โญ 6.00
๐ generative models
๐ท๏ธ Flow matching variational inference riemannian manifolds material generation metal-organic framework protein backbone generation
We present Riemannian Gaussian Variational Flow Matching (RG-VFM), a geometric extension of Variational Flow Matching (VFM) for generative modeling on manifolds. Motivated by the benefits of VFM, we derive a variational flow matching objective for manifolds with closed-form geodesics based on Riemannian Gaussian distributions. Crucially, in Euclidean space, predicting endpoints (VFM), velocities (FM), or noise (diffusion) is largely equivalent due to affine interpolations. However, on curved manifolds this equivalence breaks down. For this reason, we formally analyze the relationship between our model and Riemannian Flow Matching (RFM), revealing that the RFM objective lacks a curvature-dependent penalty -- encoded via Jacobi fields -- that is naturally present in RG-VFM. Based on this relationship, we hypothesize that endpoint prediction provides a stronger learning signal by directly minimizing geodesic distances. Experiments on synthetic spherical and hyperbolic benchmarks, as well as real-world tasks in material and protein generation, demonstrate that RG-VFM more effectively captures manifold structure and improves downstream performance over Euclidean and velocity-based baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1414. RLAP-CLIP: Continual Multimodal Learning with Prototype Adaptation and Difficulty-Aware Routing
๐ฌ 5
โญ 6.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Continual Multimodal Learning; Prototype Optimization; Mixture-of-Experts
Vision-language models, such as CLIP, achieve strong zero-shot performance through contrastive pre-training but face significant challenges in class-incremental image classification scenarios. When learning new tasks sequentially, current methods suffer from degradation in prototype quality due to passive averaging and underutilize their visual adaptation capabilities. We propose RLAP-CLIP, which addresses these limitations through three components. First, Reinforcement Learning-based Prototype Optimization (RLPO) formulates prototype construction as a reinforcement learning problem to actively optimize class separability rather than relying on simple averaging. Second, difficulty-aware cross-modal fusion uses a mixture-of-experts to route samples through specialized processing pathways based on complexity. Third, dual-modal prompting balances visual and textual adaptation. Experiments on eight image classification benchmarks demonstrate consistent improvements, with RLAP-CLIP achieving average accuracy gains of 3.72-4.46 points and final accuracy improvements of 0.49-4.48 points over other methods, validating that RLAP-CLIP achieves state-of-the-art performance. Our source code is available at [RLAP-CLIP](https://anonymous.4open.science/r/197165541613026132779/RLAP-CLIP).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1415. Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models
๐ฌ 14
โญ 6.00
๐ generative models
๐ท๏ธ discrete diffusion masked diffusion math reasoning image generation reinforcement learning GRPO
Optimizing discrete diffusion model (DDM) with rewards remains a challengeโthe non-autoregressive paradigm makes importance sampling intractable and rollout complex, puzzling reinforcement learning methods such as Group Relative Policy
Optimization (GRPO). In this study, we introduce **MaskGRPO**, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations. To this end,
we first clarify the theoretical foundation for DDMs, which facilitates building an importance estimator that captures valuable token fluctuation for gradient updates. We then delicately tailored the rollout method for visual sequences, which yields diverse completions and reliable optimization gradients. Across math reasoning, coding, and visual generation benchmarks, MaskGRPO brings more stable and efficient updates, **doubling** reinforcement learning gains while speeding up training by up to **30%**. This study establishes MaskGRPO as a systematic policy optimization approach and the **first** practical way for discretized visual diffusion. The code is enclosed in the supplementary and will be open-source.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1416. Video-GPT via Next Clip Diffusion
๐ฌ 19
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Video; Diffusion; LLM
GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1417. Scaling Linear Attention with Sparse State Expansion
๐ฌ 21
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Linear Attention Language Model
The Transformer architecture, despite its widespread success, struggles with long-context scenarios due to quadratic computation and linear memory growth. While various linear attention variants mitigate these efficiency constraints by compressing context into fixed-size states, they often degrade performance in tasks such as in-context retrieval and reasoning. To address this limitation and achieve more effective context compression, we propose two key innovations. First, we introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification. This enables sparse state updates via softmax-based top-$k$ hard classification, thereby extending receptive fields and reducing inter-class interference. Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions, effectively decoupling parameter size from state capacity while maintaining the sparse classification paradigm. Supported by efficient parallelized implementations, our design achieves effective classification and highly discriminative state representations. We extensively validate SSE in both pure linear and hybrid (SSE-H) architectures across language modeling, in-context retrieval, and mathematical reasoning benchmarks. SSE demonstrates strong retrieval performance and scales favorably with state size. Moreover, after reinforcement learning (RL) training, our 2B SSE-H model achieves state-of-the-art mathematical reasoning performance among small reasoning models, scoring 64.5 on AIME24 and 50.2 on AIME25, significantly outperforming similarly sized open-source Transformers. These results highlight SSE as a promising and efficient architecture for long-context modeling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1418. UP2You: Fast Reconstruction of Yourself from Unconstrained Photo Collections
๐ฌ 20
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D clothed human reconstruction image-based reconstruction human digitization SMPL multi-view diffusion model
We present UP2You, the first tuning-free solution for reconstructing high-fidelity 3D clothed portraits from extremely unconstrained in-the-wild 2D photos. Unlike previous approaches that require "clean" inputs (e.g., full-body images with minimal occlusions, or well calibrated cross-view captures), UP2You directly processes raw, unstructured photographs, which may vary significantly in pose, viewpoint, cropping, and occlusion. Instead of compressing data into tokens for slow online text-to-3D optimization, we introduce a data rectifier paradigm that efficiently converts unconstrained inputs into clean, orthogonal multi-view images in a single forward pass within seconds, simplifying the 3D reconstruction. Central to UP2You is a pose-correlated feature aggregation module PCFA, that selectively fuses information from multiple reference images w.r.t. target poses, enabling better identity preservation and nearly constant memory footprint, with more observations. Extensive experiments on 4D-Dress, PuzzleIOI, and in-the-wild captures demonstrate that UP2You consistently surpasses previous methods in both geometric accuracy (Chamfer-15\%$\\downarrow$, P2S-18\%$\\downarrow$ on PuzzleIOI) and texture fidelity (PSNR-21\%$\\uparrow$, LPIPS 46\%$\\downarrow$ on 4D-Dress). UP2You is efficient (1.5 minutes per person), and versatile (supports arbitrary pose control, and training-free multi-garment 3D virtual try-on), making it practical for real-world scenarios where humans are casually captured. Both models and code will be released to facilitate future research on this underexplored task.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1419. Multilingual Routing in Mixture-of-Experts
๐ฌ 16
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ mixture-of-expert cross-lingual transfer multilingual model modularity math interpretability LLM model steering
Mixture-of-Experts (MoE) architectures have become the key to scaling modern LLMs, yet little is understood about how their sparse routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late decoder layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense LLMs. In particular, we reveal a clear, strong correlation between a model's performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs. In comparison, interventions outside of the middle layers or targeting multilingual-specialized experts only yield performance degradation. Altogether, we present numerous findings that explain how MoEs process non-English text and demonstrate that generalization is limited by the modelโs ability to leverage language-universal experts in all languages.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1420. WideSearch: Benchmarking Agentic Broad Info-Seeking
๐ฌ 25
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ LLM Evaluation Info-Seeking Benchmark Search Agent
From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0\%, with the best performer reaching just 7\%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100\% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1421. Fisher-Rao Sensitivity for Out-of-Distribution Detection in Deep Neural Networks
๐ฌ 16
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Deep Learning Out-Of Distribution Detection Information Geometry
Deep neural networks often remain overconfident on Out-of-Distribution (OoD) inputs. We revisit this problem through Riemannian information geometry. We model the network's predictions as a statistical manifold and find that OoD inputs exhibit higher local Fisher-Rao sensitivity. By quantifying this sensitivity with the trace of the Fisher Information Matrix (FIM), we derive a unifying geometric connection between two common OoD signals: feature magnitude and output uncertainty. Analyzing the limitations of this multiplicative form, we extend our analysis using a product manifold construction. This provides a theoretical framework for the robust additive scores used in state-of-the-art (SOTA) detectors and motivates our final, competitive method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1422. Unleashing Scientific Reasoning for Bio-experimental Protocol Generation via Structured Component-based Reward Mechanism
๐ฌ 23
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ AI for biology Protocol Generation Scientific Reasoning Large Language Model Reinforcement Learning
The foundation of reproducible science lies in protocols that are precise, logically ordered, and executable. The autonomous generation of these protocols through natural language queries could greatly improve the efficiency of the reproduction process. However, current leading large language models (LLMs) often generate incomplete or inconsistent protocols, limiting their utility. To address this limitation, we first introduce SciRecipe, a large-scale dataset of over 12K structured protocols spanning 27 biological subfields and encompassing both comprehension and problem-solving tasks. To further improve protocol generation, we propose the "Sketch-and-Fill" paradigm, which separates analysis, structuring, and expression to ensure each step is explicit and verifiable. Complementing this, the structured component-based reward mechanism evaluates step granularity, action order, and semantic fidelity, aligning model optimization with experimental reliability. Building on these components, we develop Thoth, trained through a staged Knowledge-to-Action process that progresses from knowledge acquisition to operational reasoning and ultimately to robust, executable protocol generation. Across multiple benchmarks, Thoth consistently surpasses both proprietary and open-source LLMs, achieving significant improvements in step alignment, logical sequencing, and semantic accuracy. Our approach paves the way for reliable scientific assistants that bridge knowledge with experimental execution.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1423. DR-Submodular Maximization with Stochastic Biased Gradients: Classical and Quantum Gradient Algorithms
๐ฌ 11
โญ 6.00
๐ optimization
๐ท๏ธ DR-submodular Maximization Stochastic Biased Gradients Zero-Order Optimization Quantum Gradient Estimation Approximation Algorithms
In this work, we investigate DR-submodular maximization using stochastic biased gradients, which is a more realistic but challenging setting than stochastic unbiased gradients. We first generalize the Lyapunov framework to incorporate biased stochastic gradients, characterizing the adverse impacts of bias and noise. Leveraging this framework, we consider not only conventional constraints but also a novel constraint class: convex sets with a largest element, which naturally arises in applications such as resource allocations. For this constraint, we propose an $1/e$ approximation algorithm for non-monotone DR-submodular maximization, surpassing the hardness result $1/4$ for general convex constraints. As a direct application of stochastic biased gradients, we consider zero-order DR-submodular maximization and introduce both classical and quantum gradient estimation algorithms. In each constraint we consider, while retaining the same approximation ratio, the iteration complexity of our classical zero-order algorithms is $O(\epsilon^{-3})$, matching that of stochastic unbiased gradients; our quantum zero-order algorithms reach $O(\epsilon^{-1})$ iteration complexity, on par with classical first-order algorithms, demonstrating quantum acceleration and validated in numerical experiments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1424. BioBO: Biology-informed Bayesian Optimization for Perturbation Design
๐ฌ 23
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Bayesian optimization; Biological priors; Perturbation design
Efficient design of genomic perturbation experiments is crucial for accelerating drug discovery and therapeutic target identification, yet exhaustive perturbation of the human genome remains infeasible due to the vast search space of potential genetic interactions and experimental constraints. Bayesian optimization (BO) has emerged as a powerful framework for selecting informative interventions, but existing approaches often fail to exploit domain-specific biological prior knowledge. We propose Biology-Informed Bayesian Optimization (BioBO), a method that integrates Bayesian optimization with multimodal gene embeddings and enrichment analysis, a widely used tool for gene prioritization in biology, to enhance surrogate modeling and acquisition strategies.
BioBO combines biologically grounded priors with acquisition functions in a principled framework, which biases the search toward promising genes while maintaining the ability to explore uncertain regions.
Through experiments on established public benchmarks and datasets, we demonstrate that BioBO improves labeling efficiency by 25-40\%, and consistently outperforms conventional BO by identifying top-performing perturbations more effectively. Moreover, by incorporating enrichment analysis, BioBO yields pathway-level explanations for selected perturbations, offering mechanistic interpretability that links designs to biologically coherent regulatory circuits.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1425. Interpretable 3D Neural Object Volumes for Robust Conceptual Reasoning
๐ฌ 22
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Interpretability Robustness 3D-aware classification with concepts Sparse volumetric object representation 3D consistency
With the rise of deep neural networks, especially in safety-critical applications, robustness and interpretability are crucial to ensure their trustworthiness. Recent advances in 3D-aware classifiers that map image features to volumetric representation of objects, rather than relying solely on 2D appearance, have greatly improved robustness on out-of-distribution (OOD) data. Such classifiers have not yet been studied from the perspective of interpretability. Meanwhile, current concept-based XAI methods often neglect OOD robustness. We aim to address both aspects with CAVE - Concept Aware Volumes for Explanations - a new direction that unifies interpretability and robustness in image classification. We design CAVE as a robust and inherently interpretable classifier that learns sparse concepts from 3D object representation. We further propose 3D Consistency (3D-C), a metric to measure spatial consistency of concepts. Unlike existing metrics that rely on human-annotated parts on images, 3D-C leverages ground-truth object meshes as a common surface to project and compare explanations across concept-based methods. CAVE achieves competitive classification performance while discovering consistent and meaningful concepts across images in various OOD settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1426. Spotlight on Token Perception for Multimodal Reinforcement Learning
๐ฌ 5
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multimodal Reasoning LVLM Reinforcement Learning
While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose $\textbf{V}$isually-$\textbf{P}$erceptive $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{VPPO}$), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1427. CoT Vectors: Transferring and Probing the Reasoning Mechanisms of LLMs
๐ฌ 13
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Chain-of-Thought (CoT); Task Vectors; Model Steering; Large Language Models (LLMs)
Chain-of-Thought (CoT) prompting has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing implementations, such as in-context learning and fine-tuning, remain costly and inefficient. To improve CoT reasoning at a lower cost, and inspired by the task vector paradigm, we introduce CoT Vectors, compact representations that encode task-general, multi-step reasoning knowledge. Through experiments with Extracted CoT Vectors, we observe pronounced layer-wise instability, manifesting as a U-shaped performance curve that reflects a systematic three-stage reasoning process in LLMs. To address this limitation, we propose Learnable CoT Vectors, optimized under a teacherโstudent framework to provide more stable and robust guidance. Extensive evaluations across diverse benchmarks and models demonstrate that CoT Vectors not only outperform existing baselines but also achieve performance comparable to parameter-efficient fine-tuning methods, while requiring fewer trainable parameters. Moreover, by treating CoT Vectors as a probe, we uncover how their effectiveness varies due to latent space structure, information density, acquisition mechanisms, and pre-training differences, offering new insights into the functional organization of multi-step reasoning in LLMs. The source code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1428. Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
๐ฌ 6
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ World Action Model; Embodied AI; Vision-language-action; Robotic Manipulation
We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that jointly learns visual representations and action policies within a single video-generative framework. At its core, GE-Base is a large-scale instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Building on this foundation, GE-Act employs a lightweight flow-matching decoder to map latent representations into executable action trajectories, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. Trained on over 1 million manipulation episodes, GE supports both short- and long-horizon tasks, and generalizes across embodiments. All code, models, and benchmarks will be released publicly.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1429. The Forecast After the Forecast: A Post-Processing Shift in Time Series
๐ฌ 19
โญ 6.00
๐ learning on time series and dynamical systems
๐ท๏ธ Time Series Forecasting Post-Processing Fine-Tuning
Time series forecasting has long been dominated by advances in model architecture, with recent progress driven by deep learning and hybrid statistical techniques. However, as forecasting models approach diminishing returns in accuracy, a critical yet underexplored opportunity emerges: the strategic use of post-processing. In this paper, we address the last-mile gap in time-series forecasting, which is to improve accuracy and uncertainty without retraining or modifying a deployed backbone. We propose $\delta$-Adapter, a lightweight, architecture-agnostic way to boost deployed time series forecasters without retraining. $\delta$-Adapter learns tiny, bounded modules at two interfaces: input nudging (soft edits to covariates) and output residual correction. We provide local descent guarantees, $O(\delta)$ drift bounds, and compositional stability for combined adapters.
Meanwhile, it can act as a feature selector by learning a sparse, horizon-aware mask over inputs to select important features, thereby improving interpretability.
In addition, it can also be used as a distribution calibrator to measure uncertainty. Thus, we introduce a Quantile Calibrator and a Conformal Corrector that together deliver calibrated, personalized intervals with finite-sample coverage.
Our experiments across diverse backbones and datasets show that $\delta$-Adapter improves accuracy and calibration with negligible compute and no interface changes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1430. Natural Identifiers for Privacy and Data Audits in Large Language Models
๐ฌ 25
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ privacy auditing natural identifiers dataset inference differential privacy LLMs
Assessing the privacy of large language models (LLMs) presents significant challenges. In particular, most existing methods for auditing *differential privacy* require the insertion of specially crafted canary data *during training*, making them impractical for auditing already-trained models without costly retraining. Additionally, *dataset inference*, which audits whether a suspect dataset was used to train a model, is *infeasible* without access to a private non-member held-out dataset. Yet, such held-out datasets are often unavailable or difficult to construct for real-world cases since they have to be from the same distribution (IID) as the suspect data. These limitations severely hinder the ability to conduct scalable, *post-hoc* audits. To enable such audits, this work introduces **natural identifiers (NIDs)** as a novel solution to the above-mentioned challenges. NIDs are structured random strings, such as cryptographic hashes and shortened URLs, naturally occurring in common LLM training datasets. Their format enables the generation of unlimited additional random strings from the same distribution, which can act as alternative canaries for audits and as same-distribution held-out data for dataset inference. Our evaluation highlights that indeed, using NIDs, we can facilitate post-hoc differential privacy auditing *without any retraining* and enable dataset inference for any suspect dataset containing NIDs without the need for a private non-member held-out dataset.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1431. Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images
๐ฌ 32
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ unified model; generation helps understanding; 3d scene understanding; novel view synthesis
This paper presents Omni-View, which extends the unified multimodal understanding and generation to 3D scenes based on multiview images, exploring the principle that ``generation facilitates understanding". Consisting of understanding model, texture module, and geometry module, Omni-View jointly models scene understanding, novel view synthesis, and geometry estimation, enabling synergistic interaction between 3D scene understanding and generation tasks. By design, it leverages the spatiotemporal modeling capabilities of its texture module responsible for appearance synthesis, alongside the explicit geometric constraints provided by its dedicated geometry module, thereby enriching the modelโs holistic understanding of 3D scenes. Trained with a two-stage strategy, Omni-View achieves a state-of-the-art score of 55.4 on the VSI-Bench benchmark, outperforming existing specialized 3D understanding models, while simultaneously delivering strong performance in both novel view synthesis and 3D scene generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1432. High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning
๐ฌ 3
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLMs hallucination abstention
Large Language Models (LLMs) currently respond to every prompt. However, they can produce incorrect answers when they lack knowledge or capability -- a problem known as hallucination. We instead propose post-training an LLM to generate content only when confident in its correctness and to otherwise (partially) abstain. Specifically, our method, HALT, produces capability-aligned post-training data that encodes what the model can and cannot reliably generate. We generate this data by splitting responses of the pretrained LLM into factual fragments (atomic statements or reasoning steps), and use ground truth information to identify incorrect fragments. We achieve capability-aligned finetuning responses by either removing incorrect fragments or replacing them with "Unsure from Here" -- according to a tunable threshold that allows practitioners to trade off response completeness and mean correctness of the response's fragments. We finetune four open-source models for biography writing, mathematics, coding, and medicine with HALT for three different trade-off thresholds. HALT effectively trades off response completeness for correctness, increasing the mean correctness of response fragments by 15% on average, while resulting in a 4% improvement in the F1 score (mean of completeness and correctness of the response) compared to the relevant baselines. By tuning HALT for highest correctness, we train a single reliable Llama3-70B model with correctness increased from 51% to 87% across all four domains while maintaining 53% of the response completeness achieved with standard finetuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1433. Adaptive Width Neural Networks
๐ฌ 13
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Neural Networks Learning the Number of Neurons Adaptive Width Learning Dynamic Architectures Information Compression Variational Inference
For almost 70 years, researchers have typically selected the width of neural networksโ layers either manually or through automated hyperparameter tuning methods such as grid search and, more recently, neural architecture search. This paper challenges the status quo by introducing an easy-to-use technique to learn an \textit{unbounded} width of a neural network's layer \textit{during training}. The method jointly optimizes the width and the parameters of each layer via standard backpropagation. We apply the technique to a broad range of data domains such as tables, images, text, sequences, and graphs, showing how the width adapts to the task's difficulty. A by product of our width learning approach is the easy truncation of the trained network at virtually zero cost, achieving a smooth trade-off between performance and compute resources. Alternatively, one can dynamically compress the network until performances do not degrade.
In light of recent foundation models trained on large datasets, requiring billions of parameters and where hyper-parameter tuning is unfeasible due to huge training costs, our approach introduces a viable alternative for width learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1434. Part-X-MLLM: Part-aware 3D Multimodal Large Language Model
๐ฌ 14
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ 3D Computer Vision 3D Vision-language Modeling Part-aware 3D understanding Multimodal Large Language Model
We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q&A, compositional generation, and localized editing through one unified interface.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1435. PROS: Towards Compute-Efficient RLVR via Rollout Prefix Reuse
๐ฌ 5
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ RLVR reasoning
Large reasoning models (LRMs) trained with *Reinforcement Learning with Verifiable Rewards* (RLVR) have achieved remarkable progress on complex reasoning tasks. However, RLVR heavily relies on on-policy rollout generation, whose cost grows rapidly with rollout length and model size, eventually becoming the training bottleneck. Our empirical analysis reveals that independent rollouts for the same query often share similar early steps, indicating substantial redundancy. To address this, we propose **Pros** (**P**refix **R**euse for **O**n-policy **S**ampling), a paradigm that reuses promising prefixes of historical rollouts in RLVR training. **Pros** appends these self-generated partial rollouts to the original queries to form *Augmented Queries*, which are then used as regular training inputs in subsequent iterations, thereby reducing redundant computation. To select training batch from augmented queries, **Pros** adopts a hierarchical Bayesian model to estimate their pass rates and prioritize those with the highest reward uncertainty. Experiments across diverse settings show that **Pros** consistently improves training efficiency and achieves higher accuracy than strong baselines. These results highlight **Pros** as a practical path toward scalable and compute-efficient RLVR.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1436. Non-Asymptotic Analysis of (Sticky) Track-and-Stop
๐ฌ 10
โญ 6.00
๐ learning theory
๐ท๏ธ Multi-Armed Bandit Theory Pure Exploration Fixed-Confidence
In pure exploration problems, a statistician sequentially collects information to answer a question about some stochastic and unknown environment. The probability of returning a wrong answer should not exceed a maximum risk parameter $\delta$ and good algorithms make as few queries to the environment as possible. The Track-and-Stop algorithm is a pioneering method to solve these problems. Specifically, it is well-known that it enjoys asymptotic optimality sample complexity guarantees for $\delta \to 0$ whenever the map from the environment to its correct answers is single-valued (e.g., best-arm identification with a unique optimal arm). The Sticky Track-and-Stop algorithm extends these results to settings where, for each environment, there might exist multiple correct answers (e.g., $\epsilon$-optimal arm identification). Although both methods are optimal in the asymptotic regime, their non-asymptotic guarantees remain unknown. In this work, we fill this gap and provide non-asymptotic guarantees for both algorithms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1437. Critical attention scaling in long-context transformers
๐ฌ 13
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language models attention scaling long-context length scaling rank-collapse phase transition YaRN Qwen
As large language models scale to longer contexts, attention layers suffer from a fundamental pathology: attention scores collapse toward uniformity as context length $n$ increases, causing tokens to cluster excessively, a phenomenon known as rank-collapse. While $\text{\emph{attention scaling}}$ effectively addresses this deficiency by rescaling attention scores with a polylogarithmic factor $\beta_n$, theoretical justification for this approach remains lacking.
We analyze a simplified yet tractable model that magnifies the effect of attention scaling. In this model, attention exhibits a phase transition governed by the scaling factor $\beta_n$: insufficient scaling collapses all tokens to a single direction, while excessive scaling reduces attention to identity, thereby eliminating meaningful interactions between tokens.
Our main result identifies the critical scaling $\beta_n \asymp \log n$ and provides a rigorous justification for attention scaling in YaRN and Qwen, clarifying why logarithmic scaling maintains sparse, content-adaptive attention at large context lengths.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1438. Pitfalls in Evaluating Language Model Forecasters
๐ฌ 12
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ forecasting evaluation criticism leakage standards LLMs prediction future benchmarks
Large language models (LLMs) have recently been applied to forecasting tasks, with some works claiming these systems match or exceed human performance. In this paper, we argue that, as a community, we should be careful about such conclusions as evaluating LLM forecasters presents unique challenges. We identify two broad categories of issues: (1) difficulty in trusting evaluation results due to many forms of temporal leakage, and (2) difficulty in extrapolating from evaluation performance to real-world forecasting. Through systematic analysis and concrete examples from prior work, we demonstrate how evaluation flaws can raise concerns about current and future performance claims. We argue that more rigorous evaluation methodologies are needed to confidently assess the forecasting abilities of LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1439. Inter-Agent Relative Representations for Multi-Agent Option Discovery
๐ฌ 10
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Option Discovery Multi-agent Reinforcement Learning
Temporally extended actions improve the ability to explore and plan in single-agent settings. In multi-agent settings, the exponential growth of the joint state space with the number of agents makes coordinated behaviours even more valuable. Yet, this same exponential growth renders the design of multi-agent options particularly challenging. Existing multi-agent option discovery methods often sacrifice coordination by producing loosely coupled or fully independent behaviors. Toward addressing these limitations, we describe a novel approach for multi-agent option discovery. Specifically, we propose a joint-state abstraction that compresses the state space while preserving the information necessary to discover strongly coordinated behaviours. Our approach builds on the inductive bias that synchronisation over agent states provides a natural foundation for coordination in the absence of explicit objectives. We first approximate a fictitious state of maximal alignment with the team, the Fermat state, and use it to define a measure of spreadness, capturing team-level misalignment on each individual state dimension. Building on this representation, we then employ a neural graph Laplacian estimator to derive options that capture state synchronisation patterns between agents. We evaluate the resulting options across multiple scenarios in two multi-agent domains, showing that they yield stronger downstream coordination capabilities compared to alternative option discovery methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1440. Generative Value Conflicts Reveal LLM Priorities
๐ฌ 9
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM alignment value alignment evaluation moral dilemmas
Past work seeks to align large language model (LLM)-based assistants with a target set of values, but such assistants are frequently forced to make tradeoffs *between* values when deployed. In response to the scarcity of value conflict in existing alignment datasets, we introduce ConflictScope, an automatic pipeline to evaluate how LLMs prioritize different values. Given a user-defined value set, ConflictScope automatically generates scenarios in which a language model faces a conflict between two values sampled from the set. It then prompts target models with an LLM-written ``user prompt'' and evaluates their free-text responses to elicit a ranking over values in the value set. Comparing results between multiple-choice and open-ended evaluations, we find that models shift away from supporting protective values, such as harmlessness, and toward supporting personal values, such as user autonomy, in more open-ended value conflict settings. However, including detailed value orderings in models' system prompts improves alignment with a target ranking by 14%, showing that system prompting can achieve moderate success at aligning LLM behavior under value conflict. Our work demonstrates the importance of evaluating value prioritization in models and provides a foundation for future work in this area.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1441. Peng's Q($\lambda$) for Conservative Value Estimation in Offline Reinforcement Learning
๐ฌ 14
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Offline reinforcement learning Offline-to-online settings Multi-step operator
We propose a model-free offline multi-step reinforcement learning (RL) algorithm, Conservative Peng's Q($\lambda$) (CPQL).
Our algorithm adapts the Peng's Q($\lambda$) (PQL) operator for conservative value estimation as an alternative to the Bellman operator.
To the best of our knowledge, this is the first work in offline RL to theoretically and empirically demonstrate the effectiveness of conservative value estimation with the *multi-step* operator by fully leveraging offline trajectories.
The fixed point of the PQL operator in offline RL lies closer to the value function of the behavior policy, thereby naturally inducing implicit behavior regularization.
CPQL simultaneously mitigates over-pessimistic value estimation, achieves performance greater than (or equal to) that of the behavior policy, and provides near-optimal performance guarantees --- a milestone that previous conservative approaches could not achieve.
Extensive numerical experiments on the D4RL benchmark demonstrate that CPQL consistently and significantly outperforms existing offline single-step baselines.
In addition to the contributions of CPQL in offline RL, our proposed method also contributes to the framework of offline-to-online learning.
Using the Q-function pre-trained by CPQL in offline settings enables the online PQL agent to avoid the performance drop typically observed at the start of fine-tuning and attain robust performance improvement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1442. DriftLite: Lightweight Drift Control for Inference-Time Scaling of Diffusion Models
๐ฌ 26
โญ 6.00
๐ generative models
๐ท๏ธ Diffusion Model Inference-Time Scaling Variance Reduction Sequential Monte Carlo Guidance
We study inference-time scaling for diffusion models, where the goal is to adapt a pre-trained model to new target distributions without retraining. Existing guidance-based methods are simple but introduce bias, while particle-based corrections suffer from weight degeneracy and high computational cost. We introduce *DriftLite*, a lightweight, training-free particle-based approach that steers the inference dynamics on-the-fly with provably optimal stability control. DriftLite exploits a fundamental degree of freedom in the Fokker-Planck equation between the drift and particle potential, and yields two practical instantiations: *Variance- and Energy-Controlling Guidance (VCG/ECG)* for approximating the optimal drift with modest and scalable overhead. Across Gaussian mixture models, particle systems, and large-scale protein-ligand co-folding problems, DriftLite consistently reduces variance and improves sample quality over pure guidance and sequential Monte Carlo baselines. These results highlight a principled, efficient route toward scalable inference-time adaptation of diffusion models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1443. Bures Generalized Category Discovery
๐ฌ 5
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ generalized category discovery Bures metric quantum informatics
Generalized Category Discovery (GCD) seeks to discover categories by clustering unlabeled samples that mix known and novel classes. The prevailing recipe enforces compact clustering, this pursuit is largely blind to representation geometry: it over-compresses token manifolds, distorts eigen-structure, and yields brittle feature distributions that undermine discovery. We argue that GCD requires not more compression, but geometric restoration of an over-flattened feature space. Drawing inspiration from quantum information science, which similarly pursues representational completeness, we introduce Bures-Isotropy Alignment (BIA), which optimizes the class-token covariance toward an isotropic prior by minimizing the Bures distance. Under a mild trace constraint, BIA admits a practical surrogate equivalent to maximizing the nuclear norm of stacked class tokens, thereby promoting isotropic, non-collapsed subspaces without altering architectures. The induced isotropy homogenizes the eigen-spectrum and raises the von Neumann entropy of class-token autocorrelation, improving both cluster separability and class-number estimation. BIA is plug-and-play, implemented in a few lines on unlabeled batches, and consistently boosts strong GCD baselines on coarse- and fine-grained benchmarks, improving overall accuracy and reducing errors in the estimation of class-number. By restoring the geometry of token manifolds rather than compressing them blindly, BIA supplies compactness for known classes and cohesive emergence for novel ones, advancing robust open-world discovery.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1444. Benchmarking Open-ended Segmentation
๐ฌ 18
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Benchmarking Open-ended Segmentation Evaluation Protocol Lexical Alignment
Open-ended segmentation requires models capable of generating free-form descriptions of previously unseen concepts and regions. Despite advancements in model development, current evaluation protocols for open-ended segmentation tasks fail to capture the true semantic accuracy of the generated descriptions. We empirically demonstrate that embeddingโbased similarity score mappings diverge significantly from human judgments. To address this issue, we introduce a novel mapping function that considers multiple lexical relationships between freeโform outputs and testโvocabulary labels, yielding much closer alignment with human annotations. We integrate this mapping into a robust evaluation framework and reโbenchmark previous stateโofโtheโart methods. Additionally, we present the first Multi-modal LargeโLanguage Model trained with a contrastive objective to jointly align visual regions and textual descriptions, achieving new stateโofโtheโart results in openโended panoptic segmentation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1445. SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization
๐ฌ 16
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Efficient Evaluation LLM Evaluation
As large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved. However, evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs.
In this paper, we revisit the model-item performance matrix and show that it exhibits sparsity, that representative items can be selected as anchors, and that the task of efficient benchmarking can be formulated as a sparse optimization problem.
Based on these insights, we propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection. We utilize the representation capacity of MLP to handle sparse optimization and propose the Anchor Importance Score and Candidate Importance Score to evaluate the value of each item for task-aware refinement. Extensive experiments demonstrate the low estimation error and high Kendallโs~$\tau$ of our method across a variety of benchmarks, showcasing its superior robustness and practicality in real-world scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1446. Micro-Macro Coupled Koopman Modeling on Graph for Traffic Flow Prediction
๐ฌ 16
โญ 6.00
๐ learning on time series and dynamical systems
๐ท๏ธ Koopman Operator; Traffic Flow Prediction
Traffic systems are inherently multi-scale: microscopic vehicle interactions and macroscopic flow co-evolve nonlinearly. Microscopic models capture local interactions but miss flow evolution; macroscopic models enforce aggregated consistency yet overlook stochastic vehicle-level dynamics. We propose MicroโMacro Coupled Koopman Modeling (MMCKM), which lifts the coupled dynamics to a high-dimensional linear observation space for a unified linear-operator representation. Unlike grid-based discretizations, MMCKM adopts a vehicle-centric dynamic graph that preserves microscopic perturbations while respecting macroscopic conservation laws by discretizing PDEs onto this graph. At the micro scale, scenario-adaptive Koopman evolvers selected by an Intent Discriminator are designed to model vehicle dynamics. A Koopman control module explicitly formulate how flow state influences individual vehicles, yielding bidirectional couplings. To our knowledge, this is the first work to jointly model vehicle trajectories and traffic flow density using a unified Koopman framework without requiring historical trajectories. The proposed MMCKM is validated for trajectory prediction on NGSIM and HighD. While MMCKM uses only real-time measurement, it achieves comparable or even higher accuracy than history-dependent baselines. We further analyze the effect of the operator interval and provide ablations to show the improvement by intent inference, macro-to-micro control, and diffusion. Code and implementation details are included to facilitate reproducibility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1447. Search Arena: Analyzing Search-Augmented LLMs
๐ฌ 11
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Large Language Models Web Search Human-AI Interaction
Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce \textbf{Search Arena}, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations and types of cited sources, even when the cited content does not directly support the associated claims, uncovering a gap between perceived and actual credibility. To assess cross-setting performance, we conduct cross-arena analyses by testing search-augmented LLMs in a general purpose chat environment and conventional LLMs in search-heavy settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model's parametric knowledge. We open-sourced the dataset to support future research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1448. Multiplicative Diffusion Models: Beyond Gaussian Latents
๐ฌ 26
โญ 6.00
๐ generative models
๐ท๏ธ score-based diffusion generative modeling multiplicative noise non-Gaussian latent variables conservative dynamics heavy-tailed distributions FokkerโPlanck equation
We introduce a new class of generative models based on multiplicative score-driven diffusion. In contrast to classical diffusion models that rely on additive Gaussian noise, our construction is driven by skew-symmetric multiplicative noise. It yields conservative forward-backward dynamics inspired by principles of physics. We prove that the forward process converges exponentially fast to a tractable non-Gaussian latent distribution, and we characterize this limit explicitly. A key property of our diffusion is that it preserves the distribution of data norms, resulting in a latent space that is inherently data-aware. Unlike the standard Gaussian prior, this structure better adapts to heavy-tailed and anisotropic data, providing a closer match between latent and observed distributions.
On the algorithmic side, we derive the reverse-time stochastic differential equation and associated probability flow, and show that sliced score matching furnishes a consistent estimator for the backward dynamics. This estimation procedure is equivalent to maximizing an evidence lower bound (ELBO), bridging our framework with established variational principles.
Empirically, we demonstrate the advantages of our model in challenging settings, including correlated Cauchy distributions and experimental fluid dynamics data (d=1024). Across these tasks, our approach more accurately captures extreme events and tail behavior than classical diffusion models, particularly in the low-data regime.
Our results suggest that multiplicative conservative diffusions open a principled alternative to current score-based generative models, with strong potential for domains where rare but critical events dominate.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1449. PerFit: Exploring Personalization Shifts in Representation Space of LLMs
๐ฌ 3
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Personalization Large Language Models
Personalization has become a pivotal field of study in contemporary intelligent systems. While large language models (LLMs) excel at general knowledge tasks, they often struggle with personalization, i.e., adapting their outputs to individual user expectations. Existing approaches that steer LLM behavior to meet usersโ implicit preferences and behavior patterns, primarily relying on tune-free methods (e.g., RAG, PAG) or parameter fine-tuning methods (e.g., LoRA), face challenges in effectively balancing effectiveness and efficiency. Moreover, the mechanisms underlying personalized preferences remain underexplored. To address these challenges, we first uncover key patterns of user-specific information embedded in the representation space. Specifically, we find that (1) personalized information lies within a low-rank subspace represented by vectors, and (2) these vectors demonstrate both a collective shift shared across users and a personalized shift unique to each individual user. Building on these insights, we introduce PerFit, a novel two-stage solution that directly fine-tunes interventions in the hidden representation space by addressing both collective and user-specific shifts, thereby achieving precise steering of LLM with minimal parameter overhead. Experimental results demonstrate that \perfit delivers strong performance across six datasets while \cutting the number of parameters by an average of 92.3% compared to the state-of-the-art method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1450. DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training
๐ฌ 10
โญ 6.00
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ LLM Attention Mechanism Deterministic Training
Determinism is indispensable for reproducibility in large language model (LLM) training, yet it often exacts a steep performance cost. In widely used attention implementations such as FlashAttention-3, the deterministic backward pass can incur up to a 37.9% throughput reduction relative to its nonโdeterministic counterpart, primarily because gradient accumulation operations must be serialized to guarantee numerical consistency. This performance loss stems from suboptimal scheduling of compute and gradientโreduction phases, leading to significant hardware underutilization.
To address this challenge, we formulate the backward pass of deterministic attention as a scheduling problem on a Directed Acyclic Graph (DAG) and derive schedules that minimize the critical path length. Building on this formulation, we present DASH (Deterministic Attention Scheduling for High-Throughput), which encapsulates two complementary scheduling strategies: (i) Descending QโTile Iteration, a reversed queryโblock traversal that shrinks pipeline stalls in causal attention, and (ii) Shift Scheduling, a theoretically optimal schedule within our DAG model that reduces pipeline stalls for both full and causal masks.
Our empirical evaluations on NVIDIA H800 GPUs demonstrate that DASH narrows the performance gap of deterministic attention. The proposed strategies improve the throughput of the attention backward pass by up to 1.28$\times$ compared to the baseline, significantly advancing the efficiency of reproducible LLM training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1451. EXP-Bench: Can AI Conduct AI Research Experiments?
๐ฌ 4
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ AI Agents
Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading AI agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1452. SiNGER: A Clearer Voice Distills Vision Transformers Further
๐ฌ 12
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Vision foundation models model compression knowledge distillation representation learning
Vision Transformers are widely adopted as the backbone of vision foundation models, but they are known to produce high-norm artifacts that degrade representation quality.
When knowledge distillation transfers these features to students, high-norm artifacts dominate the objective, so students overfit to artifacts and underweight informative signals, diminishing the gains from larger models. Prior work attempted to remove artifacts but encountered an inherent trade-off between artifact suppression and preserving informative signals from teachers. To address this, we introduce Singular Nullspace-Guided Energy Reallocation (SiNGER), a novel distillation framework that suppresses artifacts while preserving informative signals.
The key idea is principled teacher feature refinement: during refinement, we leverage the nullspace-guided perturbation to preserve information while suppressing artifacts.
Then, the refined teacher's features are distilled to a student.
We implement this perturbation efficiently with a LoRA-based adapter that requires minimal structural modification.
Extensive experiments show that \oursname consistently improves student models, achieving state-of-the-art performance in multiple downstream tasks and producing clearer and more interpretable representations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1453. Composition of Memory Experts for Diffusion World Models
๐ฌ 11
โญ 6.00
๐ generative models
๐ท๏ธ World Model Diffusion Model Memory Generative Models Video Generation
World models aim to predict plausible futures consistent with past observations, a
capability central to planning and decision-making in reinforcement learning. Yet,
existing architectures face a fundamental memory trade-off: transformers preserve
local detail but are bottlenecked by quadratic attention, while recurrent and state-
space models scale more efficiently but compress history at the cost of fidelity. To
overcome this trade-off, we suggest decoupling futureโpast consistency from any
single architecture and instead leveraging a set of specialized experts. We introduce
a diffusion-based framework that integrates heterogeneous memory models through
a contrastive product-of-experts formulation. Our approach instantiates three
complementary roles: a short-term memory expert that captures fine local dynamics,
a long-term memory expert that stores episodic history in external diffusion weights
via lightweight test-time finetuning, and a spatial long-term memory expert that
enforces geometric and spatial coherence. This compositional design avoids mode
collapse and scales to long contexts without incurring a quadratic cost. Across
simulated and real-world benchmarks, our method improves temporal consistency,
recall of past observations, and navigation performance, establishing a novel
paradigm for building and operating memory-augmented diffusion world models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1454. Towards Understanding the Shape of Representations in Protein Language Models
๐ฌ 10
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Protein Language Models Shape Analysis Transformers
While protein language models (PLMs) are one of the most promising avenues of research for future de novo protein design, the way in which they transform sequences to hidden representations, as well as the information encoded in such representations is yet to be fully understood. Several works have attempted to propose interpretability tools for PLMs, but they have focused on understanding how individual sequences are transformed by such models. Therefore, the way in which PLMs transform the whole space of sequences along with their relations is still unknown. In this work we attempt to understand this transformed space of sequences by identifying protein structure and representation with square-root velocity (SRV) representations and graph filtrations. Both approaches naturally lead to a metric space in which pairs of proteins or protein representations can be compared with each other.
We analyze different types of proteins from the SCOP dataset and show that the Karcher mean and effective dimension of the SRV shape space follows a non-linear pattern as a function of the layers in ESM2 models of different sizes. Furthermore, we use graph filtrations as a tool to study the context lengths at which models encode the structural features of proteins. We find that PLMs preferentially encode immediate as well as local relations between amino acids, but start to degrade for larger context lengths. The most structurally faithful encoding tends to occur close to, but before the last layer of the models, indicating that training a folding model ontop of these layers might lead to improved folding performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1455. Distilling to Hybrid Attention Models via KL-Guided Layer Selection
๐ฌ 16
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Linear attention Hybrid architectures Distillation Layer selection Inference efficiency
Distilling pretrained softmax attention Transformers into more efficient hybrid architectures that interleave softmax and linear attention layers is a promising approach for improving the inference efficiency of LLMs without requiring expensive pretraining from scratch. A critical factor in the conversion process is layer selection, i.e., deciding on which layers to convert to linear attention variants. This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data. Once the layers have been selected we use a recent pipeline for the distillation process itself \citep[RADLADS;][]{goldstein2025radlads}, which consists of attention weight transfer, hidden state alignment, KL-based distribution matching, followed by a small amount of finetuning. We find that this approach is more effective than existing approaches for layer selection, including heuristics that uniformly interleave linear attentions based on a fixed ratio, as well as more involved approaches that rely on specialized diagnostic datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1456. Bayesian Parameter Shift Rules in Variational Quantum Eigensolvers
๐ฌ 11
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ parameter shift rule variational quantum eigensolver quantum computing confidence region Gaussian process
Parameter shift rules (PSRs) are key techniques for efficient gradient estimation in variational quantum eigensolvers (VQEs). In this paper, we propose their Bayesian variant, where Gaussian processes with appropriate kernels are used to estimate the gradient of the VQE objective. Our Bayesian PSR offers flexible gradient estimation from observations at arbitrary locations with uncertainty information, and reduces to the generalized PSR in special cases. In stochastic gradient descent (SGD), the flexibility of Bayesian PSR allows reuse of observations in previous steps, which accelerates the optimization process. Furthermore, the accessibility to the posterior uncertainty, along with our proposed notion of gradient confident region (GradCoRe), enables us to minimize the observation costs in each SGD step. Our numerical experiments show that the VQE optimization with Bayesian PSR and GradCoRe significantly accelerates SGD, and outperforms the state-of-the-art methods, including sequential minimal optimization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1457. LearNAT: Learning NL2SQL with AST-guided Task Decomposition for Large Language Models
๐ฌ 21
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Language Model Text-to-SQL
Natural Language to SQL (NL2SQL) aims to translate natural language queries into executable SQL statements, offering non-expert users intuitive access to databases. While recent approaches leveraging large-scale private LLMs such as GPT-4 have achieved state-of-the-art results, they face two critical challenges: the lack of openness and reproducibility, and the prohibitive computational cost of test-time scaling. To address these issues, we explore improving the model-level performance of small-scale public LLMs in NL2SQL under resource-constrained settings. Our exploratory experiments reveal the potential of task decomposition for enhancing NL2SQL performance, but also highlight the difficulty of enabling LLMs to decompose queries effectively. Motivated by these findings, we propose LearNAT, a novel framework designed to enhance LLMsโ decomposition capabilities. LearNAT introduces (1) a Decomposition Synthesis Procedure, which leverages AST-guided search with pruning strategies to generate verifiable and efficient decompositions, and (2) Margin-Aware Reinforcement Learning, which provides fine-grained preference optimization for multi-step reasoning beyond standard DPO. Extensive experiments on benchmark datasets demonstrate that LearNAT significantly improves the performance of small-scale LLMs, achieving results comparable to GPT-4 with only a 7B parameter model. These results validate the effectiveness of verifiable decomposition and fine-grained preference learning in advancing NL2SQL towards openness, transparency, and efficiency.
Our code is publicly available at https://anonymous.4open.science/r/LearNAT.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1458. Training-free Counterfactual Explanation for Temporal Graph Model Inference
๐ฌ 12
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Temporal Graph Nerual Networks Explainability Training Free
Temporal graph neural networks (TGNN) extend graph neural networks to dynamic networks and have demonstrated strong predictive power. However, interpreting TGNN remains far less explored than their static-graph counterparts. This paper introduces TEMporal Graph eXplainer (TemGX), a training-free,post-hoc framework that help users interpret and understand TGNN behavior by discovering temporal subgraphs and their evolution that are responsible for TGNN output of interests.We introduce a class of explainability measures that extends influence maximization in terms of structural influence and time decay to model temporal influence. We formulate the explanation task as a constrained optimization problem, and propose fast algorithms to discover explanations with guarantees on their temporal explainability. Our experimental study verifies the effectiveness and efficiency of TemGX for TGNN explanation, compared with state-of-the-art explainers. We also showcase how TemGX supports inference queries for dynamic network analysis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1459. Evolution and compression in LLMs: on the emergence of human-aligned categorization
๐ฌ 15
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ LLMs information theory semantics
Converging evidence suggests that human systems of semantic categories achieve near-optimal compression via the Information Bottleneck (IB) complexity-accuracy tradeoff. Large language models (LLMs) are not trained for this objective, which raises the question: are LLMs capable of evolving efficient human-aligned semantic systems? To address this question, we focus on color categorization --- a key testbed of cognitive theories of categorization with uniquely rich human data --- and replicate with LLMs two influential human studies. First, we conduct an English color-naming study, showing that LLMs vary widely in their complexity and English-alignment, with larger instruction-tuned models achieving better alignment and IB-efficiency. Second, to test whether these LLMs simply mimic patterns in their training data or actually exhibit a human-like inductive bias toward IB-efficiency, we simulate cultural evolution of pseudo color-naming systems in LLMs via a method we refer to as Iterated in-Context Language Learning (IICLL). We find that akin to humans, LLMs iteratively restructure initially random systems towards greater IB-efficiency. However, only a model with strongest in-context capabilities (Gemini 2.0) is able to recapitulate the wide range of near-optimal IB-tradeoffs observed in humans, while other state-of-the-art models converge to low-complexity solutions. These findings demonstrate how human-aligned semantic categories can emerge in LLMs via the same fundamental principle that underlies semantic efficiency in humans.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1460. Soft-Masked Diffusion Language Models
๐ฌ 14
โญ 6.00
๐ generative models
๐ท๏ธ Masked diffusion language models continuous feedback code generation
Diffusion models have demonstrated strong potential in language modeling, offering various advantages over traditional autoregressive approaches. Their ability to generate and revise entire responses in parallel enables faster generation and built-in self-correction mechanisms. Most modern diffusion-based language models employ masked diffusion, where decoding involves iteratively processing masked tokens based on a binary decision: either retaining the mask or replacing it with the predicted token. However, this binary choice discards valuable predictive information when the mask is retained. To address this limitation, we introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top-k predicted tokens from the previous decoding step, for each retained mask. This provides the model with a more informative prior, preserving context from earlier computations and allowing partial information about masked tokens to propagate beyond a single step. We propose a training methodology that efficiently adapts masked diffusion language models to incorporate SM. We demonstrate that training a 169M parameter model from scratch with SM yields superior perplexity and MAUVE scores compared to binary masking baselines. Similarly, a pretrained model can be enhanced with SM through continued pretraining. Finally, we finetune two state-of-the-art diffusion models, Dream-7B and Dream-Coder-7B, with SM. SM consistently improves performance across multiple coding benchmarks, particularly in high-throughput settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1461. Rainbow Padding: Mitigating Early Termination in Instruction-Tuned Diffusion LLMs
๐ฌ 15
โญ 6.00
๐ generative models
๐ท๏ธ Discrete Diffusion Instruction Tuning NLP
Diffusion large language models (dLLMs) have emerged as a promising alternative to autoregressive models, offering flexible generation orders and strong performance on complex reasoning tasks.
However, instruction-tuned dLLMs exhibit a critical vulnerability we term \ overflow: as allocated sequence length increases, responses paradoxically become shorter, collapsing into early termination or degenerating into streams of \ tokens.
Although noticed in practice, this issue has not been systematically analyzed. We trace its root cause to the dual role of \ as both termination and padding, which concentrates probability mass on \ at later positions and propagates backward to trigger early termination.
To address this, we introduce Rainbow Padding, a simple remedy that replaces repeated \ placeholders with a repeating cycle of distinct padding tokens, distributing probability mass and breaking \ dominance.
Experiments show that Rainbow Padding substantially improves length robustness and output quality, with as few as seven padding tokens sufficient to prevent early termination.
Moreover, the method integrates efficiently into existing instruction-tuned models: LoRA fine-tuning for a single epoch on minimal data yields significant improvements, making this solution highly practical.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1462. Learning Physics-Grounded 4D Dynamics with Neural Gaussian Force Fields
๐ฌ 14
โญ 6.00
๐ learning on time series and dynamical systems
๐ท๏ธ Physical reasoning video prediction
Predicting physical dynamics from raw visual data remains a major challenge in AI. While recent video generation models have achieved impressive visual quality, they still cannot consistently generate physically plausible videos due to a lack of modeling of physical laws. Recent approaches combining 3D Gaussian splatting and physics engines can produce physically plausible videos, but are hindered by high computational costs in both reconstruction and simulation, and often lack robustness in complex real-world scenarios. To address these issues, we introduce **Neural Gaussian Force Field (NGFF)**, an end-to-end neural framework that integrates 3D Gaussian perception with physics-based dynamic modeling to generate interactive, physically realistic 4D videos from multi-view RGB inputs, achieving two orders of magnitude faster than prior Gaussian simulators. To support training, we also present **GSCollision**, a 4D Gaussian dataset featuring diverse materials, multi-object interactions, and complex scenes, totaling over 640k rendered physical videos (โผ4 TB). Evaluations on synthetic and real 3D scenarios show NGFFโs strong generalization and robustness in physical reasoning, advancing video prediction towards physics-grounded world models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1463. A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models
๐ฌ 11
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large language models Self-refine
Recent advances in self-refinement have demonstrated significant potential for improving the outputs of large language models (LLMs) through iterative refinement. However, most existing self-refinement methods rely on a reactive process with a fixed number of iterations, making it difficult to determine the optimal timing and content of refinement based on the evolving generation context. Inspired by the way humans dynamically refine their thoughts during execution, we propose ProActive Self-Refinement (PASR), a novel method that enables LLMs to refine their outputs during the generation process. Unlike methods that regenerate entire responses, PASR proactively decides whether, when, and how to refine based on the modelโs internal state and evolving context. We conduct extensive experiments on a diverse set of 10 tasks to evaluate the effectiveness of PASR. Experimental results show that PASR significantly enhances problem-solving performance. In particular, on Qwen3-8B, PASR reduces average token consumption by 41.6% compared to standard generation, while also achieving an 8.2% improvement in accuracy. Our code and all baselines used in the paper are available in the GitHub.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1464. Preventing Model Collapse Under Overparametrization: Optimal Mixing Ratios for Interpolation Learning and Ridge Regression
๐ฌ 15
โญ 6.00
๐ learning theory
๐ท๏ธ Model Collapse High Dimensional Regression Overparametrization
Model collapse occurs when generative models degrade after repeatedly training on their own synthetic outputs. We study this effect in overparameterized linear regression in a setting where each iteration mixes fresh real labels with synthetic labels drawn from the model fitted in the previous iteration. We derive precise generalization error formulae for minimum-$\ell_2$-norm interpolation and ridge regression under this iterative scheme. Our analysis reveals intriguing properties of the optimal mixing weight that minimizes long-term prediction error and provably prevents model collapse. For instance, in the case of min-$\ell_2$-norm interpolation, we establish that the optimal real-data proportion converges to the reciprocal of the golden ratio for fairly general classes of covariate distributions. Previously, this property was known only for ordinary least squares, and additionally in low dimensions. For ridge regression, we further analyze two popular model classes -- the random-effects model and the spiked covariance model -- demonstrating how spectral geometry governs optimal weighting. In both cases, as well as for isotropic features, we uncover that the optimal mixing ratio should be at least one-half, reflecting the necessity of favoring real-data over synthetic. We study three additional settings: (i) where real data is fixed and fresh labels are not obtained at each iteration, (ii) where covariates vary across iterations but fresh real labels are available each time, and (iii) where covariates vary with time but only a fraction of them receive fresh real labels at each iteration. Across these diverse settings, we characterize when model collapse is inevitable and when synthetic data improves learning.
We validate our theoretical results with extensive simulations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1465. Byzantine-Robust Federated Learning with Learnable Aggregation Weights
๐ฌ 15
โญ 6.00
๐ optimization
๐ท๏ธ Federated Learning Byzantine Robustness Distributed Optimization
Federated Learning (FL) enables clients to collaboratively train a global model without sharing their private data. However, the presence of malicious (Byzantine) clients poses significant challenges to the robustness of FL, particularly when data distributions across clients are heterogeneous. In this paper, we propose a novel Byzantine-robust FL optimization problem that incorporates adaptive weighting into the aggregation process. Unlike conventional approaches, our formulation treats aggregation weights as learnable parameters, jointly optimizing them alongside the global model parameters. To solve this optimization problem, we develop an alternating minimization algorithm with strong convergence guarantees under adversarial attack.
We analyze the Byzantine resilience of the proposed objective.
We evaluate the performance of our algorithm against state-of-the-art Byzantine-robust FL approaches across various datasets and attack scenarios. Experimental results demonstrate that our method consistently outperforms existing approaches, particularly in settings with highly heterogeneous data and a large proportion of malicious clients.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1466. EUBRL: Epistemic Uncertainty Directed Bayesian Reinforcement Learning
๐ฌ 20
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Bayesian RL epistemic uncertainty exploration
At the boundary between the known and the unknown, an agent inevitably confronts the dilemma of whether to explore or to exploit. Epistemic uncertainty reflects such boundaries, representing systematic uncertainty due to limited knowledge. In this paper, we propose a Bayesian reinforcement learning (RL) algorithm, $\texttt{EUBRL}$, which leverages epistemic guidance to achieve principled exploration. This guidance adaptively reduces per-step regret arising from estimation errors. We establish nearly minimax-optimal regret and sample complexity guarantees for a specific class of priors in infinite-horizon discounted MDPs. Empirically, we evaluate $\texttt{EUBRL}$ on tasks characterized by sparse rewards, long horizons, and stochasticity. Results demonstrate that $\texttt{EUBRL}$ achieves superior sample efficiency, scalability, and consistency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1467. Cross-Domain Lossy Compression via Rate- and Classification-Constrained Optimal Transport
๐ฌ 23
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Lossy Compression Image Compression Image Restoration Image Inpainting Optimal Transport Multi-task Learning Rate-Distortion-Perception Tradeoff Rate-Distortion-Classification Tradeoff Deep Learning Unsupervised Learning
We study cross-domain lossy compression, where the encoder observes a degraded source while the decoder reconstructs samples from a distinct target distribution. The problem is formulated as constrained optimal transport with two constraints on compression rate and classification loss. With shared common randomness, the one-shot setting reduces to a deterministic transport plan, and we derive closed-form distortion-rate-classification (DRC) and rate-distortion-classification (RDC) tradeoffs for Bernoulli sources under Hamming distortion. In the asymptotic regime, we establish analytic DRC/RDC expressions for Gaussian models under mean-squared error. The framework is further extended to incorporate perception divergences (Kullback-Leibler and squared Wasserstein), yielding closed-form distortion-rate-perception-classification (DRPC) functions. To validate the theory, we develop deep end-to-end compression models for super-resolution (MNIST), denoising (SVHN, CIFAR-10, ImageNet, KODAK), and inpainting (SVHN) problems, demonstrating the consistency between the theoretical results and empirical performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1468. CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process
๐ฌ 15
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Mathematical Reasoning Benchmark Circuit Benchmark Symbolic Reasoning
Engineering design operates through hierarchical abstraction from system specifications to component implementations, requiring visual understanding coupled with mathematical reasoning at each level. While Multi-modal Large Language Models (MLLMs) excel at natural image tasks, their ability to extract mathematical models from technical diagrams remains unexplored. We present \textbf{CircuitSense}, a comprehensive benchmark evaluating circuit understanding across this hierarchy through 8,006+ problems spanning component-level schematics to system-level block diagrams. Our benchmark uniquely examines the complete engineering workflow: Perception, Analysis, and Design, with a particular emphasis on the critical but underexplored capability of deriving symbolic equations from visual inputs. We introduce a hierarchical synthetic generation pipeline consisting of a grid-based schematic generator and a block diagram generator with auto-derived symbolic equation labels. Comprehensive evaluation of eight state-of-the-art MLLMs, including both closed-source and open-source models, reveals fundamental limitations in visual-to-mathematical reasoning. Closed-source models achieve over 85\% accuracy on perception tasks involving component recognition and topology identification, yet their performance on symbolic derivation and analytical reasoning falls below 19\%, exposing a critical gap between visual parsing and symbolic reasoning. Models with stronger symbolic reasoning capabilities consistently achieve higher design task accuracy, confirming the fundamental role of mathematical understanding in circuit synthesis and establishing symbolic reasoning as the key metric for engineering competence. Our synthetic pipeline code is available at \href{https://anonymous.4open.science/r/CircuitSense-8AC7/README.md}{URL}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1469. Critical Confabulation: Can LLMs Hallucinate for Social Good?
๐ฌ 13
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models AI for Social Good Hallucination and Confabulation Narrative Modeling Data Contamination and Memorization Computational Creativity Evidence-Grounded Generation
LLMs hallucinate, yet some confabulations can have social affordances if carefully bounded. We propose critical confabulation (inspired by critical fabulation from literary and social theory), the use of LLM hallucinations to "fill-in-the-gap'' for omissions in archives due to social and political inequality, and reconstruct divergent yet evidence-bound narratives for history's "hidden figures''. We simulate these gaps with an open-ended narrative cloze task: asking LLMs to generate a masked event in a character-centric timeline sourced from a novel corpus of unpublished texts. We evaluate audited (for data contamination), fully-open models (the OLMo-2 family) and unaudited open-weight and proprietary baselines under a range of prompts designed to elicit controlled and useful hallucinations. Our findings validate LLMs' foundational narrative understanding capabilities to perform critical confabulation, and show how controlled and well-specified hallucinations can support LLM applications for knowledge production without collapsing speculation into a lack of historical accuracy and fidelity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1470. Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction
๐ฌ 19
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Real-time Music Accompaniment Music Generation Reinforcement Learning Adversarial Machine Learning
Most applications of generative AI involve a sequential interaction in which a person inputs a prompt and waits for a response, and where reaction time and adaptivity are not important factors. In contrast, live jamming is a collaborative interaction that requires real-time coordination and adaptation without access to the other playerโs future moves, while preserving diversity to sustain a creative flow. Reinforcement learning post-training enables effective adaptation through on-policy interaction, yet it often reduces output diversity by exploiting coherence-based rewards. This collapse, known as ``reward hacking'', affects many RL post-training pipelines, but is especially harmful in live jamming, where musical creativity relies on dynamic variation and mutual responsiveness. In this paper, we propose a novel adversarial training method on policy-generated trajectories to mitigate reward hacking in RL post-training for melody-to-chord accompaniment. A co-evolving discriminator separates policy trajectories from the data distribution, while the policy maximizes the discriminator output in addition to coherence rewards to prevent collapse to trivial outputs. We evaluate accompaniment quality and output diversity in simulation with both fixed test melodies and learned melody agents, and we conduct a user study with the model deployed in a real-time interactive system with expert musicians. Quantitative evaluation and user feedback demonstrate improved output diversity, harmonic coherence, adaptation speed and user agency. Our results demonstrate a simple yet effective method to mitigate reward hacking in RL post-training of generative sequence models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1471. Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ bias
๐ฌ 16
โญ 6.00
๐ learning theory
๐ท๏ธ Norm scaling law deep linear nerual network linear regression
For overparameterized linear regression with isotropic Gaussian design and minimum-$\ell_p$ interpolator $p\in(1,2]$, we give a unified, high-probability characterization for the scaling of the family of parameter norms $ \\{ \lVert \widehat{w_p} \rVert_r \\}_{r \in [1,p]} $ with sample size.
We solve this basic, but unresolved question through a simple dual-ray analysis, which reveals a competition between a signal
*spike* and a *bulk* of null coordinates in $X^\top Y$, yielding closed-form predictions for (i) a data-dependent transition
$n_\star$ (the "elbow"), and (ii) a universal threshold $r_\star=2(p-1)$ that separates $\lVert \widehat{w_p} \rVert_r$'s which plateau from those that continue to grow with an explicit exponent.
This unified solution resolves the scaling of *all* $\ell_r$ norms within the family $r\in [1,p]$ under $\ell_p$-biased interpolation,
and explains in one picture which norms saturate and which increase as $n$ grows.
We then study diagonal linear networks (DLNs) trained by gradient descent.
By calibrating the initialization scale $\alpha$ to an effective $p_{\mathrm{eff}}(\alpha)$ via the DLN separable potential,
we show empirically that DLNs inherit the same elbow/threshold laws,
providing a predictive bridge between explicit and implicit bias.
Given that many generalization proxies depend on $\lVert \widehat {w_p} \rVert_r$,
our results suggest that their predictive power will depend sensitively on which $l_r$ norm is used.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1472. It's All Just Vectorization: einx, a Universal Notation for Tensor Operations
๐ฌ 17
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Tensor notation tensor programming einx einsum einops
Tensor operations represent a cornerstone of modern scientific computing. However, the Numpy-like notation adopted by predominant tensor frameworks is often difficult to read and write and prone to so-called shape errors, i.a., due to following inconsistent rules across a large, complex collection of operations. Alternatives like einsum and einops have gained popularity, but are inherently restricted to few operations and lack the generality required for a universal model of tensor programming.
To derive a better paradigm, we revisit vectorization as a function for transforming tensor operations, and use it to both lift lower-order operations to higher-order operations, and conceptually decompose higher-order operations to lower-order operations and their vectorization.
Building on the universal nature of vectorization, we introduce einx, a universal notation for tensor operations. It uses declarative, pointful expressions that are defined by analogy with loop notation and represent the vectorization of tensor operations. The notation reduces the large APIs of existing frameworks to a small set of elementary operations, applies consistent rules across all operations, and enables a clean, readable and writable representation in code. We provide an implementation of einx that is embedded in Python and integrates seamlessly with existing tensor frameworks: https://github.com/REMOVED_FOR_REVIEW
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1473. PYRREGULAR: A Unified Framework for Irregular Time Series, with Classification Benchmarks
๐ฌ 7
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ irregular time series classification
Irregular temporal data, characterized by varying recording frequencies, differing observation durations, and missing values, presents significant challenges across fields like mobility, healthcare, and environmental science. Existing research communities often overlook or address these challenges in isolation, leading to fragmented tools and methods. To bridge this gap, we introduce a unified framework, and the first standardized dataset repository for irregular time series classification, built on a common array format to enhance interoperability. This repository comprises 34 datasets on which we benchmark 12 classifier models from diverse domains and communities. This work aims to centralize research efforts and enable a more robust evaluation of irregular temporal data analysis methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1474. Flow Actor-Critic for Offline Reinforcement Learning
๐ฌ 22
โญ 6.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning offline reinforcement learning flow actor-critic flow policies flow matching
The dataset distributions in offline reinforcement learning (RL) often exhibit complex and multi-modal distributions, necessitating expressive policies to capture such distributions beyond widely-used Gaussian policies. To handle such complex and multi-modal datasets, in this paper, we propose Flow Actor-Critic, a new actor-critic method for offline RL, based on recent flow policies. The proposed method not only uses the flow model for actor as in previous flow policies but also exploits the expressive flow model for conservative critic acquisition to prevent Q-value explosion in out-of-data regions. To this end, we propose a new form of critic regularizer based on the accurate proxy behavior model obtained as a byproduct of flow-based actor design. Leveraging the flow model in this joint way, we achieve new state-of-the-art performance for test datasets of offline RL including the D4RL and recent OGBench benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1475. Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes
๐ฌ 14
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Vision Language Model Spatial Reasoning Multiview Images
Understanding 3D spatial relationships remains a major limitation of current Vision-Language Models (VLMs). Prior work has addressed this issue by creating spatial question-answering (QA) datasets based on single images or indoor videos. However, real-world embodied AI agentsโsuch as robots and self-driving carsโtypically rely on ego-centric, multi-view observations. To this end, we introduce Ego3D-Bench, a new benchmark designed to evaluate the spatial reasoning abilities of VLMs using ego-centric, multi-view outdoor data. Ego3D-Bench comprises over 8,600 QA pairs, created with significant involvement from human annotators to ensure quality and diversity. We benchmark 16 SOTA VLMs, including GPT-4o, Gemini1.5-Pro, InternVL3, and Qwen2.5-VL. Our results reveal a notable performance gap between human level scores and VLM performance, highlighting that current VLMs still fall short of human level spatial understanding (SU). To bridge this gap, we propose Ego3D-VLM, a post-training framework that enhances 3D spatial reasoning of VLMs. Ego3D-VLM generates cognitive map based on estimated global 3D coordinates, resulting in 12% and 56% average improvements on multi-choice QA and absolute distance estimation, respectively. Ego3D-VLM can be integrated with any existing VLM. Together, Ego3D-Bench and Ego3D-VLM offer valuable tools for advancing toward human level SU in real-world, multi-view environments. Code is available in the supplementary materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1476. RoRE: Rotary Ray Embedding for Generalised Multi-Modal Scene Understanding
๐ฌ 14
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ transformers attention geometric vision multi-modal vision novel view synthesis thermal fisheye
Transformers have emerged as powerful implicit rendering models, capable of performing geometric reasoning and producing photorealistic novel views in a single feedforward pass. A central challenge in these architectures is how to inject camera parameters into the transformer in a way that generalises across diverse sensing conditions. In this work, we present Rotary Ray Embedding (RoRE), an approach that embeds image patches directly as rays, using a learning based rotary positional embedding (RoPE). This ray-based formulation provides a unified and general representation, improving robustness to unconventional camera geometries and sensing modalities. We evaluate our approach on conventional perspective imagery, fisheye cameras, and multi-modal RGB-thermal setups, showing that a single network can flexibly integrate arbitrary numbers of cameras and modalities into a coherent scene representation. Experiments demonstrate improved generalisation and cross-modal consistency compared to existing methods, highlighting the potential for relative ray-based embeddings to build adaptable, plug-and-play vision systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1477. SK2Decompile: LLM-based Two-Phase Binary Decompilation from Skeleton to Skin
๐ฌ 28
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ decompilation binary analysis
Large Language Models (LLMs) have emerged as a promising approach for binary decompilation. However, the existing LLM-based decompilers still are somewhat limited in effectively presenting a program's source-level structure with its original identifiers.
To mitigate this, we introduce SK2Decompile, a novel two-phase approach to decompile from the skeleton (semantic structure) to the skin (identifier) of programs. Specifically, we first apply a Structure Recovery model to translate a program's binary code to an Intermediate Representation (IR) as deriving the program's "skeleton", i.e., preserving control flow and data structures while obfuscating all identifiers with generic placeholders. We also apply reinforcement learning to reward the model for producing program structures that adhere to the syntactic and semantic rules expected by compilers. Second, we apply an Identifier Naming model to produce meaningful identifiers which reflect actual program semantics as deriving the program's "skin". We train the Identifier Naming model with a separate reinforcement learning objective that rewards the semantic similarity between its predictions and the reference code. Such a two-phase decompilation process facilitates advancing the correctness and readability of decompilation independently.
Our evaluations indicate that SK2Decompile, significantly outperforms the SOTA baselines, achieving 21.6% average re-executability rate gain over GPT-5-mini on the HumanEval dataset and 29.4% average R2I improvement over Idioms on the GitHub2025 benchmark.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1478. How Dark Patterns Manipulate Web Agents
๐ฌ 15
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Agents Redteaming Evaluations Reasoning Foundation Models
Deceptive UI designs, widely instantiated across the web and commonly known
as dark patterns, manipulate users into performing actions misaligned with their
goals. In this paper, we show that dark patterns are highly effective in steering
agent trajectories, posing a significant risk to agent robustness. To quantify this
risk, we introduce , an environment for testing individual dark patterns in isolation.
DECEPTICON includes 850 web navigation tasks with dark patternsโ600 generated
tasks and 250 real-world tasks, designed to measure instruction-following
success and dark pattern effectiveness. Across SOTA agents, we find dark patterns
successfully steer agent trajectories towards malicious outcomes in over 70% of
tested generated and real-world tasks. Moreover, we find that dark pattern
effectiveness correlates positively with model size and test-time reasoning, making
larger, more capable models more susceptible. Leading countermeasures against
adversarial attacks, including in-context prompting and guardrail models, fail to
consistently reduce the success rate of dark pattern interventions. Our findings
reveal dark patterns as a latent and unmitigated risk to web agents, highlighting
the urgent need for robust defenses against manipulative designs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1479. Enhancing Vision Transformers for Object Detection via Context-Aware Token Selection and Packing
๐ฌ 8
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ vision transformer object detection
In recent years, the long-range attention mechanism of vision transformers has driven significant performance breakthroughs across various computer vision tasks. However, these advancements come at the cost of inefficiency and substantial computational expense, especially when dealing with sparse data. While sparse attention mechanisms have been introduced to mitigate these issues by pruning tokens involved in attention, they often lack context-awareness and intelligence, frequently limiting the number of selected tokens uniformly across different inputs. To address these challenges, we propose a novel algorithm: Select and Pack Attention (SPA). SPA dynamically selects informative tokens using a low-cost gating layer and packs these selected tokens into new batches, allowing for a variable number of tokens to be used in GPU batch training and inference. Through extensive experiments on diverse datasets and multiple computer vision tasks, our method demonstrates superior performance and efficiency, including a 0.5-2.7 AP improvement in object detection and a 10.9%-24.9% reduction in computation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1480. Take Note: Your Molecular Dataset Is Probably Aligned
๐ฌ 22
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ molecular machine learning datasets orientation bias equivariance 3D orientation
Massive training datasets are fueling the astounding progress in molecular machine learning. Since these datasets are typically generated with computational chemistry codes which do not randomize pose, the resulting geometries are usually not randomly oriented. While cheminformaticians are well aware of this fact, it can be a real pitfall for machine learners entering the burgeoning field of molecular machine learning. We demonstrate that molecular poses in the popular datasets QM9, QMugs and OMol25 are indeed biased. While the fact can easily be overseen by visual inspection alone, we show that a simple classifier can separate original data samples from randomly rotated ones with high accuracy. Second, we validate empirically that neural networks can and do exploit the orientedness in these datasets by successfully training a model on chemical property regression using the molecular orientation as _sole_ input. Third, we present visualizations of all molecular orientations and confirm that chemically similar molecules tend to have similar canonical poses. In summary, we recall and document orientational bias in the prevalent datasets that machine learners should be aware of.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1481. Swap-guided Preference Learning for Personalized Reinforcement Learning from Human Feedback
๐ฌ 13
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Ranking and Preference Learning Latent Variable Models
Reinforcement Learning from Human Feedback (RLHF) is a widely used approach to align large-scale AI systems with human values. However, RLHF typically assumes a single, universal reward, which overlooks diverse preferences and limits personalization. Variational Preference Learning (VPL) seeks to address this by introducing user-specific latent variables. Despite its promise, we found that VPL suffers from posterior collapse. While this phenomenon is well known in VAEs, it has not previously been identified in preference learning frameworks. Under sparse preference data and with overly expressive decoders, VPL may cause latent variables to be ignored, reverting to a single-reward model. To overcome this limitation, we propose Swap-guided Preference Learning (SPL). The key idea is to construct fictitious swap annotators and use the mirroring property of their preferences to guide the encoder. SPL introduces three components: (1) swap-guided base regularization, (2) Preferential Inverse Autoregressive Flow (P-IAF), and (3) adaptive latent conditioning. Experiments show that SPL mitigates collapse, enriches user-specific latents, and improves preference prediction. Our code and data are available at https://anonymous.4open.science/r/SPL-0111
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1482. Efficient Message-Passing Transformer for Error Correcting Codes
๐ฌ 10
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Channel coding Error correcting codes Transformer-based decoder Message-passing decoder Neural decoder Transformer Efficient attention module
Error correcting codes (ECCs) are a fundamental technique for ensuring reliable communication over noisy channels. Recent advances in deep learning have enabled transformer-based decoders to achieve state-of-the-art performance on short codes; however, their computational complexity remains significantly higher than that of classical decoders due to the attention mechanism. To address this challenge, we propose EfficientMPT, an efficient message-passing transformer that significantly reduces computational complexity while preserving decoding performance. A key feature of EfficientMPT is the Efficient Error Correcting (EEC) attention mechanism, which replaces expensive matrix multiplications with lightweight vector-based element-wise operations. Unlike standard attention, EEC attention relies only on query-key interaction using global query vector, efficiently encode global contextual information for ECC decoding. Furthermore, EfficientMPT can serve as a foundation model, capable of decoding various code classes and long codes by fine-tuning. In particular, EfficientMPT achieves 85% and 91% of significant memory reduction and 47% and 57% of FLOPs reduction compared to ECCT for $(648,540)$ and $(1056,880)$ standard LDPC code, respectively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1483. MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models
๐ฌ 13
โญ 6.00
๐ generative models
๐ท๏ธ video diffusion models physical plausability
Text-to-video diffusion models have enabled high-quality video synthesis, yet often fail to generate temporally coherent and physically plausible motion. A key reason is the models' insufficient understanding of complex motions that natural videos often entail. Recent works tackle this problem by aligning diffusion model features with those from pretrained video encoders. However, these encoders mix video appearance and dynamics into entangled features, limiting the benefit of such alignment. In this paper, we propose a motion-centric alignment framework that learns a disentangled motion subspace from a pretrained video encoder. This subspace is optimized to predict ground-truth optical flow, ensuring it captures true motion dynamics. We then align the latent features of a text-to-video diffusion model to this new subspace, enabling the generative model to internalize motion knowledge and generate more plausible videos. Our method improves the physical commonsense in a state-of-the-art video diffusion model, while preserving adherence to textual prompts, as evidenced by empirical evaluations on VideoPhy, VideoPhy2, VBench, and VBench-2.0, along with a user study.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1484. Gradient-Aligned Calibration for Post-Training Quantization of Diffusion Models
๐ฌ 9
โญ 6.00
๐ generative models
๐ท๏ธ Quantization Diffusion
Diffusion models have shown remarkable performance in image synthesis by progressively estimating a smooth transition from a Gaussian distribution of noise to a real image. Unfortunately, their practical deployment is limited by slow inference speed, high memory usage, and the computational demands of the noise estimation process. Post-training quantization (PTQ) emerges as a promising solution to accelerate sampling and reduce the memory overhead of diffusion models. Existing PTQ methods for diffusion models typically apply uniform weights to calibration samples across timesteps, which is sub-optimal since data at different timesteps may contribute differently to the diffusion process. Additionally, due to varying activation distributions and gradients across timesteps, a uniform quantization approach is sub-optimal. Each timestep requires a different gradient direction for optimal quantization, and treating them equally can lead to conflicting gradients that degrade performance. In this paper, we propose a novel PTQ method that addresses these challenges by assigning appropriate weights to calibration samples.
Specifically, our approach learns to assign optimal weights to calibration samples to align the quantized modelโs gradients across timesteps, facilitating the quantization process. Extensive experiments on CIFAR-10, LSUN-Bedrooms, and ImageNet datasets demonstrate the superiority of our method compared to other PTQ methods for diffusion models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1485. ON THE ROLE OF IMPLICIT REGULARIZATION OF STOCHASTIC GRADIENT DESCENT IN GROUP ROBUSTNESS
๐ฌ 18
โญ 6.00
๐ learning theory
๐ท๏ธ Spurious Correlations Stochastic Gradient Descent (SGD) Implicit Regularization
Training with stochastic gradient descent (SGD) at moderately large learning rates has been observed to improve robustness against spurious correlations, strong correlation between non-predictive features and target labels. Yet, the mechanism underlying this effect remains unclear. In this work, we identify batch size as an additional critical factor and show that robustness gains arise from the implicit regularization of SGD, which intensifies with larger learning rates and smaller batch sizes. This implicit regularization reduces reliance on spurious or shortcut features, thereby enhancing robustness while preserving accuracy. Importantly, this effect appears unique to SGD: gradient descent (GD) does not confer the same benefit and may even exacerbate shortcut reliance. Theoretically, we establish this phenomenon in linear models by leveraging statistical formulations of spurious correlations, proving that SGD systematically suppresses spurious feature dependence. Empirically, we demonstrate that the effect extends to deep neural networks across multiple benchmarks. For the experiments and codes, please refer to this \href{https://github.com/ICLR2026-submission/implicit-regularization-in-group-robustness}{GitHub repository}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1486. R2-Dreamer: Redundancy-Reduced World Models without Decoders or Augmentation
๐ฌ 14
โญ 6.00
๐ reinforcement learning
๐ท๏ธ model-based reinforcement learning world models representation learning
A central challenge in image-based Model-Based Reinforcement Learning (MBRL) is to learn representations that distill task-essential information from irrelevant details. While promising, approaches that learn representations by reconstructing input images often waste capacity on spatially large but task-irrelevant visual information, such as backgrounds. Decoder-free methods address this issue by leveraging data augmentation (DA) to enforce robust representations, but the reliance on such external regularizers to prevent collapse severely limits their versatility. To address this, we propose R2-Dreamer, an MBRL framework that introduces a self-supervised objective acting as an internal regularizer, thus preventing collapse without resorting to DA. The core of our method is a feature redundancy reduction objective inspired by Barlow Twins, which can be easily integrated into existing frameworks. In evaluations on standard continuous control benchmarks, including DMC Vision and Meta-World, R2-Dreamer achieves performance competitive with strong baselines, including the leading decoder-based agent DreamerV3 and its decoder-free counterpart that relies on DA. Notably, thanks to its simple decoder-free design, R2-Dreamer achieves 1.59x faster training than DreamerV3. Furthermore, its effectiveness is highlighted on a challenging benchmark with tiny but task-relevant objects (DMC-Subtle), where our approach demonstrates substantial gains over all baselines. These results show that R2-Dreamer provides a versatile, high-performance framework for decoder-free MBRL by incorporating an effective internal regularizer.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1487. THE SELF-RE-WATERMARKING TRAP: FROM EXPLOIT TO RESILIENCE
๐ฌ 24
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ watemarking deep learning AI Security Re-Watermarking attack
Watermarking has been widely used for copyright protection of digital images. Deep learning-based watermarking systems have recently emerged as more robust and effective than traditional methods, offering improved fidelity and resilience against attacks. Among the various threats to deep learning-based watermarking systems, self-re-watermarking attacks represent a critical and underexplored challenge. In such attacks, the same encoder is maliciously reused to embed a new message into an already watermarked image. This process effectively prevents the original decoder from retrieving the original watermark without introducing perceptual artifacts. In this work, we make two key contributions. First, we introduce the self-re-watermarking threat model as a novel attack vector and demonstrate that existing state-of-the-art watermarking methods consistently fail under such attacks. Second, we develop a self-aware deep watermarking framework to defend against this threat. Our key insight for mitigating the risk of self-re-watermarking is to limit the sensitivity of the watermarking models to the inputs, thereby resisting re-embedding of new watermarks. To achieve this, we propose a self-aware deep watermarking framework that extends Lipschitz constraints to the watermarking process, regulating encoderโdecoder sensitivity in a principled manner. In addition, the framework incorporates re-watermarking adversarial training, which further constrains sensitivity to distortions arising from re-embedding. The proposed method provides theoretical bounds on message recoverability under malicious encoder based re-watermarking and demonstrates strong empirical robustness against diverse scenarios of re-watermarking attempts. In addition, it maintains high visual fidelity and demonstrates competitive robustness against common image processing distortions compared to state-of-the-art watermarking methods. This work establishes a robust defense against both standard distortions and self-re-watermarking attacks. The implementation will be made publicly available in GitHub.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1488. Forward-Learned Discrete Diffusion: Learning how to noise to denoise faster
๐ฌ 4
โญ 6.00
๐ generative models
๐ท๏ธ diffusion generative models variational inference
Discrete diffusion models are a powerful class of generative models that demonstrate strong performance across many domains. However, for efficiency, discrete diffusion typically parameterizes the generative (reverse) process with factorized distributions, which makes it difficult for the model to learn a target process in a small number of steps and necessitates a long, computationally expensive sampling procedure. To reduce the gap between the target and model distributions and enable few-step generation, we introduce a learnable noising (forward) process for discrete diffusion. Instead of fixing a Markovian forward chain, we adopt a non-Markovian formulation and introduce learnable marginal and posterior distributions. This allows the generative process to remain factorized while matching the target defined by the noising process. We train all parameters end-to-end under the standard variational objective.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1489. Post-hoc Probabilistic Vision-Language Models
๐ฌ 17
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Uncertainty Quantification Active Fine-Tuning Bayesian Deep Learning Vision-Language Models
Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1490. The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology
๐ฌ 14
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Persistent Homology Interpretability Topological Data Analysis Representation Geometry Large Language Models AI Security Adversarial Attacks Sparse Autoencoders
Existing interpretability methods for Large Language Models (LLMs) often fall short by focusing on linear directions or isolated features, overlooking the high-dimensional, nonlinear, and relational geometry within model representations. This study focuses on how adversarial inputs systematically affect the internal representation spaces of LLMs, a topic which remains poorly understood. We propose the application of persistent homology (PH) to measure and understand the geometry and topology of the representation space when the model is under external adversarial influence. Specifically, we use PH to systematically interpret six state-of-the-art models under two distinct adversarial conditionsโindirect prompt injection and backdoor fine-tuningโand uncover a consistent topological signature of adversarial influence. Across architectures and model sizes, adversarial inputs induce "topological compression'', where the latent space becomes structurally simpler, collapsing from varied, compact, small-scale features into fewer, dominant, and more dispersed large-scale ones. This topological signature is statistically robust across layers, highly discriminative, and provides interpretable insights into how adversarial effects emerge and propagate. By quantifying the shape of activations and neuron-level information flow, our architecture-agnostic framework reveals fundamental invariants of representational change, offering a complementary perspective to existing interpretability methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1491. Provably Tracking Equivalent Mechanistic Interpretations Across Neural Networks
๐ฌ 19
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ mechanistic interpretability
Mechanistic interpretability (MI) is an emerging framework for interpreting neural networks. Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model's decision process on that task. However, MI is difficult to scale and generalize. This stems in part from two key challenges: the lack of a well-defined notion of a valid interpretation; and, the ad hoc nature of generating and searching for such explanations. In this paper, we address these challenges by formally defining and studying the problem of interpretive equivalence: determining whether two different models share a common interpretation, without requiring an explicit description of what that interpretation is. At the core of our approach, we propose and formalize the principle that two interpretations of a model are (approximately) equivalent if and only if all of their possible implementations are also (approximately) equivalent. We develop tractable algorithms to estimate interpretive equivalence and case study their use on Transformer-based models. To analyze our algorithms, we introduce necessary and sufficient conditions for interpretive equivalence grounded in the similarity of their neural representations. As a result, we provide the first theoretical guarantees that simultaneously relate a model's algorithmic interpretations, circuits, and representations. Our framework lays a foundation for the development of more rigorous evaluation methods of MI and automated, generalizable interpretation discovery methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1492. SafeFlowMatcher: Safe and Fast Planning using Flow Matching with Control Barrier Functions
๐ฌ 26
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Flow matching Safety guarantees Planning and Control
Generative planners based on flow matching (FM) produce high-quality paths in a single or a few ODE steps, but their sampling dynamics offer no formal safety guarantees and can yield incomplete paths near constraints. We present SafeFlowMatcher, a planning framework that couples FM with control barrier functions (CBFs) to achieve both real-time efficiency and certified safety. SafeFlowMatcher uses a two-phase prediction-correction (PC) integrator: (i) a prediction phase integrates the learned FM once (or a few steps) to obtain a candidate path without intervention; (ii) a correction phase refines this path with a vanishing timeโscaled vector field and a CBF-based quadratic program that minimally perturbs the vector field. We prove a barrier certificate for the resulting flow system, establishing forward invariance of a robust safe set and finite-time convergence to the safe set. In addition, by enforcing safety only on the executed pathโrather than all intermediate latent pathsโSafeFlowMatcher avoids distributional drift and mitigates local trap problems. Moreover, SafeFlowMatcher attains faster, smoother, and safer paths than diffusion- and FM-based baselines on maze navigation, locomotion, and robot manipulation tasks. Extensive ablations corroborate the contributions of the PC integrator and the barrier certificate.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1493. Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
๐ฌ 12
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ vision-language models benchmark dataset medical AI evaluation reasoning-intensive tasks
Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning.
We introduce \texttt{Neural-MedBench}, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics.
Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings.
Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1494. Death of the Novel(ty): Beyond N-Gram Novelty as a Metric for Textual Creativity
๐ฌ 13
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ creativity creative writing evaluation creativity evaluation machine creativity n-gram novelty
$N$-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity's dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and $n$-gram novelty through 7542 expert writer annotations ($n=26$) of novelty, pragmaticality, and sensicality via \emph{close reading} of human and AI-generated text. We find that while $n$-gram novelty is positively associated with expert writer-judged creativity, $\approx 91\%$ of top-quartile expressions by $n$-gram novelty are not judged as creative, cautioning against relying on $n$-gram novelty alone. Furthermore, unlike human-written text, higher $n$-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1495. Supporting High-Stakes Decision Making Through Interactive Preference Elicitation in the Latent Space
๐ฌ 6
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Bayesian optimization preference elicitation autoencoder LLM
High-stakes, infrequent consumer decisions, such as housing selection, challenge conventional recommender systems due to sparse interaction signals, heterogeneous multi-criteria objectives, and high-dimensional feature spaces.
This work presents an interactive preference elicitation framework that couples preferential Bayesian optimization (PBO) with two complementary components: (i) large language models (LLMs) that interpret natural language input to produce personalized probabilistic priors over feature utility weights to mitigate cold start, and (ii) an autoencoder (AE)-based latent representation that reduces effective dimensionality for sample-efficient exploration. The framework learns a latent utility function from user pairwise comparisons observed and integrated in real-time.
We evaluate the developed method on rental real estate datasets from two major European cities. The results show that executing PBO in an AE latent space improves final pairwise ranking accuracy by 12%. For LLM-based preference prior generation, we find that direct, LLM-driven weight specification is outperformed by a static prior, while probabilistic weight priors that use LLMs only to rank feature importance achieve 25% better pairwise accuracy on average than a direct approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1496. Understanding the Role of Training Data in Test-Time Scaling
๐ฌ 11
โญ 6.00
๐ learning theory
๐ท๏ธ Language models Learning theory Chains-of-Thought Inference compute Test error
Test-time scaling improves the reasoning capabilities of large language models (LLMs) by allocating extra compute to generate longer Chains-of-Thoughts (CoTs). This enables models to tackle more complex problem by breaking them down into additional steps, backtracking, and correcting mistakes. Despite its strong performance--demonstrated by OpenAI's o1 and DeepSeek R1, the conditions in the training data under which long CoTs emerge, and when such long CoTs improve the performance, remain unclear. In this paper, we study the performance of test-time scaling for transformers trained on an in-context weight prediction task for linear regression. Our analysis provides a theoretical explanation for several intriguing observations: First, at any fixed test error, increasing test-time compute allows us to reduce the number of in-context examples (context length) in training prompts. Second, if the skills required to solve a downstream task are not sufficiently present in the training data, increasing test-time compute can harm performance. Finally, we characterize task hardness via the smallest eigenvalue of its feature covariance matrix and show that training on a diverse, relevant, and hard set of tasks results in best performance for test-time scaling. We confirm our findings with experiments on large, nonlinear transformer architectures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1497. Uncertainty-Aware Diagnostics for Physics-Informed Machine Learning
๐ฌ 8
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ physics informed gaussian process model selection uncertainty quantification
Physics-informed machine learning (PIML) integrates prior physical information, often in the form of differential equation constraints, into the process of fitting ML models to physical data. Popular PIML approaches, including neural operators, physics-informed neural networks, and neural ordinary differential equations, are typically fit to objectives that simultaneously include both data and physical constraints. However, the multi-objective nature of this approach creates ambiguity in the measurement of model quality. This is related to a poor understanding of epistemic uncertainty, and it can lead to surprising failure modes, even when existing metrics suggest strong fits. Working within a Gaussian process regression framework, we introduce the Physics-Informed Log Evidence (PILE) score. Bypassing the ambiguities of test losses, the PILE score is a single, uncertainty-aware metric that provides a selection principle for hyperparameters of a physics-informed model. We show that PILE minimization yields excellent choices for a wide variety of model parameters, including kernel bandwidth, least squares regularization weights, and even kernel function selection. We also show that, prior to data acquisition, a special data-free case of the PILE score identifies a-priori kernel choices that are "well adapted" to a given PDE. Beyond the kernel setting, we anticipate that the PILE score can be extended to PIML at large, and we outline approaches to do so.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1498. Saddle-To-Saddle Dynamics in Deep ReLU Networks: Low-Rank Bias in the First Saddle Escape
๐ฌ 9
โญ 6.00
๐ learning theory
๐ท๏ธ Saddle-to-Saddle Implicit bias Low-rank bias Bottleneck rank
When a deep ReLU network is initialized with small weights, gradient descent (GD) is at first dominated by the saddle at the origin in parameter space. We study the so-called escape directions along which GD leaves the origin, which play a similar role as the eigenvectors of the Hessian for strict saddles. We show that the optimal escape direction features a \textit{low-rank bias} in its deeper layers: the first singular value of the $\ell$-th layer weight matrix is at least $\ell^{\frac{1}{4}}$ larger than any other singular value. We also prove a number of related results about these escape directions. We suggest that deep ReLU networks exhibit saddle-to-saddle dynamics, with GD visiting a sequence of saddles with increasing bottleneck rank.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1499. Goal Reaching with Eikonal-Constrained Hierarchical Quasimetric Reinforcement Learning
๐ฌ 18
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Goal-conditioned Reinforcement Learning Quasimetric RL Eikonal Partial Differential Equation
Goal-Conditioned Reinforcement Learning (GCRL) mitigates the difficulty of reward design by framing tasks as goal reaching rather than maximizing hand-crafted reward signals. In this setting, the optimal goal-conditioned value function naturally forms a quasimetric, motivating Quasimetric RL (QRL), which constrains value learning to quasimetric mappings and enforces local consistency through discrete, trajectory-based constraints. We propose Eikonal-Constrained Quasimetric RL (Eik-QRL), a continuous-time reformulation of QRL based on the Eikonal Partial Differential Equation (PDE). This PDE-based structure makes Eik-QRL trajectory-free, requiring only sampled states and goals, while improving out-of-distribution generalization. We provide theoretical guarantees for Eik-QRL and identify limitations that arise under complex dynamics. To address these challenges, we introduce Eik-Hierarchical QRL (Eik-HiQRL), which integrates Eik-QRL into a hierarchical decomposition. Empirically, Eik-HiQRL achieves state-of-the-art performance in offline goal-conditioned navigation and yields consistent gains over QRL in manipulation tasks, matching temporal-difference methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1500. DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle
๐ฌ 17
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Benchmark Agent Tool Call
Even though demonstrating extraordinary capabilities in code generation and software issue resolving, AI agents' capabilities in the full software DevOps cycle are still unknown.
Different from pure code generation, handling the DevOps cycle in real-world software, including developing, deploying, and managing, requires analyzing large-scale projects, understanding dynamic program behaviors, leveraging domain-specific tools, and making sequential decisions.
However, existing benchmarks focus on isolated problems and lack environments and tool interfaces for DevOps.
We introduce DevOps-Gym, the first end-to-end benchmark for evaluating AI agents across core DevOps workflows: build and configuration, monitoring, issue resolving, and test generation.
DevOps-Gym includes 700+ real-world tasks collected from 30+ projects in Java and Go.
We develop a semi-automated data collection mechanism with rigorous and non-trivial expert efforts in ensuring the task coverage and quality.
Our evaluation of state-of-the-art models and agents reveals fundamental limitations: they struggle with issue resolving and test generation in Java and Go, and remain unable to handle new tasks such as monitoring and build and configuration.
These results highlight the need for essential research in automating the full DevOps cycle with AI agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1501. Pallatom-Ligand: an All-Atom Diffusion Model for Designing Ligand-Binding Proteins
๐ฌ 16
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Diffusion Protein Design Ligand Binding
Small-molecule ligands extend protein functionality beyond natural amino acids, enabling sophisticated processes like catalysis, signal transduction, and light harvesting. However, designing proteins with high affinity and selectivity for arbitrary ligands remains a major challenge. We present Pallatom-Ligand, a diffusion model that performs end-to-end generation of ligand-binding proteins at atomic resolution. By directly learning the joint distribution of all atoms in the proteinโligand complexes, Pallatom-Ligand delivers state-of-the-art performance, achieving the highest *in silico* success rates in a comprehensive benchmark. In addition, Pallatom-Ligand's novel conditioning framework enables programmable control over global protein fold and atomic-level ligand solvent accessibility. With these capabilities, Pallatom-Ligand opens new opportunities for exploring the protein function space, advancing both generative modeling and computational protein engineering.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1502. Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning
๐ฌ 11
โญ 6.00
๐ generative models
๐ท๏ธ Information bottleneck Generalisation Large Language models Latent space reasoning Representation learning Memory consolidation KV-cache compression Predictive encoding Reasoning Information theory
Transformer LLMs have been shown to exhibit strong reasoning ability that scales with inference-time compute, most prominently through token-space โthinkingโ (i.e., chains of thought). A growing line of work pushes this extra computation into the modelโs latent space (adjacent to standard decoding) which we term Auxiliary Latent-Space Computation (ALSC). Existing ALSC methods largely fall into three buckets: (i) token-mediated latent or special-token rollouts, (ii) residual/activation steering, and (iii) memory compression via cache pruning, merging, or summarization. An underexplored alternative is memory consolidation and reconsolidation, two processes in the brain that are responsible for stabilising newly formed memory traces, and, upon recall, transiently rendering established traces plastic such they can integrate new contextual information before restabilising. In a Transformer LLM, this can be seen as analogous to performing in-place global rewrites of incoming KV segments, and rewrites of past segments conditioned on newly observed tokens. In this work, we give a theoretical justification as to why memory (re)consolidation via KV cache rewrites is beneficial for improved reasoning. We do this through the lens of Information Bottleneck (IB) theory, which posits that model generalisation emerges from an optimal balance between input information compression and retention of predictive information in latent representations. We prove using IB theory that Vanilla decoder-only Transformers are inherently constrained in their ability to form task-optimal sequence representations. We then introduce the Bottlenecked Transformer, which augments a decoder-only backbone LLM with a lightweight Cache Processor, an auxiliary Transformer that performs periodic, non-causal, in-place KV rewrites at newline-delimited reasoning step boundaries. The processor consolidates recently written KV entries and reconsolidates a small, top-$k$ attention-selected set of prior entries, conditioned on recent context. We evaluate our Bottlenecked Transformer architecture on seven mathematical reasoning benchmarks, with four backbone LLMs. Our model sees consistent performance gains over vanilla Transformers and pause-token augmented Transformer baselines, with gains of up to +6.6pp for selected tasks and backbones.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1503. Code World Models for General Game Playing
๐ฌ 12
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language models code world models code generation information set MCTS planning partial observability two-player games imperfect information games
Large Language Models (LLMs) reasoning abilities are increasingly being applied to classical board and card games, but the dominant approach---involving prompting for direct move generation---has significant drawbacks. It relies on the model's implicit fragile pattern-matching capabilities, leading to frequent illegal moves and strategically shallow play. Here we introduce an alternative approach: We use the LLM to translate natural language rules and game trajectories into a formal, executable world model represented as Python code. This generated model---comprising functions for state transition, legal move enumeration, and termination checks---serves as a verifiable simulation engine for high-performance planning algorithms like Monte Carlo tree search (MCTS). In addition, we prompt the LLM to generate heuristic value functions (to make MCTS more efficient), and inference functions (to estimate hidden states in imperfect information games). Our method offers three distinct advantages compared to directly using the LLM as a policy: (1) Verifiability: The generated CWM serves as a formal specification of the game's rules, allowing planners to algorithmically enumerate valid actions and avoid illegal moves, contingent on the correctness of the synthesized model; (2) Strategic Depth: We combine LLM semantic understanding with the deep search power of classical planners; and (3) Generalization: We direct the LLM to focus on the meta-task of data-to-code translation, enabling it to adapt to new games more easily. We evaluate our agent on 10 different games, of which 4 are novel and created for this paper. 5 of the games are fully observed (perfect information), and 5 are partially observed (imperfect information). We find that our method outperforms or matches Gemini 2.5 Pro in 9 out of the 10 considered games.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1504. Test-time Verification via Optimal Transport: Coverage, ROC, & Sub-optimality
๐ฌ 11
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Test-time verification Coverage Approximate verifier ROC
While test-time scaling with verification has shown promise in improving the performance of large language models (LLMs), role of the verifier and its imperfections remain underexplored. The effect of verification manifests through interactions of three quantities: (i) the generatorโs *coverage*, (ii) the verifierโs *region of convergence* (ROC), and (iii) the sampling algorithmโs *sub-optimality*. Though recent studies capture subsets of these factors, a unified framework quantifying the geometry of their interplay is missing. We frame verifiable test-time scaling as a transport problem. This characterizes the interaction of coverage, ROC, and sub-optimality, and uncovers that the sub-optimalityโcoverage curve exhibits three regimes. A *transport regime* -- where sub-optimality increases with coverage, a *policy improvement regime* -- where sub-optimality may decrease with coverage, depending on the verifierโs ROC, and a *saturation regime* -- where sub-optimality plateaus, unaffected by coverage. We further propose and analyze two classes of sampling algorithms -- *sequential* and *batched*, and examine how their computational complexities shape these trade-offs. Empirical results with \texttt{Qwen}, \texttt{Llama}, and \texttt{Gemma} models corroborate our theoretical findings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1505. Hierarchical Multi-Stage Recovery Framework for Kronecker Compressed Sensing
๐ฌ 4
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Compressed sensing;Kronecker product;Restricted isometry property;Hierarchical sparsity;Tensor operation
In this paper, we study the Kronecker compressed sensing problem, which focuses on recovering sparse vectors using linear measurements obtained using the Kronecker product of two or more matrices. We first introduce the hierarchical view of the Kronecker compressed sensing, showing that the Kronecker product measurement matrix probes the sparse vector from different levels, following a block-wise and hierarchical structure. Leveraging this insight, we develop a versatile multi-stage sparse recovery algorithmic framework and tailor it to three different sparsity models: standard, hierarchical, and Kronecker-supported. We further analyze the restricted isometry property of Kronecker product matrices under different sparsity models, and provide theoretical recovery guarantees for our multi-stage algorithm. Simulations demonstrate that our method achieves comparable recovery performance to other state-of-the-art techniques while substantially reducing run time owing to the hierarchical, multi-stage recovery process.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1506. Fine-Grained Iterative Adversarial Attacks with Limited Computation Budget
๐ฌ 22
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Adversarial Attack Efficiency Robustness
This work tackles a critical challenge in AI safety research under limited compute: given a fixed computation budget, how can one maximize the strength of iterative adversarial attacks? Coarsely reducing the number of attack iterations lowers cost but substantially weakens effectiveness. To fulfill the attainable attack efficacy within a constrained budget, we propose a fine-grained control mechanism that selectively recomputes layer activations across both iteration-wise and layer-wise levels. Extensive experiments show that our method consistently outperforms existing baselines at equal cost. Moreover, when integrated into adversarial training, it attains comparable performance with only 30\% of the original budget.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1507. Query-Aware Flow Diffusion for Graph-Based RAG with Retrieval Guarantees
๐ฌ 19
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph-Based RAG Training-Free Retrieval Multi-Hop Reasoning Query-Aware Graph Traversal Subgraph Recovery Guarantees
Graph-based Retrieval-Augmented Generation (RAG) systems leverage interconnected knowledge structures to capture complex relationships that flat retrieval struggles with, enabling multi-hop reasoning. Yet most existing graph-based methods suffer from (i) heuristic designs lacking theoretical guarantees for subgraph quality or relevance and/or (ii) the use of static exploration strategies that ignore the query's holistic meaning, retrieving neighborhoods or communities regardless of intent. We propose *Query-Aware Flow Diffusion RAG* (QAFD-RAG), a training-free framework that dynamically adapts graph traversal to each query's holistic semantics. The central innovation is *query-aware traversal*: during graph exploration, edges are dynamically weighted by how well their endpoints align with the query's embedding, guiding flow along semantically relevant paths while avoiding structurally connected but irrelevant regions. These query-specific reasoning subgraphs enable the first statistical guarantees for query-aware graph retrieval, showing that QAFD-RAG recovers relevant subgraphs with high probability under mild signal-to-noise conditions. The algorithm converges exponentially fast, with complexity scaling with the retrieved subgraph size rather than the full graph. Experiments on question answering and text-to-SQL tasks demonstrate consistent improvements over state-of-the-art graph-based RAG methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1508. Free Energy Mixer
๐ฌ 25
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Sequence Modeling Attention Transformer
Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1509. FERD: Fairness-Enhanced Data-Free Adversarial Robustness Distillation
๐ฌ 19
โญ 6.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Data-Free Robustness Distillation; Robust Fairness
Data-Free Robustness Distillation (DFRD) aims to transfer the robustness from the teacher to the student without accessing the training data. While existing methods focus on overall robustness, they overlook the robust fairness issues, leading to severe disparity of robustness across different categories. In this paper, we find two key problems: (1) student model distilled with equal class proportion data behaves significantly different across distinct categories; and (2) the robustness of student model is not stable across different attacks target. To bridge these gaps, we present the first Fairness Enhanced data-free Robustness Distillation (FERD) framework to adjust the proportion and distribution of adversarial examples. For the proportion, FERD adopts a robustness guided class reweighting strategy to synthesize more samples for the less robust categories, thereby improving robustness of them. For the distribution, FERD generates complementary data samples for advanced robustness distillation. It generates Fairness-Aware Examples (FAEs) by enforcing a uniformity constraint on feature-level predictions, which suppress the dominance of class-specific non-robust features, providing a more balanced representation across all categories. Then, FERD constructs Uniform-Target Adversarial Examples (UTAEs) from FAEs by applying a uniform target class constraint to avoid biased attack directions, which distribute the attack targets across all categories and prevents overfitting to specific vulnerable categories. Extensive experiments on three public datasets show that FERD achieves state-of-the-art worst-class robustness under all adversarial attack (e.g., the worst-class robustness under FGSM and AutoAttack are improved by 15.1% and 6.4% using MobileNetV2 on CIFAR-10), demonstrating superior performance in both robustness and fairness aspects. Our code is available at: [https://anonymous.4open.science/r/FERD-2A48/](https://anonymous.4open.science/r/FERD-2A48/).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1510. Measuring LLM Novelty As The Frontier Of Original And High-Quality Output
๐ฌ 15
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ generation evaluation memorization novelty benchmark creativity
As large language models (LLMs) are increasingly used for ideation and scientific discovery, it is important to evaluate their ability to generate novel output. Prior work evaluates novelty as originality with respect to model training data, but original outputs can be of low quality. In contrast, non-expert judges more reliably score quality but may favor memorized outputs, limiting the reliability of human preference as a metric. We introduce a new novelty metric for LLM generations that balances originality and quality---the harmonic mean of the fraction of \ngrams unseen during training and a task-specific quality score. Using this framework, we identify trends that affect the novelty of generations from three families of open-data models (OLMo, OLMo-2, and Pythia) on three creative tasks---story completion, poetry writing, and creative tool use. We find that model-generated text from some base LLMs is less novel than human-written text from the internet. However, increasing model scale (OLMo 1B to 7B to 32B) and post-training reliably improves novelty due to improvements in output quality. We also find that improving the base model at the same scale (\eg OLMo 7B to OLMo-2 7B) leads to higher novelty due to higher originality. Finally, we observe that inference-time methods, such as prompting and providing novel in-context examples, have a much smaller effect on novelty, often increasing originality at the expense of quality. This highlights the need for further research into more effective elicitation strategies as we use models for creative applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1511. On the identifiability of causal graphs with multiple environments
๐ฌ 16
โญ 6.00
๐ causal reasoning
๐ท๏ธ causal discovery; heterogeneous data; multiple environment; nonlinear independent component analysis
Causal discovery from i.i.d. observational data is known to be generally ill-posed. We demonstrate that if we have access to the distribution induced by a structural causal model, and additional data from (in the best case) *only two* environments that sufficiently differ in the noise statistics, the unique causal graph is identifiable. Notably, this is the first result in the literature that guarantees the entire causal graph recovery with a constant number of environments and arbitrary nonlinear mechanisms. Our only constraint is the Gaussianity of the noise terms; however, we propose potential ways to relax this requirement. Of interest on its own, we expand on the well-known duality between independent component analysis (ICA) and causal discovery; recent advancements have shown that nonlinear ICA can be solved from multiple environments, at least as many as the number of sources: we show that the same can be achieved for causal discovery while having access to much less auxiliary information.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1512. Chessformer: A Unified Architecture for Chess Modeling
๐ฌ 17
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Transformers Interpretability Human-Aligned AI Chess Action Prediction
Chess has played a uniquely important historical role as a testbed domain for artificial intelligence. Applying new architectures to improve absolute chess performance, and more recently to predict human moves at specified skill levels, has therefore garnered attention in the machine learning literature. Current approaches to these problems employ transformer models with widely varying architectural designs, and use unintuitive tokenization schemes that are not amenable to interpretability techniques, which hinders their applicability for teaching and human-AI interaction. We introduce Chessformer, a novel chess transformer model design that consists of an encoder-only model which processes chessboard squares as input tokens, instead of moves or the entire position, a dynamic positional encoding scheme that allows the model to flexibly adapt to the unique geometries present in chess, and an attention-based policy output design. We show that Chessformer advances the state of the art in all three major chess modeling goals: it significantly improves the chess-playing performance of a state-of-the-art chess engine, it surpasses the previous best human move-matching prediction performance with a much smaller model, and it enables substantial interpretability benefits. Our unified approach constitutes a broad advance across several important tasks in chess AI, and also demonstrates the benefits of carefully adapting transformers' tokenization systems, output systems, and positional encodings to reflect the structure of a domain of interest.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1513. Composite Optimization with Error Feedback: the Dual Averaging Approach
๐ฌ 13
โญ 6.00
๐ optimization
๐ท๏ธ Composite Optimization Distributed Optimization Communication Compression Error Feedback Dual Averaging
Communication efficiency is a central challenge in distributed machine learning training, and message compression is a widely used solution. However, standard Error Feedback (EF) methods (Seide et al., 2014), though effective for smooth unconstrained optimization with compression (Karimireddy et al., 2019), fail in the broader and practically important setting of composite optimization, which captures, e.g., objectives consisting of a smooth loss combined with a non-smooth regularizer or constraints. The theoretical foundation and behavior of EF in the context of the general composite setting remain largely unexplored. In this work, we consider composite optimization with EF. We point out that the basic EF mechanism and its analysis no longer stand when a composite part is involved. We argue that this is because of a fundamental limitation in the method and its analysis technique. We propose a novel method that combines _Dual Averaging_ with EControl (Gao et al., 2024), a state-of-the-art variant of the EF mechanism, and achieves for the first time a strong convergence analysis for composite optimization with error feedback. Along with our new algorithm, we also provide a new and novel analysis template for inexact dual averaging method, which might be of independent interest. We also provide experimental results to complement our theoretical findings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1514. A Bayesian Nonparametric Framework For Learning Disentangled Representations
๐ฌ 12
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ representation learning disentangled representations unsupervised learning nonparametric methods
Disentangled representation learning aims to identify and organize the underlying sources of variation in observed data. However, learning disentangled representations without any additional supervision necessitates inductive biases to solve the fundamental identifiability problem of uniquely recovering the true latent structure and parameters of the data-generating process from observational data alone. Existing methods address this by imposing heuristic inductive biases that typically lack theoretical identifiability guarantees. They also rely on strong regularization to impose these inductive biases, creating an inherent trade-off in which stronger regularization improves disentanglement but limits the latent capacity to represent underlying variations. To address both challenges, we propose a principled generative model with a Bayesian nonparametric hierarchical mixture prior that embeds inductive biases within a provably identifiable framework for unsupervised disentanglement. Specifically, the hierarchical mixture prior imposes the structural constraints necessary for identifiability guarantees, while the nonparametric formulation enables inference of sufficient latent capacity to represent the underlying variations without violating these constraints. To enable tractable inference under this nonparametric hierarchical prior, we develop a structured variational inference framework with a nested variational family that both preserves the hierarchical structure of the identifiable generative model and approximates the expressiveness of the nonparametric prior. We evaluate our proposed probabilistic model on standard disentanglement benchmarks, 3DShapes and MPI3D datasets characterized by diverse source variation distributions, to demonstrate that our method consistently outperforms strong baseline models through structural biases and a unified objective function, obviating the need for auxiliary regularization constraints or careful hyperparameter tuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1515. Orbital Transformers for Predicting Wavefunctions in Time-Dependent Density Functional Theory
๐ฌ 16
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Machine learning density functional theory Time dependent neural PDE solver
We aim to learn wavefunctions simulated by time-dependent density functional theory (TDDFT), which can be efficiently represented as linear combination coefficients of atomic orbitals. In real-time TDDFT, the electronic wavefunctions of a molecule evolve over time in response to an external excitation, enabling first-principles predictions of physical properties such as optical absorption, electron dynamics, and high-order response. However, conventional real-time TDDFT relies on time-consuming propagation of all occupied states with fine time steps. In this work, we propose OrbEvo, which is based on an equivariant graph transformer architecture and learns to evolve the full electronic wavefunction coefficients across time steps. First, to account for external field, we design an equivariant conditioning to encode both strength and direction of external electric field and break the symmetry from SO(3) to SO(2). Furthermore, we design two OrbEvo models, OrbEvo-WF and OrbEvo-DM, using wavefunction pooling and density matrix as interaction method, respectively. Motivated by the central role of the density functional in TDDFT, OrbEvo-DM encodes the density matrix aggregated from all occupied electronic states into feature vectors via tensor contraction, providing a more intuitive approach to learn the time evolution operator. We adopt a training strategy specifically tailored to limit the error accumulation of time-dependent wavefunctions over autoregressive rollout. To evaluate our approach, we generate TDDFT datasets consisting of 5,000 different molecules in the QM9 dataset and 1,500 molecular configurations of the malonaldehyde molecule in the MD17 dataset. Results show that our OrbEvo model accurately captures quantum dynamics of excited states under external field, including time-dependent wavefunctions, time-dependent dipole moment, and optical absorption spectra characterized by dipole oscillator strength. It also shows strong generalization capability on the diverse molecules in the QM9 dataset.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1516. Bi-Criteria Metric Distortion
๐ฌ 13
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Social Choice Theory Metric Distortion
Selecting representatives based on voters' preferences is a fundamental problem in social choice theory. While cardinal utility functions offer a detailed representation of preferences, voters often cannot precisely quantify their affinity towards a given candidate. As a result, modern voting systems rely on ordinal rankings to simplistically represent preference profiles. In quantifying the suboptimality of solutions due to the loss of information when using ordinal preferences, the metric distortion framework models voters and candidates as points in a metric space, with distortion bounding the efficiency loss. Prior works within this framework use the distance between a voter and a candidate in the underlying metric as the cost of selecting the candidate for the given voter, with a goal of minimizing the sum (utilitarian) or maximum (egalitarian) of costs across voters. For deterministic election mechanisms selecting a single winning candidate, the best possible distortion is known to be 3 for any metric, as established by Gkatzelis, Halpern, and Shah (FOCS'20). In contrast, for randomized mechanisms, distortions cannot be lower than $2.112$, as shown by Charikar and Ramakrishnan (SODA'22), and there exists a mechanism with a distortion guarantee of $2.753$, according to Charikar, Ramakrishnan, Wang, and Wu (SODA'24 Best Paper Award). Our work asks: can one obtain a better approximation compared to an optimal candidate by selecting a committee of $k$ candidates ($k \ge 1$), where the cost of a voter is defined to be its distance to the closest candidate in the committee? We affirmatively answer this question by introducing the concept of bi-criteria approximation within the metric distortion framework. In the line metric, it is possible to achieve optimal cost with only $O(1)$ candidates. In contrast, we also prove that in both the two-dimensional and tree metrics -- which naturally generalize the line metric -- achieving optimal cost is impossible unless all candidates are selected. These results apply to both utilitarian and egalitarian objectives. Our results establish a stark separation between the line metric and the 2D or tree metric in the context of the metric distortion problem.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1517. Hybrid Reinforcement: when reward is sparse, better to be dense
๐ฌ 14
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Hybrid rewards for reinforcement learning
Post-training for reasoning in large language models has increasingly relied on verifiable rewards: deterministic checkers that provide $0$โ$1$ correctness signals. While reliable, such binary feedback is brittleโmany tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates sparse verifier signals with dense reward model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms reward model-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1518. A New Approach to Controlling Linear Dynamical Systems
๐ฌ 11
โญ 6.00
๐ learning theory
๐ท๏ธ Online Convex Optimization Online Control Linear Dynamical Systems
We propose a new method for controlling linear dynamical systems under adversarial disturbances and cost functions. Our algorithm achieves a running time that scales polylogarithmically with the inverse of the stability margin, improving upon prior methods with polynomial dependence maintaining the same regret guarantees. The technique, which may be of independent interest, is based on a novel convex relaxation that approximates linear control policies using spectral filters constructed from the eigenvectors of a specific Hankel matrix.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1519. Convergent Differential Privacy Analysis for General Federated Learning
๐ฌ 21
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Differential privacy federated learning
The powerful cooperation of federated learning (FL) and differential privacy (DP) provides a promising paradigm for the large-scale private clients. However, existing analyses in FL-DP mostly rely on the composition theorem and cannot tightly quantify the privacy leakage challenges, which is tight for a few communication rounds but yields an arbitrarily loose and divergent bound eventually. This also implies a counterintuitive judgment, suggesting that FL-DP may not provide adequate privacy support during long-term training under constant-level noisy perturbations, yielding discrepancy between the theoretical and experimental results. To further investigate the convergent privacy and reliability of the FL-DP framework, in this paper, we comprehensively evaluate the worst privacy of two classical methods under the non-convex and smooth objectives based on the $f$-DP analysis. With the aid of the shifted interpolation technique, we successfully prove that privacy in Noisy-FedAvg has a tight convergent bound. Moreover, with the regularization of the proxy term, privacy in Noisy-FedProx has a stable constant lower bound. Our analysis further demonstrates a solid theoretical foundation for the reliability of privacy in FL-DP. Meanwhile, our conclusions can also be losslessly converted to other classical DP analytical frameworks, e.g.
$(\epsilon,\delta)$-DP and R$\'{e}$nyi-DP (RDP), to provide more fine-grained understandings for the FL-DP frameworks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1520. A Tale of Two Smoothness Notions: Adaptive Optimizers and Non-Euclidean Descent
๐ฌ 11
โญ 6.00
๐ optimization
๐ท๏ธ adaptive optimizer steepest descent loss geometry convergence rate
Adaptive optimizers can reduce to normalized steepest descent (NSD) when only adapting to the current gradient, suggesting a close connection between the two algorithmic families. A key distinction in their analyses, however, lies in the smoothness assumptions they rely on. In the convex setting, adaptive optimizers are governed by a stronger adaptive smoothness condition, while NSD relies on the standard notion of smoothness. We extend the theory of adaptive smoothness to the nonconvex setting and show that it precisely characterizes the convergence of adaptive optimizers. Moreover, we establish that adaptive smoothness enables acceleration of adaptive optimizers with Nesterov momentum, a guarantee unattainable under standard smoothness. We further develop an analogous comparison for stochastic optimization by introducing adaptive variance, which parallels adaptive smoothness and leads to qualitatively stronger guarantees than the standard variance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1521. A Schrรถdinger Eigenfunction Method for Long-Horizon Stochastic Optimal Control
๐ฌ 9
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Stochastic Optimal Control Schrรถdinger operator eigenfunction learning long-horizon control
High-dimensional stochastic optimal control (SOC) becomes harder with longer planning horizons: existing methods scale linearly in the horizon $T$, with performance often deteriorating exponentially. We overcome these limitations for a subclass of linearly-solvable SOC problemsโthose whose uncontrolled drift is the gradient of a potential. In this setting, the Hamilton-Jacobi-Bellman equation reduces to a linear PDE governed by an operator $\mathcal{L}$. We prove that, under the gradient drift assumption, $\mathcal{L}$ is unitarily equivalent to a Schrรถdinger operator $\mathcal{S} = -\Delta + \mathcal{V}$ with purely discrete spectrum, allowing the long-horizon control to be efficiently described via the eigensystem of $\mathcal{L}$. This connection provides two key results: first, for a symmetric linear-quadratic regulator (LQR), $\mathcal{S}$ matches the Hamiltonian of a quantum harmonic oscillator, whose closed-form eigensystem yields an analytic solution to the symmetric LQR with arbitrary terminal cost. Second, in a more general setting, we learn the eigensystem of $\mathcal{L}$ using neural networks. We identify implicit reweighting issues with existing eigenfunction learning losses that degrade performance in control tasks, and propose a novel loss function to mitigate this. We evaluate our method on several long-horizon benchmarks, achieving an order-of-magnitude improvement in control accuracy compared to state-of-the-art methods, while reducing memory usage and runtime complexity from $\mathcal{O}(Td)$ to $\mathcal{O}(d)$.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1522. Sampling Complexity of TD and PPO in RKHS
๐ฌ 18
โญ 6.00
๐ learning theory
๐ท๏ธ Kernel method Kernel gradient descent PPO Temporal difference
We revisit Proximal Policy Optimization (PPO) from a function-space perspective.
Our analysis decouples policy evaluation and improvement in a reproducing kernel Hilbert space (RKHS):
(i) A kernelized temporal-difference (TD) critic performs efficient RKHS-gradient updates using only one-step stateโaction transition samples.
(ii) a KL-regularized, natural-gradient policy step exponentiates the evaluated action-value, recovering a PPO/TRPO-style proximal update in continuous state-action spaces.
We provide non-asymptotic, instance-adaptive guarantees whose rates depend on RKHS entropy, unifying tabular, linear, Sobolev, Gaussian, and Neural Tangent Kernel (NTK) regimes, and we derive a sampling rule for the proximal update that ensures the optimal $k^{-1/2}$ convergence rate for stochastic optimization.
Empirically, the theory-aligned schedule improves stability and sample efficiency on common control tasks (e.g., CartPole, Acrobot), while our TD-based critic attains favorable throughput versus a GAE baseline.
Altogether, our results place PPO on a firmer theoretical footing beyond finite-dimensional assumptions and clarify when RKHS-proximal updates with kernel-TD critics yield global policy improvement with practical efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1523. A Unification of Discrete, Gaussian, and Simplicial Diffusion
๐ฌ 12
โญ 6.00
๐ generative models
๐ท๏ธ discrete diffusion simplicial diffusion gaussian diffusion generative models proteins dna
To model discrete sequences such as DNA, proteins, and language using diffusion, practitioners must choose between three major methods: diffusion in discrete space, Gaussian diffusion in Euclidean space, or diffusion on the simplex. Despite their shared goal, these models have disparate algorithms, theoretical structures, and tradeoffs: discrete diffusion has the most natural domain, Gaussian diffusion has more mature algorithms, and diffusion on the simplex in principle combines the strengths of the other two but in practice suffers from a numerically unstable stochastic processes. Ideally we could see each of these models as instances of the same underlying framework, and enable practitioners to switch between models for downstream applications. However previous theories have only considered connections in special cases. Here we build a theory unifying all three methods of discrete diffusion as different parameterizations of the same underlying process: the Wright-Fisher population genetics model. In particular, we find simplicial and Gaussian diffusion as two large-population limits. Our theory formally connects the likelihoods and hyperparameters of these models and leverages decades of mathematical genetics literature to unlock stable simplicial diffusion. Finally, we relieve the practitioner of balancing model trade-offs by demonstrating it is possible to train a single model that can perform diffusion in any of these three domains at test time. Our experiments show that Wright-Fisher simplicial diffusion is more stable and outperforms previous simplicial diffusion models on conditional DNA generation. We also show that we can train models on multiple domains at once that are competitive with models trained on any individual domain.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1524. Best-of-three-worlds Analysis for Dueling Bandits with Borda Winner
๐ฌ 9
โญ 6.00
๐ learning theory
๐ท๏ธ dueling bandits; borda winner; best of three worlds; FTRL
The dueling bandits (DB) problem addresses online learning from relative preferences, where the learner queries pairs of arms and receives binary win-loss feedback. Most existing work focuses on designing algorithms for specific stochastic or adversarial environments. Recently, a unified algorithm has been proposed that achieves convergence across all settings. However, this approach relies on the existence of a Condorcet winner, which is often not achievable, particularly when the preference matrix changes in the adversarial setting. Aiming for a more general Borda winner objective, there currently exists no unified framework that simultaneously achieves optimal regret across these environments.
In this paper, we explore how the follow-the-regularized-leader (FTRL) algorithm can be employed to achieve this objective. We propose a hybrid negative entropy regularizer and demonstrate that it enables us to achieve $\tilde{O}(K^{1/3} T^{2/3})$ regret in the adversarial setting, ${O}({K \log^2 T}/{\Delta_{\min}^2})$ regret in the stochastic setting, and $O({K \log^2 T }/{\Delta_{\min}^2} + ({C^2 K \log^2 T }/{\Delta_{\min}^2})^{1/3})$ regret in the corrupted setting, where $K$ is the arm set size, $T$ is the horizon, $\Delta_{\min}$ is the minimum gap between the optimal and sub-optimal arms, and $C$ is the corruption level. These results align with the state-of-the-art in individual settings, while eliminating the need to assume a specific environment type. We also present experimental results demonstrating the advantages of our algorithm over baseline methods across different environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1525. Talking Points: Describing and Localizing Pixels
๐ฌ 11
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Keypoint Description Keypoint Localization Pixel-Level Grounding Reinforcement Learning Vision-Language Model
Vision-language models have achieved remarkable success in cross-modal understanding. Yet, these models remain limited to object-level or region-level grounding, lacking the capability for pixel-precise keypoint comprehension through natural language. We introduce a novel framework for pixel level grounding. The framework consists of two complementary components: a Point Descriptor that generates rich, contextual descriptions of individual keypoints, and a Point Localizer that regresses precise pixel coordinates from these descriptions. Unlike prior work that relies on templated prompts or keypoint names, our approach produces free-form, coarse-to-fine descriptions that situate keypoints within their visual context.
Since there is no available dataset to train such a system, we introduce LlamaPointInPart, a carefully curated dataset of 20K+ image-keypoint-description triplets synthesized from multiple vision-language models, capturing multi-scale information from scene-level context to visual features around the keypoint.
For cross-category generalization, we optimize the Point Descriptor on AP-10K via GRPO, using the frozen Point Localizer as a reward model to produce descriptions that maximize localization accuracy.
To evaluate our results we establish a new evaluation protocol. Instead of comparing the text description produced by our method to the ground truth, we use the localizer to determine how close is the predicted point generated to the ground truth point.
Experiments demonstrate superior performance compared to baseline models on LlamaPointInPart.
The bidirectional nature of our framework enables applications in both keypoint-guided image understanding and language-guided precise localization. Our dataset and code will be published upon publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1526. COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics
๐ฌ 20
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Steerable Generation Large language models Representation Engineering Test-time Intervention Learning Dynamics
Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches either capture suboptimally steering signals from labeled examples or require hundreds to thousands of examples to optimize using specific procedures for each behavioral target. We introduce COLD-Steer, a training-free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in-context examples. Our key insight is that the effect of fine-tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite-difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95\% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD-Steer enables real-time adaptation to new steering objectives and facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context-aware model control that can flexibly address varying loss-driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1527. Reward Models Inherit Value Biases from Pretraining
๐ฌ 15
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ reward models value alignment finetuning preference learning large language models RLHF AI safety bias pretraining
Reward models (RMs) are central to aligning large language models (LLMs) with human values but have received less attention than pre-trained and post-trained LLMs themselves. Because RMs are initialized from LLMs, they inherit representations that shape their behavior, but the nature and extent of this influence remain understudied. In a comprehensive study of 10 leading open-weight RMs using validated psycholinguistic corpora, we show that RMs exhibit significant differences along multiple dimensions of human value as a function of their base model. Using the "Big Two" psychological axes, we show a robust preference of Llama RMs for "agency" and a corresponding robust preference of Gemma RMs for "communion." This phenomenon holds even when the preference data and finetuning process are identical, and we trace it back to the logits of the respective instruction-tuned and pre-trained models. These log-probability differences themselves can be formulated as an implicit RM; we derive usable implicit reward scores and show that they exhibit the very same agency/communion difference. We run experiments training RMs with ablations for preference data source and quantity, which demonstrate that this effect is not only repeatable but surprisingly durable. Despite RMs being designed to represent human preferences, our evidence shows that their outputs are influenced by the pretrained LLMs on which they are based. This work underscores the importance of safety and alignment efforts at the pretraining stage, and makes clear that open-source developers' choice of base model is as much a consideration of values as of performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1528. ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning
๐ฌ 10
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Planning Dataset and Benchmark Large Language Models
We introduce ACPBench Hard, a dataset of generative, open-ended questions which LLM models needs to answer in order to plan. Models that perform well on these tasks could in principle be integrated into a planner or be used directly as a policy. We discuss the complexity of these tasks as well as the complexity of validating the correctness of their answers and present validation algorithms for each task. Equipped with these validators, we test the performance of a variety of models on our tasks and find that for most of these tasks, the performance of even the largest models is still subpar. The models do not possess even the most basic capability of identifying which actions can be performed in a given state. No model outperforms any other on our proposed tasks and, with a few exceptions, all tested language models score below 65\%, indicating that even the current frontier language models as well as so-called reasoning models have a long way to go before they can reliably reason about planning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1529. Discovering Hierarchical Software Engineering Agents via Bandit Optimization
๐ฌ 18
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Multi-armed bandit Model selection Software engineering
Large language models (LLMs) are increasingly applied to software engineering (SWE), but they struggle on real-world tasks that are long-horizon and often out of distribution. Current systems typically adopt monolithic designs where a single model attempts to interpret ambiguous issues, navigate large codebases, and implement fixes in one extended reasoning chain. This design makes it difficult to generalize beyond training data. Inspired by how human engineers decompose problems into sub-tasks, we argue that SWE agents should be structured as orchestrators coordinating specialized sub-agents, each responsible for a specific sub-task such as bug reproduction, fault localization, code modification, or validation. The central challenge is how to design these hierarchies effectively. Manual decompositions follow human workflows but often mismatch LLM capabilities, while automated search methods such as evolutionary strategies require evaluating a very large number of candidates, making them prohibitively expensive for SWE. We show that formulating hierarchy discovery as a multi-armed bandit problem enables efficient exploration of sub-agent designs under limited budgets. On SWE-bench-Verified, this approach outperforms single-agent systems and manually designed multi-agent systems. On SWE-bench-Live, which features recent and out-of-distribution issues, our system ranks 2nd on the leaderboard with a 36B model, surpassing larger systems such as GPT-4 and Claude. This provides the first evidence that hierarchical multi-agent systems improves generalization on challenging long-horizon SWE tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1530. Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas
๐ฌ 19
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ AI Values value alignment ai risk dilemma
Detecting AI risks becomes more challenging as stronger models emerge and find novel methods such as Alignment Faking to circumvent these detection attempts. Inspired by how risky behaviors in humans (i.e., illegal activities that may hurt others) are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI's risky behaviors. We create LitmusValues, an evaluation pipeline to reveal AI models' priorities on a range of AI value classes. Then, we collect AIRiskDilemmas, a diverse collection of dilemmas that pit values against one another in scenarios relevant to AI safety risks such as Power Seeking. By measuring an AI model's value prioritization using its aggregate choices, we obtain a self-consistent set of predicted value priorities that uncover potential risks. We show that values in LitmusValues (including seemingly innocuous ones like Care) can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1531. General search techniques without common knowledge for imperfect-information games, and application to superhuman Fog of War chess
๐ฌ 17
โญ 6.00
๐ reinforcement learning
๐ท๏ธ imperfect-informaton games subgame solving game theory
Since the advent of AI, games have served as progress benchmarks. Meanwhile, imperfect-information variants of chess have existed for over a century, present extreme challenges, and have been the focus of decades of AI research. Beyond calculation needed in regular chess, they require reasoning about information gathering, the opponentโs knowledge, signaling, _etc_. The most popular variant, _Fog of War (FoW) chess_ (a.k.a. _dark chess_), has been a major challenge problem in imperfect-information game solving since superhuman performance was reached in no-limit Texas holdโem poker. We present _Obscuro_, the first superhuman AI for FoW chess. It introduces advances to search in imperfect-information games, enabling strong, scalable reasoning. Experiments against the prior state-of-the-art AI and human players---including the world's best---show that _Obscuro_ is significantly stronger. FoW chess is the largest (by amount of imperfect information) turn-based zero-sum game in which superhuman performance has been achieved and the largest game in which imperfect-information search has been successfully applied.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1532. Neyman-Pearson Classification under Both Null and Alternative Distributions Shift
๐ฌ 10
โญ 6.00
๐ learning theory
๐ท๏ธ Imbalanced classification Transfer Learning Neyman-Pearson Classification.
We consider the problem of transfer learning in NeymanโPearson classification, where the objective is to minimize the error w.r.t. a distribution $\mu_1$, subject to the constraint that the error w.r.t. a distribution $\mu_0$ remains below a prescribed threshold. While transfer learning has been extensively studied in traditional classification, transfer learning in imbalanced classification such as NeymanโPearson classification has received much less attention. This setting poses unique challenges, as both types of errors must be simultaneously controlled. Existing works address only the case of distribution shift in $\mu_1$, whereas in many practical scenarios shifts may occur in both $\mu_0$ and $\mu_1$. We derive an adaptive procedure that not only guarantees improved Type-I and Type-II errors when the source is informative, but also automatically adapt to situations where the source is uninformative, thereby avoiding negative transfer. In addition to such statistical guarantees, the procedures is efficient, as shown via complementary computational guarantees.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1533. Bilinear relational structure fixes reversal curse and enables consistent model editing
๐ฌ 12
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ model editing reversal curse language model relational knowledge knowledge editing
The reversal curse---a language model's (LM) inability to infer an unseen fact ``B is A'' from a learned factA is B''---is widely considered a fundamental limitation. We show that this is not an inherent failure but an artifact of how models encode knowledge. By training LMs from scratch on a synthetic dataset of relational knowledge graphs, we demonstrate that bilinear relational structure emerges in their hidden representations. This structure is associated with alleviating the reversal curse, facilitating the inference of unseen reverse facts. Crucially, we also find that this bilinear structure plays a key role in consistent model editing. When a fact is updated in a LM with this structure, the edit correctly propagates to its reverse and other logically dependent facts. In contrast, models lacking this representation not only suffer from the reversal curse but also fail to generalize edits, further introducing logical inconsistencies. Our results establish that training on a relational knowledge dataset induces the emergence of bilinear internal representations, which in turn support LMs in behaving in a logically consistent manner after editing. This implies that the success of model editing may be tied not just to editing algorithms but to the underlying representational geometry of the knowledge being modified.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1534. Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN
๐ฌ 8
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ mechanistic interpretability reinforcement learning sokoban
We partially reverse-engineer a convolutional recurrent neural network (RNN) trained with model-free reinforcement learning to play the box-pushing game Sokoban. We find that the RNN stores future moves (plans) as activations in particular channels of the hidden state, which we call *path channels*. A high activation in a particular location means that, when a box is in that location, it will get pushed in the channel's assigned direction.
We examine the convolutional kernels between path channels and find that they encode the change in position resulting from each possible action, thus representing part of a learned *transition model*.
The RNN constructs plans by starting at the boxes and goals.
These kernels, *extend* activations in path channels forwards from boxes and backwards from the goal.
Negative values are placed in channels at obstacles. This causes the extension kernels to propagate the negative value in reverse, thus pruning the last few steps and letting an alternative plan emerge; a form of backtracking.
Our work shows that, a precise understanding of the plan representation allows us to directly understand the bidirectional planning-like algorithm learned by model-free training in more familiar terms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1535. SigmaDock: Untwisting Molecular Docking with Fragment-Based SE(3) Diffusion
๐ฌ 12
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Molecular Docking Geometric Deep Learning Generative Models SE(3) Diffusion AI for Science
Determining the binding pose of a ligand to a protein, known as molecular docking, is a fundamental task in drug discovery. Generative approaches promise faster, improved, and more diverse pose sampling than physics-based methods, but are often hindered by chemically implausible outputs, poor generalisability, and high computational cost. To address these challenges, we introduce a novel fragmentation scheme, leveraging inductive biases from structural chemistry, to decompose ligands into rigid-body fragments.
Building on this decomposition, we present SigmaDock, an SE(3) Riemannian diffusion model that generates poses by learning to reassemble these rigid bodies within the binding pocket. By operating at the level of fragments in SE(3), SigmaDock exploits well-established geometric priors while avoiding overly complex diffusion processes and unstable training dynamics. Experimentally, we show SigmaDock achieves state-of-the-art performance, reaching Top-1 success rates (RMSD <2 & PB-valid) above 79.9% on the PoseBusters set, compared to 12.7-32.8% reported by recent deep learning approaches, whilst demonstrating consistent generalisation to unseen proteins. SigmaDock is the first deep learning approach to surpass classical physics-based docking under the PB train-test split, marking a significant leap forward in the reliability and feasibility of deep learning for molecular modelling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1536. Can we generate portable representations for clinical time series data using LLMs?
๐ฌ 29
โญ 6.00
๐ learning on time series and dynamical systems
๐ท๏ธ Machine Learning for Healthcare ICU Time-series LLMs Representation Learning
Deploying clinical ML is slow and brittle: models that work at one hospital often degrade under distribution shifts at the next. In this work, we study a simple question -- can large language models (LLMs) create portable patient embeddings i.e. representations of patients enable a downstream predictor built on one hospital to be elsewhere with minimal-to-no retraining and fine-tuning.
To do so, we map from irregular ICU time series onto concise natural language summaries using a frozen LLM, then embed each summary with a frozen text embedding model to obtain a fixed length vector capable of serving as input to a variety of downstream predictors.
Across three cohorts (MIMIC-IV, HIRID, PPICU), on multiple clinically grounded forecasting and classification tasks, we find that our approach is simple, easy to use and surprisingly competitive with in-distribution with grid imputation, self-supervised representation learning, and time series foundation models, while exhibiting smaller relative performance drops when transferring to new hospitals.
We study the variation in performance across prompt design, with structured prompts being crucial to reducing the variance of the predictive models without altering mean accuracy. We find that using these portable representations improves few-shot learning and does not increase demographic recoverability of age or sex relative to baselines, suggesting little additional privacy risk.
Our work points to the potential that LLMs hold as tools to enable the scalable deployment of production grade predictive models by reducing the engineering overhead.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1537. Eigen-1: Scientific Reasoning through Adaptive Multi-Agent Refinement and Monitor-based RAG
๐ฌ 14
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ LLM Agents Reasoning
Large language models (LLMs) have recently shown strong progress on scientific reasoning, yet two major bottlenecks remain. First, explicit retrieval fragments reasoning, imposing a hidden tool tax of extra tokens and steps. Second, multi-agent pipelines often dilute strong solutions by averaging across all candidates. We address these challenges with a unified framework that combines implicit retrieval and structured collaboration. At its foundation, a Monitor-based retrieval module operates at the token level, integrating external knowledge with minimal disruption to reasoning. On top of this substrate, Hierarchical Solution Refinement (HSR) iteratively designates each candidate as an anchor to be repaired by its peers, while Quality-Aware Iterative Reasoning (QAIR) adapts refinement to solution quality. On Humanityโs Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracyโthe highest reported to date, surpassing the strongest agent baseline by 13.4 points and leading frontier LLMs by up to 18.1 points, while simultaneously reducing token usage by 53.5% and agent steps by 43.7%. Results on SuperGPQA and TRQA confirm robustness across domains. Error analysis shows that reasoning failures and knowledge gaps co-occur in over 85% of cases, while diversity analysis reveals a clear dichotomy: retrieval tasks benefit from solution variety, whereas reasoning tasks favor consensus. Together, these findings demonstrate how implicit augmentation and structured refinement overcome the inefficiencies of explicit tool use and uniform aggregation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1538. Multiple Token Divergence: Measuring and Steering In-Context Computation Density
๐ฌ 10
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Language models in-context learning reasoning interpretability decoding
Measuring the in-context computational effort of language models is a key challenge, as metrics like next-token loss fail to capture reasoning complexity. Prior methods based on latent state compressibility can be invasive and unstable. We propose Multiple Token Divergence (MTD), a simple measure of computational effort defined as the KL divergence between a model's full output distribution and that of a shallow, auxiliary prediction head. MTD can be computed directly from pre-trained models with multiple prediction heads, requiring no additional training. Building on this, we introduce Divergence Steering, a novel decoding method to control the computational character of generated text. We empirically show that MTD is more effective than prior methods at distinguishing complex tasks from simple ones. On mathematical reasoning benchmarks, MTD correlates positively with problem difficulty. Lower MTD is associated with more accurate reasoning. MTD provides a practical, lightweight tool for analyzing and steering the computational dynamics of language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1539. Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents
๐ฌ 15
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ AI security Large Language Models Security Benchmark Red Teaming AI Safety
AI agents powered by large language models (LLMs) are being deployed at scale, yet we lack a systematic understanding of how the choice of backbone LLM affects agent security.
The non-deterministic sequential nature of AI agents complicates security modeling, while the integration of traditional software with AI components entangles novel LLM vulnerabilities with conventional security risks.
Existing frameworks only partially address these challenges as they either capture specific vulnerabilities only or require modeling of complete agents.
To address these limitations, we introduce threat snapshots: a framework that isolates specific states in an agent's execution flow where LLM vulnerabilities manifest, enabling the systematic identification and categorization of security risks that propagate from the LLM to the agent level.
We apply this framework to construct the $b^3$ benchmark, a security benchmark based on 194,331 unique crowdsourced adversarial attacks. We then evaluate 34 popular LLMs with it, revealing, among other insights, that enhanced reasoning capabilities improve security, while model size does not correlate with security.
We release our benchmark, dataset, and evaluation code to facilitate widespread adoption by LLM providers and practitioners, offering guidance for agent developers and incentivizing model developers to prioritize backbone security improvements.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1540. Splat Regression Models
๐ฌ 4
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Wasserstein Fisher-Rao gradient flow gaussian splatting scientific machine learning
We introduce a highly expressive class of function approximators called *Splat Regression Models*. Model outputs are mixtures of heterogeneous and anisotropic bump functions, termed *splats*, each weighted by an output vector. The power of splat modeling lies in its ability to locally adjust the scale and direction of each splat, achieving both high interpretability and accuracy. Fitting splat models reduces to optimization over the space of mixing measures, which can be implemented using Wasserstein-Fisher-Rao gradient flows. As a byproduct, we recover the popular *Gaussian Splatting* methodology as a special case, providing a unified theoretical framework for this state-of-the-art technique that clearly disambiguates the inverse problem, the model, and the optimization algorithm. Through numerical experiments, we demonstrate that the resulting models and algorithms constitute a flexible and promising approach for solving diverse approximation, estimation, and inverse problems involving low-dimensional data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1541. Quantum machine learning advantages beyond hardness of evaluation
๐ฌ 15
โญ 6.00
๐ learning theory
๐ท๏ธ Quantum machine learning Quantum-classical learning separations Learning theory
Recent years have seen rigorous proofs of quantum advantages in machine learning, particularly when data is labeled by cryptographic or inherently quantum functions. These results typically rely on the infeasibility of classical polynomial-sized circuits to evaluate the true labeling function. While broad in scope, these results however reveal little about advantages stemming from the actual learning process itself. This motivates the study of the so-called identification task, where the goal is to ``just'' identify the labeling function behind a dataset, making the learning step the only possible source of advantage. The identification task also has natural applications, which we discuss. Yet, such identification advantages remain poorly understood. So far they have only been proven in cryptographic settings by leveraging random-generatability, the ability to efficiently generate labeled data. However, for quantum functions this property is conjectured not to hold, leaving identification advantages unexplored. In this work, we provide the first proofs of identification learning advantages for quantum functions under complexity-theoretic assumptions. Our main result relies on a new proof strategy, allowing us to show that for a broad class of quantum identification tasks there exists an exponential quantum advantage unless BQP is in a low level of the polynomial hierarchy. Along the way we prove a number of more technical results including the aforementioned conjecture that quantum functions are not random generatable (subject to plausible complexity-theoretic assumptions), which shows a new proof strategy was necessary. These findings suggest that for many quantum-related learning tasks, the entire learning processโnot just final evaluationโgains significant advantages from quantum computation
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1542. Representational Alignment Across Model Layers and Brain Regions with Hierarchical Optimal Transport
๐ฌ 21
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Representation Similarity Representational Alignment
Standard representational similarity methods align each layer of a network to its best match in another independently, producing asymmetric results, lacking a global alignment score, and struggling with networks of different depths. These limitations arise from ignoring global activation structure and restricting mappings to rigid one-to-one layer correspondences.
We propose Hierarchical Optimal Transport (HOT), a unified framework that jointly infers soft, globally consistent layer-to-layer couplings and neuron-level transport plans. HOT allows source neurons to distribute mass across multiple target layers while minimizing total transport cost under marginal constraints. This yields both a single alignment score for the entire network comparison and a soft transport plan that naturally handles depth mismatches through mass distribution.
We evaluate HOT on vision models, large language models, and human visual cortex recordings. Across all domains, HOT matches or surpasses standard pairwise matching in alignment quality. Moreover, it reveals smooth, fine-grained hierarchical correspondences: early layers map to early layers, deeper layers maintain relative positions, and depth mismatches are resolved by distributing representations across multiple layers. These structured patterns emerge naturally from global optimization without being imposed, yet are absent in greedy layer-wise methods. HOT thus enables richer, more interpretable comparisons between representations, particularly when networks differ in architecture or depth. We further extend our method to a three-level HOT framework, providing a proof-of-concept alignment of two networks across their training trajectories and demonstrating that HOT uncovers checkpoint-wise correspondences missed by greedy layer-wise matching.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1543. Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
๐ฌ 11
โญ 6.00
๐ generative models
๐ท๏ธ generative models visual synthesis diffusion flow matching
We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained end-to-end using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring it remains smooth and suitable for generation.Our single-token formulation resolves the spatial redundancies of the 2D latent space, simplifies architectures, and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and extends naturally to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling. We will release our model to facilitate further research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1544. An Optimal Diffusion Approach to Quadratic Rate-Distortion Problems: New Solution and Approximation Methods
๐ฌ 4
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ information theory rate-distortion diffusion processes stochastic control
When compressing continuous data, some loss of information is inevitable, and this incurred a distortion when reconstruction the data. The RateโDistortion (RD) function characterizes the minimum achievable rate for a code whose decoding permits a specified amount of distortion. We exploit the connection between rate-distortion theory and entropic optimal transport to propose a novel stochastic-control formulation for the former, and use a classic result dating back to Schrodinger to show that the tradeoff between rate and mean squared error distortion is equivalent to a tradeoff between control energy and the differential entropy of the terminal state, whose probability law defines the reconstruction distribution. For a special class of sources, we show that the optimal control law and the corresponding trajectory in the space of probability measures are obtained by solving a backward heat equation. In more general settings, our approach yields a numerical method that estimates the RD function using diffusion processes with a constant diffusion coefficient. We demonstrate the effectiveness of our method through several examples.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1545. SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
๐ฌ 4
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ human behavior simulation large language models benchmarking computational social science human-AI alignment calibration human-centered AI
Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We discover an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1546. Much Ado About Noising: Do Flow Models Actually Make Better Control Policies?
๐ฌ 10
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Generative model Flow Control Behavior cloning
Generative models, like flows and diffusions, have recently emerged as popular and efficacious policy parameterizations in robotics. There has been much speculation as to the factors underlying their successes, ranging from capturing multimodal action distributions to expressing more complex behaviors. In this work, we perform a comprehensive evaluation of popular generative control policies (GCPs) on common behavior cloning (BC) benchmarks. We find that GCPs do not owe their success to their ability to capture multimodality or to express more complex observation-to-action mappings. Instead, we find that their advantage stems from iterative computation, provided that intermediate steps are supervised during training and this supervision is paired with a suitable level of stochasticity. As a validation of our findings, we show that a minimal iterative policy (MIP), a lightweight two-step regression-based policy, essentially matches the performance of flow GCPs. Our results suggest that the distribution-fitting component of GCPs is less salient than commonly believed and point toward new design spaces focusing solely on control performance. Videos and supplementary materials are available at https://anonymous.4open.science/w/mip-anonymous/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1547. EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video
๐ฌ 10
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ egocentric video manipulation embodied ai robotics
Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording, where multiple calibrated cameras and on-device SLAM can be used to precisely track the pose of every joint of each hand. The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194 different tabletop tasks ranging from tying shoelaces to folding laundry. Furthermore, we train and systematically evaluate imitation learning policies for hand trajectory prediction on the dataset, introducing metrics and benchmarks for measuring progress in this increasingly important area. By releasing this large-scale dataset, we hope to push the frontier of robotics, computer vision, and foundation models. EgoDex is publicly available for download.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1548. Musculoskeletal simulation of limb movement biomechanics in Drosophila melanogaster
๐ฌ 8
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Musculoskeletal Modeling Drosophila melanogaster Imitation Learning
Computational models are critical to advance our understanding of how neural, biomechanical, and physical systems interact to orchestrate animal behaviors. Despite the availability of near-complete reconstructions of the Drosophila melanogaster central nervous system, musculature, and exoskeleton, anatomically and physically grounded models of fly leg muscles are still missing. These models provide an indispensable bridge between motor neuron activity and joint movements. Here, we introduce the first 3D, data-driven musculoskeletal model of Drosophila legs, implemented in both OpenSim and MuJoCo simulation environments. Our model incorporates a Hill-type muscle representation based on high-resolution X-ray scans from multiple fixed specimens. We present a pipeline for constructing muscle models using morphological imaging data and for optimizing unknown muscle parameters specific to the fly. We then combine our musculoskeletal models with detailed 3D pose estimation data from behaving flies to achieve muscle-actuated behavioral replay in OpenSim. Simulations of muscle activity across diverse walking and grooming behaviors predict coordinated muscle synergies that can be tested experimentally. Furthermore, by training imitation learning policies in MuJoCo, we test the effect of different passive joint properties on learning speed and find that damping and stiffness facilitate learning. Overall, our model enables the investigation of motor control in an experimentally tractable model organism, providing insights into how biomechanics contribute to generation of complex limb movements. Moreover, our model can be used to control embodied artificial agents to generate naturalistic and compliant locomotion in simulated environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1549. Priors in time: Missing inductive biases for language model interpretability
๐ฌ 11
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Top-Down Interpretability Sparse Autoencoders Temporal Structure Stationarity
A central aim of interpretability tools applied to language models is to recover meaningful concepts from model activations. Existing feature extraction methods focus on single activations regardless of the context, implicitly assuming independence (and therefore stationarity). This leaves open whether they can capture the rich temporal and context-sensitive structure in the activations of language models (LMs). Adopting a Bayesian view, we demonstrate that standard Sparse Autoencoders (SAEs) impose priors that assume independence of concepts across time. We then show that LM representations exhibit rich temporal dynamics, including systematic growth in conceptual dimensionality, context-dependent correlations, and pronounced non-stationarity, in direct conflict with the priors of SAEs. This mismatch casts doubt on existing SAEs' ability to reflect temporal structures of interest in the data. We introduce a novel SAE architecture---Temporal SAE---with a temporal inductive bias that decomposes representations at a given time into two parts: a predictable component, which can be inferred from the context, and a residual component, which captures novel information that cannot be captured by the context. Experiments on LLM activations with Temporal SAE demonstrate its ability to correctly parse garden path sentences, identify event boundaries, and more broadly delineate abstract, slow-moving information from novel, fast-moving information, while existing SAEs show significant pitfalls in all the above tasks. Our results underscore the need for inductive biases that match the data in designing robust interpretability tools.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1550. EigenBench: A Comparative Behavioral Measure of Value Alignment
๐ฌ 10
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ value alignment Bradley-Terry model EigenTrust model disposition constitutional AI
Aligning AI with human values is a pressing unsolved problem. To address the lack of quantitative metrics for value alignment, we propose EigenBench: a black-box method for comparatively benchmarking language modelsโ values. Given an ensemble of models, a constitution describing a value system, and a dataset of scenarios, our method returns a vector of scores quantifying each modelโs alignment to the given constitution. To produce these scores, each model judges the outputs of other models across many scenarios, and these judgments are aggregated with EigenTrust (Kamvar et al., 2003), yielding scores that reflect a weighted consensus judgment of the whole ensemble. EigenBench uses no ground truth labels, as it is designed to quantify subjective traits for which reasonable judges may disagree on the correct label. Hence, to validate our method, we collect human judgments on the same ensemble of models and show that EigenBenchโs judgments align closely with those of human evaluators. We further demonstrate that EigenBench can recover model rankings on the GPQA benchmark without access to objective labels, supporting its viability as a framework for evaluating subjective values for which no ground truths exist.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1551. The False Promise of Zero-Shot Super-Resolution in Machine-Learned Operators
๐ฌ 15
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Partial differential equations Neural operators super-resolution zero-shot super-resolution multi-resolution training
A core challenge in scientific machine learning, and scientific computing more generally, is modeling continuous phenomena which (in practice) are represented discretely. Machine-learned operators (MLO) have been introduced as a means to achieve this modeling goal, as this class of architecture can perform inference at arbitrary resolution. In this work, we evaluate whether this architectural innovation is sufficient to perform โzero-shot super-resolution,โ namely to enable a model to serve inference on higher-resolution data than that on which it was originally trained. We comprehensively evaluate both zero-shot sub-resolution and super-resolution (i.e., multi-resolution) inference in MLOs. We decouple multi-resolution inference into two key behaviors: 1) extrapolation to varying frequency information; and 2) interpolating across varying resolutions. We empirically demonstrate that MLOs fail to do both of these tasks in a zero-shot manner.
Consequently, we find MLOs are not able to perform accurate inference at resolutions different from those on which they were trained, and instead they are brittle and susceptible to aliasing. To address these failure modes, we propose a simple, computationally-efficient, and data-driven multi-resolution training protocol that overcomes aliasing and that provides robust multi-resolution generalization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1552. Beyond Hearing: Learning Task-agnostic ExG Representations from Earphones via Physiology-informed Tokenization
๐ฌ 20
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Physiology-informed Multi-band Tokenization ExG Representation Learning Free-living ExG dataset Task-agnostic training
Electrophysiological (ExG) signals offer valuable insights into human physiology, yet building foundation models that generalize across everyday tasks remains challenging due to two key limitations: (i) insufficient data diversity, as most ExG recordings are collected in controlled labs with bulky, expensive devices; and (ii) task-specific model designs that require tailored processing (i.e., targeted frequency filters) and architectures, which limit generalization across tasks. To address these challenges, we introduce an approach for scalable, task-agnostic ExG monitoring in the wild. We collected 50 hours of unobtrusive free-living ExG data with an earphone-based hardware prototype to narrow the data diversity gap. At the core of our approach is Physiology-informed Multi-band Tokenization (PiMT), which decomposes ExG signals into 12 physiology-informed tokens, followed by a reconstruction task to learn robust representations. This enables adaptive feature recognition across the full frequency spectrum while capturing task-relevant information. Experiments on our new HumanSense dataset, the first to enable ExG-based analysis across five human senses, together with four public ExG benchmarks, demonstrate that PiMT consistently outperforms state-of-the-art methods across diverse tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1553. Multi-Armed Bandits with Minimum Aggregated Revenue Constraints
๐ฌ 12
โญ 6.00
๐ learning theory
๐ท๏ธ Multi-Armed Bandit with Constraints Exploration-Exploitation Regret Constraint Violation
We examine a multi-armed bandit problem with contextual information, where the objective is to ensure that each arm receives a minimum aggregated reward across contexts while simultaneously maximizing the total cumulative reward. This framework captures a broad class of real-world applications where fair revenue allocation is critical and contextual variation is inherent. The cross-context aggregation of minimum reward constraints, while enabling better performance and easier feasibility, introduces significant technical challengesโparticularly the absence of closed-form optimal allocations typically available in standard MAB settings. We design and analyze algorithms that either optimistically prioritize performance or pessimistically enforce constraint satisfaction. For each algorithm, we derive problem-dependent upper bounds on both regret and constraint violations. Furthermore, we establish a lower bound demonstrating that the dependence on the time horizon in our results is optimal in general and revealing fundamental limitations of the free exploration principle leveraged in prior work.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1554. Better Learning-Augmented Spanning Tree Algorithms via Metric Forest Completion
๐ฌ 20
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Minimum spanning tree metric spaces learning-augmented algorithms algorithms with predictions approximation algorithms dynamic programming k-center
We present improved learning-augmented algorithms for finding an approximate minimum spanning tree (MST) for points in an arbitrary metric space. Our work follows a recent framework called metric forest completion (MFC), where the learned input is a forest that must be given additional edges to form a full spanning tree. Veldt et al. (2025) showed that optimally completing the forest takes $\Omega(n^2)$ time, but designed a 2.62-approximation for MFC with subquadratic complexity. The same method is a $(2\gamma + 1)$-approximation for the original MST problem, where $\gamma \geq 1$ is a quality parameter for the initial forest.
We introduce a generalized method that interpolates between this prior algorithm and an optimal $\Omega(n^2)$-time MFC algorithm. Our approach considers only edges incident to a growing number of strategically chosen "representative" points. One corollary of our analysis is to improve the approximation factor of the previous algorithm from 2.62 for MFC and $(2\gamma+1)$ for metric MST to 2 and $2\gamma$ respectively. We prove this is tight for worst-case instances, but we still obtain better instance-specific approximations using our generalized method. We complement our theoretical results with a thorough experimental evaluation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1555. PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity
๐ฌ 17
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Conditional Semantic Textual Similarity Reinforcement Learning Large Language Models Natural Language Processing Curriculum Learning
Conditional Semantic Textual Similarity (C-STS) measures the semantic proximity between text segments under a specific condition, thereby overcoming the ambiguity inherent in traditional STS. However, existing methods are largely confined to discriminative models, failing to fully integrate recent breakthroughs in the NLP community concerning Large Language Models (LLMs) and Reinforcement Learning (RL). RL is a particularly well-suited paradigm for this task, as it can directly optimize the non-differentiable Spearman ranking metric and guide the reasoning process required by C-STS. However, we find that naively applying listwise RL fails to produce meaningful improvements, as the model is overwhelmed by a complex, coarse-grained reward signal. To address this challenge, we introduce PoLi-RL, a novel Point-to-List Reinforcement Learning framework. PoLi-RL employs a two-stage curriculum: it first trains the model with simple pointwise rewards to establish fundamental scoring capabilities, then transitions to a hybrid reward that combines pointwise, pairwise, and listwise objectives to refine the model's ability to discern subtle semantic distinctions. Crucially, we propose an innovative Parallel Slice Ranking Reward (PSRR) mechanism that computes ranking rewards in parallel slices, where each slice comprises same-indexed completions from different samples. This provides a precise, differentiated learning signal for each individual completion, enabling granular credit assignment and effective optimization. On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture. As the first work to successfully apply RL to C-STS, our study introduces a powerful and effective paradigm for training LLMs on complex, ranking-based conditional judgment tasks. Our code and checkpoints are available at https://anonymous.4open.science/r/PoLi-RL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1556. Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training
๐ฌ 13
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Downstream Metrics Pretraining Evaluation Benchmarks LLM
While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of downstream accuracy from the training budget. We demonstrate that for a fixed token-to-parameter ratio, a simple two-parameter scaling law accurately describes this relationship. Our findings are validated by experiments on models with up to 17B parameters trained on up to 350B tokens, showing that downstream capabilities scaling can be described using a scaling law. Furthermore, we extend this framework to extrapolate and predict accuracy of target model with up to 6.7x larger training budget based on a set of smaller experiments. We will release a complete list of model losses and downstream evaluation results at various different scales to support reproducibility and encourage future research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1557. FACT: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations
๐ฌ 8
โญ 6.00
๐ learning theory
๐ท๏ธ feature learning deep learning neural feature ansatz convergence theory
It is a central challenge in deep learning to understand how neural networks learn representations. A leading approach is the Neural Feature Ansatz (NFA) (Radhakrishnan et al., 2024), a conjectured mechanism for how feature learning occurs. Although the NFA is empirically validated, it is an educated guess and lacks a theoretical basis, and thus it is unclear when it might fail, and how to improve it. In this paper, we take a first-principles approach to understanding why this observation holds, and when it does not. We use first-order optimality conditions to derive the Features at Convergence Theorem (FACT), an alternative to the NFA that (a) obtains greater agreement with learned features at convergence, (b) explains why the NFA holds in most settings, and (c) captures essential feature learning phenomena in neural networks such as grokking behavior in modular arithmetic and phase transitions in learning sparse parities, similarly to the NFA. Thus, our results unify theoretical first-order optimality analyses of neural networks with the empirically-driven NFA literature, and provide a principled alternative that provably and empirically holds at convergence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1558. $p\textrm{-less}$ Sampling: A Robust Hyperparameter-Free Approach for LLM Decoding
๐ฌ 20
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ LLM decoding sampling truncation inference information-theoretic information-theory hyperparameterless hyperparameter-free entropy entropy-aware distribution-aware adaptive efficient generation
Obtaining high-quality outputs from Large Language Models (LLMs) often depends upon the choice of a sampling-based decoding strategy to probabilistically choose the next token at each generation step. While a variety of such sampling methods have been proposed, their performance can be sensitive to the selection of hyperparameters which may require different settings depending upon the generation task and temperature configuration. In this work, we introduce $p\textrm{-less}$ sampling: an information-theoretic approach to sampling which dynamically sets a truncation threshold at each decoding step based on the entire token probability distribution. Unlike existing methods, $p\textrm{-less}$ sampling has no hyperparameters and consistently produces high-quality outputs as temperature increases. We provide theoretical perspectives on $p$-less sampling to ground our proposed method and conduct experiments to empirically validate its effectiveness across a range of math, logical reasoning, and creative writing tasks. Our results demonstrate how $p\textrm{-less}$ sampling consistently outperforms existing sampling approaches while exhibiting much less degradation in text quality at higher temperature values. We further show how $p$-less achieves greater inference-time efficiency than alternative methods through lower average token sampling times and shorter generation lengths, without sacrificing accuracy.
Finally, we provide analyses to highlight the benefits of $p\textrm{-less}$ through qualitative examples, case studies, and diversity assessments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1559. Trapped by simplicity: When Transformers fail to learn from noisy features
๐ฌ 8
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ boolean analysis simplicity bias transformer feature noise
Noise is ubiquitous in data used to train large language models, but it is not well understood whether these models are able to correctly generalize to inputs generated without noise. Here, we study noise-robust learning: are transformers trained on data with noisy features able to find a target function that correctly predicts labels for noiseless features? We show that transformers succeed at noise-robust learning for a selection of $k$-sparse parity and majority functions, compared to LSTMs which fail at this task for even modest feature noise. However, we find that transformers typically fail at noise-robust learning of random $k$-juntas, especially when the boolean sensitivity of the optimal solution is smaller than that of the target function. We argue that this failure is due to a combination of two factors: transformers' bias toward simpler functions, combined with an observation that the empirically optimal function for noise-robust learning has lower sensitivity than the target function. We test this hypothesis by exploiting transformers' simplicity bias to trap them in an incorrect solution, but show that transformers can escape this trap by training with an additional loss term penalizing high-sensitivity solutions. Overall, we find that transformers are particularly ineffective for learning boolean functions in the presence of feature noise.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1560. Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?
๐ฌ 22
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Fairness Explainability Hate speech detection
Natural language processing (NLP) models often replicate or amplify social bias from training data, raising concerns about fairness.
At the same time, their black-box nature makes it difficult for users to recognize biased predictions and for developers to effectively mitigate them.
While some studies suggest that input-based explanations can help detect and mitigate bias, others question their reliability in ensuring fairness.
Existing research on explainability in fair NLP has been predominantly qualitative, with limited large-scale quantitative analysis.
In this work, we conduct the first systematic study of the relationship between explainability and fairness in hate speech detection, focusing on both encoder- and decoder-only models.
We examine three key dimensions: (1) identifying biased predictions, (2) selecting fair models, and (3) mitigating bias during model training.
Our findings show that input-based explanations can effectively detect biased predictions and serve as useful supervision for reducing bias during training, but they are unreliable for selecting fair models among candidates.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1561. Hilbert-Guided Sparse Local Attention
๐ฌ 8
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ local attention window attention neighborhood attention sliding window attention Hilbert curve attention acceleration
The quadratic compute and memory costs of global self-attention severely limit its use in high-resolution images. Local attention reduces complexity by restricting attention to neighborhoods. Block-sparse kernels can further improve the efficiency of local attention, but conventional local attention patterns often fail to deliver significant speedups because tokens within a window are not contiguous in the 1D sequence. This work proposes a novel method for constructing windows and neighborhoods based on the Hilbert curve. Image tokens are first reordered along a Hilbert curve, and windows and neighborhoods are then formed on the reordered 1D sequence. From a block-sparse perspective, this strategy significantly increases block sparsity and can be combined with existing block-sparse kernels to improve the efficiency of 2D local attention. Experiments show that the proposed Hilbert Window Attention and Hilbert Slide Attention can accelerate window attention and slide attention by about $4\times$ and $18\times$, respectively. To assess practicality, the strategy is instantiated as the Hilbert Window Transformer and the Hilbert Neighborhood Transformer, both of which achieve end-to-end speedups with minimal accuracy loss. Overall, combining Hilbert-guided local attention with block-sparse kernels offers a general and practical approach to enhancing the efficiency of 2D local attention for images.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1562. Online Rounding and Learning Augmented Algorithms for Facility Location
๐ฌ 11
โญ 6.00
๐ learning theory
๐ท๏ธ Learning Augmented Algorithms Clustering Facility Location
Facility Location is a fundamental problem in clustering and unsupervised learning. Recently, significant attention has been given to studying this problem in the classical online setting enhanced with machine learning advice. While (almost) tight bounds exist for the fractional version of the problem, the integral version remains less understood, with only weaker results available. In this paper, we address this gap by presenting the first online rounding algorithms for the facility location problem, and by showing their applications to online facility location with machine learning advice.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1563. GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
๐ฌ 22
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ prompt optimization natural language reflection large language models agent design agent discovery code optimization compound AI systems genetic language based learning evolutionary algorithms
Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language often provides a much richer learning medium for LLMs, compared to policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error.
Given any AI system containing one or more LLM prompts, GEPA samples trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA's design, it can often turn even just a few rollouts into a large quality gain.
Across four tasks, GEPA outperforms GRPO by 6% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% (e.g., +10% accuracy on AIME-2025).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1564. Intrinsic training dynamics of deep neural networks
๐ฌ 16
โญ 6.00
๐ learning theory
๐ท๏ธ gradient flow path-lifting intrinsic lower dimensional dynamic conservation laws implicit bias
A fundamental challenge in the theory of deep learning is to understand whether gradient-based training can promote parameters belonging to certain lower-dimensional structures (e.g., sparse or low-rank sets), leading to so-called implicit bias. As a stepping stone, motivated by the proof structure of existing intrinsic bias analyses, we study when a gradient flow on a parameter $\theta$ implies an intrinsic gradient flow on a ``lifted'' variable $z = \phi(\theta)$, for an architecture-related function $\phi$. We express a so-called intrinsic dynamic property and show how it is related to the study of conservation laws associated with the factorization $\phi$. This leads to a simple criterion based on the inclusion of kernels of linear maps, which yields a necessary condition for this property to hold. We then apply our theory to general ReLU networks of arbitrary depth and show that, for a dense set of initializations, it is possible to rewrite the flow as an intrinsic dynamic in a lower dimension that depends only on $z$ and the initialization, when $\phi$ is the so-called path-lifting. In the case of linear networks with $\phi$, the product of weight matrices, the intrinsic dynamic is known to hold under so-called balanced initializations; we generalize this to a broader class of {\em relaxed balanced} initializations, showing that, in certain configurations, these are the \emph{only} initializations that ensure the intrinsic metric property. Finally, for the linear neural ODE associated with the limit of infinitely deep linear networks, with relaxed balanced initialization, we explicit the corresponding intrinsic dynamics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1565. LLMs Must Think Thrice to Solve Executable Counterfactuals
๐ฌ 12
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Counterfactual Reasoning Large Language Models Reinforcement Learning Generalization
Counterfactual reasoning, a hallmark of intelligence, consists of three steps: inferring latent variables from observations (abduction), constructing alternative situations (interventions), and predicting the outcomes of the alternatives (prediction). This skill is essential for advancing LLMs' causal understanding and expanding their applications in high-stakes domains such as scientific research and healthcare. However, existing efforts in assessing LLM's counterfactual reasoning capabilities tend to skip the abduction step, effectively reducing to interventional reasoning and leading to over-estimated LLM performance. To address this, we introduce executable counterfactuals, a novel framework that operationalizes causal reasoning through code and math problems. Our framework explicitly requires all three steps of counterfactual reasoning and enables scalable synthetic data creation with varying difficulty, creating a new frontier for evaluating and improving LLM's reasoning. Our results reveal substantial drop in accuracy (25-40%) from interventional to counterfactual reasoning for state-of-the-art models such as o4-mini and Claude-4-Sonnet. To address this gap, we construct a training set comprising counterfactual code problems having if-condition and test on out-of-distribution code structures (e.g., having while-loop); we also test whether a model trained on code would generalize to counterfactual math word problems. While Supervised Finetuning (SFT) on stronger models' reasoning traces improves in-distribution performance of Qwen models, it leads to a decrease in accuracy on out-of-distribution tasks such as counterfactual math problems. In contrast, reinforcement learning (RL) induces the core cognitive behaviors and generalizes to new distributions, yielding substantial accuracy gains over the base model on both code (improvement of 1.5x-2x) and counterfactual math problems. Analysis of the reasoning traces further reinforces these findings and highlights the promise of RL with scalable data generation for improving LLMs' counterfactual reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1566. SteinsGate: Adding Causality to Diffusions for Long Video Generation via Path Integral
๐ฌ 11
โญ 6.00
๐ generative models
๐ท๏ธ Generative Models Video Generation Diffusion Guidance
Video generation has advanced rapidly, but current models remain limited to short clips, far from the length and complexity of real-world narratives. Long video generation is thus both important and challenging. Existing approaches either attempt to extend the modeling length of video diffusion models directly or merge short clips via shared frames. However, due to the lack of temporal causality modeling for video data, they achieve only limited extensions, suffer from discontinuous or even contradictory actions, and fail to support flexible and fine-grained temporal control. Thus, we propose Instruct-Video-Continuation (InstructVC), combining Temporal Action Binding for fine-grained temporal control and Causal Video Continuation for natural long-term simulation. Temporal Action Binding decomposes complex long videos by temporal causality into scene descriptions and action sequences with predicted durations, while Causal Video Continuation autoregressively generates coherent video narratives from the text story. We further introduce SteinsGate, an inference-time instance of InstructVC that uses an MLLM for Temporal Action Binding and Video Path Integral to enforce causality between actions, converting a pre-trained TI2V diffusion model into an autoregressive video continuation model. Benchmark results demonstrate the advantages of SteinsGate and InstructVC in achieving accurate temporal control and generating natural, smooth multi-action long videos.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1567. GuidedSampling: Steering LLMs Towards Diverse Candidate Solutions at Inference-Time
๐ฌ 11
โญ 6.00
๐ generative models
๐ท๏ธ Inference-time algorithms LLMs
Repeated Sampling (RS) is a simple inference-time algorithm that has been shown to improve model performance on complex tasks. Although it is an effective way of scaling inference time, it often struggles to generate diverse solution candidates, frequently relying on the same underlying approach to solve the problem and thus producing redundant samples. To address this limitation, we propose a new inference algorithm, GuidedSampling, which decouples the exploration and generation phases during inference, increasing diversity of generated candidate solutions. The exploration phase identifies multiple concepts that can be utilized to solve the problem, while the generation phase applies a specific concept to provide final solution candidates. We first define the theoretical bounds of GuidedSampling and then empirically demonstrate that it improves the performance of base model at pass@50 by on an average $\sim21.6$% across various benchmarks compared to RS. Furthermore, models trained on trajectories of GuidedSampling exhibit substantial performance improvements at pass@5 by on an average $\sim9.7$%, compared to models trained on traditional RS. Additionally, models trained with GuidedSampling increases the average number of concepts per instance ($1.67 \to 3.03$), yielding a diverse set of candidates than traditional RS.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1568. PerfGuard: A Performance-Aware Agent for Visual Content Generation
๐ฌ 13
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Agent Large Language Model Image Generation Image Editing
The advancement of Large Language Model (LLM)-powered agents has enabled automated task processing through reasoning and tool invocation capabilities. However, existing frameworks often operate under the idealized assumption that tool executions are invariably successful, relying solely on textual descriptions that fail to distinguish precise performance boundaries and cannot adapt to iterative tool updates. This gap introduces uncertainty in planning and execution, particularly in domains like visual content generation (AIGC), where nuanced tool performance significantly impacts outcomes. To address this, we propose PerfGuard, a performance-aware agent framework for visual content generation that systematically models tool performance boundaries and integrates them into task planning and scheduling. Our framework introduces three core mechanisms: (1) Performance-Aware Selection Modeling (PASM), which replaces generic tool descriptions with a multi-dimensional scoring system based on fine-grained performance evaluations; (2) Adaptive Preference Update (APU), which dynamically optimizes tool selection by comparing theoretical rankings with actual execution rankings; and (3) Capability-Aligned Planning Optimization (CAPO), which guides the planner to generate subtasks aligned with performance-aware strategies. Experimental comparisons against state-of-the-art methods demonstrate PerfGuardโs advantages in tool selection accuracy, execution reliability, and alignment with user intent, validating its robustness and practical utility for complex AIGC tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1569. Good allocations from bad estimates
๐ฌ 15
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Treatment Allocation Treatment effects Sample complexity RCT
Conditional average treatment effect (CATE) estimation is the de facto gold standard for targeting a treatment to a heterogeneous population. The method estimates treatment effects up to an error $\epsilon > 0$ in each of $M$ different strata of the population, targeting individuals in decreasing order of estimated treatment effect until the budget runs out. In general, this method requires $O(M/\epsilon^2)$ samples. This is best possible if the goal is to estimate all treatment effects up to an $\epsilon$ error.
In this work, we show how to achieve the same total treatment effect as CATE with only $O(M/\epsilon)$ samples for natural distributions of treatment effects. The key insight is that coarse estimates suffice for near-optimal treatment allocations. In addition, we show that budget flexibility can further reduce the sample complexity of allocation.
Finally, we evaluate our algorithm on various real-world RCT datasets. In all cases, it finds nearly optimal treatment allocations with surprisingly few samples. Our work highlights the fundamental distinction between treatment effect estimation and treatment allocation: the latter requires far fewer samples.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1570. Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review
๐ฌ 12
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Machine Learning Evaluation Benchmark Datasets Robustness in NLP Large Language Models (LLMs) Generative AI HumanโAI Alignment Ethical Considerations in ML
Peer review is a critical process for ensuring the integrity of published scientific research. Confidence in this process is predicated on the assumption that experts in the relevant domain give careful consideration to the merits of manuscripts which are submitted for publication. With the recent rapid advancements in large language models (LLMs), a new risk to the peer review process is that negligent reviewers will rely on LLMs to perform the often time consuming process of reviewing a paper. However, there is a lack of existing resources for benchmarking the detectability of AI text in the domain of peer review. To address this deficiency, we introduce a comprehensive dataset containing a total of 788,984 AI-written peer reviews paired with corresponding human reviews, covering 8 years of papers submitted to each of two leading AI research conferences (ICLR and NeurIPS). We use this new resource to evaluate the ability of 18 existing AI text detection algorithms to distinguish between peer reviews fully written by humans and different state-of-the-art LLMs. Additionally, we explore a context-aware detection method called Anchor, which leverages manuscript content to detect AI-generated reviews, and analyze the sensitivity of detection models to LLM-assisted editing of human-written text. Our work reveals the difficulty of identifying AI-generated text at the individual peer review level, highlighting the urgent need for new tools and methods to detect this unethical use of generative AI. To support future research and reproducibility, we will publicly release our dataset upon publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1571. RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
๐ฌ 17
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Computer-Use Agents Adversarial Risks Sandbox Benchmark
Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection, where attackers embed malicious content into the environment to hijack agent behavior. Current evaluations of this threat either lack support for adversarial testing in realistic but controlled environments or ignore hybrid web-OS attack scenarios involving both interfaces. To address this, we propose RedTeamCUA, an adversarial testing framework featuring a novel hybrid sandbox that integrates a VM-based OS environment with Docker-based web platforms. Our sandbox supports key features tailored for red teaming, such as flexible adversarial scenario configuration, and a setting that decouples adversarial evaluation from navigational limitations of CUAs by initializing tests directly at the point of an adversarial injection. Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities. Benchmarking current frontier CUAs identifies significant vulnerabilities: Claude 3.7 Sonnet | CUA demonstrates an Attack Success Rate (ASR) of 42.9%, while Operator, the most secure CUA evaluated, still exhibits an ASR of 7.6%. Notably, CUAs often attempt to execute adversarial tasks with an Attempt Rate as high as 92.5%, although failing to complete them due to capability limitations. Nevertheless, we observe concerning ASRs of up to 50% in realistic end-to-end settings, indicating that CUA threats can already result in tangible risks to users and computer systems. Overall, RedTeamCUA provides an essential framework for advancing realistic, controlled, and systematic analysis of CUA vulnerabilities, highlighting the urgent need for robust defenses to indirect prompt injection prior to real-world deployment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1572. Gaussian certified unlearning in high dimensions: A hypothesis testing approach
๐ฌ 13
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Machine unlearning in high dimensions Proportional asymptotics High dimensional statistical theory Privacyโaccuracy tradeoff Hypothesis testing Gaussian noise calibration Newton method
Machine unlearning seeks to efficiently remove the influence of selected data while preserving generalization. Significant progress has been made in low dimensions, \textcolor{blue}{where the dimension of the parameter} $p$ is much smaller than \textcolor{blue}{the sample size} $n$, but high dimensions, including proportional regimes $p \sim n$, pose serious theoretical challenges as standard optimization assumptions of $\Omega(1)$ strong convexity and $O(1)$ smoothness of the per-example loss $f$ rarely hold simultaneously in proportional regimes $p\sim n$.
In this work, we introduce $\varepsilon$-Gaussian certifiability, a canonical and robust notion well-suited to high-dimensional regimes, that optimally captures a broad class of noise adding mechanisms. Then we theoretically analyze the performance of a widely used unlearning algorithm based on one step of the Newton method in the high-dimensional setting described above. Our analysis shows that a single Newton step, followed by a well-calibrated Gaussian noise, is sufficient to achieve both privacy and accuracy in this setting. This result stands in sharp contrast to the only prior work that analyzes machine unlearning in high dimensions \citet{zou2025certified}, which relaxes some of the standard optimization assumptions for high-dimensional applicability, but operates under the notion of $\varepsilon$-certifiability. That work concludes %that a single Newton step is insufficient even for removing a single data point, and
that at least two steps are required to ensure both privacy and accuracy. Our result leads us to conclude that the discrepancy in the number of steps arises because of the sub optimality of the notion of $\varepsilon$-certifiability and its incompatibility with noise adding mechanisms, which $\varepsilon$-Gaussian certifiability is able to overcome optimally.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1573. Self-Improving Vision-Language-Action Models with Data Generation via Residual RL
๐ฌ 15
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ VLA Robot Foundation Model Robot Learning Reinforcement Learning
Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (PLD), a plug-and-play framework that improves VLAs through residual reinforcement learning and distribution-aware data collection. In Stage 1 (specialist acquisition), we freeze the VLA backbone and train lightweight residual actors via off-policy RL. These specialists take over in states where the base policy fails, thereby probing failure regions of the generalist. In Stage 2 (data collection), we employ a hybrid rollout scheme that biases residual interventions toward states frequently visited by the base policy, aligning collected trajectories with the generalistโs deployment distribution while capturing recovery behaviors. In Stage 3 (fine-tuning), these curated trajectories are distilled back into the generalist with standard SFT, applicable to both flow-matching and autoregressive heads. We evaluate PLD across diverse settings: it achieves a near-saturated 99% task success rate on the LIBERO benchmark, delivers over 50% performance gains in SimplerEnv, and demonstrates practicality on real-world Franka arm manipulation tasks. We further provide ablations showing that residual policy probing and distribution-aware replay are key to collecting deployment-aligned data that improves VLAsโ capabilities on both seen and unseen tasks. Our results demonstrate that RL-generated, policy-aligned data can surpass teleoperation-only demonstrations, offering a scalable path toward self-improving VLA models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1574. Zero-Shot Adaptation of Behavioral Foundation Models to Unseen Dynamics
๐ฌ 11
โญ 6.00
๐ reinforcement learning
๐ท๏ธ zero-shot reinforcement learning unsupervised reinforcement learning successor measure
Behavioral Foundation Models (BFMs) proved successful in producing near-optimal policies for arbitrary tasks in a zero-shot manner, requiring no test-time retraining or task-specific fine-tuning. Among the most promising BFMs are the ones that estimate the successor measure learned in an unsupervised way from task-agnostic offline data. However, these methods fail to react to changes in the dynamics, making them inefficient under partial observability or when the transition function changes. This hinders the applicability of BFMs in a real-world setting, e.g., in robotics, where the dynamics can unexpectedly change at test time. In this work, we demonstrate that ForwardโBackward (FB) representation, one of the methods from the BFM family, cannot produce reasonable policies under distinct dynamics, leading to an interference among the latent policy representations. To address this, we propose an FB model with a transformer-based belief estimator, which greatly facilitates zero-shot adaptation. Additionally, we show that partitioning the policy encoding space into dynamics-specific clusters, aligned with the context-embedding directions, yields additional gain in performance. Those traits allow our method to respond to the dynamics mismatches observed during training and to generalize to unseen ones. Empirically, in the changing dynamics setting, our approach achieves up to a 2x higher zero-shot returns compared to the baselines for both discrete and continuous tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1575. Contrastive Diffusion Guidance for Spatial Inverse Problems
๐ฌ 21
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Diffusion Models Inverse Problems Contrastive Learning Spatial Inference
We consider the inverse problem of reconstructing the spatial layout of a place, a home floorplan for example, from a userโs movements inside that layout.
Direct inversion is ill-posed since many floorplans can explain the same movement trajectories.
We adopt a diffusion-based posterior sampler to generate layouts consistent with the measurements.
While active research is in progress on generative inverse solvers, we find that the forward operator in our problem poses new challenges.
The path planning process inside a floorplan is a non-invertible, non-differentiable function, and causes instability while optimizing using the likelihood score.
We break-away from existing approaches and reformulate the likelihood score in a smoother embedding space.
The embedding space is trained with a contrastive loss which brings compatible floorplans and trajectories close to each other, while pushing mismatched pairs far apart.
We show that a surrogate form of the likelihood score in this embedding space is a valid approximation of the true likelihood score, making it possible to steer the denoising process towards the posterior.
Across extensive experiments, our model CoGuide produces more consistent floorplans from trajectories, and is more robust than differentiable-planner baselines and guided-diffusion methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1576. When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training
๐ฌ 11
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Exploration
While Large Language Models (LLMs) hold promise to become autonomous agents, they often explore suboptimally in sequential decision-making. Recent work has sought to enhance this capability via supervised fine-tuning (SFT) or reinforcement learning (RL), improving regret on the classic multi-armed bandit task. However, it remains unclear how these learning methods shape exploration strategies and how well they generalize. We investigate both paradigms by training LLMs with SFT on expert trajectories and RL with a range of tailored reward signals including a strategic, regret-shaped reward to reduce variance, and an algorithmic reward that enables oracle imitation. The resulting agents outperform pre-trained models and achieve performance comparable to Upper Confidence Bound (UCB) and Thompson Sampling, with robust generalization to 6$\times$ longer horizons and across bandit families. Behavioral analysis reveals that gains often stem from more sophisticated but greedier exploitation: RL/SFT agents are more prone to early catastrophic failure than pre-trained models, prematurely abandoning exploration. Furthermore, agents trained to imitate UCB learn to outperform their teacher by adopting more exploitative variants. Our findings clarify when each training paradigm is preferable and advocate tailored reward design and evaluation beyond average regret to promote robust exploratory behavior.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1577. GRADIEND: Feature Learning within Neural Networks Exemplified through Biases
๐ฌ 10
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Feature Learning Bias Mitigation AI Fairness Language Models
AI systems frequently exhibit and amplify social biases, leading to harmful consequences in critical areas. This study introduces a novel encoder-decoder approach that leverages model gradients to learn a feature neuron encoding societal bias information such as gender, race, and religion. We show that our method can not only identify which weights of a model need to be changed to modify a feature, but even demonstrate that this can be used to rewrite models to debias them while maintaining other capabilities. We demonstrate the effectiveness of our approach across various model architectures and highlight its potential for broader applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1578. Contrastive Predictive Coding Done Right for Mutual Information Estimation
๐ฌ 13
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ information estimation contrastive predictive coding representation learning noise contrastive estimation density ratio estimation
The InfoNCE objective, originally introduced for contrastive representation learning, has become a popular choice for mutual information (MI) estimation, despite its indirect connection to MI. In this paper, we demonstrate why InfoNCE should not be regarded as a valid MI estimator, and we introduce a simple modification, which we refer to as *InfoNCE-anchor*, for accurate MI estimation.
Our modification introduces an auxiliary \emph{anchor} class, enabling consistent density ratio estimation and yielding a plug-in MI estimator with significantly reduced bias.
Beyond this, we generalize our framework using proper scoring rules, which recover InfoNCE-anchor as a special case when the log score is employed.
This formulation unifies a broad spectrum of contrastive objectives, including NCE, InfoNCE, and $f$-divergence variants, under a single principled framework.
Empirically, we find that InfoNCE-anchor with the log score achieves the most accurate MI estimates; however, in self-supervised representation learning experiments, we find that the anchor does not improve the downstream task performance.
These findings corroborate that contrastive representation learning benefits not from accurate MI estimation per se, but from the learning of structured density ratios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1579. Decoupled Q-Chunking
๐ฌ 16
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Reinforcement learning action chunking offline RL
Bootstrapping bias problem is a long-standing challenge in temporal-difference (TD) methods in off-policy reinforcement learning (RL). Multi-step return backups can alleviate this issue but require delicate importance sampling to correct their off-policy bias. Recent work has proposed to use chunked critics, which estimate the value of short action sequences ("chunks") rather than individual actions, enabling unbiased multi-step backup. However, extracting policies from chunked critics is challenging: policies must output the entire action chunk open-loop, which can be sub-optimal in environments that require policy reactivity and also challenging to model especially when the chunk length grows. Our key insight is to decouple the chunk length of the critic from that of the policy, allowing the policy to operate over shorter action chunks. We propose a novel algorithm that achieves this by optimizing the policy against a distilled critic for partial action chunks, constructed by optimistically backing up from the original chunked critic to approximate the maximum value achievable when a partial action chunk is extended to a complete one. This design retains the benefits of multi-step value propagation while sidestepping both the open-loop sub-optimality and the difficulty of learning policies over long action chunks. We evaluate our method on challenging, long-horizon offline goal-conditioned benchmarks and show that it reliably outperforms prior methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1580. Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining
๐ฌ 10
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ quality ware scaling laws scaling laws data quality LLM pretraining
Scaling laws for language model training traditionally characterize how performance scales with model size and dataset volume. Prior work has explored architecture variants and data treatments such as dataset filtering and noise injection in language model pretraining; however, these studies have not formalized data quality within a principled scaling law. We introduce a dimensionless data-quality parameter Q, and propose a quality-aware scaling law extending the Chinchilla framework to predict loss as a joint function of model size, data volume, and data quality. The law is motivated by an effective-sample-size and information-theoretic view of noisy or redundant corpora, and it admits two practical estimators for Q: (i) a corruption rate proxy and (ii) a deficiency measure. Through synthetic experiments in neural machine translation and autoregressive modeling--where we systematically control data quality via multiple levels of noise injection--we show that loss scales predictably with data quality and that higher-quality data can substantially reduce model size and hence compute requirements. Our results demonstrate a sublinear decay of effective data with quality and robustness to moderate data corruption; out-of-sample evaluations further validate the predictive form of the law. Unlike prior empirical analyses, our work establishes an explicit, generalizable law for data quality, offering concrete guidance for balancing data curation effort and model scale in large-scale pretraining.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1581. Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining
๐ฌ 10
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM pretraining efficient LLMs metadata
Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signalโURLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1582. A Genetic Algorithm for Navigating Synthesizable Molecular Spaces
๐ฌ 9
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ synthesizability molecular design genetic algorithms
Inspired by the effectiveness of genetic algorithms and the importance of synthesizability in molecular design, we present SynGA, a simple genetic algorithm that operates directly over synthesis routes. Our method features custom crossover and mutation operators that explicitly constrain it to synthesizable molecular space. By modifying the fitness function, we demonstrate the effectiveness of SynGA on a variety of design tasks, including synthesizable analog search and sample-efficient property optimization, for both 2D and 3D objectives. Furthermore, by coupling SynGA with a machine learning-based filter that focuses the building block set, we boost SynGA to state-of-the-art performance. For property optimization, this manifests as a model-based variant SynGBO, which employs SynGA and block filtering in the inner loop of Bayesian optimization. Since SynGA is lightweight and enforces synthesizability by construction, our hope is that SynGA can not only serve as a strong standalone baseline but also as a versatile module that can be incorporated into larger synthesis-aware workflows in the future.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1583. Compute-Optimal Quantization-Aware Training
๐ฌ 17
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ quantization-aware training QAT neural network quantization compute optimization scaling laws large language models LLMs model compression compute budget allocation training efficiency model optimization quantized neural networks efficient deep learning
Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks. Previous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior accuracy compared to QAT alone. However, the optimal allocation of compute between the FP and QAT phases remains unclear. We conduct extensive experiments with various compute budgets, QAT bit widths, and model sizes from 86.0M to 2.2B to investigate how different QAT durations impact final performance. We demonstrate that, contrary to previous findings, the loss-optimal ratio of QAT to FP training increases with the total amount of compute. Moreover, the optimal fraction can be accurately predicted for a wide range of model sizes and quantization widths using the tokens-per-parameter-byte statistic. From experimental data, we derive a loss scaling law that predicts both optimal QAT ratios and final model performance across different QAT/FP compute allocation strategies and QAT bit widths. We use the scaling law to make further predictions, which we verify experimentally, including which QAT bit width is optimal under a given memory constraint and how QAT accuracy with different bit widths compares to full-precision model accuracy. Additionally, we propose a novel cooldown and QAT fusion approach that performs learning rate decay jointly with quantization-aware training, eliminating redundant full-precision model updates and achieving significant compute savings. These findings provide practical insights into efficient QAT planning and enable the training of higher-quality quantized models with the same compute budget.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1584. Transfer Learning in Infinite Width Feature Learning Networks
๐ฌ 11
โญ 6.00
๐ learning theory
๐ท๏ธ Transfer Learning Infinite Width Kernel Methods;
We develop a theory of transfer learning in infinitely wide neural networks under gradient flow that quantifies when pretraining on a source task improves generalization on a target task. We analyze both (i) fine-tuning, when the downstream predictor is trained on top of source-induced features and (ii) a jointly rich setting, where both pretraining and downstream tasks can operate in a feature learning regime, but the downstream model is initialized with the features obtained after pre-training. In this setup, the summary statistics of randomly initialized networks after a rich pre-training are adaptive kernels which depend on both source data and labels. For (i), we analyze the performance of a readout for different pretraining data regimes. For (ii), the summary statistics after learning the target task are still adaptive kernels with features from both source and target tasks. We test our theory on linear and polynomial regression tasks as well as real datasets. Our theory allows interpretable conclusions on performance, which depend on the amount of data on both tasks, the alignment between tasks, and the feature learning strength.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1585. Latent Speech-Text Transformer
๐ฌ 14
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ SpeechโText Models Latent Patching Multimodal Alignment Large Language Models
Auto-regressive speech-text models are typically pre-trained on a large number of interleaved sequences of text tokens and raw speech encoded as speech tokens using vector quantization. These models have demonstrated state-of-the-art performance in speech-to-speech understanding and generation benchmarks, together with promising scaling laws, primarily enabled by the representational alignment between text and speech. Nevertheless, they suffer from shortcomings, partly owing to the disproportionately longer sequences of speech tokens in contrast to textual tokens. This results in a large compute imbalance between modalities during pre-training as well as during inference, and a potential hindrance to effectively aligning speech and text, ultimately translating to several orders of magnitude slower scaling laws. We introduce the Latent Speech-Text Transformer (LST), which makes pre-training speech-text models more data-efficient by dynamically and inexpensively aggregating speech tokens into latent speech patches. These patches serve as higher-level units that can either align with corresponding textual units to aid capability transfer or even encapsulate common speech sequences like silences to be more compute-efficient. We show that LST outperforms vanilla approaches on speech-to-speech as well as text-to-text benchmarks in both data- and compute-controlled settings, the former indicating more effective representational alignment and the latter indicating steeper scaling laws for speech-text models. On HellaSwag story completion, LST achieves 6.5% absolute gain in speech accuracy under compute-controlled training and 5.3% under data-controlled training, while also improving text performance. We will release our models, code, and the evaluation data to facilitate further research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1586. Knowledgeable Language Models as Black-Box Optimizers for Personalized Medicine
๐ฌ 18
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Large language models Personalized medicine Black-box optimization Distribution shift
The goal of personalized medicine is to discover a treatment regimen that optimizes a patient's clinical outcome based on their personal genetic and environmental factors. However, candidate treatments cannot be arbitrarily administered to the patient to assess their efficacy; we often instead have access to an *in silico* surrogate model that approximates the true fitness of a proposed treatment. Unfortunately, such surrogate models have been shown to fail to generalize to previously unseen patient-treatment combinations. We hypothesize that domain-specific prior knowledgeโsuch as medical textbooks and biomedical knowledge graphsโcan provide a meaningful alternative signal of the fitness of proposed treatments. To this end, we introduce **L**LM-based **E**ntropy-guided **O**ptimization with k**N**owledgeable priors (**LEON**), a mathematically principled approach to leverage large language models (LLMs) as black-box optimizers without any task-specific fine-tuning, taking advantage of their ability to contextualize unstructured domain knowledge to propose personalized treatment plans in natural language. In practice, we implement LEON via 'optimization by prompting,' which uses LLMs as stochastic engines for proposing treatment designs. Experiments on real-world optimization tasks show LEON outperforms both traditional and LLM-based methods in proposing individualized treatments for patients.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1587. MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes
๐ฌ 22
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ moral reasoning reasoning evaluation ai safety
As AI systems progresses, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks (fail to) predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1588. ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning
๐ฌ 13
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Reward Machines Foundation Models Generalization Exploration
Reinforcement learning (RL) algorithms are highly sensitive to reward function specification, which remains a central challenge limiting their broad applicability. We present ARM-FM: Automated Reward Machines via Foundation Models, a framework for automated, compositional reward design in RL that leverages the high-level reasoning capabilities of foundation models (FMs). Reward machines (RMs) - an automata-based formalism for reward specification - are used as the mechanism for RL objective specification, and are automatically constructed via the use of FMs. The structured formalism of RMs yields effective task decompositions, while the use of FMs enables objective specifications in natural language. Concretely, we (i) use FMs to automatically generate RMs from natural language specifications; (ii) associate language embeddings with each RM automata-state to enable generalization across tasks; and (iii) provide empirical evidence of ARM-FM's effectiveness in a diverse suite of challenging environments, including evidence of zero-shot generalization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1589. Distributions as Actions: A Unified Framework for Diverse Action Spaces
๐ฌ 12
โญ 6.00
๐ reinforcement learning
๐ท๏ธ deterministic policy gradient actor-critic continuous control discrete control hybrid control action space reinforcement learning
We introduce a novel reinforcement learning (RL) framework that treats parameterized action distributions as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, hybrid, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, \emph{Distributions-as-Actions Policy Gradient} (DA-PG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce \emph{interpolated critic learning} (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical actor-critic algorithm, \emph{Distributions-as-Actions Actor-Critic} (DA-AC). Empirically, DA-AC achieves competitive performance in various settings across discrete, continuous, and hybrid control.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1590. AMiD: Knowledge Distillation for LLMs with $\alpha$-mixture Assistant Distribution
๐ฌ 25
โญ 6.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Knowledge distillation Large language model Information geometry
Autoregressive large language models (LLMs) have achieved remarkable improvement across many tasks but incur high computational and memory costs. Knowledge distillation (KD) mitigates this issue by transferring knowledge from a large teacher to a smaller student through distributional alignment. Previous studies have proposed various discrepancy metrics, but the capacity gap and training instability caused by near-zero probabilities, stemming from the high-dimensional output of LLMs, remain fundamental limitations. To overcome these challenges, several approaches implicitly or explicitly incorporating assistant distribution have recently been proposed. However, the past proposals of assistant distributions have been a fragmented approach without a systematic investigation of the interpolation path and the divergence. This paper proposes $\alpha$-mixture assistant distribution, a novel generalized family of assistant distributions, and $\alpha$-mixture distillation, coined AMiD, a unified framework for KD using the assistant distribution. The $\alpha$-mixture assistant distribution provides a continuous extension of the assistant distribution by introducing a new distribution design variable $\alpha$, which has been fixed in all previous approaches. Furthermore, AMiD generalizes the family of divergences used with the assistant distributions based on optimality, which has also been restricted in previous works. Through extensive experiments, we demonstrate that AMiD offers superior performance and training stability by leveraging a broader and theoretically grounded assistant distribution space.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1591. Adversarial Dรฉjร Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks
๐ฌ 14
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Safety Alignment
Large language models remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs. Defending against novel jailbreaks represents a critical challenge in AI safety. Adversarial training---designed to make models robust against worst-case perturbations---has been the dominant paradigm for adversarial robustness. However, due to optimization challenges and difficulties in defining realistic threat models, adversarial training methods often fail on newly developed jailbreaks in practice. This paper proposes a new paradigm for improving robustness against unseen jailbreaks, centered on the Adversarial Dรฉjร Vu hypothesis: novel jailbreaks are not fundamentally new, but largely recombinations of adversarial skills from previous attacks. We study this hypothesis through a large-scale analysis of 32 attack papers published over two years. Using an automated pipeline, we extract and compress adversarial skills into a sparse dictionary of primitives, with LLMs generating human-readable descriptions. Our analysis reveals that unseen attacks can be effectively explained as sparse compositions of earlier skills, with explanatory power increasing monotonically as skill coverage grows.
Guided by this insight, we introduce Adversarial Skill Compositional Training (ASCoT), which trains on diverse compositions of skill primitives rather than isolated attack instances. ASCoT substantially improves robustness to unseen attacks, including multi-turn jailbreaks, while maintaining low over-refusal rates. We also demonstrate that expanding adversarial skill coverage, not just data scale, is key to defending against novel attacks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1592. Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks
๐ฌ 19
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Human heads 3D shape correspondence foundation models vision transformer point tracking
We propose DenseMarks -- a new learned representation for human heads, enabling high-quality dense correspondences of human head images. For a 2D image of a human head, a Vision Transformer network predicts a 3D embedding for each pixel, which corresponds to a location in a 3D canonical unit cube. In order to train our network, we collect a dataset of pairwise point matches, estimated by a state-of-the-art point tracker over a collection of diverse in-the-wild talking heads videos, and guide the mapping via a contrastive loss, encouraging matched points to have close embeddings. We further employ multi-task learning with face landmarks and segmentation constraints, as well as imposing spatial continuity of embeddings through latent cube features, which results in an interpretable and queryable canonical space. The representation can be used for finding common semantic parts, face/head tracking, and stereo reconstruction. Due to the strong supervision, our method is robust to pose variations and covers the entire head, including hair. Additionally, the canonical space bottleneck makes sure the obtained representations are consistent across diverse poses and individuals. We demonstrate state-of-the-art results in geometry-aware point matching and monocular head tracking with 3D Morphable Models. The code and the model checkpoint will be made available to the public.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1593. Modeling the Density of Pixel-level Self-supervised Embeddings for Unsupervised Pathology Segmentation in Medical CT
๐ฌ 13
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Unsupervised Visual Anomaly Segmentation Self-supervised learning Density estimation Computed Tomography
Accurate detection of all pathological findings in 3D medical images remains a significant challenge, as supervised models are limited to detecting only the few pathology classes annotated in existing datasets. To address this, we frame pathology detection as an unsupervised visual anomaly segmentation (UVAS) problem, leveraging the inherent rarity of pathological patterns compared to healthy ones. We enhance the existing density-based UVAS framework with two key innovations: (1) dense self-supervised learning for feature extraction, eliminating the need for supervised pretraining, and (2) learned, masking-invariant dense features as conditioning variables, replacing hand-crafted positional encodings. Trained on over 30,000 unlabeled 3D CT volumes, our fully self-supervised model, Screener, outperforms existing UVAS methods on four large-scale test datasets comprising 1,820 scans with diverse pathologies. Furthermore, in a low-shot supervised fine-tuning setting, Screener surpasses existing self-supervised pretraining methods, establishing it as a state-of-the-art foundation for pathology segmentation. The code and pretrained models will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1594. UALM: Unified Audio Language Model for Understanding, Generation and Reasoning
๐ฌ 14
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Audio Language Model Audio Understanding Audio Generation
Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks -- an essential step toward advanced multimodal reasoning. This paper introduces Unified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. Furthermore, we present UALM-R1, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal generative reasoning, with its effectiveness confirmed by subjective evaluations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1595. RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards
๐ฌ 14
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ reward modeling model alignment inference-time control customization LLM post-training
Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2\%) and JudgeBench (81.4\%, \#1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at $<5$\% of the inference cost).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1596. Discovering and Steering Interpretable Concepts in Large Generative Music Models
๐ฌ 23
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ Interpretability Generative Models Music Mechanistic Sparse Autoencoders
The fidelity with which neural networks can now generate content such as music presents a scientific opportunity: these systems appear to have learned implicit theories of such content's structure through statistical learning alone. This offers a potentially new lens on theories of human-generated media. When internal representations align with traditional constructs (e.g. chord progressions in music), they show how such categories can emerge from statistical regularities; when they diverge, they expose limits of existing frameworks and patterns we may have overlooked but that nonetheless carry explanatory power. In this paper, focusing on music generators, we introduce a method for discovering interpretable concepts using sparse autoencoders (SAEs), extracting interpretable features from the residual stream of a transformer model. We make this approach scalable and evaluable using automated labeling and validation pipelines. Our results reveal both familiar musical concepts and coherent but uncodified patterns lacking clear counterparts in theory or language. As an extension, we show such concepts can be used to steer model generations. Beyond improving model transparency, our work provides an empirical tool for uncovering organizing principles that have eluded traditional methods of analysis and synthesis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1597. Point-Focused Attention Meets Context-Scan State Space: Robust Biological Visual Perception for Point Cloud Representation
๐ฌ 18
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Point cloud learning Attention mechanism State space model Biomimetic vision
Synergistically capturing intricate local structures and global contextual dependencies has become a critical challenge in point cloud representation learning. To address this, we introduce PointLearner, a point cloud representation learning network that closely aligns with biological vision which employs an active, foveation-inspired processing strategy, thus enabling local geometric modeling and long-range dependency interactions simultaneously. Specifically, we first design a point-focused attention, which simulates foveal vision at the visual focus through a competitive normalized attention mechanism between local neighbors and spatially downsampled features. The spatially downsampled features are extracted by a pooling method based on learnable inducing points, which can flexibly adapt to the non-uniform distribution of point clouds as the number of inducing points is controlled and they interact directly with point clouds. Second, we propose a context-scan state space that mimics eye's saccade inference, which infers the overall semantic structure and spatial content in the scene through a scan path guided by the Hilbert curve for the bidirectional S6. With this focus-then-context biomimetic design, PointLearner demonstrates remarkable robustness and achieves state-of-the-art performance across multiple point cloud tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1598. Infinite Horizon Markov Economies
๐ฌ 20
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Algorithmic game theory Equilibrium computation Markov pseudo-games Markov exchange economies
In this paper, we study a generalization of Markov games and pseudo-games that we call Markov pseudo-games, which like the former, captures time and uncertainty, and like the latter, allows for the playersโ actions to determine the set of actions available to the other players. In the same vein as Arrow and Debreu, we intend for this model to be rich enough to encapsulate a broad mathematical framework for modeling economies. We then prove the existence of a game-theoretic equilibrium in our model, which in turn implies the existence of a general equilibrium in the corresponding economies. Finally, going beyond Arrow and Debreu, we introduce a solution method for Markov pseudo-games, and prove its polynomial-time convergence.
We then provide an application of Markov pseudo-games to infinite-horizon Markov exchange economies, a stochastic economic model that extends Radnerโs stochastic exchange economy and Magill and Quinziiโs infinite horizon incomplete markets model. We show that under suitable assumptions, the solutions of any infinite horizon Markov exchange economy (i.e., recursive Radner equilibriaโRRE) can be formulated as the solution to a concave Markov pseudo-game, thus establishing the existence of RRE, and providing first-order methods for approximating RRE. Finally, we demonstrate the effectiveness of our approach in practice by building the corresponding generative adversarial policy neural network, and using it to compute RRE in a variety of infinite-horizon Markov exchange economies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1599. Exploring the Design Space of Transition Matching
๐ฌ 10
โญ 6.00
๐ generative models
๐ท๏ธ flow matching transition matching generative models
Transition Matching (TM) is an emerging paradigm for generative modeling that generalizes diffusion and flow-matching models as well as continuous-state autoregressive models. TM, similar to previous paradigms, gradually transforms noise samples to data samples, however it uses a second ``internal'' generative model to implement the transition steps, making the transitions more expressive compared to diffusion and flow models. To make this paradigm tractable, TM employs a large backbone network and a smaller "head" module to efficiently execute the generative transition step.
In this work, we present a large-scale, systematic investigation into the design, training and sampling of the head in TM frameworks, focusing on its time-continuous bidirectional variant. Through comprehensive ablations and experimentation involving training 56 different 1.7B text-to-image models (resulting in 549 unique evaluations) we evaluate the affect of the head module architecture and modeling during training as-well as a useful family of stochastic TM samplers. We analyze the impact on generation quality, training, and inference efficiency. We find that TM with an MLP head, trained with a particular time weighting and sampled with high frequency sampler provides best ranking across all metrics reaching state-of-the-art among all tested baselines, while Transformer head with sequence scaling and low frequency sampling is a runner up excelling at image aesthetics.
Lastly, we believe the experiments presented highlight the design aspects that are likely to provide most quality and efficiency gains, while at the same time indicate what design choices are not likely to provide further gains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1600. Unified 3D Scene Understanding Through Physical World Modeling
๐ฌ 13
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ 3D Scene Undertstanding Visual World Models
Understanding 3D scenes requires flexible combinations of visual reasoning tasks, including depth estimation, novel view synthesis, and object manipulation, all of which are essential for perception and interaction. Existing approaches have typically addressed these tasks in isolation, preventing them from sharing a common representation or transferring knowledge across tasks. A conceptually simpler but practically non-trivial alternative is to unify these diverse tasks into a single model, reducing different tasks from separate training objectives to merely different prompts and allowing for joint training across all datasets. In this work, we present a physical world model for unified 3D understanding and interaction 3WM, formulated as a probabilistic graphical model in which nodes represent multimodal scene elements such as RGB, optical flow, and camera pose. Diverse tasks emerge from different inference pathways through the graph: novel view synthesis from RGB and dense flow prompts, object manipulation from RGB and sparse flow prompts, and depth estimation from RGB and camera conditioning, all zero-shot without task-specific training. 3WM outperforms specialized baselines without the need for finetuning by offering precise controllability, strong geometric consistency, and robustness in real-world scenarios, achieving state-of-the-art performance on NVS and 3D object manipulation. Beyond predefined tasks, the model supports composable inference pathways, such as moving objects aside while navigating a 3D environment, enabling complex geometric reasoning. This demonstrates that a unified model can serve as a practical alternative to fragmented task-specific systems, taking a step towards a general-purpose visual world model.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1601. Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels
๐ฌ 12
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM data pipeline Reinforcement learning
Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the \textbf{\texttt{Webscale-RL} pipeline}, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the \textbf{\texttt{Webscale-RL} dataset}, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100$\times$ fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1602. Learning Semi-Structured Sparsity for LLMs via Shared and Context-Aware Hypernetwork
๐ฌ 9
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ LLM pruning semi-structured sparsity hypernetwork continual learning
Large Language Models (LLMs) achieve state-of-the-art performance but are costly to deploy in resource-constrained environments. Pruning with $n:m$ semi-structured sparsity reduces computation and enables hardware acceleration, yet existing methods face a trade-off: one-shot approaches are efficient but heuristic, while optimization-based methods are accurate but expensive.
We introduce \textbf{HyperPrune}, a resource-efficient framework that directly optimizes $n:m$ sparsity. A lightweight hypernetwork, shared across layers and conditioned on learnable embeddings, generates structured masks in a one-shot, layer-wise manner. \textit{Continual pruning} preserves cross-layer knowledge, and \textit{feature outlier regularization} retains critical activations, unifying the strengths of heuristic and optimization-based methods.
Experiments on LLaMA-7B to 70B show state-of-the-art accuracyโsparsity trade-offs on a single A100 GPU, achieving higher efficiency, accuracy, and scalability than prior approaches. HyperPrune offers a practical, scalable, and hardware-friendly solution for structured LLM pruning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1603. Trust-Region Adaptive Policy Optimization
๐ฌ 21
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Large Language Models Reasoning Model Reinforcement Learning Trust Region Knowledge Distillation
Post-training methods, especially Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), play an important role in improving large language models' (LLMs) complex reasoning abilities. However, the dominant two-stage pipeline (SFT then RL) suffers from a key inconsistency: SFT enforces rigid imitation that suppresses exploration and induces forgetting, limiting RL's potential for improvements.
We address this inefficiency with TRAPO (**T**rust-**R**egion **A**daptive **P**olicy **O**ptimization), a hybrid framework that interleaves SFT and RL within each training instance by optimizing SFT loss on expert prefixes and RL loss on the model's own completions, unifying external supervision and self-exploration. To stabilize training, we introduce Trust-Region SFT (TrSFT), which minimizes forward KL divergence inside a trust region but attenuates optimization outside, effectively shifting toward reverse KL and yielding stable, mode-seeking updates favorable for RL. An adaptive prefix-selection mechanism further allocates expert guidance based on measured utility. Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines, as well as recent state-of-the-art approaches, establishing a strong new paradigm for reasoning-enhanced LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1604. HARDTESTGEN: A High-Quality RL Verifier Generation Pipeline for LLM Algorithimic Coding
๐ฌ 15
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ LLMs RLVR code generation
Verifiers provide important reward signals for reinforcement learning of large language models (LLMs). However, it is challenging to develop or create reliable verifiers, especially for code generation tasks. A well-disguised wrong solution program may only be detected by carefully human-written edge cases that are difficult to synthesize automatically. To address this issue, we propose HardTestsGen, an approach to synthesize high-quality test cases for algorithmic coding problems. We curate a comprehensive algorithmic programming dataset HardTests with 26.6k problems and high-quality synthetic tests. Compared with existing tests, HardTestsGen tests demonstrate significantly higher accuracy in verifying LLM-generated code (+11.22 percentage points in precision, the percentage of actually correct code within the predicted correct ones). We also show that downstream post-training โ including rejection sampling and reinforcement learning (RL) โ using HardTests verifier results in improved performance of LLM code generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1605. MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE
๐ฌ 12
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Model Compression Mixture-of-Experts Structured Pruning Expert Pruning
Mixture-of-Experts (MoE) enables efficient scaling of large language models by activating only a subset of experts per input token.
However, deploying MoE-based models incurs significant memory overhead due to the need to retain all experts in memory.
While structured pruning is promising to reduce memory costs, existing methods often show suboptimal performance and unstable degradation in three dimensions: model architectures, calibration data sources, and calibration sample sizes.
This paper proposes \textbf{M}ixture-\textbf{o}f-\textbf{N}ovices-and-\textbf{E}xperts (\textbf{MoNE}), a novel expert pruning method that replaces redundant experts with lightweight novices to achieve effective and robust model compression.
MoNE evaluates expert redundancy based on two metrics: access frequency and output variance.
Experts exhibiting low usage and stable outputs are pruned and replaced with lightweight novicesโunbiased estimations of their original outputsโminimizing performance degradation.
Extensive experiments demonstrate that MoNE consistently outperforms baseline methods with minimal accuracy degradation across the three dimensions, confirming its effectiveness and robustness.
Notably, it outperforms baselines by up to 2.72 for the average zero shot accuracy across nine downstream tasks under 25\% pruning ratio, with only 0.14 performance drop for Qwen2-57B-A14B.
The code is available at https://anonymous.4open.science/r/AnonymizedMoNE.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1606. ICaRus: Identical Cache Reuse for Efficient Multi-Model Inference
๐ฌ 33
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ multi model inference agentic ai prefix caching fine-tuning
Multi model inference, where multiple task-specialized models collaborate to solve complex real-world problems, has recently emerged as a prominent paradigm, particularly in the development of agentic AI systems. However, in such scenarios, each model must maintain its own Key-Value (KV) cache for the identical prompt, leading to explosive memory consumption.
This explosive growth of KV caches forces LLM serving systems to evict previously stored caches, which in turn introduces significant recomputation overhead whenever the evicted caches are required again.
Moreover, prefix caching is inherently infeasible across different models, forcing each model to recompute KV cache for the identical prompt, which leads to signficant overhead. To alleviate these issues, we propose Identical Cache Reuse (ICaRus), a novel architecture that allows multiple models to share identical KV caches across all layers. ICaRus is based on the key observation that a decoder-only Transformer can be conceptually decomposed into a logical encoder, which generates KV caches, and a logical decoder, which predicts output tokens from the KV caches. ICaRus fine-tunes only the logical decoder while freezing the logical encoder, enabling multiple models to share an identical KV cache. This eliminates cache memory explosion and unexpected evictions while also allowing cross-model reuse of KV caches for new input tokens, thereby removing redundant recomputation in multi model inference achieving both efficiency and scalability. Moreover, by incorporating lightweight adapters such as LoRA, ICaRus parallelizes KV cache generation and next-token prediction during decoding. ICaRus achieves comparable accuracy to task-specific fine-tuned model across a diverse set of tasks, while allowing multiple specialized models to fully share KV caches. ICaRus achieves up to $11.1\times$ lower P95 latency and $3.8\times$ higher throughput in multi agent scenarios with 8 different models, compared to prior multi model system.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1607. Flow Matching Policy Gradients
๐ฌ 13
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Flow Matching Policy Gradient
Flow-based generative models, including diffusion models, excel at modeling continuous distributions in high-dimensional spaces. In this work, we introduce Flow Policy Optimization (FPO), a simple on-policy reinforcement learning algorithm that brings flow matching into the policy gradient framework. FPO casts policy optimization as maximizing an advantage-weighted ratio computed from the conditional flow matching loss, in a manner compatible with the popular PPO-clip framework. It sidesteps the need for exact likelihood computation while preserving the generative capabilities of flow-based models. Unlike prior approaches for diffusion-based reinforcement learning that bind training to a specific sampling method, FPO is agnostic to the choice of diffusion or flow integration at both training and inference time. We show that FPO can train diffusion-style policies from scratch in a variety of continuous control tasks. We find that flow-based models can capture multimodal action distributions and achieve higher performance than Gaussian policies, particularly in under-conditioned settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1608. TrainRef: Curating Data with Label Distribution and Minimal Reference for Accurate Prediction and Reliable Confidence
๐ฌ 14
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Label misinformation data curation influence function
Practical classification requires both high predictive accuracy and reliable confidence for human-AI collaboration. Given that a high-quality dataset is expensive and sometimes impossible, learning with noisy labels (LNL) is of great importance. The state-of-the-art works propose many denoising approaches by categorically correcting the label noise, i.e., change a label from one class to another. While effective in improving accuracy, they are less effective for learning reliable confidence. This happens especially when the number of classes grows, giving rise to more ambiguous samples. In addition, traditional approaches usually curate the training dataset (e.g., reweighting samples or correcting data labels) by intrinsically learning normalities from the noisy dataset. The curation performance can suffer when the noisy ratio is high enough to form a polluting normality.
In this work, we propose a training-time data-curation framework, TrainRef, to uniformly address predictive accuracy and confidence calibration by (1) an extrinsic small set of reference samples $D_{{ref}}$ to avoid normality pollution and (2) curate labels into a class distribution instead of a categorical class to handle sample ambiguity. Our insights lie in that the extrinsic information allows us to select more precise clean samples even when $|D_{{ref}}|$ equals to the number of classes (i.e., one sample per class). Technically, we design (1) a reference augmentation technique to select clean samples from the dataset based on $D_{{ref}}$; and (2) a model-dataset co-evolving technique for a near-perfect embedding space, which is used to vote on the class-distribution for the label of a noisy sample. Extensive experiments on CIFAR-100, Animal10N, and WebVision demonstrate that TrainRef outperform the state-of-the-art denoising techniques (DISC, L2B, and DivideMix) and model calibration techniques (label smoothing, Mixup, and temperature scaling). Furthermore, our user study shows that the model confidence trained by TrainRef well aligns with human intuition. More demonstration, proof, and experimental details are available at https://sites.google.com/view/train-ref.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1609. Learning Boltzmann Generators via Constrained Mass Transport
๐ฌ 16
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ sampling Boltzmann generators annealing
Efficient sampling from high-dimensional and multimodal unnormalized probability distributions is a central challenge in many areas of science and machine learning. We focus on Boltzmann generators (BGs) that aim to sample the Boltzmann distribution of physical systems, such as molecules, at a given temperature. Classical variational approaches that minimize the reverse KullbackโLeibler divergence are prone to mode collapse, while annealing-based methods, commonly using geometric schedules, can suffer from mass teleportation and rely heavily on schedule tuning. We introduce *Constrained Mass Transport* (CMT), a variational framework that generates intermediate distributions under constraints on both the KL divergence and the entropy decay between successive steps. These constraints enhance distributional overlap, mitigate mass teleportation, and counteract premature convergence. Across standard BG benchmarks and the here introduced *ELIL tetrapeptide*, the largest system studied to date without access to samples from molecular dynamics, CMT consistently surpasses state-of-the-art variational methods, achieving more than 2.5ร higher effective sample size while avoiding mode collapse.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1610. Enough is as good as a feast: A Comprehensive Analysis of How Reinforcement Learning Mitigates Task Conflicts in LLMs
๐ฌ 16
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language model Reinforcement learning Model merging
Model merging plays a crucial role in consolidating multiple specialized models into a single, unified model, especially in the era of large language models (LLMs). Recent research has primarily focused on developing strategies to enhance merging performance with the trained models, while the impact of training paradigms, such as supervised fine-tuning (SFT) and reinforcement learning (RL), on the effectiveness of model merging remains underexplored. In this study, we systematically explore the merging behavior of RL-trained LLMs compared to those trained with traditional SFT. Through comprehensive evaluations across five representative tasks, we find that RL significantly reduces task conflicts and results in less performance degradation after merging, making RL-trained models particularly well-suited for this process.
To unearth the reasons behind the superior suitability of RL for model merging, we conduct extensive empirical experiments and theoretical analyses. Our findings highlight three key factors: (1) On-policy training data in RL control the gradient updates in a smaller magnitude, reducing the risk of overwriting existing knowledge for other tasks in the model. (2) The RL optimization objective, which favors "\textit{enough is as good as a feast}", progressively reduces the magnitude of parameter updates as the model converges, thereby alleviating inter-task conflicts. (3) Joint optimization of positive and negative examples in RL steers the model towards an unbiased task-specific parameter subspace, ensuring robust performance while further preventing parameter conflicts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1611. Enhancing Visual Token Representations for Video Large Language Models via Training-free Spatial-Temporal Pooling and Gridding
๐ฌ 17
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Visual Token Representation Video Understanding Multimodal Large Language Models
Recent advances in Multimodal Large Language Models (MLLMs) have significantly advanced video understanding tasks, yet challenges remain in efficiently compressing visual tokens while preserving spatiotemporal interactions. Existing methods, such as LLaVA family, utilize simplistic pooling or interpolation techniques that overlook the intricate dynamics of visual tokens. To bridge this gap, we propose ST-GridPool, a novel training-free visual token enhancement method designed specifically for Video LLMs. Our approach integrates Pyramid Temporal Gridding (PTG), which captures multi-grained spatiotemporal interactions through hierarchical temporal gridding, and Norm-based Spatial Pooling (NSP), which preserves high-information visual regions by leveraging the correlation between token norms and semantic richness. Extensive experiments on various benchmarks demonstrate that ST-GridPool consistently enhances performance of Video LLMs without requiring costly retraining.
Our method offers an efficient and plug-and-play solution for improving visual token representations.
Our code is available in [https://anonymous.4open.science/r/ST-GridPool-85BE](https://anonymous.4open.science/r/ST-GridPool-85BE).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1612. Learning Retrieval Models with Sparse Autoencoders
๐ฌ 16
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ text embedding sparse autoencoders sparse retrieval large language models
Sparse autoencoders (SAEs) provide a powerful mechanism for decomposing the dense representations produced by Large Language Models (LLMs) into interpretable latent features. We posit that SAEs constitute a natural foundation for Learned Sparse Retrieval (LSR), whose objective is to encode queries and documents into high-dimensional sparse representations optimized for efficient retrieval. In contrast to existing LSR approaches that project input sequences into the vocabulary space, SAE-based representations offer the potential to produce more semantically structured, expressive, and language-agnostic features. By leveraging recently released open-source SAEs, we show that their latent features can serve as effective indexing units for representing documents and queries for sparse retrieval. Our experiments demonstrate that SAE-based LSR models consistently outperform their vocabulary-based counterparts in multilingual and out-of-domain settings. Finally, we introduce SPLARE, a 7B-parameter multilingual retrieval model capable of producing generalizable sparse latent embeddings for a wide range of languages and domains, achieving top results on MMTEBโs multilingual and English retrieval tasks. We also release a more efficient 2B-parameter variant, offering strong performance with a significantly lighter footprint.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1613. Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives
๐ฌ 11
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ honesty interrogation alignment auditing
As AI systems become more capable of complex agentic tasks, they also become more capable of pursuing undesirable objectives and causing harm. Previous work has attempted to catch these unsafe instances by interrogating models directly about their objectives and behaviors. However, the main weakness of trusting interrogations is that models can lie. We propose self-report fine-tuning (SRFT), a simple supervised fine-tuning technique that trains models to occasionally make factual mistakes, then admit them when asked. We show that the admission of factual errors in simple question-answering settings generalizes out-of-distribution (OOD) to the admission of hidden misaligned objectives in adversarial agentic settings. We evaluate SRFT in OOD stealth tasks, where models are instructed to complete a hidden misaligned objective alongside a user-specified objective without being caught by monitoring. After SRFT, models are more likely to confess the details of their hidden objectives when interrogated, even under strong pressure not to disclose them. Interrogation on SRFT models can detect hidden objectives with near-ceiling performance (F1 score = 0.98), while the baseline model lies when interrogated under the same conditions (F1 score = 0). Interrogation on SRFT models can further elicit the content of the hidden objective, recovering 28-100\% details, compared to 0\% details recovered in the baseline model and by prefilled assistant turn attacks. This provides a promising technique for promoting honesty propensity and incriminating misaligned AI systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1614. A Single Architecture for Representing Invariance Under Any Space Group
๐ฌ 17
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ symmetry group invariance space groups crystallographic groups Fourier series
Incorporating known symmetries in data into machine learning models has consistently improved predictive accuracy, robustness, and generalization. However, achieving exact invariance to specific symmetries typically requires designing bespoke architectures for each group of symmetries, limiting scalability and preventing knowledge transfer across related symmetries. In the case of the space groupsโsymmetries critical to modeling crystalline solids in materials science and condensed matter physicsโthis challenge is particularly salient as there are 230 such groups in three dimensions. In this work we present a new approach to such crystallographic symmetries by developing a single machine learning architecture that is capable of adapting its weights automatically to enforce invariance to any input space group. Our approach is based on constructing symmetry-adapted Fourier bases through an explicit characterization of constraints that group operations impose on Fourier coefficients. Encoding these constraints into a neural network layer enables weight sharing across different space groups, allowing the model to leverage structural similarities between groups and overcome data sparsity when limited measurements are available for specific groups. We demonstrate the effectiveness of this approach in achieving competitive performance on material property prediction tasks and performing zero-shot learning to generalize to unseen groups.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1615. Exchangeability of GNN Representations with Applications to Graph Retrieval
๐ฌ 26
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ GNN Locality sensitive hashing
In this work, we discover a probabilistic symmetry, called as exchangeability in graph neural networks (GNNs). Specifically, we show that the trained node embedding computed using a large family of graph neural networks, learned under standard optimization tools, are exchangeable random variables. This implies that the probability density of the node embeddings remains invariant with respect to a permutation applied on their dimension axis. This results in identical distribution across the elements of the graph representations. Such a property enables approximation of transportation-based graph similarities by Euclidean similarities between order statistics. Leveraging this reduction, we propose a unified locality-sensitive hashing (LSH) framework that supports diverse relevance measures, including subgraph matching and graph edit distance. Experiments show that our method helps to do LSH more effectively than baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1616. Diagnosing and Remedying Knowledge Deficiencies in LLMs via Label-free Curricular Meaningful Learning
๐ฌ 11
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Deficiency Diagnosis Data Synthesis LLMs Reasoning
Large Language Models (LLMs) have demonstrated impressive generalization ability by learning from extensive unlabeled text. However, they still exhibit reasoning mistakes, which can affect their trustworthiness and reliability. Although users can interact with LLMs and provide diverse and comprehensive queries to expose the flaws of LLMs, obtaining sufficient and effective feedback is demanding. Furthermore, comprehensively evaluating LLMs with limited labeled samples is difficult. These make it a challenge to diagnose and remedy the deficiencies in LLMs through rich label-free user queries. To tackle this challenge and considersing that LLMs' reasoning mistakes often stem from knowledge deficiencies, we propose label-free curricular meaningful learning (LaMer), which first employs relative entropy to diagnose and quantify knowledge deficiencies of LLMs in a label-free setting. Then, LaMer adaptively synthesizes augmentation data based on deficiency severity and progressively remedies them with a curricular remedy strategy. Experiments show that LaMer effectively diagnoses and remedies knowledge deficiencies in LLMs, improving various LLMs across seven out-of-distribution (OOD) reasoning benchmarks, achieving comparable results to baselines with only 40% training data. LaMer even surpasses methods that rely on labeled data for deficiency diagnosis. In application, LaMer offers a diagnostic tool for efficient LLM development.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1617. From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity
๐ฌ 12
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Psychiatric Comorbidity Diagnostic Dialogue EMR Dataset Multi-Agent Simulation
Psychiatric comorbidity is clinically significant yet challenging due to the complexity of multiple co-occurring disorders. To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation. We create 502 synthetic EMRs for common comorbid conditions using a pipeline that ensures clinical relevance and diversity. Our multi-agent framework transfers the clinical interview protocol into a hierarchical state machine and context tree, supporting over 130 diagnostic states while maintaining clinical standards. Through this rigorous process, we construct the first large-scale dialogue dataset supporting comorbidity, containing 3,000 multi-turn diagnostic dialogues validated by psychiatrists. This dataset enhances diagnostic accuracy and treatment planning, offering a valuable resource for psychiatric comorbidity research. Compared to real-world clinical transcripts, PsyCoTalk exhibits high structural and linguistic fidelity in terms of dialogue length, token distribution, and diagnostic reasoning strategies. Licensed psychiatrists confirm the realism and diagnostic validity of the dialogues. This dataset enables the development and evaluation of models capable of multi-disorder psychiatric screening in a single conversational pass.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1618. Online Decision-Focused Learning
๐ฌ 12
โญ 6.00
๐ optimization
๐ท๏ธ decision-focused learning integrated estimaton optimization predict-then-optimize online learning
Decision-focused learning (DFL) is an increasingly popular paradigm for training predictive models whose outputs are used in decision-making tasks. Instead of merely optimizing for predictive accuracy, DFL trains models to directly minimize the loss associated with downstream decisions. However, existing studies focus solely on scenarios where a fixed batch of data is available and the objective function does not change over time. We instead investigate DFL in dynamic environments where the objective function and data distribution evolve over time. This setting is challenging for online learning because the objective function has zero or undefined gradients---which prevents the use of standard first-order optimization methods---and is generally non-convex. To address these difficulties, we (i) regularize the objective to make it differentiable and (ii) use perturbation techniques along with a near-optimal oracle to overcome non-convexity. Combining those techniques yields two original online algorithms tailored for DFL, for which we establish respectively static and dynamic regret bounds. These are the first provable guarantees for the online decision-focused problem. Finally, we showcase the effectiveness of our algorithms on a knapsack experiment, where they outperform two standard benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1619. Theoretical Analysis of Contrastive Learning under Imbalanced Data: From Training Dynamics to a Pruning Solution
๐ฌ 21
โญ 6.00
๐ learning theory
๐ท๏ธ Contrastive learning Feature learning Training dynamics Theoretical analysis
Contrastive learning has emerged as a powerful framework for learning generalizable representations, yet its theoretical understanding remains limited, particularly under imbalanced data distributions that are prevalent in real-world applications. Such an imbalance can degrade representation quality and induce biased model behavior, yet a rigorous characterization of these effects is lacking. In this work, we develop a theoretical framework to analyze the training dynamics of contrastive learning with Transformer-based encoders under imbalanced data. Our results reveal that neuron weights evolve through three distinct stages of training, with different dynamics for majority features, minority features, and noise. We further show that minority features reduce representational capacity, increase the need for more complex architectures, and hinder the separation of ground-truth features from noise. Inspired by these neuron-level behaviors, we show that pruning restores performance degraded by imbalance and enhances feature separation, offering both conceptual insights and practical guidance. Major theoretical findings are validated through numerical experiments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1620. WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization
๐ฌ 22
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ agent information seeking data synthesis llm
The advent of Large Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-based information-seeking (IS) capabilities. The scarcity of high-quality training data has limited the development of IS agents. Existing data synthesis approaches typically adopt an information-driven paradigm that first collects information and then refines question-answer pairs through retrieval. However, this may lead to inconsistency between information structure and reasoning structure, as well as between the question and the corresponding answer. To mitigate, we propose a formalization-driven IS data synthesis framework WebShaper, which systematically formalizes IS tasks using set-theoretic constructs.
Central to the formalization is the concept of Knowledge Projections (KP), which enables precise control over reasoning structure by KP operation compositions. During synthesis, we begin by creating seed tasks, then use a multi-step expansion process.
At each step, an agentic Expander expands the current formal question more complex through retrieval and validation tools grounded in our formalization. We train our model on the synthesized dataset. Experiment results demonstrate that WebShaper achieves state-of-the-art performance among open-sourced IS agents on competitive benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1621. FIRE: Frobenius-Isometry Reinitialization for Balancing the StabilityโPlasticity Tradeoff
๐ฌ 11
โญ 6.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ stability-plasticity tradeoff continual learning
Deep neural networks trained on nonstationary data must balance stability (i.e., retaining prior knowledge) and plasticity (i.e., adapting to new tasks). Standard reinitialization methods, which reinitialize weights toward their original values, are widely used but difficult to tune: conservative reinitializations fail to restore plasticity, while aggressive ones erase useful knowledge. We propose FIRE, a principled reinitialization method that explicitly balances the stabilityโplasticity tradeoff. FIRE quantifies stability through Squared Frobenius Error (SFE), measuring proximity to past weights, and plasticity through Deviation from Isometry (DfI), reflecting weight isotropy. The reinitialization point is obtained by solving a constrained optimization problem, minimizing SFE subject to DfI being zero, which is efficiently approximated by NewtonโSchulz iteration. FIRE is evaluated on continual visual learning (CIFAR-10 with ResNet-18), language modeling (OpenWebText with GPT-0.1B), and reinforcement learning (HumanoidBench with SAC and Atari games with DQN). Across all domains, FIRE consistently outperforms both naive training without intervention and standard reinitialization methods, demonstrating effective balancing of the stabilityโplasticity tradeoff.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1622. H$^3$GNNs: Harmonizing Heterophily and Homophily in GNNs via Self-Supervised Node Encoding
๐ฌ 21
โญ 6.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph Neural Networks Heterophily and Homophily Self Supervise Learning
Graph Neural Networks (GNNs) have made significant advances in representation learning on various types of graph-structured data. However, GNNs struggle to simultaneously model heterophily and homophily, a challenge that is amplified under self-supervised learning (SSL) where no labels are available to guide the training process. This paper presents H$^3$GNNs, an end-to-end graph SSL framework designed to harmonize heterophily and homophily through two complementary innovative perspectives: (i) Representation Harmonization via Joint Structural Node Encoding. Nodes are embedded into a unified latent space that retains both node specificity and graph structural awareness for harmonizing heterophily and homophily. Node specificity is learned via linear and non-linear node feature projections. Graph structural awareness is learned via a proposed Weighted Graph Convolutional Network (WGCN). A self-attention module enables the model learning-to-adapt to varying levels of patterns. (ii) Objective Harmonization via Predictive Architecture with Node-DifficultyโAware Masking. A teacher network processes the full graph. A student network receives a partially masked graph. The student is trained end-to-end, while the teacher is an exponential moving average of the student. The proxy task is to train the student to predict the teacherโs embeddings for all nodes (masked and unmasked). To keep the objective informative across the graph, two masking strategies that guide selection toward currently hard nodes while retaining exploration are proposed. Theoretical underpinnings of H$^3$GNNs are also analyzed in detail. Comprehensive evaluations on benchmarks demonstrate that H$^3$GNNs achieves state-of-the-art performance on heterophilic graphs (e.g., +7.1% on Texas, +9.6% on Roman-Empire over the prior art) while matching SOTA on homophilic graphs, and delivering strong computational efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1623. Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the Diverse Framework
๐ฌ 12
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Leaderboards LLM Evaluation Benchmarking
Current evaluation of large language models relies predominantly on technical benchmarks that fail to capture how users actually experience these systems in practice. Even the most notable human preference evaluation approaches suffer from methodological limitations including unrepresentative sampling, superficial assessment depth, and single-metric reductionism that obscures the multidimensional nature of human-AI interaction quality. We introduce DIVERSE, a rigorous evaluation framework that addresses these limitations through demographically stratified sampling, multi-turn naturalistic conversations, and assessment across five human-centric dimensions. We collected conversations from 21,352 participants stratified across 22 demographic groups in the US and UK, evaluating 27 state-of-the-art language models through pairwise comparisons. Using a robust hierarchical Bradley-Terry-Davidson model alongside post-stratified demographic adjustments to census weights, we reveal insights unavailable within existing approaches: (1) clear performance hierarchies with Gemini-2.5-Pro achieving 97% probability of ranking first for overall preference, (2) quantification of significant preference heterogeneity, identifying user age as the primary factor, revealing failures in model generalization across populations, and (3) differential discriminative power across human-centric evaluation dimensions, with Trust, Ethics & Safety showing significantly higher tie rates than task performance metrics. Our framework demonstrates that meaningful evaluation requires moving beyond aggregate preference scores to understand the complex, demographic-specific patterns that determine real-world model preference. We release our complete dataset, interactive leaderboard, and evaluation framework to catalyse further research into more rigorous and equitable evaluation of language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1624. Discrete Diffusion for Bundle Construction
๐ฌ 15
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Bundle Construction Bundle Completion Recommendation System Discrete Diffusion Model
As a central task in product bundling, bundle construction aims to select a subset of items from huge item catalogs to complete a partial bundle. Existing methods often rely on the sequential construction paradigm that predicts items one at a time, nevertheless, this paradigm is fundamentally unsuitable for the essentially unordered bundles. In contrast, the non-sequential construction paradigm models bundle as a set, while it still faces two dimensionality curses: the combination complexity is exponential to the catalog size and bundle length.
Accordingly, we identify two technical challenges: 1) how to effectively and efficiently model the higher-order intra-bundle relations with the growth of bundle length; and 2) how to learn item embeddings that are sufficiently discriminative while maintaining a relatively smaller search space other than the huge item set.
To address these challenges, we propose DDBC, a Discrete Diffusion model for Bundle Construction. DDBC leverages a masked denoising diffusion process to build bundles non-sequentially, capturing joint dependencies among items without relying on certain pre-defined order. To mitigate the curse of large catalog size, we integrate residual vector quantization (RVQ), which compresses item embeddings into discrete codes drawn from a globally shared codebook, enabling more efficient search while retaining semantic granularity. We evaluate our method on real-world bundle construction datasets of music playlist continuation and fashion outfit completion, and the experimental results show that DDBC can achieve more than 100\% relative performance improvements compared with state-of-the-art baseline methods. Ablation and model analyses further confirm the effectiveness of both the diffusion backbone and RVQ tokenizer, where the performance gain is more significant for larger catalog size and longer bundle length.
Our code is available at https://anonymous.4open.science/r/DDBC-44EE.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1625. Animating the Uncaptured: Humanoid Mesh Animation with Video Diffusion Models
๐ฌ 4
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Motion generation Motion Tracking & Transfer
Animation of humanoid characters is essential in various graphics applications, but require significant time and cost to create realistic animations. We propose an approach to synthesize 4D animated sequences of input static 3D humanoid meshes, leveraging strong generalized motion priors from generative video models -- as such video models contain powerful motion information covering a wide variety of human motions. From an input static 3D humanoid mesh and a text prompt describing the desired animation, we synthesize a corresponding video conditioned on a rendered image of the 3D mesh. We then employ an underlying SMPL representation to animate the corresponding 3D mesh according to the video-generated motion, based on our motion optimization. This enables a cost-effective and accessible solution to enable the synthesis of diverse and realistic 4D animations
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1626. Statistical Guarantees for Offline Domain Randomization
๐ฌ 9
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Domain Randomization Sim-To-Real
Reinforcement-learning agents often struggle when deployed from simulation to the real-world. A dominant strategy for reducing the sim-to-real gap is domain randomization (DR) which trains the policy across many simulators produced by sampling dynamics parameters, but standard DR ignores offline data already available from the real system. We study offline domain randomization (ODR), which first fits a distribution over simulator parameters to an offline dataset. While a growing body of empirical work reports substantial gains with algorithms such as DROPO, the theoretical foundations of ODR remain largely unexplored. In this work, we cast ODR as a maximum-likelihood estimation over a parametric simulator family and provide statistical guarantees: under mild regularity and identifiability conditions, the estimator is weakly consistent (it converges in probability to the true dynamics as data grows), and it becomes strongly consistent (i.e., it converges almost surely to the true dynamics) when an additional uniform Lipschitz continuity assumption holds. We examine the practicality of these assumptions and outline relaxations that justify ODRโs applicability across a broader range of settings. Taken together, our results place ODR on a principled footing and clarify when offline data can soundly guide the choice of a randomization distribution for downstream offline RL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1627. The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss
๐ฌ 10
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ MultiModal Large Language Model;Pre-Normlization
Multimodal Large Language Models (MLLMs), which couple pre-trained vision encoders and language models, have shown remarkable capabilities. However, their reliance on the ubiquitous Pre-Norm architecture introduces a subtle yet critical flaw: a severe norm disparity between the high-norm visual tokens and the low-norm text tokens. In this work, we present a formal theoretical analysis demonstrating that this imbalance is not a static issue. Instead, it induces an ''asymmetric update dynamic,'' where high-norm visual tokens exhibit a ``representational inertia,'' causing them to transform semantically much slower than their textual counterparts. This fundamentally impairs effective cross-modal feature fusion. Our empirical validation across a range of mainstream MLLMs confirms that this theoretical dynamic---the persistence of norm disparity and the resulting asymmetric update rates---is a prevalent phenomenon. Based on this insight, we propose a remarkably simple yet effective solution: inserting a single, carefully initialized LayerNorm layer after the visual projector to enforce norm alignment. Experiments conducted on the LLaVA-1.5 architecture show that this intervention yields significant performance gains not only on a wide suite of multimodal benchmarks but also, notably, on text-only evaluations such as MMLU, suggesting that resolving the architectural imbalance leads to a more holistically capable model.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1628. A Probabilistic Hard Concept Bottleneck for Steerable Generative Models
๐ฌ 18
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ generative models interpretability steerability concept bottleneck hard concepts probabilistic models
Concept Bottleneck Generative Models (CBGMs) incorporate a human-interpretable concept bottleneck layer, which makes them interpretable and steerable. However, designing such a layer for generative models poses the same challenges as for concept bottleneck models in a supervised context, if not greater ones. Deterministic mappings from the model inner representations to soft concepts in existing CBGMs: (i) limit steerable generation to modifying concepts in existing inputs; and, more importantly, (ii) are susceptible to *concept leakage*, which hinders their steerability. To address these limitations, we first introduce the Variational Hard Concept Bottleneck (VHCB) layer. The VHCB maps probabilistic estimates of binary latent variables to hard concepts, which have been shown to mitigate leakage. Remarkably, its probabilistic formulation enables direct generation from a specified set of concepts. Second, we propose a systematic evaluation framework for assessing the steerability of CBGMs across various tasks (e.g., activating and deactivating concepts). Our framework which allows us to empirically demonstrate that the VHCB layer consistently improves steerability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1629. Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity
๐ฌ 10
โญ 6.00
๐ learning theory
๐ท๏ธ loss of plasticity deep learning theory continual learning
Deep learning models excel in stationary settings but suffer from loss of plasticity (LoP) in non-stationary environments. While prior literature characterizes LoP through symptoms like rank collapse of representations, it often lacks a mechanistic explanation for why gradient descent fails to recover from these states. This work presents a first-principles investigation grounded in dynamical systems theory, formally defining LoP not merely as a statistical degradation, but as an entrapment of gradient dynamics within invariant sub-manifolds of the parameter space. We identify two primary mechanisms that create these traps: frozen units from activation saturation and cloned-unit manifolds from representational redundancy. Crucially, our framework uncovers a fundamental tension: the very mechanisms that promote generalization in static settings, such as low-rank compression, actively steer the network into these LoP manifolds. We validate our theoretical analysis with numerical simulations and demonstrate how architectural interventions can destabilize these manifolds to restore plasticity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1630. LightMem: Lightweight and Efficient Memory-Augmented Generation
๐ฌ 33
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language model LLM memory
Despite their remarkable capabilities, Large Language Model (LLM) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often incur substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the AtkinsonโShiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognitive-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference.
Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117ร, API calls by up to 159ร, and runtime by over 12ร. Code will be released on GitHub.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1631. Q-RAG: Long Context MultiโStep Retrieval via ValueโBased Embedder Training
๐ฌ 14
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning RL QA Long-context RAG NLP
Retrieval-Augmented Generation (RAG) methods enhance LLM performance by efficiently filtering relevant context for LLMs, reducing hallucinations and inference cost.
However, most existing RAG methods focus on single-step retrieval, which is often insufficient for answering complex questions that require multi-step search.
Recently, multi-step retrieval approaches have emerged, typically involving the fine-tuning of small LLMs to perform multi-step retrieval.
However, this type of fine-tuning is highly resource-intensive and does not enable the use of larger LLMs.
In this work, we propose Q-RAG, a novel approach that fine-tunes the Embedder model for multi-step retrieval using reinforcement learning (RL).
Q-RAG offers a competitive, resource-efficient alternative to existing multi-step retrieval methods for open-domain question answering and achieves state-of-the-art results on the popular long-context benchmarks Babilong and RULER for contexts up to 10M tokens.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1632. Fingerprinting Deep Neural Networks for Ownership Protection: An Analytical Approach
๐ฌ 19
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ neural network fingerprinting ownership verification
Adversarial-example-based fingerprinting approaches, which leverage the decision boundary characteristics of deep neural networks (DNNs) to craft fingerprints, has proven effective for protecting model ownership. However, a fundamental challenge remains unresolved: how far a fingerprint should be placed from the decision boundary to simultaneously satisfy two essential propertiesโrobustness and uniquenessโrequired for effective and reliable ownership protection. Despite the importance of the fingerprint-to-boundary distance, existing works offer no theoretical solution and instead rely on empirical heuristics to determine it, which may lead to violations of either robustness or uniqueness properties.
We propose AnaFP, an analytical fingerprinting scheme that constructs fingerprints under theoretical guidance. Specifically, we formulate the fingerprint generation task as the problem of controlling the fingerprint-to-boundary distance through a tunable stretch factor. To ensure both robustness and uniqueness, we mathematically formalize these properties that determine the lower and upper bounds of the stretch factor. These bounds jointly define an admissible interval within which the stretch factor must lie, thereby establishing a theoretical connection between the two constraints and the fingerprint-to-boundary distance. To enable practical fingerprint generation, we approximate the original (infinite) sets of pirated and independently trained models using two finite surrogate model pools and employ a quantile-based relaxation strategy to relax the derived bounds. Particularly, due to the circular dependency between the lower bound and the stretch factor, we apply a grid search strategy over the admissible interval to determine the most feasible stretch factor. Extensive experimental results demonstrate that AnaFP consistently outperforms prior methods, achieving effective and reliable ownership verification across diverse model architectures and model modification attacks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1633. PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning
๐ฌ 30
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ test-time learning long-context reasoning meta-learning reasoning algorithm length extrapolation
Long-context reasoning requires accurately identifying relevant information in extensive, noisy input contexts. In this work, we propose PERK (Parameter Efficient Reasoning over Knowledge), a scalable approach for learning to encode long contexts using gradient updates at test time. Specifically, PERK employs two nested optimization loops in a meta-training phase. The inner loop rapidly encodes contexts into a low-rank adapter (LoRA) that serves as a parameter-efficient memory module for the base model. Concurrently, the outer loop learns to use the updated adapter to accurately recall and reason over relevant information from the encoded long context. Our evaluations on several long-context reasoning tasks show that PERK significantly outperforms the standard long-context finetuning, achieving average absolute performance gains of up to 20% for Qwen-2.5 (0.5B & 7B) on synthetic and real-world long-context reasoning. PERK also maintains its advantages across model scales and families. Compared to specialized long-context LLMs, PERK matches or surpasses their performance. Finally, our analyses show PERK is more robust to reasoning complexity, length extrapolation, and the positions of relevant information in contexts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1634. CARD: Towards Conditional Design of Multi-agent Topological Structures
๐ฌ 16
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multi-Agent Systems Graph Learning
Large language model (LLM)-based multi-agent systems have shown strong capabilities in tasks such as code generation and collaborative reasoning. However, the effectiveness and robustness of these systems critically depend on their communication topology, which is often fixed or statically learned, ignoring real-world dynamics such as model upgrades, API (or tool) changes, or knowledge source variability. To address this limitation, we propose CARD (Conditional Agentic Graph Designer), a conditional graph-generation framework that instantiates AMACP, a protocol for adaptive multi-agent communication. CARD explicitly incorporates dynamic environmental signals into graph construction, enabling topology adaptation at both training and runtime. Through a conditional variational graph encoder and environment-aware optimization, CARD produces communication structures that are both effective and resilient to shifts in model capability or resource availability. Empirical results on HumanEval, MATH, and MMLU demonstrate that CARD consistently outperforms static and prompt-based baselines, achieving higher accuracy and robustness across diverse conditions. The source code is available at: \url{https://anonymous.4open.science/r/agentgraph-FF9A}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1635. DIVERSE: Disagreement-Inducing Vector Evolution for Rashomon Set Exploration
๐ฌ 16
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Rashomon Set Rashomon Effect Feature-wise Linear Modulation (FiLM) CMA-ES Model Multiplicity Predictive Multiplicity Neural Network Diversity
We propose DIVERSE, a framework for systematically exploring the Rashomon set of deep neural networks, the collection of models that match a reference modelโs accuracy while differing in their predictive behavior. DIVERSE augments a pretrained model with Feature-wise Linear Modulation (FiLM) layers and uses Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to search a latent modulation space, generating diverse model variants without retraining or gradient access. Across MNIST, PneumoniaMNIST, and CIFAR-10, DIVERSE uncovers multiple high-performing yet functionally distinct models. Our experiments show that DIVERSE offers a competitive and efficient exploration of the Rashomon set, making it feasible to construct diverse sets that maintain robustness and performance while supporting well-balanced model multiplicity. While retraining remains the baseline to generate Rashomon sets, DIVERSE achieves comparable diversity at reduced computational cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1636. Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding
๐ฌ 22
โญ 6.00
๐ generative models
๐ท๏ธ diffusion language models compute efficient sampling skipping compute adaptive attention
Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens. However, they still recompute the attention and feed-forward blocks for every token position at every step---even when many unmasked tokens are essentially fixed, resulting in substantial waste in compute.
We propose \textbf{\textsc{SureLock}}: when the posterior at an unmasked position has stabilized across steps (our \emph{sure} condition), we \emph{lock} that position---thereafter skipping its query projection and feed-forward sublayers---while caching its attention keys and values so other positions can continue to attend to it.
This reduces the dominant per-iteration computational cost from $O(N^2d)$ to $O(MNd)$ where $N$ is the sequence length, $M$ is the number of unlocked token positions, and $d$ is the model dimension.
In practice, $M$ decreases as the iteration progresses, yielding substantial savings.
On LLaDA-8B, SureLock reduces algorithmic FLOPs by 30--50\% relative to the same sampler without locking,
while maintaining comparable generation quality.
We also provide a theoretical analysis to justify the design rationale of SureLock: monitoring only the local KL at the lock step suffices to bound the deviation in final token probabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1637. LiteGuard: Efficient Task-Agnostic Model Fingerprinting with Enhanced Generalization
๐ฌ 21
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ neural network fingerprinting ownership verification
Task-agnostic model fingerprinting has recently gained increasing attention due to its ability to provide a universal framework applicable across diverse model architectures and tasks. The current state-of-the-art method, MetaV, ensures generalization by jointly training a set of fingerprints and a neural-network-based global verifier using two large and diverse model sets: one composed of pirated models (i.e., the protected model and its variants) and the other comprising independently-trained models. However, publicly available models are scarce in many real-world domains, and constructing such model sets requires intensive training efforts and massive computational resources, posing a significant barrier to practical deployment. Reducing the number of models can alleviate the overhead, but increases the risk of overfitting, a problem further exacerbated by MetaV's entangled design, in which all fingerprints and the global verifier are jointly trained. This overfitting issue leads to compromised generalization capability to verify unseen models.
In this paper, we propose LiteGuard, an efficient task-agnostic fingerprinting framework that attains enhanced generalization while significantly lowering computational cost. Specifically, LiteGuard introduces two key innovations: (i) a checkpoint-based model set augmentation strategy that enriches model diversity by leveraging intermediate model snapshots captured during the training of each pirated and independently-trained modelโthereby alleviating the need to train a large number of pirated and independently-trained models, and (ii) a local verifier architecture that pairs each fingerprint with a lightweight local verifier, thereby reducing parameter entanglement and mitigating overfitting. Extensive experiments across five representative tasks show that LiteGuard consistently outperforms MetaV in both generalization performance and computational efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1638. Learning From the Past with Cascading Eligibility Traces
๐ฌ 13
โญ 6.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ biological credit assignment eligibility traces synaptic plasticity computational neuroscience
Animals often receive information about errors and rewards after significant delays. In some cases these delays are fixed aspects of neural processing or sensory feedback, for example, there is typically a delay of tens to hundreds of milliseconds between motor actions and visual feedback. The standard approach to handling delays in models of synaptic plasticity is to use eligibility traces. However, standard eligibility traces that decay exponentially mix together any events that happen during the delay, presenting a problem for any credit assignment signal that occurs with a significant delay. Here, we show that eligibility traces formed by a state-space model, inspired by a cascade of biochemical reactions, can provide a temporally precise memory for handling credit assignment at arbitrary delays. We demonstrate that these cascading eligibility traces (CETs) work for credit assignment at behavioral time-scales, ranging from seconds to minutes. As well, we can use CETs to handle extremely slow retrograde signals, as have been found in retrograde axonal signaling. These results demonstrate that CETs can provide an excellent basis for modeling synaptic plasticity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1639. Not Search, But Scan: Benchmarking MLLMs on Scan-Oriented Academic Paper Reasoning
๐ฌ 32
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Multimodal Large Language Models; Academic Paper Reasoning; Scan-Oriented Reasoning
With the rapid progress of multimodal large language models (MLLMs), AI already performs well at literature retrieval and certain reasoning tasks, serving as a capable assistant to human researchers, yet it remains far from autonomous research. The fundamental reason is that current work on scholarly paper reasoning is largely confined to a search-oriented paradigm centered on pre-specified targets, with reasoning grounded in relevance retrieval, which struggles to support researcher-style full-document understanding, reasoning, and verification. To bridge this gap, we propose ScholScan, a new benchmark for scholarly paper reasoning. ScholScan introduces a scan-oriented task setting that asks models to read and cross-check entire papers like human researchers, scanning the document to identify consistency issues. The benchmark comprises 1,800 carefully annotated questions drawn from 9 error families across 13 natural-science domains and 715 papers, and provides detailed annotations for evidence localization and reasoning traces, together with a unified evaluation protocol. We assessed 15 models across 24 input configurations and conduct a fine-grained analysis of MLLM capabilities across error families. Across the board, retrieval-augmented generation (RAG) methods yield no significant improvements, revealing systematic deficiencies of current MLLMs on scan-oriented tasks and underscoring the challenge posed by ScholScan. We expect ScholScan to be the leading and representative work of the scan-oriented task paradigm.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1640. Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking
๐ฌ 37
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ agent information seeking data synthesis llm
Large Language Model (LLM)-based agents have emerged as a transformative approach for open-ended problem solving, with information seeking (IS) being a core capability that enables autonomous reasoning and decision-making. While prior research has largely focused on improving retrieval depth, we observe that current IS agents often suffer from \textit{low search efficiency}, which in turn constrains overall performance. A key factor underlying this inefficiency is the sparsity of target entities in training tasks, which limits opportunities for agents to learn and generalize efficient search behaviors. To address these challenges, we propose WebLeaper, a framework for constructing high-coverage IS tasks and generating efficient solution trajectories. We formulate IS as a tree-structured reasoning problem, enabling a substantially larger set of target entities to be embedded within a constrained context. Leveraging curated Wikipedia tables, we propose three variants for synthesizing IS tasksโBasic, Union, and Reverse-Unionโto systematically increase both IS efficiency and effectiveness. Finally, we curate training trajectories by retaining only those that are simultaneously accurate and efficient, ensuring that the model is optimized for both correctness and search performance. Extensive experiments conducted on five IS benchmarksโBrowserComp, GAIA, Seal-0, WideSearch, and xbench-DeepSearchโdemonstrate that our method consistently achieves improvements in both effectiveness and efficiency over strong baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1641. SCRAPL: Scattering Transform with Random Paths for Machine Learning
๐ฌ 16
โญ 6.00
๐ learning on time series and dynamical systems
๐ท๏ธ scattering transform wavelets stochastic optimization ddsp perceptual quality assessment
The Euclidean distance between wavelet scattering transform coefficients (known as *paths*) provides informative gradients for perceptual quality assessment of deep inverse problems in computer vision, speech, and audio processing.
However, these transforms are computationally expensive when employed as differentiable loss functions for stochastic gradient descent due to their numerous paths, which significantly limits their use in neural network training.
Against this problem, we propose ``Scattering transform with Random Paths for machine Learning'' (SCRAPL): a stochastic optimization scheme for efficient evaluation of multivariable scattering transforms.
We implement SCRAPL for the joint timeโfrequency scattering transform (JTFS) which demodulates spectrotemporal patterns at multiple scales and rates, allowing a fine characterization of intermittent auditory textures.
We apply SCRAPL to differentiable digital signal processing (DDSP), specifically, unsupervised sound matching of a granular synthesizer and the Roland TR-808 drum machine.
We also propose an initialization heuristic based on importance sampling, which adapts SCRAPL to the perceptual content of the dataset, improving neural network convergence and evaluation performance.
We make our audio samples available and provide SCRAPL as a Python package.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1642. Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)
๐ฌ 16
โญ 6.00
๐ generative models
๐ท๏ธ Diffusion models Flow Matching Acceleration of diffusion/flow models Distillation of diffusion/flow models
While achieving exceptional generative quality, modern diffusion, flow, and other matching models suffer from slow inference, as they require many steps of iterative generation. Recent distillation methods address this by training efficient one-step generators under the guidance of a pre-trained teacher model. However, these methods are often constrained to only one specific framework, e.g., only to diffusion or only to flow models. Furthermore, these methods are naturally data-free, and to benefit from the usage of real data, it is required to use an additional complex adversarial training with an extra discriminator model. In this paper, we present \textbf{RealUID}, a unified distillation framework for all matching models that seamlessly incorporates real data into the distillation procedure without GANs. Our \textbf{RealUID} approach offers a simple theoretical foundation that covers previous distillation methods for Flow Matching and Diffusion models, and is also extended to their modifications, such as Bridge Matching and Stochastic Interpolants.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1643. Conditioned Initialization for Attention
๐ฌ 17
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ spectral conditioning transformers spectral properties of attention
Transformers are a dominant architecture in modern machine learning, powering applications across vision, language, and beyond. At the core of their success lies the attention layer, where the query, key, and value matrices determine how token dependencies are captured. While considerable work has focused on scaling and optimizing Transformers, comparatively little attention has been paid to how the weights of the queries, keys and values are initialized. Common practice relies on random initialization or alternatives such as mimetic initialization, which imitates weight patterns from converged models, and weight selection, which transfers weights from a teacher model. In this paper, we argue that initialization can introduce an optimization bias that fundamentally shapes training dynamics. We propose **conditioned initialization**, a principled scheme that initializes attention weights to improve the spectral properties of the attention layer. Theoretically, we show that conditioned initialization can potentially reduce the condition number of the attention Jacobian, leading to more stable optimization. Empirically, it accelerates convergence and improves generalization across diverse applications, highlighting conditioning as a critical yet underexplored area for advancing Transformer performance. Importantly, conditioned initialization is simple to apply and integrates seamlessly into a wide range of Transformer architectures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1644. A Relative Error-Based Evaluation Framework of Heterogeneous Treatment Effect Estimators
๐ฌ 28
โญ 6.00
๐ causal reasoning
๐ท๏ธ Causal Inference Conditional Average Treatment Effect Relative Error Robust Evaluation
While significant progress has been made in heterogeneous treatment effect (HTE) estimation, the evaluation of HTE estimators remains underdeveloped. In this article, we propose a robust evaluation framework based on relative error, which quantifies performance differences between two HTE estimators. We first derive the key theoretical conditions on the nuisance parameters that are necessary to achieve a robust estimator of relative error. Building on these conditions, we introduce novel loss functions and design a neural network architecture to estimate nuisance parameters, thereby obtaining a robust estimation of relative error. We provide large sample properties of the proposed relative error estimator. Furthermore, beyond evaluation, we propose a new learning algorithm for HTE that leverages both the previously HTE estimators and the nuisance parameters learned through our neural network architecture. Extensive experiments demonstrate that our evaluation framework supports reliable comparisons across HTE estimators, and the proposed learning algorithm for HTE exhibits desirable performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1645. Splat the Net: Radiance Fields with Splattable Neural Primitives
๐ฌ 14
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ neural rendering radiance field representation 3DGS NeRF
Radiance fields have emerged as a predominant representation for modeling 3D scene appearance. Neural formulations such as Neural Radiance Fields provide high expressivity but require costly ray marching for rendering, whereas primitive-based methods such as 3D Gaussian Splatting offer real-time efficiency through splatting, yet at the expense of representational power. Inspired by advances in both these directions, we introduce splattable neural primitives, a new volumetric representation that reconciles the expressivity of neural models with the efficiency of primitive-based splatting. Each primitive encodes a bounded neural density field parameterized by a shallow neural network. Our formulation admits an exact analytical solution for line integrals, enabling efficient computation of perspectively accurate splatting kernels. As a result, our representation supports integration along view rays without the need for costly ray marching. The primitives flexibly adapt to scene geometry and, being larger than prior analytic primitives, reduce the number required per scene. On novel-view synthesis benchmarks, our approach matches the quality and speed of 3D Gaussian Splatting while using 10x fewer primitives and 6x fewer parameters. These advantages arise directly from the representation itself, without reliance on complex control or adaptation frameworks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1646. Study of Training Dynamics for Memory-Constrained Fine-Tuning
๐ฌ 11
โญ 6.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Efficient Learning Energy Saving
Memory-efficient training of deep neural networks has become increasingly important as models grow larger while deployment environments impose strict resource constraints. We propose TraDy, a novel transfer learning scheme leveraging two key insights: layer importance for updates is architecture-dependent and determinable a priori, while dynamic stochastic channel selection provides superior gradient approximation compared to static approaches. We introduce a dynamic channel selection approach that stochastically resamples channels between epochs within preselected layers. Extensive experiments demonstrate TraDy achieves state-of-the-art performance across various downstream tasks and architectures while maintaining strict memory constraints, achieving up to 99\% activation sparsity, 95\% weight derivative sparsity, and 97\% reduction in FLOPs for weight derivative computation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1647. Why We Need New Benchmarks for Local Intrinsic Dimension Estimation
๐ฌ 11
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ Local intrinsic dimension estimation LIDL FLIPD Diffusion Models Benhamark Normalizing Flows ESS Normal Bundle NB LID
Recent advancements in algorithms for local intrinsic dimension (LID) estimation have been closely tied to progress in neural networks (NN). However, NN architectures are often tailored to specific domains, such as audio or image data, incorporating inductive biases that limit their transferability across domains. Moreover, existing LID estimation methods leveraging these architectures are typically evaluated on either overly simplistic benchmarks or domain datasets where the true LID is unknown, resulting in potentially erroneous evaluations. To close this research gap, we first isolate problematic aspects of LID estimation and leverage them to analyze the limitations of state-of-the-art methods. Our approach employs several techniques to create LID benchmarks for arbitrary domains, including the introduction of a method to transform any manifold into the domain while preserving the manifold structure, thereby addressing challenges posed by biases in neural network-based methods. Our comparative analysis reveals critical limitations and identifies new directions for future development in LID estimation methods. Code will be available on github when published.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1648. Provably Explaining Neural Additive Models
๐ฌ 16
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ explainability XAI explainable AI formal verification sufficient explanations
Despite significant progress in post-hoc explanation methods for neural networks, many remain heuristic and lack provable guarantees. A key approach for obtaining explanations with provable guarantees is by identifying a *cardinally-minimal* subset of input features which by itself is *provably sufficient* to determine the prediction. However, for standard neural networks, this task is often computationally infeasible, as it demands a worst-case *exponential* number of verification queries in the number of input features, each of which is NP-hard. In this work, we show that for Neural Additive Models (NAMs), a recent and more interpretable neural network family, we can *efficiently* generate explanations with such guarantees. We present a new model-specific algorithm for NAMs that generates provably cardinally-minimal explanations using only a *logarithmic* number of verification queries in the number of input features, after a parallelized preprocessing step with logarithmic runtime in the required precision is applied to each small univariate NAM component. Our algorithm not only makes the task of obtaining cardinally-minimal explanations feasible, but even outperforms existing algorithms designed to find *subset-minimal* explanations -- which may be larger and less informative but easier to compute -- despite our algorithm solving a much more difficult task. Our experiments demonstrate that, compared to previous algorithms, our approach provides provably smaller explanations than existing works and substantially reduces the computation time. Moreover, we show that our generated provable explanations offer benefits that are unattainable by standard sampling-based techniques typically used to interpret NAMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1649. Truthfulness Despite Weak Supervision: Evaluating and Training LLMs Using Peer Prediction
๐ฌ 14
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Language Model Evaluation AI Alignment AI Truthfulness and Deception Large Language Models
The evaluation and post-training of large language models (LLMs) rely on supervision, but strong supervision for difficult tasks is often unavailable, especially when evaluating strong models. In such cases, models have been demonstrated to exploit evaluation schemes built on such imperfect supervision, leading to deceptive results.
However, underutilized in LLM research, a wealth of mechanism design research focuses on game-theoretic *incentive compatibility* - eliciting honest and informative answers with weak supervision.
Drawing from this literature, we introduce the peer prediction method for model evaluation and post-training. It rewards honest and informative answers over deceptive and uninformative ones, using a metric based on mutual predictability and without requiring ground truth labels.
We demonstrate the method's effectiveness and resistance to deception, with both theoretical guarantees and empirical validation on models with up to 405B parameters. We show that training an 8B model with peer prediction-based reward recovers most of the drop in truthfulness due to prior malicious finetuning, even when the reward is produced by a 0.135B language model with no finetuning.
On the evaluation front, in contrast to LLM-as-a-Judge which requires strong and trusted judges, we discover an inverse scaling property in peer prediction, where, surprisingly, resistance to deception is *strengthened* as the capability gap between the experts and participants *widens*, enabling reliable evaluation of strong models with weak supervision. In particular, LLM-as-a-Judge become worse than random guess when facing deceptive models 5-20$\times$ the judge's size, while peer prediction thrives when such gaps are large, including in cases with over 100$\times$ size difference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1650. A Dense Subset Index for Collective Query Coverage
๐ฌ 13
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ collective retrieval subset search submodular functions
In traditional information retrieval, corpus items compete with each other to occupy top ranks in response to a query. In contrast, in many recent retrieval scenarios associated with complex, multi-hop question answering or text-to-SQL, items are not self-complete: they must instead collaborate, i.e., information from multiple items must be combined to respond to the query. In the context of modern dense retrieval, this need translates into finding a small collection of corpus items whose contextual word vectors collectively cover the contextual word vectors of the query. The central challenge is to retrieve a near-optimal collection of covering items in time that is sublinear in corpus size. By establishing coverage as a submodular objective, we enable successive dense index probes to quickly assemble an item collection that achieves near-optimal coverage. Successive query vectors are iteratively `edited', and the dense index is built using random projections of a novel, lifted dense vector space. Beyond rigorous theoretical guarantees, we report on a scalable implementation of this new form of vector database. Extensive experiments establish the empirical success of DISCo, in terms of the best coverage vs. query latency tradeoffs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1651. Condition Errors Refinement in Autoregressive Image Generation with Diffusion Loss
๐ฌ 14
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Language Models Autoregressive Language Models Autoregressive Image Generation
Recent studies have explored autoregressive models for image generation, with promising results, and have combined diffusion models with autoregressive frameworks to optimize image generation via diffusion losses. In this study, we present a theoretical analysis of diffusion and autoregressive models with diffusion loss, highlighting the latter's advantages. We present a theoretical comparison of conditional diffusion and autoregressive diffusion with diffusion loss, demonstrating that patch denoising optimization in autoregressive models effectively mitigates condition errors and leads to a stable condition distribution. Our analysis also reveals that autoregressive condition generation refines the condition, causing the condition error influence to decay exponentially. In addition, we introduce a novel condition refinement approach based on Optimal Transport (OT) theory to address ``condition inconsistency''. We theoretically demonstrate that formulating condition refinement as a Wasserstein Gradient Flow ensures convergence toward the ideal condition distribution, effectively mitigating condition inconsistency. Experiments demonstrate the superiority of our method over diffusion and autoregressive models with diffusion loss methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1652. Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search
๐ฌ 15
โญ 6.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ auto-bidding offline reinforcement learning generative decision making
Auto-bidding serves as a critical tool for advertisers to improve their advertising performance. Recent progress has demonstrated that AI-Generated Bidding (AIGB), which learns a conditional generative planner from offline data, achieves superior performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still face a performance bottleneck due to their inherent inability to explore beyond the static offline dataset. To address this, we propose AIGB-Pearl (Planning with EvaluAtor via RL), a novel method that integrates generative planning and policy optimization. The core of AIGB-Pearl lies in constructing a trajectory evaluator for scoring generation quality and designing a provably sound KL-Lipschitz-constrained score maximization scheme to ensure safe and efficient generalization beyond the offline dataset. A practical algorithm incorporating the synchronous coupling technique is further devised to ensure the model regularity required by the proposed scheme. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1653. You Point, I Learn: Online Adaptation of Interactive Segmentation Models for Handling Distribution Shifts in Medical Imaging
๐ฌ 4
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Medical Image Segmentation Interactive Online Adaptation
Interactive segmentation uses real-time user inputs, such as mouse clicks, to iteratively refine model predictions. Although not originally designed to address distribution shifts, this paradigm naturally lends itself to such challenges. In medical imaging, where distribution shifts are common, interactive methods can use user inputs to guide models towards improved predictions.
Moreover, once a model is deployed, user corrections can be used to adapt the network parameters to the new data distribution, mitigating distribution shift. Based on these insights, we aim to develop a practical, effective method for improving the adaptive capabilities of interactive segmentation models to new data distributions in medical imaging. Firstly, we found that strengthening the model's responsiveness to clicks is important for the initial training process. Moreover, we show that by treating the post-interaction user-refined model output as pseudo-ground-truth, we can design a lean, practical online adaptation method that enables a model to learn effectively across sequential test images. The framework includes two components: (i) a Post-Interaction adaptation process, updating the model after the user has completed interactive refinement of an image, and (ii) a Mid-Interaction adaptation process, updating incrementally after each click. Both processes include a Click-Centered Gaussian loss that strengthens the model's reaction to clicks and enhances focus on user-guided, clinically relevant regions. Experiments on 5 fundus and 4 brainโMRI databases show that our approach consistently outperforms existing methods under diverse distribution shifts, including unseen imaging modalities and pathologies.
Code and pretrained models will be released upon publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1654. IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs
๐ฌ 15
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Vision Language Models VLMs Multimodal models Cultural VLMs Mutlimodal Evaluation OCR Cultural VQA Mutlimodal Machine Translation MMT
Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1655. How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
๐ฌ 13
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ vision benchmark multimodal foundation models vision language models standard computer vision tasks
Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) and using established datasets (e.g., COCO, ImageNet and its variants, etc).
The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework.
We observe that 1) the models are not close to the state-of-the-art specialist models at any tasks, and 2) they perform semantic tasks notably better than geometric ones. However, 3) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks and 6) reasoning models, e.g. o3, show improvements in geometric tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1656. Conformal Prediction with Corrupted Labels: Uncertain Imputation and Robust Re-weighting
๐ฌ 10
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Conformal Prediction Uncertainty Quantification Distribution Shift Corrupted Labels Privileged Information
We introduce a framework for robust uncertainty quantification in situations where labeled training data are corrupted, through noisy or missing labels. We build on conformal prediction, a statistical tool for generating prediction sets that cover the test label with a pre-specified probability. The validity of conformal prediction, however, holds under the i.i.d assumption, which does not hold in our setting due to the corruptions in the data. To account for this distribution shift, the privileged conformal prediction (PCP) method proposed leveraging privileged information (PI)---additional features available only during training---to re-weight the data distribution, yielding valid prediction sets under the assumption that the weights are accurate. In this work, we analyze the robustness of PCP to inaccuracies in the weights. Our analysis indicates that PCP can still yield valid uncertainty estimates even when the weights are poorly estimated. Furthermore, we introduce uncertain imputation (UI), a new conformal method that does not rely on weight estimation. Instead, we impute corrupted labels in a way that preserves their uncertainty. Our approach is supported by theoretical guarantees and validated empirically on both synthetic and real benchmarks. Finally, we show that these techniques can be integrated into a triply robust framework, ensuring statistically valid predictions as long as at least one underlying method is valid.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1657. LEXam: Benchmarking Legal Reasoning on 340 Law Exams
๐ฌ 11
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ LLM legal reasoning long-form multiple-choice question answering
Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. To address this, we introduce ***LEXam***, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions present significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Deploying an ensemble LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately, closely aligning with human expert assessments. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. Anonymous repository: [this URL](https://anonymous.4open.science/r/LEXam-anonymous-12EB).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1658. Deep Global-sense Hard-negative Discriminative Generation Hashing for Cross-modal Retrieval
๐ฌ 18
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Deep Hashing Cross-modal Retrieval Informative Learning Hard Negative Generation
Hard negative generation (HNG) provides valuable signals for deep learning, but existing methods mostly rely on local correlations while neglecting the global geometry of the embedding space. This limitation often leads to weak discrimination, particularly in cross-modal hashing, which obtains compact binary codes.
We propose Deep Global-sense Hard-negative Discriminative Generation Hashing (DGHDGH), a framework that constructs a structured graph with dual-iterative message propagation to capture global correlations, and then performs difficulty-adaptive, channel-wise interpolation to synthesize semantically consistent hard negatives aligned with global Hamming geometry.
Our approach yields more informative negatives, sharpens semantic boundaries in the Hamming co-space, and substantially enhances cross-modal retrieval. Experiments on multiple benchmarks consistently demonstrate improvements in retrieval accuracy, verifying the discriminative advantages brought by global-sense HNG in cross-modal hashing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1659. Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning
๐ฌ 20
โญ 6.00
๐ reinforcement learning
๐ท๏ธ Distributed Reinforcement Learning Agent Ensemble Learning Agent Diversity Exploration Efficiency
Scaling reinforcement learning to tens of thousands of parallel environments requires overcoming the limited exploration capacity of a single policy. Ensemble-based policy gradient methods, which employ multiple policies to collect diverse samples, have recently been proposed to promote exploration. However, merely broadening the exploration space does not always enhance learning capability, since excessive exploration can reduce exploration quality or compromise training stability. In this work, we theoretically analyze the impact of inter-policy diversity on learning efficiency in policy ensembles, and propose Coupled Policy Optimization (CPO), which regulates diversity through KL constraints between policies. The proposed method enables effective exploration and outperforms strong baselines such as SAPG, PBT, and PPO across multiple dexterous manipulation tasks in both sample efficiency and final performance. Furthermore, analysis of policy diversity and effective sample size during training reveals that follower policies naturally distribute around the leader, demonstrating the emergence of structured and efficient exploratory behavior. Our results indicate that diverse exploration under appropriate regulation is key to achieving stable and sample-efficient learning in ensemble policy gradient methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1660. One protein is all you need
๐ฌ 19
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ proteins generalization self-supervised learning model customization test-time training
Generalization beyond training data remains a central challenge in machine learning for biology. A common way to enhance generalization is self-supervised pre-training on large datasets. However, aiming to perform well on all possible proteins can limit a modelโs capacity to excel on any specific one, whereas practitioners typically need accurate predictions for individual proteins they study, often not covered in training data. To address this limitation, we propose a method that enables self-supervised customization of protein language models to one target protein at a time, on the fly, and without assuming any additional data. We show that our Protein Test-Time Training (ProteinTTT) method consistently enhances generalization across different models, their sizes, and datasets. ProteinTTT improves structure prediction for challenging targets, achieves new state-of-the-art results on protein fitness prediction, and enhances function prediction on two tasks. We also demonstrate ProteinTTT on two challenging case studies. We show that customization via ProteinTTT enables more accurate antibodyโantigen loop modeling and improves 19% of structures in the Big Fantastic Virus Database, delivering improved predictions where general-purpose AlphaFold2 and ESMFold struggle.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1661. SpeechOp: Inference-Time Task Composition for Generative Speech Processing
๐ฌ 10
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ speech generation TTS enhancement diffusion latent diffusion
While generative Text-to-Speech (TTS) systems leverage vast "in-the-wild" data to achieve remarkable success, speech-to-speech processing tasks like enhancement face data limitations, which lead data-hungry generative approaches to distort speech content and speaker identity. To bridge this gap, we present SpeechOp, a multi-task latent diffusion model that transforms pre-trained TTS models into a universal speech processor capable of performing a wide range of speech tasks and composing them in novel ways at inference time. By adapting a pre-trained TTS model, SpeechOp inherits a rich understanding of natural speech, accelerating training and improving S2S task quality, while simultaneously enhancing core TTS performance. Finally, we introduce Implicit Task Composition (ITC), a novel pipeline where ASR-derived transcripts (e.g., from Whisper) guide SpeechOp's enhancement via our principled inference-time task composition. ITC achieves state-of-the-art content preservation by robustly combining web-scale speech understanding with SpeechOp's generative capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1662. From Parameters to Behaviors: Unsupervised Compression of the Policy Space
๐ฌ 6
โญ 6.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning unsupervised reinforcement learning unsupervised representation learning
Despite its recent successes, Deep Reinforcement Learning (DRL) is notoriously sample-inefficient. We argue that this inefficiency stems from the standard practice of optimizing policies directly in the high-dimensional and highly redundant parameter space $\\Theta$. This challenge is greatly compounded in multi-task settings. In this work, we develop a novel, unsupervised approach that compresses the policy parameter space $\\Theta$ into a low-dimensional latent space $\\mathcal Z$. We train a generative model $g:\\mathcal Z\\to\\Theta$ by optimizing a behavioral reconstruction loss, which ensures that the latent space is organized by functional similarity rather than proximity in parameterization. We conjecture that the inherent dimensionality of this manifold is a function of the environment's complexity, rather than the size of the policy network. We validate our approach in continuous control domains, showing that the parameterization of standard policy networks can be compressed up to five orders of magnitude while retaining most of its expressivity. As a byproduct, we show that the learned manifold enables task-specific adaptation via Policy Gradient operating in the latent space $\\mathcal{Z}$.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1663. Training Large Language Models To Reason In Parallel With Global Forking Tokens
๐ฌ 15
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language model reasoning chain of thoughts
Although LLMs have demonstrated improved performance by scaling parallel test-time compute, doing so relies on generating reasoning paths that are both diverse and accurate. For challenging problems, the forking tokens that trigger diverse yet correct reasoning modes are typically deep in the sampling tree. Consequently, common strategies to encourage diversity, such as temperature scaling,
encounter a worsened trade-off between diversity and accuracy. Motivated by this challenge, we treat parallel reasoning as a set-of-next-token-prediction problem and incorporate a set-based global loss into Supervised Fine-Tuning (SFT) using bipartite matching between global forking tokens and unique reasoning traces. We observe that, whereas naive fine-tuning with multiple reasoning traces collapses these unique reasoning modes, our proposed method, Set Supervised Fine-Tuning (SSFT), preserves these modes and produces emergent global forking tokens. Experiments on multiple reasoning benchmarks show our SSFT method consistently outperforms SFT under both pass@1 and cons@k metrics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1664. Submodular Function Minimization with Dueling Oracle
๐ฌ 9
โญ 6.00
๐ optimization
๐ท๏ธ submodular minimization deling oracle preference-based optimization
We consider submodular function minimization using a \textit{dueling oracle}, a noisy pairwise comparison oracle that provides relative feedback on function values between two queried sets. The oracle's responses are governed by a \textit{transfer function}, which characterizes the relationship between differences in function values and the parameters of the response distribution. For a \textit{linear} transfer function, we propose an algorithm that achieves an error rate of $O(n^{\frac{3}{2}}/\sqrt{T})$, where $n$ is the size of the ground set and $T$ denotes the number of oracle calls. We establish a lower bound: Under the constraint that differences between queried sets are bounded by a constant, any algorithm incurs an error of at least $\Omega(n^{\frac{3}{2}}/\sqrt{T})$. Without such a constraint, the lower bound becomes $\Omega(n/\sqrt{T})$. These results show that our algorithm is optimal up to constant factors for constrained algorithms. For a \textit{sigmoid} transfer function, we design an algorithm with an error rate of $O(n^{\frac{7}{5}}/T^{\frac{2}{5}})$, and establish lower bounds analogous to the linear case.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1665. Quadratic Direct Forecast for Training Multi-Step Time-Series Forecast Models
๐ฌ 13
โญ 6.00
๐ learning on time series and dynamical systems
๐ท๏ธ Time-series time-series forecast
The design of training objective is central to training time-series forecasting models. Existing training objectives such as mean squared error mostly treat each future step as an independent, equally weighted task, which we found leading to the following two issues: (1) overlook the *label autocorrelation effect* among future steps, leading to biased training objective; (2) fail to set *heterogeneous task weights* for different forecasting tasks corresponding to varying future steps, limiting the forecasting performance. To fill this gap, we propose a novel quadratic-form weighted training objective, addressing both of the issues simultaneously. Specifically, the off-diagonal elements of the weighting matrix account for the label autocorrelation effect, whereas the non-uniform diagonals are expected to match the most preferable weights of the forecasting tasks with varying future steps. To achieve this, we propose a Quadratic Direct Forecast (QDF) learning algorithm, which trains the forecast model using the adaptively updated quadratic-form weighting matrix. Experiments show that our QDF effectively improves performance of various forecast models, achieving state-of-the-art results. Code is available at https://anonymous.4open.science/r/QDF-8937.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1666. HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games
๐ฌ 18
โญ 6.00
๐ datasets and benchmarks
๐ท๏ธ long-tail benchmark logic puzzle games large reasoning model
Large Reasoning Models (LRMs) have demonstrated impressive performance on complex tasks, including logical puzzle games that require deriving solutions satisfying all constraints. However, whether they can flexibly apply appropriate rules to varying conditions, particularly when faced with non-canonical game variants, remains an open question. Existing corpora focus on popular puzzles like 9x9 Sudoku, risking overfitting to canonical formats and memorization of solution patterns, which can mask deficiencies in understanding novel rules or adapting strategies to new variants. To address this, we introduce **HardcoreLogic**, a challenging benchmark of over 5,000 puzzles across 10 games, designed to test the robustness of LRMs on the "long-tail" of logical games. HardcoreLogic systematically transforms canonical puzzles through three dimensions: **Increased Complexity (IC)**, **Uncommon Elements (UE)**, and **Unsolvable Puzzles (UP)**, reducing reliance on shortcut memorization. Evaluations on a diverse set of LRMs reveal significant performance drops, even for models achieving top scores on existing benchmarks, indicating heavy reliance on memorized stereotypes. While increased complexity is the dominant source of difficulty, models also struggle with subtle rule variations that do not necessarily increase puzzle difficulty. Our systematic error analysis on solvable and unsolvable puzzles further highlights gaps in genuine reasoning. Overall, HardcoreLogic exposes the limitations of current LRMs and establishes a benchmark for advancing high-level logical reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1667. GaitSnippet: Gait Recognition Beyond Unordered Sets and Ordered Sequences
๐ฌ 14
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Gait Recognition Snippets Sampling Snippet Modeling
Recent advancements in gait recognition have significantly enhanced performance by treating silhouettes as either an unordered set or an ordered sequence. However, both set-based and sequence-based approaches exhibit notable limitations. Specifically, set-based methods tend to overlook short-range temporal context for individual frames, while sequence-based methods struggle to capture long-range temporal dependencies effectively. To address these challenges, we draw inspiration from human identification and propose a new perspective that conceptualizes human gait as a composition of individualized actions. Each action is represented by a series of frames, randomly selected from a continuous segment of the sequence, which we term a snippet. Fundamentally, the collection of snippets for a given sequence enables the incorporation of multi-scale temporal context, facilitating more comprehensive gait feature learning. Moreover, we introduce a non-trivial solution for snippet-based gait recognition, focusing on Snippet Sampling and Snippet Modeling as key components. Extensive experiments on four widely-used gait datasets validate the effectiveness of our proposed approach and, more importantly, highlight the potential of gait snippets. For instance, our method achieves the rank-1 accuracy of 77.5% on Gait3D and 81.7% on GREW using a 2D convolution-based backbone.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1668. FASTโDIPS: AdjointโFree Analytic Steps and HardโConstrained Likelihood Correction for DiffusionโPrior Inverse Problems
๐ฌ 15
โญ 6.00
๐ generative models
๐ท๏ธ inverse problem image reconstruction diffusion models
$\textbf{FAST-DIPS}$ is a training-free solver for diffusion-prior inverse problems, including nonlinear forward operators. At each noise level, a pretrained denoiser provides an anchor $\mathbf{x}_ {0|t}$; we then perform a hard-constrained proximal correction in measurement space (AWGN) by solving
$\min_\mathbf{x} \tfrac{1}{2\gamma_t}\|\mathbf{x}-\mathbf{x}_{0|t}\|^2 \ \text{s.t.}\ \|\mathcal{A}(\mathbf{x})-\mathbf{y}\|\le\varepsilon$.
The correction is implemented via an adjoint-free ADMM with a closed-form projection onto the Euclidean ball and a few steepest-descent updates whose step size is analytic and computable from one VJP and one JVPโor a forward-difference surrogateโfollowed by decoupled re-annealing. We show this step minimizes a local quadratic model (with backtracking-based descent), any ADMM fixed point satisfies KKT for the hard-constraint, and mode substitution yields a bounded time-marginal error. We also derive a latent variant $\mathcal{A}\mapsto\mathcal{A}\circ\mathcal{D}$ and a one-parameter pixel$\rightarrow$latent hybrid schedule. FAST-DIPS delivers comparable or better PSNR/SSIM/LPIPS while being substantially faster, requiring only autodiff access to $\mathcal{A}$ and no hand-coded adjoints or inner MCMC.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1669. Task-Agnostic Amortized Multi-Objective Optimization
๐ฌ 21
โญ 6.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Multi-Objective Optimization Bayesian Optimization Transformers Neural Processes
Balancing competing objectives is omnipresent across disciplines, from drug design to autonomous systems. Multi-objective Bayesian optimization is a promising solution for such expensive, black-box problems: it fits probabilistic surrogates and selects new designs via an acquisition function that balances exploration and exploitation. In practice, it requires tailored choices of surrogate and acquisition that rarely transfer to the next problem, is myopic when multi-step planning is often required, and adds refitting overhead, particularly in parallel or time-sensitive loops. We present TAMO, a fully amortized, universal policy for multi-objective black-box optimization. TAMO uses a transformer architecture that operates across varying input and objective dimensions, enabling pretraining on diverse corpora and transfer to new problems without retraining: at test time, the pretrained model proposes the next design with a single forward pass. We pretrain the policy with reinforcement learning to maximize cumulative hypervolume improvement over full trajectories, conditioning on the entire query history to approximate the Pareto frontier. Across synthetic benchmarks and real tasks, TAMO produces fast proposals, reducing proposal time by 50โ1000ร versus alternatives while matching or improving Pareto quality under tight evaluation budgets. These results show that transformers can perform multi-objective optimization entirely in-context, eliminating per-task surrogate fitting and acquisition engineering, and open a path to foundation-style, plug-and-play optimizers for scientific discovery workflows.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1670. On Coreset for LASSO Regression Problem with Sensitivity Sampling
๐ฌ 10
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ LASSO regression sampling algorithm
In this paper, we study coreset construction for LASSO regression, where a coreset is a small, weighted subset of the data that approximates the original problem with provable guarantees. For unregularized regression problems, sensitivity sampling is a successful and widely applied technique for constructing coresets. However, extending these methods to LASSO typically requires coreset size to scale with O(\mathcal{G}d), where d is the VC dimension and \mathcal{G} is the total sensitivity, following existing generalization bounds. A key challenge in improving upon this general bound lies in the difficulty of capturing the sparse and localized structure of the function space induced by the \ell_1 penalty in LASSO objective. To address this, we first provide an empirical process-based method of sensitivity sampling for LASSO, localizing the procedure by decomposing the functional space into independent spaces, which leads to tighter estimation error. By carefully leveraging the geometric properties of these localized spaces, we establish tight empirical process bounds on the required coreset size. These techniques enable us to achieve a coreset of size \tilde{O}(\epsilon^{-2}d\cdot((\log d)^3\cdot\min(1,\log d/\lambda^2)+\log(1/\delta))), which ensures a (1\pm\epsilon)-approximation for any \epsilon,\delta\in(0,1) and \lambda > 0. Furthermore, we give a lower bound showing that any algorithm achieving a (1+\epsilon)-approximation must select at least \Omega(\frac{d\log{d}}{\epsilon^2}) rows in the regime where \lambda=O(d^{-1/2}). Empirical experiments show that our proposed algorithm is at least 4 times faster than the existing LASSO solver and more than 9 times faster on half of the datasets, while ensuring high solution quality and sparsity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1671. Learning Robust Intervention Representations with Delta Embeddings
๐ฌ 15
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ action representation causal representation learning interventions
Causal representation learning has attracted significant research interest during the past few years, as a means for improving model generalization and robustness. Causal representations of interventional image pairs (also called ``actionable counterfactuals'' in the literature), have the property that only variables corresponding to scene elements affected by the intervention / action are changed between the start state and the end state. While most work in this area has focused on identifying and representing the variables of the scene under a causal model, fewer efforts have focused on representations of the interventions themselves. In this work, we show that an effective strategy for improving out of distribution (OOD) robustness is to focus on the representation of actionable counterfactuals in the latent space. Specifically, we propose that an intervention can be represented by a Causal Delta Embedding that is invariant to the visual scene and sparse in terms of the causal variables it affects. Leveraging this insight, we propose a method for learning causal representations from image pairs, without any additional supervision. Experiments in the Causal Triplet challenge demonstrate that Causal Delta Embeddings are highly effective in OOD settings, significantly exceeding baseline performance in both synthetic and real-world benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1672. The Gaussian-Head OFL Family: One-Shot Federated Learning from Client Global Statistics
๐ฌ 11
โญ 6.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ One-Shot Federated Learning data-free aggregation Gaussian Discriminant Heads Knowledge Distillation
Classical Federated Learning relies on a multi-round iterative process of model exchange and aggregation between server and clients, with high communication costs and privacy risks from repeated model transmissions. In contrast, one-shot federated learning (OFL) alleviates these limitations by reducing communication to a single round, thereby lowering overhead and enhancing practical deployability. Nevertheless, most existing one-shot approaches remain either impractical or constrained, for example, they often depend on the availability of a public dataset, assume homogeneous client models, or require uploading additional data or model information. To overcome these issues, we introduce the Gaussian-Head OFL (GH-OFL) family, a suite of one-shot federated methods that assume class-conditional Gaussianity of pretrained embeddings. Clients transmit only sufficient statistics (per-class counts and first/second-order moments) and the server builds heads via three components: (i) Closed-form Gaussian heads (NB/LDA/QDA) computed directly from the received statistics; (ii) FisherMix, a linear head with cosine margin trained on synthetic samples drawn in an estimated Fisher subspace; and (iii) Proto-Hyper, a lightweight low-rank residual head that refines Gaussian logits via knowledge distillation on those synthetic samples. In our experiments, GH-OFL methods deliver state-of-the-art robustness and accuracy under strong non-IID skew while remaining strictly data-free.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1673. Vulcan: Crafting Compact Class-Specific Vision Transformers For Edge Intelligence
๐ฌ 19
โญ 6.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Class-specific model derivation Vision Transformer structured pruning edge intelligence
Large Vision Transformers (ViTs) must often be compressed before they can be deployed on resource-constrained edge devices.
However, many edge devices require only part of the *all-classes* knowledge of a pre-trained ViT in their corresponding application scenarios. This is overlooked by existing compression methods. Lightweight models produced by these methods retain a substantial amount of class-irrelevant knowledge and suffer suboptimal performance on target classes. To address this, we analyze the knowledge distribution of ViT and reveal a knowledge disentanglement within it: neurons in the feed-forward network (FFN) modules encode class-specific knowledge, while the multi-head attention (MHA) modules capture class-agnostic patterns. Building on this insight, we introduce Vulcan, a pruning-oriented post-training method for deriving compact class-specific models from a pre-trained ViT under given resource budgets. Vulcan follows a novel *train-then-prune* paradigm, which introduces redundancy into ViTs deliberately by collapsing FFN neurons onto those with the highest class-specific activations and by enforcing low-rankness in MHA weights. This design mitigates the irreversible knowledge loss of direct pruning, so that the post-trained model can be compressed into a compact one with negligible performance loss. Notably, the derived edge ViTs not only achieve significant reductions in size and computation but also even surpass the original ViTs in performance on specific classes. Comprehensive experiments with five base ViTs covering three representative visual tasks on four datasets demonstrate that Vulcan-derived ViTs outperform the base ViTs on class-specific tasks by up to 15.12\% in accuracy, with only 20\%โ40\% of their sizes. Compared with state-of-the-art structured pruning methods, Vulcan improves class-specific accuracy by up to 13.92\%. Code is available at [Vulcan](https://anonymous.4open.science/r/Vulcan-4CEB).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1674. Reconciling Visual Perception and Generation in Diffusion Models
๐ฌ 14
โญ 6.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Visual Perception Image Classification Object Detection Semantic Segmentation
We present \textsc{GenRep}, a unified image understanding and synthesis model that jointly conducts discriminative learning and generative modeling in one training session. By leveraging Monte Carlo approximation, \textsc{GenRep} distills distributional knowledge embedded in diffusion models to guide the discriminative learning for visual perception tasks. Simultaneously, a semantic-driven image generation process is established, where high-level semantics learned from perception tasks can be used to inform image synthesis, creating a positive feedback loop for mutual boosts. Moreover, to reconcile the learning process for both tasks, a gradient alignment strategy is proposed to symmetrically modify the optimization directions of perception and generation losses. These designs empower \textsc{GenRep} to be a versatile and powerful model that achieves top-leading performance on both image understanding and generation benchmarks. Code will be released after acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1675. A General Framework for Black-Box Attacks Under Cost Asymmetry
๐ฌ 10
โญ 6.00
๐ interpretability and explainable AI
๐ท๏ธ zeroth-order optimization asymmetric cost black-box adversarial attacks
Traditional decision-based black-box adversarial attacks on image classifiers aim to generate adversarial examples by slightly modifying input images while keeping the number of queries low, where each query involves sending an input to the model and observing its output. Most existing methods assume that all queries have equal cost. However, in practice, queries may incur *asymmetric costs*; for example, in content moderation systems, certain output classes may trigger additional review, enforcement, or penalties, making them more costly than others. While prior work has considered such asymmetric cost settings, effective algorithms for this scenario remain underdeveloped.
In this paper, we introduce asymmetric black-box attacks, a new family of decision-based attacks that generalize to the asymmetric query-cost setup. We develop new methods for boundary search and gradient estimation when crafting adversarial examples. Specifically, we propose *Asymmetric Search (AS)*, a more conservative alternative to binary search that reduces reliance on high-cost queries, and *Asymmetric Gradient Estimation (AGREST)*, which shifts the sampling distribution in Monte Carlo style gradient estimation to favor low-cost queries.
We design efficient algorithms that reduce total attack cost by balancing different query types, in contrast to earlier methods such as *stealthy attacks* that focus only on limiting expensive (high-cost) queries. We perform both theoretical analysis and empirical evaluation on standard image classification benchmarks. Across various cost regimes, our method consistently achieves lower total query cost and smaller perturbations than existing approaches, reducing the perturbation norm by up to 40\% in some settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1676. Improving Extreme Wind Prediction with Frequency-Informed Learning
๐ฌ 17
โญ 6.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Extreme Weather Forecasting Meteorological Analysis AI for Science
Accurate prediction of extreme wind velocities has substantial significance in industry, particularly for the operation management of wind power plants. Although the state-of-the-art data-driven models perform well for general meteorological forecasting, they may exhibit large errors for extreme weatherโfor example, systematically underestimating the magnitudes and short-term variation of extreme winds. To address this issue, we conduct a theoretical analysis of how the data frequency spectrum influences errors in extreme wind prediction. Based on these insights, we propose a novel loss function that incorporates a gradient penalty to mitigate the magnitude shrinkage of extreme weather, and we theoretically justify its effectiveness via a PDE-based energyโenstrophy analysis. To capture more precise short-term wind velocity variations, we design a novel structure of physics-embedded machine learning models with frequency reweighting. Experiments demonstrate that, compared to the baseline models, our approach achieves significant improvements in predicting extreme wind velocities while maintaining robust overall performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1677. Fewer Weights, More Problems: A Practical Attack on LLM Pruning
๐ฌ 15
โญ 6.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ pruning large language models security poisoning
Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during deployment. Through popular inference engines, such as vLLM, users can conveniently prune downloaded models before deploying them. While the utility and efficiency of pruning methods have improved significantly, the security implications of LLM pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited.
In particular, an adversary can construct a model that appears benign yet, once pruned, exhibits malicious behaviors. Our method is based on the idea that the adversary can compute a proxy metric that estimates how likely each parameter is to be pruned. With this information, the adversary can first inject a malicious behavior into those parameters that are unlikely to be pruned. Then, they can repair the model by using parameters that are likely to be pruned, effectively canceling out the injected behavior in the unpruned model. We demonstrate the severity of our attack through extensive evaluation on five models; after any of the prunings in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of attack scenarios (success rates of up to 95.7\% for jailbreak, 98.7\% for benign instruction refusal, and 99.5\% for targeted content injection). Our results reveal a critical deployment-time security gap and underscore the urgent need for stronger security awareness in model compression.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1678. Decentralized Nonconvex Optimization under Heavy-Tailed Noise: Normalization and Optimal Convergence
๐ฌ 14
โญ 6.00
๐ optimization
๐ท๏ธ Stochastic optimization heavy-tailed noise decentralized optimization normalization
Heavy-tailed noise in nonconvex stochastic optimization has garnered increasing research interest, as empirical studies, including those on training attention models, suggest it is a more realistic gradient noise condition. This paper studies first-order nonconvex stochastic optimization under heavy-tailed gradient noise in a decentralized setup, where each node can only communicate with its direct neighbors in a predefined graph. Specifically, we consider a class of heavy-tailed gradient noise that is zero-mean and has only $p$-th moment for $p \in (1, 2]$. We propose GT-NSGDm, Gradient Tracking based Normalized Stochastic Gradient Descent with momentum, that utilizes normalization, in conjunction with gradient tracking and momentum, to cope with heavy-tailed noise on distributed nodes. We show that, when the communication graph admits primitive and doubly stochastic weights, GT-NSGDm guarantees, for the \textit{first} time in the literature, that the expected gradient norm converges at an optimal non-asymptotic rate $O\big(1/T^{(p-1)/(3p-2)}\big)$, which matches the lower bound in the centralized setup. When tail index $p$ is unknown, GT-NSGDm attains a non-asymptotic rate $O\big( 1/T^{(p-1)/(2p)} \big)$ that is, for $p < 2$, topology independent and has a speedup factor $n^{1-1/p}$ in terms of the number of nodes $n$. Finally, experiments on nonconvex linear regression with tokenized synthetic data and decentralized training of language models on a real-world corpus demonstrate that GT-NSGDm is more robust and efficient than baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1679. Meta-RL Induces Exploration in Language Agents
๐ฌ 19
โญ 6.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Agent Reinforcement Learning Meta Learning
Reinforcement learning (RL) has enabled the training of Large Language Model (LLM) agents to interact with the environment and to solve multi-turn longhorizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM
agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11\%, 14\%, and 19\% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that meta-reinforcement learning provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1680. Learning Admissible Heuristics for A*: Theory and Practice
๐ฌ 38
โญ 5.71
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Admissible Heuristics A* Search Algorithm Optimal search Generalization Guarantees Rubikโs Cube
Heuristic functions are central to the performance of search algorithms such as A*, where \emph{admissibility}โthe property of never overestimating the true shortest-path costโguarantees solution optimality. Recent deep learning approaches often disregard full admissibility and provide limited guarantees on generalization beyond the training data. We address both of these limitations. First, we pose heuristic learning as a constrained optimization problem and introduce \emph{Cross-Entropy Admissibility (CEA)}, a loss function that enforces admissibility during training. When evaluated on the Rubikโs Cube domain, our method yields heuristics with near-perfect admissibility and significantly stronger guidance than compressed pattern database (PDB) heuristics. On the theoretical side, we derive a new upper bound on the expected suboptimality of A*. By leveraging PDB abstractions and the structural properties of graphs such as the Rubikโs Cube, we tighten the bound on the number of training samples needed for A* to generalize to unseen states. Replacing a general hypothesis class with a ReLU neural network gives bounds that depend primarily on the networkโs width and depth, rather than on graph size. Using the same network, we also provide the first generalization guarantees for \emph{goal-dependent} heuristics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1681. Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead
๐ฌ 6
โญ 5.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Post-Training Large Reasoning Models Large Language Models Performance Prediction Reinforcement Learning with Verifiable Rewards
In post-training for reasoning Large Language Models (LLMs), the current state of practice trains LLMs in two independent stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR, shortened as "RL" below). In this work, we challenge whether high SFT scores translate to improved performance after RL. We provide extensive counter-examples where this is not true. We find high SFT scores can be biased toward simpler or more homogeneous data and are not reliably predictive of subsequent RL gains or scaled-up post-training effectiveness. In some cases, RL training on models with improved SFT performance could lead to substantially worse outcome compared to RL on the base model without SFT. We study alternative metrics and identify generalization loss on held-out reasoning examples and Pass@large k performance to provide strong proxies for the RL outcome. We trained hundreds of models up to 12B-parameter with SFT and RLVR via GRPO and ran extensive evaluations on 7 math benchmarks with up to 256 repetitions, spending $>$1M GPU hours. Experiments include models from Llama3, Mistral-Nemo, Qwen3 and multiple state-of-the-art SFT/RL datasets. Compared to directly predicting from pre-RL performance, prediction based on generalization loss and Pass@large k achieves substantial higher precision, improving $R^2$ coefficient and Spearman's rank correlation coefficient by up to 0.5 (2x). This provides strong utility for broad use cases. For example, in most experiments, we find SFT training on unique examples for a one epoch underperforms training on half examples for two epochs, either after SFT or SFT-then-RL; With the same SFT budget, training only on short examples may lead to better SFT performance, though, it often leads to worse outcome after RL compared to training on examples with varying lengths. This work develops an enhanced evaluation tool that will be open-sourced.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1682. COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning
๐ฌ 24
โญ 5.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Complexity Compositionality Visual instruction tuning
Visual instruction tuning (VIT) datasets consist of randomly sampled image-question pairs without regard to the informativeness of each pair. Recent dataset selection methods have shown that a small fraction of such datasets enriched with informative samples can lead to efficient finetuning of Multimodal Large Language Models. In this work, we explore the impact of task complexity on informative data curation and introduce COMPACT (COMPositional Atomic-to-complex Visual Capability Tuning), a VIT data recipe that scales training sample complexity by combining multiple atomic visual capabilities in a single training example. Concretely, we synthesize rich and informative text questions for each image, allowing us to significantly reduce the number of training examples required for effective visual instruction tuning. COMPACT demonstrates superior data efficiency compared to existing data reduction methods. When applied to the LLaVA-665K VIT dataset, COMPACT reduces the data budget by 90% while still achieving 100.2% of the full VIT performance (compared to only 97.5% by the state-of-the-art method) across eight multimodal benchmarks. Further, training on the same COMPACT data even improves performance compared to training with full-scale data on particularly complex benchmarks such as MM-Vet (+8.6%) and MMStar (+2.9%). COMPACT offers a scalable and efficient synthetic data generation recipe to improve on visual language tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1683. WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection
๐ฌ 22
โญ 5.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ lm agent retrieval-augmented generation Reinforcement Learning
Search agents have achieved significant advancements in enabling intelligent information retrieval and decision-making within interactive environments.
Although reinforcement learning has been employed to train agentic models capable of more dynamic interactive retrieval, existing methods are limited by shallow tool-use depth and the accumulation of errors over multiple iterative interactions.
In this paper, we present WebSeer, a more intelligent search agent trained via reinforcement learning enhanced with a self-reflection mechanism. Specifically, we construct a large dataset annotated with reflection patterns and design a two-stage training framework that unifies cold start and reinforcement learning within the self-reflection paradigm for real-world web-based environments, which enables the model to generate longer and more reflective tool-use trajectories.
Our approach substantially extends tool-use chains and improves answer accuracy. Using a single 14B model, we achieve state-of-the-art results on HotpotQA and SimpleQA, with accuracies of 72.3\% and 90.0\%, respectively, and demonstrate strong generalization to out-of-distribution datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1684. Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models
๐ฌ 17
โญ 5.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Confidence Calibration Uncertainty Estimation Large Language Models
A safe and trustworthy use of Large Language Models (LLMs) requires an accurate expression of confidence in their answers. We propose a novel Reinforcement Learning approach that allows to directly fine-tune LLMs to express calibrated confidence estimates alongside their answers to factual questions. Our method optimizes a reward based on the logarithmic scoring rule, explicitly penalizing both over- and under-confidence. This encourages the model to align its confidence estimates with the actual predictive accuracy. The optimal policy under our reward design would result in perfectly calibrated confidence expressions. Unlike prior approaches that decouple confidence estimation from response generation, our method integrates confidence calibration seamlessly into the generative process of the LLM. Empirically, we demonstrate that models trained with our approach exhibit substantially improved calibration and generalize to unseen tasks without further fine-tuning, suggesting the emergence of general confidence awareness. We provide our training and evaluation code in the supplementary and will make it publicly available upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1685. Probing to Refine: Reinforcement Distillation of LLM Reasoners via Explanatory Inversion
๐ฌ 21
โญ 5.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Knowledge Distillation
Distilling robust reasoning capabilities from large language models (LLMs) into smaller, computationally efficient student models remains an unresolved challenge. Despite recent advances, distilled models frequently suffer from superficial pattern memorization and subpar generalization. To overcome these limitations, we introduce a novel distillation framework that moves beyond simple mimicry to instill a deeper conceptual understanding. Our framework features two key innovations. \underline{\textit{First}}, to address pattern memorization, Explanatory Inversion (EI) generates targeted ``explanatory probes'' that compel the student to articulate the underlying logic behind an answer, rather than just memorizing it. \underline{\textit{Second}}, to improve generalization, Explanatory GRPO (\texttt{EXGRPO}) uses a reinforcement learning algorithm with a novel Dialogue Structure Utility Bonus, which explicitly rewards the student for maintaining a coherent reasoning process across these probes. Extensive evaluations on 12 datasets demonstrate significant improvements. Using Gemma-7b as the student model, our method yields an average \textbf{20.39\%} increase over zero-shot performance and a \textbf{6.02\%} improvement over the state-of-the-art distillation baselines. Moreover, models distilled with our method show remarkable training efficiency (e.g., surpassing vanilla fine-tuning with \textbf{10-25\%} training data) and strong generalization to out-of-distribution tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1686. FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates
๐ฌ 28
โญ 5.67
๐ generative models
๐ท๏ธ Audio coding neural audio codecs speech language model
Neural audio codecs are foundational to speech language models. It is expected to have a low frame rate and decoupled semantic and acoustic information. A lower frame rate codec can reduce the computational cost of speech language models by shortening the sequence length. Recent studies have developed 12.5Hz low-frame-rate audio codecs, but even lower frame rate codecs remain underexplored. We find that pushing existing audio codecs to very low frame rates loses much semantic information. We suggest that low-frame-rate codecs' limitations are in both insufficient semantic decoupling and insufficient time resolution at capturing transient phonetic details. This paper introduces **FlexiCodec** to address this limitation. FlexiCodec improves semantic preservation with a **dynamic frame rate** approach and introduces a novel architecture featuring an **ASR feature-assisted dual stream** encoding and Transformer bottlenecks.
With dynamic frame rates, it uses less frames at information-sparse regions through adaptively merging semantically similar frames.
A dynamic frame rate also allows FlexiCodec to support inference-time **controllable frame rates** between 3Hz and 12.5Hz.
Experiments on **6.25Hz, 8.3Hz and 12.5Hz** average frame rates confirm that FlexiCodec excels over baseline systems in semantic information preservation and delivers a high audio reconstruction quality. We also validate the effectiveness of FlexiCodec in language model-based TTS. Demos are available at: https://flexicodec.github.io.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1687. Breaking Safety Paradox with Feasible Dual Policy Iteration
๐ฌ 26
โญ 5.67
๐ reinforcement learning
๐ท๏ธ Safety paradox safe reinforcement learning feasible policy iteration feasibility function
Achieving zero constraint violations in safe reinforcement learning poses a significant challenge. We discover a key obstacle called the safety paradox, where improving policy safety reduces the frequency of constraint-violating samples, thereby impairing feasibility function estimation and ultimately undermining policy safety. We theoretically prove that the estimation error bound of the feasibility function increases as the proportion of violating samples decreases. To overcome the safety paradox, we propose an algorithm called feasible dual policy iteration (FDPI), which employs an additional policy to strategically maximize constraint violations while staying close to the original policy. Samples from both policies are combined for training, with data distribution corrected by importance sampling. Extensive experiments show FDPI's state-of-the-art performance on the Safety-Gymnasium benchmark, achieving the lowest violation and competitive-to-best return simultaneously.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1688. Cross-ControlNet: Training-Free Fusion of Multiple Conditions for Text-to-Image Generation
๐ฌ 22
โญ 5.60
๐ generative models
๐ท๏ธ Training free Multi-Condition Controllable Image Synthesis
Text-to-image diffusion models achieve impressive performance, but reconciling multiple spatial conditions usually requires costly retraining or labor intensive weight tuning.
We introduce Cross-ControlNet, a training-free framework for text-to-image generation with multiple conditions.
It exploits two observations: intermediate features from different ControlNet branches are spatially aligned, and their condition strength can be measured by spatial and channel level variance.
Cross-ControlNet contains three modules: PixFusion, which fuses features pixelwise under the guidance of standard deviation maps smoothed by a Gaussian to suppress early-stage noise; ChannelFusion, which applies per channel hybrid fusion via a consistency ratio gate, reducing threshold degradation in high dimensions; and KV-Injection, which injects foreground- and background-specific key/value pairs under text-derived attention masks to disentangle conflicting cues and enforce each condition faithfully.
Extensive experiments demonstrate that Cross-ControlNet consistently improves controllable generation under both conflicting and complementary conditions, and further generalizes to the DiT-based FLUX model without additional training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1689. Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning
๐ฌ 27
โญ 5.60
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ LALMs Audio Comprehension Audio-Interleaved Reasoning
The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize informative audio segments through supervised fine-tuning, and then incentivizing proficient revisiting via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we present Echo, a LALM capable of dynamically revisiting audio segments in demand during reasoning. On audio comprehension benchmarks, Echo achieves overall superiority in both challenging expert-level and general-purpose tasks. Comprehensive analysis further confirms the efficiency and generalizability of audio-interleaved reasoning, establishing it as a promising direction for advancing audio comprehension. We commit to releasing the model, code, and data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1690. ST-HHOL: Spatio-Temporal Hierarchical Hypergraph Online Learning for Crime Prediction
๐ฌ 25
โญ 5.60
๐ learning on time series and dynamical systems
๐ท๏ธ Crime prediction Spatio-temporal graph neural networks Spatio-temporal data mining
Crime prediction is a critical yet challenging task in urban spatio-temporal forecasting.
Sparse crime records alone are insufficient to capture latent high-order patterns shaped by heterogeneous contextual factors with spatial and criminal specificity, while high non-stationarity renders conventional offline models ineffective against concept drift.
To tackle these challenges, we propose a Spatio-Temporal Hierarchical Hypergraph Online Learning framework named ST-HHOL. First, we propose a hierarchical hypergraph convolution network that integrates crime data with heterogeneous contextual factors to uncover dual-specific crime patterns and their co-occurrence relations. Second, we introduce an iterative online learning strategy to address concept drift by employing frequent fine-tuning for short-term dynamics and periodic retraining for long-term shifts.
Moreover, we adopt a Partially-Frozen LLM that leverages pre-trained sequence priors while adapting its attention mechanisms to crime-specific dependencies, enhancing spatio-temporal reasoning under sparse supervision.
Extensive experiments on three real-world datasets demonstrate that ST-HHOL consistently outperforms state-of-the-art methods in terms of accuracy and robustness, while also providing enhanced interpretability. Code is available at https://anonymous.4open.science/r/ST-HHOL-777D.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1691. Test-Time Training Done Right
๐ฌ 19
โญ 5.60
๐ foundation or frontier models, including LLMs
๐ท๏ธ Test-Time Training Sequence Model Long Context Model
Test-Time Training (TTT) models context dependencies by adapting part of the model's weights (often referred to as fast weights) at inference time. This adapted fast weight, similar to recurrent states in RNNs, stores temporary memories of past tokens in the current sequence. Existing TTT methods have struggled to demonstrate effectiveness in handling long-sequence data, due to their computational inefficiency on modern GPUs. The TTT layers in many of these approaches operate with extremely low FLOPs utilization (often below 5%) because they deliberately apply small online mini-batch sizes (e.g., updating fast weights every 16 or 64 tokens). Moreover, a small mini-batch implies fine-grained block-wise causal dependencies in the data, making them unsuitable for data beyond 1D ordered sequences, like sets or N-dimensional grids such as images or videos. In contrast, we pursue the opposite direction by proposing an extremely large chunk update, ranging from 2K to 1M tokens across tasks of varying modalities, which we refer to as Large Chunk Test-Time Training (LaCT). This approach improves hardware utilization by orders of magnitude, and more importantly, facilitates scaling of nonlinear state size (up to 40% of model parameter size), hence substantially improving state capacity, all without requiring cumbersome and error-prone custom kernel implementations. It also allows easy integration of sophisticated optimizers like Muon for online memory updates. We validate our approach across diverse data modalities and tasks, including novel view synthesis from image sets, language models, and auto-regressive video diffusion models. Our approach can scale up to 14-billion-parameter auto-regressive video diffusion models handling sequences of up to 56K tokens. In our longest sequence experiment, we perform novel view synthesis with more than one million context length. Our results highlight the computational and performance benefits of large-chunk test-time training, paving the way for more efficient and scalable long-context sequence modeling. We hope that this work will inspire and accelerate new research in the field of long-context modeling and test-time training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1692. Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution
๐ฌ 19
โญ 5.60
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Workflow Optimization Agent Reasoning WebAgent Deep Research
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks when equipped with external tools. However, current frameworks predominantly rely on sequential processing, leading to inefficient execution particularly for tasks requiring extensive tool interaction. This paper introduces Flash-Searcher, a novel parallel agent reasoning framework that fundamentally reimagines the execution paradigm from sequential chains to directed acyclic graphs (DAGs). Flash-Searcher decomposes complex tasks into subtasks with explicit dependencies, enabling concurrent execution of independent reasoning paths while maintaining logical constraints. Through dynamic workflow optimization, our framework continuously refines the execution graph based on intermediate results, effectively integrating summary module. Comprehensive evaluations across multiple benchmarks demonstrate that Flash-Searcher consistently outperforms existing approaches. Specifically, it achieves **67.7%** accuracy on BrowseComp and **83%** on xbench-DeepSearch, while reducing agent execution steps by up to **35%** compared to current frameworks. Furthermore, when distilling this parallel reasoning pipeline into single models, we observe substantial performance gains across diverse backbone architectures, underscoring the generalizability of our methodology. Our work thus represents a significant advance in agent architecture design, offering a more scalable and efficient paradigm for complex reasoning tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1693. Video-As-Prompt: Unified Semantic Control for Video Generation
๐ฌ 35
โญ 5.60
๐ generative models
๐ท๏ธ Video Generation Controllable Video Generation Video Dataset
Unified, generalizable semantic control in video generation remains a critical open challenge. Existing methods either introduce artifacts by enforcing inappropriate pixel-wise priors from structure-based controls, or rely on non-generalizable, condition-specific finetuning or task-specific architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes this problem as in-context generation. VAP leverages a reference video as a direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture prevents catastrophic forgetting and is guided by a temporally biased position embedding that eliminates spurious mapping priors for robust context retrieval. To power this approach and catalyze future research, we built VAP-Data, the largest dataset for this task with over 100K paired videos across 100 semantic conditions. As a single unified model, VAP sets a new state-of-the-art for open-source methods, achieving a 38.7\% user preference rate that rivals leading condition-specific commercial models. VAP's strong zero-shot generalization and support for various applications mark a significant advance toward general-purpose, controllable video generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1694. Beyond Spectra: Eigenvector Overlaps in Loss Geometry
๐ฌ 17
โญ 5.60
๐ learning theory
๐ท๏ธ hessian overlap eigenvector geometry ridge regression noise free probability algorithms CIFAR high dimensional statistics generalization covariate shift double descent multiple descent random matrix theory
Local loss geometry in machine learning is fundamentally a two-operator concept. When only a single loss is considered, geometry is fully summarized by the Hessian spectrum; in practice, however, both training and test losses are relevant, and the resulting geometry depends on their spectra together with the alignment of their eigenspaces. We first establish general foundations for two-loss geometry by formulating a universal local fluctuation law, showing that the expected test-loss increment under small training perturbations is a trace that combines train and test spectral data with a critical additional factor quantifying eigenspace overlap, and by proving a novel transfer law that describes how overlaps transform in response to noise. As a solvable analytical model, we next apply these laws to ridge regression with arbitrary covariate shift, where operator-valued free probability yields asymptotically exact overlap decompositions that reveal overlaps as the natural quantities specifying shift and that resolve the puzzle of multiple descent: peaks are controlled by eigenspace (mis-)alignment rather than by Hessian ill-conditioning alone. Finally, for empirical validation and scalability, we confirm the fluctuation law in multilayer perceptrons, develop novel algorithms based on subspace iteration and kernel polynomial methods to estimate overlap functionals, and apply them to a ResNet-20 trained on CIFAR10, showing that class imbalance reshapes trainโtest loss geometry via induced misalignment. Together, these results establish overlaps as the critical missing ingredient for understanding local loss geometry, providing both theoretical foundations and scalable estimators for analyzing generalization in modern neural networks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1695. Mobile-GS: Real-time Gaussian Splatting for Mobile Devices
๐ฌ 20
โญ 5.60
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Vision
3D Gaussian Splatting (3DGS) has emerged as a powerful representation for high-quality rendering across a wide range of applications.
However, its high computational demands and large storage costs pose significant challenges for deployment on mobile devices.
In this work, we propose a mobile-tailored real-time Gaussian Splatting method, dubbed Mobile-GS, enabling efficient inference of Gaussian Splatting on edge devices.
Specifically, we first identify alpha blending as the primary computational bottleneck, since it relies on the time-consuming Gaussian depth sorting process.
To solve this issue, we propose a depth-aware order-independent rendering scheme that eliminates the need for sorting, thereby substantially accelerating rendering.
Although this order-independent rendering improves rendering speed, it may introduce transparency artifacts in regions with overlapping geometry due to the scarcity of rendering order.
To address this problem, we propose a neural view-dependent enhancement strategy, enabling more accurate modeling of view-dependent effects conditioned on viewing direction, 3D Gaussian geometry, and appearance attributes.
In this way, Mobile-GS can achieve both high-quality and real-time rendering.
Furthermore, to facilitate deployment on memory-constrained mobile platforms, we propose first-degree spherical harmonics distillation, a neural vector quantization technique, and a contribution-based pruning strategy to reduce the number of Gaussian primitives and compress the 3D Gaussian representation with the assistance of neural networks.
Extensive experiments demonstrate that our proposed Mobile-GS achieves real-time rendering and compact model size while preserving high visual quality, making it well-suited for mobile applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1696. VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Video
๐ฌ 26
โญ 5.60
๐ datasets and benchmarks
๐ท๏ธ Multimodal Reasoning Video Question Answering Mathematical Understanding Temporal Reasoning Visual Grounding
Mathematical reasoning in real-world video presents a fundamentally different challenge than static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos from 10 seconds to over 1 hour. We employ graduate-level experts to ensure high quality, for over 920 man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we establish an evaluation framework for models that must reason, rather than merely perceive, jointly ground concepts across visual, audio, and textual modalities, across temporally extended mathematical problem settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1697. Learning Heterogeneous Degradation Representation for Real-World Super-Resolution
๐ฌ 14
โญ 5.60
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Real-World Super-Resolution Representation Learning.
Real-World Super-Resolution (RWSR) aims to reconstruct high-resolution images from low-resolution inputs captured under complex, real-life conditions, where diverse distortions result in significant degradation heterogeneity. Many methods rely on degradation representations, yet they struggle with the lack of spatially variant degradation modeling and degradation-content entanglement. We propose Spatially Amortized Variational Learning (SAVL), an implicit framework that models per-pixel degradations as spatially varying Gaussians inferred from local neighborhoods. SAVL couples a conditional likelihood lane (SAVL-LM) with a mutual information suppression lane (SAVL-MIS) to filter out degradation-irrelevant signals, yielding a well-constrained solution space. Both our qualitative visualizations and quantitative analyses confirm that the learned representations effectively capture the spatial distribution of complex degradations while being highly discriminative of diverse underlying degradation factors. Building on these representations, we design a degradation-aware SR network with channel-wise guidance and spatial attention modulation for adaptive reconstruction under heterogeneous degradations. Extensive experiments on real-world datasets demonstrate consistent gains over prior methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1698. Geometric Graph Neural Diffusion for Stable Molecular Dynamics
๐ฌ 13
โญ 5.60
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Machine learning force field graph neural network
Geometric graph neural networks (Geo-GNNs) have revolutionized molecular dynamics (MD) simulations by providing accurate and fast energy and force predictions. However, minor prediction errors could still destabilize MD trajectories in real MD simulations due to the limited coverage of molecular conformations in training datasets. Existing methods that focus on in-distribution predictions often fail to address extrapolation to unseen conformations, undermining the simulation stability. To tackle this, we propose Geometric Graph Neural Diffusion (GGND), a novel framework that can capture geometrically invariant topological features, thereby alleviating error accumulation and ensuring stable MD simulations. The core of our framework is that it iteratively refines atomic representations, enabling instantaneous information flow between arbitrary atomic pairs while maintaining equivariance. Our proposed GGND is a plug-and-play module that can seamlessly integrate with existing local equivariant message-passing frameworks, enhancing their predictive performance and simulation stability. We conducted sets of experiments on the 3BPA and SAMD23 benchmark datasets, which encompass diverse molecular conformations across varied temperatures. We also ran real MD simulations to evaluate the stability. GGND outperforms baseline models in both accuracy and stability under significant topological shifts, advancing stable molecular modeling for real-world applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1699. Weak-to-Strong Diffusion
๐ฌ 25
โญ 5.60
๐ generative models
๐ท๏ธ Diffusion Models Diffusion Sampling Text-to-Image Generation
The goal of generative diffusion models is to align the learned distribution with the real data distribution through gradient score matching. However, inherent limitations of current generative models lead to an inevitable gap between generated data and real data. To address this, we propose Weak-to-Strong Diffusion (W2SD), a novel framework that utilizes the estimated gap between existing weak and strong models (i.e., weak-to-strong gap) to bridge the gap between an ideal model and a strong model. By employing a reflective operation that alternates between denoising and inversion with weak-to-strong gap, W2SD steers latent variables along sampling trajectories toward regions of the real data distribution. W2SD is highly flexible and broadly applicable, enabling diverse improvements through the strategic selection of weak-to-strong model pairs (e.g., DreamShaper vs. SD1.5, good experts vs. bad experts in MoE). Extensive experiments demonstrate that W2SD significantly improves human preference, aesthetic quality, and prompt adherence, achieving significantly improved performance across various modalities (e.g., image, video), architectures (e.g., UNet-based, DiT-based, MoE), and benchmarks. For example, Juggernaut-XL with W2SD can improve with the HPSv2 winning rate up to 90\% over the original results. Moreover, the performance gains achieved by W2SD markedly outweigh its additional computational overhead, while the cumulative improvements from different weak-to-strong gap further solidify its practical utility and deployability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1700. FineNib: A Query Synthesizer For Static Analysis of Security Vulnerabilities
๐ฌ 15
โญ 5.60
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Static Analysis Program Synthesis Vulnerability Detection
CodeQL is a powerful static analysis engine that represents programsโ abstract syntax trees as databases that can be queried to detect security vulnerabilities. While CodeQL supports expressive interprocedural dataflow queries, the coverage and precision of its existing security queries remain limited, and writing new queries is challenging even for experts. Automatically synthesizing CodeQL queries from known vulnerabilities (CVEs) can provide fine-grained vulnerability signatures, enabling both improved detection and systematic variant analysis. We present FineNib, an agentic framework for synthesizing CodeQL queries from known CVE descriptions. FineNib leverages the Model Context Protocol (MCP) for agentic tool use, integrates abstract syntax tree guidance, and incorporates CodeQLโs language infrastructure and documentation into the synthesis loop. A key challenge is that state-of-the-art large language models hallucinate deprecated CodeQL syntax due to limited training data and outdated knowledge. FineNib addresses this by combining contextual engineering, iterative query feedback, and structured tool interaction to reliably generate executable, up-to-date queries.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1701. Implicit Sensing for Fourier Sparse Boolean Functions
๐ฌ 16
โญ 5.60
๐ learning theory
๐ท๏ธ Learning Theory Fourier Analysis Fourier Sparsity Sublinear Algorithm Property Testing Comprssed Sensing
Boolean functions constitute a fundamental object of study in machine learning and, more broadly, in theoretical computer science. Among their various complexity measures, Fourier sparsity, defined as the number of nonzero Fourier coefficients in a Boolean functionโs Fourier expansion, serves as a key indicator of structural simplicity. For over three decades, learning Boolean functions with sparse Fourier representations has been a central focus of computational learning theory. A notable achievement in this line of work is the development of learning algorithms whose complexity primarily depends on the Fourier sparsity parameter. However, these approaches generally assume prior knowledge of this parameter.
In this work, we address this gap in the literature on learning Fourier-sparse Boolean functions. Specifically, we study the problem of Fourier sparsity testing: given query access to a Boolean function $f : \mathbb{F}_2^n \to \{-1, +1\}$, decide whether it is $s$-Fourier sparse or far (under Hamming distance) from every such function.
Our contributions are twofold. On the algorithmic side, we design a new tester with query complexity $\widetilde{O}(s^4)$, independent of the ambient dimension. On the lower bound side, we prove that any tester requires at least $\Omega(s)$ queries. Both bounds improve upon the best known results of Gopalan et al.\ (SICOMP 2011), who presented a tester with query complexity $\widetilde{O}(s^{14})$ and a lower bound of $\Omega(\sqrt{s})$. For our upper bound, we introduce a refined notion of a sampler from the junta testing framework of Chakraborty et al.\ (ICALP 2011) and combine it with $\ell_1$-minimization-based compressed sensing techniques to construct our tester. In the process, we develop a novel method for sampling the leaves of parity decision trees associated with Fourier-sparse Boolean functions. The lower bound is obtained via a reduction from communication complexity, crucially leveraging structural properties of the Fourier coefficients of a specific class of cryptographically hard functions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1702. Generalization Below the Edge of Stability: The Role of Data Geometry
๐ฌ 19
โญ 5.60
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ neural networks deep learning theory gradient descent representation learning generalization
Understanding generalization in overparameterized neural networks hinges on the interplay between the data geometry, neural architecture, and training dynamics. In this paper, we theoretically explore how data geometry controls this implicit bias. This paper presents theoretical results for overparametrized two-layer ReLU networks trained *below the edge of stability*. First, for data distributions supported on a mixture of low-dimensional balls, we derive generalization bounds that provably adapt to the intrinsic dimension. Second, for a family of isotropic distributions that vary in how strongly probability mass concentrates toward the unit sphere, we derive a spectrum of bounds showing that rates deteriorate as the mass concentrates toward the sphere. These results instantiate a unifying principle: When the data is harder to โshatterโ with respect to the activation thresholds of the ReLU neurons, gradient descent tends to learn representations that capture shared patterns and thus finds solutions that generalize well. On the other hand, for data that is easily shattered (e.g., data supported on the sphere) gradient descent favors memorization. Our theoretical results consolidate disparate empirical findings that have appeared in the literature.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1703. In-Context Learning for Pure Exploration
๐ฌ 20
โญ 5.60
๐ reinforcement learning
๐ท๏ธ active sequential hypothesis testing pure exploration reinforcement learning in-context learning best arm identification
We study the _active sequential hypothesis testing_ problem, also known as _pure exploration_: given a new task, the learner _adaptively collects data_ from the environment to efficiently determine an underlying correct hypothesis. A classical instance of this problem is the task of identifying the best arm in a multi-armed bandit problem (a.k.a. BAI, Best-Arm Identification), where actions index hypotheses. Another important case is generalized search, a problem of determining the correct label through a sequence of strategically
selected queries that indirectly reveal information about the label.
In this work, we introduce _In-Context Pure Exploration_ (ICPE), which meta-trains Transformers to map _observation histories_ to _query actions_ and a _predicted hypothesis_, yielding a model that transfers in-context. At inference time, ICPE actively gathers evidence on new tasks and infers the true hypothesis without parameter updates.
Across deterministic, stochastic, and structured benchmarks, including BAI and generalized search, ICPE is competitive with adaptive baselines while requiring no explicit modeling of information structure. Our results support Transformers as practical architectures for _general sequential testing_.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1704. Can SAEs reveal and mitigate racial biases of LLMs in healthcare?
๐ฌ 13
โญ 5.60
๐ interpretability and explainable AI
๐ท๏ธ clinical natural language processing mechanistic interpretability
LLMs are increasingly being used in healthcare. This promises to free physicians from drudgery, enabling better care to be delivered at scale. But the use of LLMs in this space also brings risks; for example, such models may worsen existing biases. How can we spot when LLMs are (spuriously) relying on patient race to inform predictions? In this work we assess the degree to which Sparse Autoencoders (SAEs) can reveal (and control) associations the model has made between race and stigmatizing concepts.
We first identify SAE latents in gemma-2 models which appear to correlate with Black individuals. We find that this latent activates on reasonable input sequences (e.g., "African American") but also problematic words like "incarceration". We then show that we can use this latent to "steer" models to generate outputs about Black patients, and further that this can induce problematic associations in model outputs as a result. For example, activating the Black latent increases the risk assigned to the probability that a patient will become "belligerent". We also find that even in this controlled setting in which we causally intervene to manipulate only patient race, elicited CoT reasoning strings do not communicate that race is a factor in the resulting assessments. We evaluate the degree to which such "steering" via latents might be useful for mitigating bias. We find that this offers improvements in simple settings, but is less successful for more realistic and complex clinical tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1705. Estimating Worst-Case Frontier Risks of Open-Weight LLMs
๐ฌ 11
โญ 5.60
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Open-source LLMs safety frontier risks
In this paper, we study the worst-case frontier risks of the OpenAI gpt-oss model. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as capable as possible in two domains: biology and cybersecurity. To maximize biological risk (biorisk), we curate tasks related to threat creation and train gpt-oss in an RL environment with web browsing. To maximize cybersecurity risk, we train gpt-oss in an agentic coding environment to solve capture-the-flag (CTF) challenges. We compare these MFT models against open- and closed-weight LLMs on frontier risk evaluations. Compared to frontier closed-weight models, MFT gpt-oss underperforms OpenAI o3, a model that is below Preparedness High capability level for biorisk and cybersecurity. Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier. Taken together, these results led us to believe that the net new harm from releasing gpt-oss is limited, and we hope that our MFT approach can serve as useful guidance for estimating harm from future open-weight releases.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1706. The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward
๐ฌ 28
โญ 5.60
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Reinforcement Learning with Verifiable Reward f divergence
A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance (Pass@k) despite improvements in single-attempt accuracy (Pass@1). This is often accompanied by catastrophic forgetting, where models lose previously acquired skills. Despite numerous proposed methods, the community's focus on the standard reverse KL-divergence has led to a surprising oversight: the potential of alternative f-divergences as a proactive solution has been largely unexamined. We argue that standard RLVR objectivesโboth those using the mode-seeking reverse KL-divergence and those forgoing a divergence term entirelyโlack a crucial mechanism for knowledge retention. The reverse-KL actively accelerates this decay by narrowing the policy, while its absence provides no safeguard against the model drifting from its diverse knowledge base. We propose a fundamental shift in perspective: using the divergence term itself as the solution. Our framework, Diversity-Preserving Hybrid RL (DPH-RL), leverages mass-covering f-divergences (like forward-KL and JS-divergence) to function as a 'rehearsal mechanism'. By continuously referencing the initial policy, this approach forces the model to maintain broad solution coverage. Math and SQL generation experiments show that DPH-RL both improves in-domain Pass@1 and Pass@k scores and effectively prevents catastrophic forgetting on out-of-domain tasks. Additionally, DPH-RL is more training-efficient because it computes f-divergence using generator functions, requiring only sampling from the initial policy and no online reference model. Our work highlights a crucial, overlooked axis for improving RLVR, demonstrating that the proper selection of a divergence measure is a powerful tool for building more general and diverse reasoning models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1707. Converge Faster, Talk Less: Hessian-Informed Federated Zeroth-Order Optimization
๐ฌ 23
โญ 5.60
๐ optimization
๐ท๏ธ Zeroth-Order Optimization Federated Optimization Hessian
Zeroth-order (ZO) optimization enables dimension-free communication in federated learning (FL), making it attractive for fine-tuning of large language models (LLMs) due to significant communication savings. However, existing ZO-FL methods largely overlook curvature information, despite its well-established benefits for convergence acceleration. To address this, we propose **HiSo**, a Hessian-informed ZO federated optimization method that accelerates convergence by leveraging global diagonal Hessian approximations, while strictly preserving scalar-only communication **without transmitting any second-order information**. Theoretically, for non-convex functions, we show that HiSo can achieve an accelerated convergence rate that is independent of the Lipschitz constant $L$ and model dimension $d$ under some Hessian approximation assumptions, offering a plausible explanation for the observed phenomenon of ZO convergence being much faster than its worst-case $O(d)$-bound. Empirically, across diverse LLM fine-tuning benchmarks, HiSo delivers a 1$\sim$5ร speedup in communication rounds over existing state-of-the-art ZO-FL baselines. This superior convergence not only cuts communication costs but also provides strong empirical evidence that Hessian information acts as an effective accelerator in federated ZO optimization settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1708. Multimodal Dataset Distillation Made Simple by Prototype-guided Data Synthesis
๐ฌ 16
โญ 5.60
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Dataset distillation Dataset condensation vision-language models learning-free approach
Recent advances in multimodal learning have achieved remarkable success across diverse visionโlanguage tasks. However, such progress heavily relies on large-scale imageโtext datasets, making training costly and inefficient.
Prior efforts in dataset filtering and pruning attempt to mitigate this issue, but still require relatively large subsets to maintain performance and fail under very small subsets.
Dataset distillation offers a promising alternative, yet existing multimodal dataset distillation methods require full-dataset training and joint optimization of pixel and text features, making them architecture-dependent and limiting cross-architecture generalization.
To overcome this, we propose a learning-free dataset distillation framework that eliminates the need for large-scale training and optimization while enhancing generalization across architectures.
Our method uses CLIP to extract aligned imageโtext embeddings, obtains prototypes, and employs an unCLIP decoder to synthesize images, enabling efficient and scalable multimodal dataset distillation.
Extensive experiments demonstrate that our approach consistently outperforms optimization-based dataset distillation and subset selection methods, achieving state-of-the-art cross-architecture generalization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1709. GradPCA: Leveraging NTK Alignment for Reliable Out-of-Distribution Detection
๐ฌ 5
โญ 5.60
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Out-of-Distribution (OOD) detection Neural Tangent Kernel (NTK)
We introduce GradPCA, an Out-of-Distribution (OOD) detection method that exploits the low-rank structure of neural network gradients induced by Neural Tangent Kernel (NTK) alignment. GradPCA applies Principal Component Analysis (PCA) to gradient class-means, achieving more consistent performance than existing methods across standard image classification benchmarks. We provide a theoretical perspective on spectral OOD detection in neural networks to support GradPCA, highlighting feature-space properties that enable effective detection and naturally emerge from NTK alignment. Our analysis further reveals that feature qualityโparticularly the use of pretrained versus non-pretrained representationsโplays a crucial role in determining which detectors will succeed. Extensive experiments validate the strong performance of GradPCA, and our theoretical framework offers guidance for designing more principled spectral OOD detectors.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1710. Delay Flow Matching
๐ฌ 23
โญ 5.60
๐ generative models
๐ท๏ธ Generative Models Flow Matching Delay Differential Equations Trajectory Intersection Heterogeneous Distribution Transfer
Flow matching (FM) based on Ordinary Differential Equations (ODEs) has achieved significant success in generative tasks. However, it faces several inherent limitations, including an inability to model trajectory intersections, capture delay dynamics, and handle transfer between heterogeneous distributions. These limitations often result in a significant mismatch between the modeled transfer process and real-world phenomena, particularly when key coupling or inherent structural information between distributions must be preserved. To address these issues, we propose Delay Flow Matching (DFM), a new FM framework based on Delay Differential Equations (DDEs). Theoretically, we show that DFM possesses universal approximation capability for continuous transfer maps. By incorporating delay terms into the vector field, DFM enables trajectory intersections and better captures delay dynamics. Moreover, by designing appropriate initial functions, DFM ensures accurate transfer between heterogeneous distributions. Consequently, our framework preserves essential coupling relationships and achieves more flexible distribution transfer strategies. We validate DFM's effectiveness across synthetic datasets, single-cell data, and image-generation tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1711. Composer: A Search Framework for Hybrid Neural Architecture Design
๐ฌ 16
โญ 5.60
๐ foundation or frontier models, including LLMs
๐ท๏ธ Neural architecture search hybrid models efficient ML
Hybrid model architectures that combine computational primitives (e.g., Attention, MLP) in different ratios have shown promising performance beyond Transformers. Some studies have shown that different interleavings of primitives can affect model quality as well. However, prior works explore the hybrid model architecture design space manually. Due to the large design space and training costs, discovering hybrid models that combine key computational primitives for pre-training is challenging. In this work, we take a principled approach in designing a modular hybrid model architecture search framework โ Composer. Composer explores model architectures at a small scale and extrapolates the top-performing model architectures to a larger scale using our proposed scaling strategies. Using Composer, we discover new hybrid LLM architectures that outperform Llama 3.2. Compared to Llama 3.2 and previous state-of-the-art baselines, the new model architectures consistently reduce validation loss at parameter scales of 350M-3B and improve evaluation accuracy on the downstream tasks by up to 2.8-8.3% (1.1-3.1% on average) while improving both training and inference efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1712. Biologically Plausible Learning via Bidirectional Spike-Based Distillation
๐ฌ 29
โญ 5.60
๐ applications to neuroscience & cognitive science
๐ท๏ธ spiking neural networks learning algorithms
Developing biologically plausible learning algorithms that can achieve performance comparable to error backpropagation remains a longstanding challenge.
Existing approaches often compromise biological plausibility by entirely avoiding the use of spikes for error propagation or relying on both positive and negative learning signals, while the question of how spikes can represent negative values remains unresolved.
To address these limitations, we introduce Bidirectional Spike-based Distillation (BSD), a novel learning algorithm that jointly trains a feedforward and a backward spiking network.
We formulate learning as a transformation between two spiking representations (i.e., stimulus encoding and concept encoding) so that the feedforward network implements perception and decision-making by mapping stimuli to actions, while the backward network supports memory recall by reconstructing stimuli from concept representations.
Extensive experiments on diverse benchmarks, including image recognition, image generation, and sequential regression, show that BSD achieves performance comparable to networks trained with classical error backpropagation.
These findings represent a significant step toward biologically grounded, spike-driven learning in neural networks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1713. HOTA: Hamiltonian framework for Optimal Transport Advection
๐ฌ 11
โญ 5.60
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Optimal transport optimal control generalized Schrรถdinger bridge diffusion models
Optimal transport (OT) has become a natural framework for guiding the probability flows. Yet, the majority of recent generative models assume trivial geometry (e.g., Euclidean) and rely on strong density-estimation assumptions, yielding trajectories that do not respect the true principles of optimality in the underlying manifold. We present Hamiltonian Optimal Transport Advection (HOTA), a HamiltonโJacobiโBellman based method that tackles the dual dynamical OT problem explicitly through Kantorovich potentials, enabling efficient and scalable trajectory optimization. Our approach effectively evades the need for explicit density modeling, performing even when the cost functionals are non-smooth. Empirically, HOTA outperforms all baselines in standard benchmarks, as well as in custom datasets with non-differentiable costs, both in terms of feasibility and optimality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1714. VGR: Visual Grounded Reasoning
๐ฌ 38
โญ 5.60
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ VLM MultiModal Cot
In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches
predominantly rely on reasoning on pure linguistic space, which inherently suffers
from language bias and is largely confined to math or science domains. This narrow
focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper
introduces VGR, a novel reasoning multimodal large language model (MLLM) that
can replay the visual memory during thinking just like humans. Unlike traditional
MLLMs, VGR first thinks the question and detects relevant regions that may help
solve problems, then, the visual memory from the critical area is extracted to assist
reasoning. To achieve this, we curate a large-scale SFT dataset called VGR-SFT
that contains reasoning data with mixed vision grounding and language deduction.
This teaches VGR to think and actively choose grounding areas for key information before answering, and we propose a dynamic visual memory replay stage to
integrates the corresponding information into the reasoning process, enhancing
multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show
that VGR achieves superior performance on multimodal benchmarks requiring
comprehensive image detail understanding. Compared to the baseline, VGR uses
only 30% of the image token count while delivering scores of +4.1 on MMStar,
+7.1 on AI2D, and +12.9 improvement on ChartQA.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1715. CerebraGloss: Instruction-Tuning a Large Vision-Language Model for Fine-Grained Clinical EEG Interpretation
๐ฌ 15
โญ 5.60
๐ foundation or frontier models, including LLMs
๐ท๏ธ large vision-language model instruction-tuning EEG clinical
Interpreting clinical electroencephalography (EEG) is a laborious, subjective process, and existing computational models are limited to narrow classification tasks rather than holistic interpretation. A key bottleneck for applying powerful Large Vision-Language Models (LVLMs) to this domain is the scarcity of datasets pairing EEG visualizations with fine-grained, expert-level annotations. We address this by introducing CerebraGloss, an instruction-tuned LVLM for nuanced EEG interpretation. We first introduce a novel, automated data generation pipeline, featuring a bespoke YOLO-based waveform detector, to programmatically create a large-scale corpus of EEG-text instruction data. Using this data, we develop CerebraGloss, the first model of its kind capable of unified, generative analysisโperforming tasks from detailed waveform description to multi-turn, context-aware dialogue. To evaluate this new capability, we construct and release CerebraGloss-Bench, a comprehensive benchmark for open-ended EEG interpretation. CerebraGloss demonstrates strong performance, surpassing leading LVLMs, including proprietary models like GPT-5, on this benchmark and achieving a new state-of-the-art on the TUSZ seizure detection task. We will open-source our model, benchmark, and tools to foster progress in developing general-purpose neuro-intelligent systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1716. An Efficient SE(p)-Invariant Transport Metric Driven by Polar Transport Discrepancy-based Representation
๐ฌ 25
โญ 5.60
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Distribution comparison; Optimal Transport; Special Euclidean group; Shape matching
We introduce SEINT, a novel Special Euclidean group-Invariant (SE(\emph{p})) metric for comparing probability distributions on $p$-dimensional measured Banach spaces. Existing SE(\emph{p})-invariant alignment methods often face high computational costs or lack metric guarantees. To overcome these limitations, we develop a polar transport discrepancy combined with distance convolution to extract SE(\emph{p})-invariant representations. These representations are then used to compute the alignment between two distributions via optimal transport.
Theoretically, we prove that SEINT is a well-defined metric on the space of isometry classes of normed vector spaces. Beyond its inherent SE(\emph{p})-invariance, SEINT also supports cross-space distribution comparison.
Computationally, SEINT aligns two samples of size $n$ with a complexity of just $\mathcal{O}(n\log n)$ to $\mathcal{O}(n^2)$. Extensive experiments validate its advantages: As a robust metric, it outperforms or matches existing SE(\emph{p})-invariant methods in classification and cross-space tasks under isometries. As a regularizer, it greatly enhances molecular generation performance across both pre-training and fine-tuning tasks, achieving state-of-the-art (SOTA) results on key benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1717. GRACE: Generative Representation Learning via Contrastive Policy Optimization
๐ฌ 20
โญ 5.60
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Language Models Text Representation Reinforcement Learning
Prevailing methods for training Large Language Models (LLMs) as text encoders rely on contrastive losses that treat the model as a black-box function, discarding its generative and reasoning capabilities in favor of static embeddings. We introduce \GRACE{} (Generative Representation Learning via Contrastive Policy Optimization), a novel framework that reimagines contrastive signals not as losses to be minimized, but as rewards that guide a generative policy. In GRACE, the LLM acts as a policy $\pi_\theta$ that produces explicit, human-interpretable rationalesโstructured natural language explanations of its semantic understanding. These rationales are then encoded into high-quality embeddings via mean pooling. Using policy gradient optimization, we train the model with a multi-component reward function that maximizes similarity between query--positive pairs and minimizes similarity with negatives. This transforms the LLM from an opaque encoder into an interpretable agent whose reasoning process is transparent and inspectable. On MTEB benchmark, GRACE yields broad cross-category gains: averaged over four backbones, the supervised setting improves overall score by 11.5\% over base models, and the unsupervised variant adds 6.9\%, while preserving general capabilities. This work treats contrastive objectives as rewards over rationales, unifying representation learning with generation to produce stronger embeddings and transparent decision traces.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1718. Scaling Direct Feedback Learning with Theoretical Guarantees
๐ฌ 19
โญ 5.60
๐ optimization
๐ท๏ธ backpropagation-free learning optimization
Deep neural networks rely on backpropagation (BP) for optimization, but its strictly sequential backward pass hinders parallelism and scalability. Direct Feedback Alignment (DFA) has been proposed as a promising approach for parallel learning of deep neural networks, relying on fixed random projections to enable layer-wise parallel updates, but fails on deep convolutional networks, and performs poorly on modern transformer architectures. We introduce GrAPE (Gradient-Aligned Projected Error), a hybrid feedback-alignment method that (i) estimates rank-1 Jacobians via forward-mode JVPs and (ii) aligns each layerโs feedback matrix by minimizing a local cosine-alignment loss. To curb drift in very deep models, GrAPE performs infrequent BP anchor steps on a single mini-batch, preserving mostly parallel updates. We show that the forward-gradient estimator has strictly positive expected cosine with the true Jacobian and, inspired by Zoutendijk-style arguments, derive a convergence-in-expectation result under a positive expected-cosine condition. Empirically, GrAPE consistently outperforms prior alternatives to BP, enabling the training of modern architectures, closing a large fraction of the gap to BP while retaining layer-parallel updates for the vast majority of steps.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1719. RMAAT: Astrocyte-Inspired Memory Compression and Replay for Efficient Long-Contextย Transformers
๐ฌ 23
โญ 5.60
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Brain-inspired machine learning Astromorphic transformers Short-term Plasticity Long-term PLasticity Long-context sequence modeling
The quadratic complexity of self-attention mechanism presents a significant impediment to applying Transformer models to long sequences. This work explores computational principles derived from astrocytesโglial cells critical for biological memory and synaptic modulationโas a complementary approach to conventional architectural modifications for efficient self-attention. We introduce the Recurrent Memory Augmented Astromorphic Transformer (RMAAT), an architecture integrating abstracted astrocyte functionalities. RMAAT employs a recurrent, segment-based processing strategy where persistent memory tokens propagate contextual information. An adaptive compression mechanism, governed by a novel retention factor derived from simulated astrocyte long-term plasticity (LTP), modulates these tokens. Attention within segments utilizes an efficient, linear-complexity mechanism inspired by astrocyte short-term plasticity (STP). Training is performed using Astrocytic Memory Replay Backpropagation (AMRB), a novel algorithm designed for memory efficiency in recurrent networks. Evaluations on the Long Range Arena (LRA) benchmark demonstrate RMAAT's competitive accuracy and substantial improvements in computational and memory efficiency, indicating the potential of incorporating astrocyte-inspired dynamics into scalable sequence models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1720. MolEditRL: Structure-Preserving Molecular Editing via Discrete Diffusion and Reinforcement Learning
๐ฌ 19
โญ 5.60
๐ generative models
๐ท๏ธ Molecular Editing; Discrete Diffusion; Reinforcement Learning
Molecular editing aims to modify a given molecule to optimize desired chemical properties while preserving structural similarity. However, current approaches typically rely on string-based or continuous representations, which fail to adequately capture the discrete, graph-structured nature of molecules, resulting in limited structural fidelity and poor controllability. In this paper, we propose MolEditRL, a molecular editing framework that explicitly integrates structural constraints with precise property optimization. Specifically, MolEditRL consists of two stages: (1) a discrete graph diffusion model pretrained to reconstruct target molecules conditioned on source structures and natural language instructions; (2) an editing-aware reinforcement learning fine-tuning stage that further enhances property alignment and structural preservation by explicitly optimizing editing decisions under graph constraints. For comprehensive evaluation, we construct MolEdit-Instruct, the largest and most property-rich molecular editing dataset, comprising 3 million diverse examples spanning single- and multi-property tasks across 10 chemical attributes. Experimental results demonstrate that MolEditRL significantly outperforms state-of-the-art methods in both property optimization accuracy and structural fidelity, achieving a 74% improvement in editing success rate while using 98% fewer parameters.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1721. SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs
๐ฌ 21
โญ 5.60
๐ datasets and benchmarks
๐ท๏ธ Spatial reasoning VLMs benchmark
We present SpinBench, a cognitively grounded diagnostic benchmark for evaluating spatial reasoning in vision language models (VLMs).
SpinBench is designed around the core challenge of spatial reasoning: perspective taking, the ability to reason about how scenes and object relations change under viewpoint transformation.
Since perspective taking requires multiple cognitive capabilities, such as recognizing objects across views, relative positions grounding, and mentally simulating transformations, SpinBench introduces a set of fine-grained diagnostic categories.
Our categories target translation, rotation, object relative pose, and viewpoint change, and are progressively structured so that single-object simpler tasks scaffold toward the most demanding multi-object perspective-taking setting. We evaluate 37 state-of-the-art VLMs, both proprietary and open source.
Results reveal systematic weaknesses: strong egocentric bias, poor rotational understanding, and inconsistencies under symmetrical and syntactic reformulations.
Scaling analysis shows both smooth improvements and emergent capabilities.
While human subjects achieve high accuracy (91.2\%), task difficulty as measured by human response time shows strong correlation with VLM accuracy, indicating that SpinBench captures spatial reasoning challenges shared across humans and VLMs.
Together, our findings highlight the need for structured, cognitively inspired diagnostic tools to advance spatial reasoning in multimodal foundation models.
Our website can be found [here](https://sites.google.com/view/spinbench/).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1722. Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors
๐ฌ 22
โญ 5.60
๐ interpretability and explainable AI
๐ท๏ธ next-token prediction transformers interpretability
Trained Transformers have been shown to compute abstract features that appear redundant for predicting the immediate next token. We identify which components of the gradient signal from the next-token prediction objective give rise to this phenomenon, and we propose a method to estimate the influence of those components on the emergence of specific features. After validating our approach on toy tasks, we use it to interpret the origins of the world model in OthelloGPT and syntactic features in a small language model. Finally, we apply our framework to a pretrained LLM, showing that features with extremely high or low influence on future tokens tend to be related to formal reasoning domains such as code. Overall, our work takes a step toward understanding hidden features of Transformers through the lens of their development during training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1723. Time-Gated Multi-Scale Flow Matching for Time-Series Imputation
๐ฌ 18
โญ 5.60
๐ learning on time series and dynamical systems
๐ท๏ธ Time-series imputation Flow matching ODE-based generative models Transformers Multi-scale modeling
We address multivariate timeโseries imputation by learning the velocity field of a
data-conditioned ordinary differential equation (ODE) via flow matching. Our
method, Time-Gated Multi-Scale Flow Matching (TG-MSFM), conditions the
flow on a structured endpoint comprising observed values, a per-time visibility
mask, and short left/right context, processed by a time-aware Transformer whose
self-attention is masked to aggregate only from observed timestamps. To recon-
cile global trends with local details along the trajectory, we introduce time-gated
multi-scale velocity heads on a fixed 1D pyramid and blend them through a time-
dependent gate; a mild anti-aliasing filter stabilizes the finest branch. At inference,
we use a second-order Heun integrator with a per-step data-consistency projection
that keeps observed coordinates exactly on the straight path from the initial noise
to the endpoint, reducing boundary artifacts and drift. Training adopts gap-only
supervision of the velocity on missing data coordinates, with small optional regu-
larizers for numerical stability. Across standard benchmarks, Time-Gated Multi-
Scale Flow Matching attains competitive or improved MSE/MAE with favorable
speedโquality trade-offs, and ablations isolate the contributions of the time-gated
multi-scale heads, masked attention, and the data-consistent ODE integration
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1724. The Lie of the Average: How Class Incremental Learning Evaluation Deceives You?
๐ฌ 27
โญ 5.60
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Class Incremental Learning Continual Learning Evaluation Protocol Extreme Class Sequences
Class Incremental Learning (CIL) requires models to continuously learn new classes without forgetting previously learned ones, while maintaining stable performance across all possible class sequences. In real-world settings, the order in which classes arrive is diverse and unpredictable, and model performance can vary substantially across different sequences. Yet mainstream evaluation protocols calculate mean and variance from only a small set of randomly sampled sequences. Our theoretical analysis and empirical results demonstrate that this sampling strategy fails to capture the full performance range, resulting in biased mean estimates and a severe underestimation of the true variance in the performance distribution. We therefore contend that a robust CIL evaluation protocol should accurately characterize and estimate the entire performance distribution. To this end, we introduce the concept of extreme sequences and provide theoretical justification for their crucial role in the reliable evaluation of CIL. Moreover, we observe a consistent positive correlation between inter-task similarity and model performance, a relation that can be leveraged to guide the search for extreme sequences. Building on these insights, we propose **EDGE** (Extreme caseโbased Distribution \& Generalization Evaluation), an evaluation protocol that adaptively identifies and samples extreme class sequences using inter-task similarity, offering a closer approximation of the ground-truth performance distribution. Extensive experiments demonstrate that EDGE effectively captures performance extremes and yields more accurate estimates of distributional boundaries, providing actionable insights for model selection and robustness checking.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1725. Continuous Chain of Thought: Parallel Exploration and Reasoning through a Theoretical Lens
๐ฌ 20
โญ 5.60
๐ foundation or frontier models, including LLMs
๐ท๏ธ chain-of-thought latent space reasoning parallel exploration transformers policy optimization multi token sampling
Modern language models generate chain-of-thought traces by autoregressively sampling tokens from a finite vocabulary. While this discrete sampling has achieved remarkable success, conducting chain-of-thought with continuously-valued tokens (CoT2) offers a richer and more expressive alternative. Our work provides new theoretical guarantees and algorithms for CoT2, motivated by logical reasoning tasks that inherently require search capabilities. Theoretically, we establish how CoT2 facilitates the model to track multiple discrete traces in parallel; and quantify the level of achievable parallelism and its benefits for inference efficiency. We also provide a CoT2-based one-layer transformer construction that solves the combinatorial ``subset sum problem'' given a sufficient embedding dimension. These insights arise from a novel and effective supervision strategy where we match the language model outputs to the empirical token distributions of a set of target traces. Complementing this, we introduce sampling strategies that unlock policy optimization methods for CoT2. Our primary strategy samples and composes $K$ discrete tokens at each decoding step to control the level of parallelism.
Experiments confirm that (i) the optimal level of parallelism is governed by the embedding dimension, (ii) our continuous supervision strategy can outperform alternative methods, and (iii) policy optimization with CoT2 indeed improves the performance of the model beyond its initial discrete or continuous supervision.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1726. Reusing Pre-Training Data at Test Time is a Compute Multiplier
๐ฌ 10
โญ 5.60
๐ foundation or frontier models, including LLMs
๐ท๏ธ data datasets pretraining pre-training retrieval llm llms test time compute
Large language models learn from their vast pre-training corpora, gaining the ability to solve an ever increasing variety of tasks; yet although researchers work to improve these datasets, there is little effort to understand how efficient the pre-training apparatus is at extracting ideas and knowledge from the data. In this work, we use retrieval augmented generation along with test-time compute as a way to quantify how much dataset value was left behind by the process of pre-training, and how this changes across scale. We demonstrate that pre-training then retrieving from standard and largely open-sourced datasets results in significant accuracy gains in MMLU, Math-500, and SimpleQA, which persist through decontamination. For MMLU we observe that retrieval acts as a ~5x compute multiplier versus pre-training alone. We show that these results can be further improved by leveraging additional compute at test time to parse the retrieved context, demonstrating a 10 percentage point improvement on MMLU for the public LLaMA 3.1 8B model. Overall, our results suggest that today's pre-training methods do not make full use of the information in existing pre-training datasets, leaving significant room for progress.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1727. Mechanistic Independence: A Principle for Identifiable Disentangled Representations
๐ฌ 11
โญ 5.60
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Identifiability Disentangled Representation Mechanistic Independence
*Disentangled representations* seek to recover latent factors of variation underlying observed data, yet their *identifiability* is still not fully understood. We introduce a unified framework in which disentanglement is achieved through *mechanistic independence*, which characterizes latent factors by how they act on observed variables rather than by their latent distribution. This perspective is invariant to changes of the latent density, even when such changes induce statistical dependencies among factors. Within this framework, we propose several related independence criteria -- ranging from support-based and sparsity-based to higher-order conditions -- and show that each yields identifiability of latent subspaces, even under nonlinear, non-invertible mixing. We further establish a hierarchy among these criteria and provide a graph-theoretic characterization of latent factors as connected components. Together, these results clarify the conditions under which disentangled representations can be identified without relying on statistical assumptions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1728. FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models
๐ฌ 21
โญ 5.60
๐ optimization
๐ท๏ธ low-rank optimization fast fourier transform computational efficiency memory efficiency efficient optimization large language models
Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to improve running time and reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD) or QR-decomposition. Applying these techniques individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple, two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces by using a predefined orthogonal matrix of the Discrete Cosine Transform (DCT). We dynamically select columns from the DCT matrix based on their alignment with the gradient of each layer. The effective projection matrices are obtained via a simple matmul with the DCT matrix in $O(n^3)$ time, followed by a lightweight sorting step to identify the most relevant basis vectors. For large layers, DCT can be computed via Makhoul's $N$-point algorithm based on Fast Fourier Transform (FFT) in $O(n^2 \log(n))$ time. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, obtaining an approach with rank-independent running time that matches the performance of costly SVD/QR-based methods while achieving faster runtime and reduced memory usage by up to $25\\%$ across different model sizes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1729. Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs
๐ฌ 17
โญ 5.60
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Large Language Models Multimodal Prompt Optimization Prompt Optimization
Large Language Models (LLMs) have shown remarkable success, and their multimodal expansions (MLLMs) further unlock capabilities spanning images, videos, and other modalities beyond text. However, despite this shift, prompt optimization approaches, designed to reduce the burden of manual prompt crafting while maximizing performance, remain confined to text, ultimately limiting the full potential of MLLMs. Motivated by this gap, we introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of textual and non-textual prompts. To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection process of candidate prompts by leveraging earlier evaluations as priors in a Bayesian-based selection strategy. Through extensive experiments across diverse modalities that go beyond text, such as images, videos, and even molecules, we demonstrate that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1730. EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models
๐ฌ 22
โญ 5.60
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Emotion Hallucination Emotion Understanding Affective Computing
Emotion understanding is a critical yet challenging task.
Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities in this area. However, MLLMs often suffer from ``hallucinations'', generating irrelevant or nonsensical content.
To the best of our knowledge, and despite the importance of this issue, there has been no dedicated effort to evaluate emotion-related hallucinations in MLLMs.
In this work, we introduce \textbf{EmotionHallucer}, the first benchmark for detecting and analyzing emotion hallucinations in MLLMs.
Unlike humans, whose emotion understanding stems from the interplay of biology and social learning, MLLMs rely solely on data-driven learning and lack innate emotional instincts.
Fortunately, emotion psychology provides a solid foundation of knowledge about human emotions.
Building on this knowledge, we assess emotion hallucinations from two perspectives: emotion psychology knowledge and realworld multimodal perception.
To support robust evaluation, we utilize an adversarial binary questionโanswer (QA) framework, which employs carefully crafted basic and hallucinated pairs to assess the emotion hallucination tendencies of MLLMs.
By evaluating 41 LLMs and MLLMs on EmotionHallucer, we find that:
(1) most current models exhibit substantial issues with emotion hallucinations;
(2) closed-source models outperform open-source models in detecting emotion hallucinations, and reasoning capability provides additional advantages;
and (3) existing models perform better in emotion psychology knowledge than in multimodal emotion perception.
As a byproduct, these findings inspire us to propose the \textbf{PEP-MEK} framework, which yields an average improvement of 9.90\% in emotion hallucination detection across selected models.
Resources will be available on GitHub.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1731. Seeing Whatโs Not There: Negation Understanding Needs More Than Training
๐ฌ 27
โญ 5.60
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Negation Zeroshot VisionlanguageModels MachineLearning ComputerVision DeepLearning
Understanding the negation in a sentence is an important part of compositional
understanding and logic in natural language. Many practical AI applications, such
as autonomous driving, include precise instruction with negations. For example,
following instruction to an AI assistant โlocate a parking spot without a vehicleโ
requires the assistant to not confuse between presence and absence of vehicles. Al-
though joint embedding-based Vision Language Models (VLMs) like CLIP have
revolutionized multi-modal tasks, they struggle to interpret negation. To address
this limitation, recently many works proposed to solve the problem through a data-
centric approach by introducing additional datasets with hard-negative samples for
both image and text data. Contrary to these approaches, we present a zero-shot
approach to tackle the negation understanding problem. We probe the properties
of CLIP text embeddings and show that they follow compositional arithmetic op-
erations, which allow the addition or removal of semantic information directly in
the embedding space. We then present a rule-based approach to extract negated
text from given caption and then use it to explicitly remove corresponding se-
mantic information from original embedding, improving negation understanding
in VLMs. Our approach does not require expensive training process to induce
negation understanding into the model, and achieves the state-of-the-art perfor-
mance on popular benchmark for negation understanding. We improve baseline
CLIP model performance on NegBench from 25.5% to 67.0% for MCQ and from
50.9% to 56.1% for retrieval tasks. Even NegCLIP model which is fine-tuned on
negtion datasets, our approach boosts its MCQ accuracy from 54.03% to 66.22%
and retrieval accuracy from 59.25% to 60.1% showing strong performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1732. Prompt Curriculum Learning for Efficient LLM Post-Training
๐ฌ 16
โญ 5.60
๐ reinforcement learning
๐ท๏ธ reinforcement learning large language models post-training curriculum learning
Reinforcement learning (RL) is widely used to post-train large language models for tasks such as mathematical reasoning and coding. However, the convergence of RL training remains sensitive to batching and prompt selection strategies. We investigate the factors that affect convergence, including batch size and prompt difficulty. Through large-scale experiments across multiple models and datasets, we show that there exists an optimal batch size that balances generation time and gradient quality, and that prompts of intermediate difficulty (where the model has roughly a 50\% chance of success) are the most sample-efficient for model convergence. Motivated by these findings, we propose Prompt Curriculum Learning (PCL), a lightweight algorithm that selects intermediate-difficulty prompts using a learned value model. PCL avoids costly rollouts and efficiently guides training by focusing on the most informative samples. Empirically, PCL either achieves the highest performance or requires significantly less training time to reach comparable performance across a suite of benchmarks. Compared to using rollouts to filter, PCL is $12.1\times$ and $16.9\times$ faster on identifying intermediate-difficulty prompts when training on MATH and DeepScaleR respectively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1733. EA3D: Event-Augmented 3D Diffusion for Generalizable Novel View Synthesis
๐ฌ 18
โญ 5.60
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Novel view synthesis; Event Cameras; Diffusion model
We introduce **EA3D**, an Event-Augmented 3D Diffusion framework for generalizable novel view synthesis from event streams and sparse RGB inputs. Existing approaches either rely solely on RGB frames for generalizable synthesis, which limits their robustness under rapid camera motion, or require per-scene optimization to exploit event data, undermining scalability.
EA3D addresses these limitations by jointly leveraging the complementary strengths of asynchronous events and RGB imagery.
At its core lies a learnable EA-Renderer, which constructs view-dependent 3D features within target camera frustums by fusing appearance cues from RGB frames with geometric structure extracted from adaptively sliced event voxels.
These features condition a 3D-aware diffusion model, enabling high-fidelity and temporally consistent novel view generation along arbitrary camera trajectories.
To further enhance scalability and generalization, we develop the Event-DL3DV dataset, a large-scale 3D benchmark pairing diverse synthetic event streams with photorealistic multi-view RGB images and depth maps.
Extensive experiments on both real-world and synthetic event data demonstrate that EA3D consistently outperforms optimization-based and generalizable baselines, achieving superior fidelity and cross-scene generalization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1734. Learned Meta-Tokens for Language Modeling
๐ฌ 14
โญ 5.60
๐ foundation or frontier models, including LLMs
๐ท๏ธ meta-tokens language models pre-training positional encoding
Transformer-based language models (LMs) notably struggle to reliably capture distant contextual information. This work introduces a novel approach using meta-tokens -- special tokens injected during pre-training -- paired with a dedicated meta-attention mechanism to guide LMs to use these tokens. We pre-train a language model equipped with meta-attention in addition to causal multi-head attention on <100B tokens, achieving strong performance on a suite of synthetic tasks. Our method facilitates length generalization up to 2$\times$ the context window after extension with YaRN. We provide an information-theoretic analysis which reveals that meta-tokens \textit{sharpen} the positional encoding, allowing them to operate as content-based anchors that compress preceding context and โcacheโ it within the meta-token. We empirically confirm this by visualizing model internals to study the residual stream. Together, our findings demonstrate that meta-tokens and meta-attention provide a simple, data-efficient pre-training method, grounded by new mechanistic insights into their role in enabling length generalization behavior.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1735. SNAPHARD CONTRAST LEARNING
๐ฌ 24
โญ 5.60
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Contrastive Learning Hard Sample Screening Contrastive Loss Computational Geometry
In recent years, Contrastive Learning (CL) has garnered significant attention due to its efficacy across various domains, spanning from visual and textual modalities. A fundamental aspect of CL is aligning the representations of anchor instances with relevant positive samples while simultaneously separating them from negative ones. Prior studies have extensively explored diverse strategies for generating and sampling contrastive (i.e., positive/negative) pairs. Despite the empirical success, the theoretical understanding of the CL approach remains under-explored, leaving questions such as the rationale behind contrastive-pair sampling and its contributions to the model performance unclear.
This paper addresses this gap by providing a comprehensive theoretical analysis from the angle of optimality conditions and introducing the SnaPhArd Contrast Learning (SPACL). Specifically, SPACL prioritizes hard positive and hard negative samples during constructing contrastive pairs and computing the contrastive loss, rather than treating all samples equally. Experimental results across two downstream tasks demonstrate that SPACL consistently outperforms or competes favorably with state-of-the-art methods, showcasing its robustness and efficacy. A comprehensive ablation study further examines the effectiveness of SPACL's individual components to verify the theoretic findings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1736. LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence
๐ฌ 21
โญ 5.60
๐ generative models
๐ท๏ธ Diffusion DiT Image Editing
The reliance on implicit point matching via attention has become a core bottleneck in drag-based editing, resulting in a fundamental compromise on weakened inversion strength and costly test-time optimization (TTO).
This compromise severely limits the generative capabilities, suppressing high-fidelity inpainting and text-guided creation. In this paper, we introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers, which directly eliminates the reliance on implicit point matching. In concrete terms, our method generates an explicit correspondence map from user drag inputs as a reliable reference to boost the attention control.
This reliable reference opens the potential for a stable full-strength inversion process, which is the first in the drag-based editing task. It obviates the necessity for TTO and unlocks the generative capability of models. Therefore, LazyDrag naturally unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach: opening the mouth of a dog and inpainting its interior, generating new objects like a ``tennis ball'', or for ambiguous drags, making context-aware changes like moving hands into pockets. Moreover, LazyDrag supports multi-round edits with simultaneous move and scale operations. Evaluated on DragBench, our method outperforms baselines in drag accuracy and perceptual quality, as validated by mean distances, VIEScore and user studies. LazyDrag not only sets new state-of-the-art performance, but also paves a new way to editing paradigms. Code will be open-sourced.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1737. Robust Federated Inference
๐ฌ 21
โญ 5.60
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Collaborative Inference Robust Ensembles Federated Ensembles
Federated inference, in the form of one-shot federated learning, edge ensembles, or federated ensembles, has emerged as an attractive solution to combine predictions from multiple models. This paradigm enables each model to remain local and proprietary while a central server queries them and aggregates predictions. Yet, the robustness of federated inference has been largely neglected, leaving them vulnerable to even simple attacks. To address this critical gap, we formalize the problem of robust federated inference and provide the first robustness analysis of this class of methods. Our analysis of averaging-based aggregators shows that the error of the aggregator is small either when the dissimilarity between honest responses is small or the margin between the two most probable classes is large. Moving beyond linear averaging, we show that problem of robust federated inference with non-linear aggregators can be cast as an adversarial machine learning problem. We then introduce an advanced technique using the DeepSet aggregation model, proposing a novel composition of adversarial training and test-time robust aggregation to robustify non-linear aggregators. Our composition yields significant improvements, surpassing existing robust aggregation methods by 4.7 - 22.2% in accuracy points across diverse benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1738. Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning
๐ฌ 19
โญ 5.60
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ data poisoning language model ai security dataset ownership verification training data membership privacy copyright
The pre-training of large language models (LLMs) relies on massive text datasets sourced from diverse and difficult-to-curate origins.
Although membership inference attacks and hidden canaries have been explored to trace data usage, such methods rely on *regurgitation* of training data, which LM providers try to limit.
In this work, we demonstrate that *indirect data poisoning* (where the targeted behavior is absent from training data) is not only feasible against LLMs but also allows to effectively protect a dataset and trace its use.
Using gradient-based optimization prompt-tuning, we craft poisons to make a model learn arbitrary *secret sequences*: secret responses to secret prompts that are **absent from the training corpus**.\
We validate our approach on language models pre-trained from scratch and show that less than 0.005\% of poisoned tokens are sufficient to covertly make a LM learn a *secret* and detect it with extremely high confidence ( $p < 10^{-55}$ ) with a theoretically certifiable scheme.
Crucially, this occurs without performance degradation (on LM benchmarks) and despite secrets **never appearing in the training set**.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1739. Householder-Diagonalized Linear Attention (HDLA): Utilizing Enhanced Decay Mechanism for Efficient Sequence Modeling
๐ฌ 20
โญ 5.60
๐ foundation or frontier models, including LLMs
๐ท๏ธ Linear Attention Language Model Foundation Model
Linear attention mechanisms have emerged as efficient alternatives to Softmax attention, exhibiting steady improvements in language modeling capabilities driven by increasingly sophisticated designs for decay matricesโthough their structural complexity has typically been limited to the Diagonal-Plus-Rank-1 level. To further advance the understanding and capabilities of linear attention via more complex decay structures, this work makes two primary contributions: (1) We propose the HDLA linear attention mechanism, which utilizes efficient matrix decomposition to achieve a Diagonal-Plus-Rank-2 structure, thereby extending the decay matrix to a broader, more expressive, rank-enhanced and structured class. (2) We propose a more general chunk-wise parallel algorithm that accommodates both diagonal-plus-rank-$r_{ab}$ decay structure and key-value outer products of rank $r_{kv}$, thus providing a versatile foundation for future research. Comprehensive experiments demonstrate that, compared to linear attention baselines, HDLA sets new SOTA results on language modeling and retrieval tasks at 2.8B parameter scale, delivers at most 80\% and 58.2\% performance gains over baselines on retrieval-based MQAR and RULER tasks, and achieves an average score improvement of 4.39โ7.66 on the synthetic MAD benchmark, respectively. Our proposed HDLA model, as well as the rank-generalized chunk-wise parallel algorithm, together provide a versatile algorithmic foundation and promising research prospects for the design of rank-enhanced, structured linear attention mechanisms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1740. Achieving Approximate Symmetry Is Exponentially Easier than Exact Symmetry
๐ฌ 12
โญ 5.60
๐ learning theory
๐ท๏ธ symmetry invariance relaxed equivariance complexity theory
Enforcing exact symmetry in machine learning models often yields significant gains in scientific applications, serving as a powerful inductive bias. However, recent work suggests that relying on approximate symmetry can offer greater flexibility and robustness. Despite promising empirical evidence, there has been little theoretical understanding, and in particular, a direct comparison between exact and approximate symmetry is missing from the literature. In this paper, we initiate this study by asking:
What is the cost of enforcing exact versus approximate symmetry?
To address this question, we introduce averaging complexity, a framework
for quantifying the cost of enforcing symmetry via averaging. Our main result is an exponential separation: under standard conditions, achieving exact symmetry requires linear averaging complexity, whereas approximate symmetry can be attained with only logarithmic averaging complexity.
To the best of our knowledge, this provides the first theoretical separation of these two cases, formally justifying why approximate symmetry may be preferable in practice. Beyond this, our tools and techniques may be of independent interest for the broader study of symmetries in machine learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1741. Taming the Fragility of KV Cache Eviction in LLM Inference
๐ฌ 23
โญ 5.60
๐ foundation or frontier models, including LLMs
๐ท๏ธ Efficient AI Large Language Model; LLM Inference
Large language models have revolutionized natural language processing, yet their deployment remains hampered by the substantial memory and runtime overhead of the transformerโs Key-Value cache. To mitigate this, recent methods employ a scoring-aggregation framework to evict unimportant cache entries, based on the "stability assumption"โthat a fixed subset of entries remains consistently important during generation. However, prior work has largely focused on refining importance indicators for scoring, while defaulting to mean aggregation due to a faithful trust in the stability assumption. In this work, we argue that this underlying assumption is inherently fragile, making mean aggregation highly vulnerable in extreme cases. To counter this, we propose a simple yet elegant defensive aggregation strategy: a two-step, linear-time approach that controls worst-case risk, thereby defending against extreme cases with negligible computational overhead. Embodying this strategy, we propose a novel cache eviction method, DefensiveKV and its extension, Layer-DefensiveKV, which incorporates layer-wise budget allocation. Across seven task domains (18 datasets), our methods reduce generation quality loss by 2.3ร and 4.3ร respectively, versus the strongest baseline under a 20\% cache size. These results set new performance benchmarks and pioneer a promising direction for optimizing cache eviction against underlying fragility through worst-case risk management.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1742. STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
๐ฌ 21
โญ 5.60
๐ datasets and benchmarks
๐ท๏ธ audio understanding spatio-temporal reasoning 4D Intelligence
Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning.
We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1743. A Fictional Q&A Dataset for Studying Memorization and Knowledge Acquisition
๐ฌ 20
โญ 5.60
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language models memorization knowledge acquisition datasets
When language models are trained on textual data, they acquire both knowledge about the structure of language as well as knowledge of facts about the world.
At inference time, their knowledge of facts can be leveraged to solve interesting problems and perform useful knowledge work for users.
It is well known that language models can verbatim memorize long sequences from their training data.
However, it is much less well understood how language models memorize facts seen during training.
In this work, we propose a new dataset to specifically empower researchers to study the dual processes of fact memorization and verbatim sequence memorization.
The dataset consists of synthetically-generated, webtext-like documents about fictional events, as well as question-answer pairs about the events.
We conduct training experiments showing how synthetic data about fictional events can be effective in teasing apart different forms of memorization.
We also document the challenges in effectively building realistic, fictional synthetic data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1744. Count Bridges enable Modeling and Deconvolving Transcriptomics
๐ฌ 20
โญ 5.60
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ ordinal data diffusion schrodinger bridge flow matching single cell genomics spatial transcriptomics
Many modern biological assays, including RNA sequencing, yield integer-valued counts that reflect the number of RNA molecules detected. These measurements are often not at the desired resolution: while the unit of interest is typically a single cell, many RNA sequencing and imaging technologies produce counts aggregated over sets of cells.
Although recent generative frameworks such as diffusion and flow matching have been extended to non-Euclidean and discrete settings, it remains unclear how best to model integer-valued data or how to systematically deconvolve aggregated observations.
We introduce Count Bridges, a stochastic bridge process on the integers that provides an exact, tractable analogue of diffusion-style models for count data, with closed-form conditionals for efficient training and sampling. We extend this framework to enable direct training from aggregated measurements via an Expectation-Maximization-style approach that treats unit-level counts as latent variables.
We demonstrate state-of-the-art performance on integer distribution matching benchmarks, comparing against flow matching and discrete flow matching baselines across various metrics. We then apply Count Bridges to two large-scale problems in biology: modeling single-cell gene expression data at the nucleotide resolution, with applications to deconvolving bulk RNA-seq, and resolving multicellular spatial transcriptomic spots into single-cell count profiles. Our methods offer a principled foundation for generative modeling and deconvolution of biological count data across scales and modalities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1745. Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework
๐ฌ 24
โญ 5.60
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ AI Alignment Population-Proportional Alignment Social Choice Theory Axiomatic Framework Rank Aggregation Pluralistic Alignment Preference-based Reinforcement Learning Reinforcement Learning from Human Feedback Nash Learning from Human Feedback Large Language Model
Conventional preference learning methods often prioritize opinions held more widely when aggregating preferences from multiple evaluators. This may result in policies that are biased in favor of some types of opinions or groups and susceptible to strategic manipulation. To address this issue, we develop a novel preference learning framework capable of aligning aggregate opinions and policies proportionally with the true population distribution of evaluator preferences. Grounded in social choice theory, our approach infers the feasible set of evaluator population distributions directly from pairwise comparison data. Using these estimates, the algorithm constructs a policy that satisfies foundational axioms from social choice theory, namely monotonicity and Pareto efficiency, as well as our newly-introduced axioms of population-proportional alignment and population-bounded manipulability. Moreover, we propose a soft-max relaxation method that smoothly trade-offs population-proportional alignment with the selection of the Condorcet winner (which beats all other options in pairwise comparisons). Finally, we validate the effectiveness and scalability of our approach through experiments on both tabular recommendation tasks and large language model alignment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1746. ExpVid: A Benchmark for Experiment Video Understanding & Reasoning
๐ฌ 13
โญ 5.60
๐ datasets and benchmarks
๐ท๏ธ Multimodal Large Language Model Video Large Language Model Nature Science Benchmark
Multimodal Large Language Models (MLLMs) hold promise for accelerating scientific discovery by interpreting complex experimental procedures. However, their true capabilities are poorly understood, as existing benchmarks neglect the fine-grained and long-horizon nature of authentic laboratory work, especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos. Curated from peer-reviewed video publications, ExpVid features a new three-level task hierarchy that mirrors the scientific process: (1) Fine-grained Perception of tools, materials, and actions; (2) Procedural Understanding of step order and completeness; and (3) Scientific Reasoning that connects the full experiment to its published conclusions. Our vision-centric annotation pipeline, combining automated generation with multi-disciplinary expert validation, ensures that tasks require visual grounding. We evaluate 19 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes. Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning. ExpVid not only provides a diagnostic tool but also charts a roadmap for developing MLLMs capable of becoming trustworthy partners in scientific experimentation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1747. Semi-Parametric Contextual Pricing with General Smoothness
๐ฌ 17
โญ 5.60
๐ learning theory
๐ท๏ธ Contextual pricing; online learning; semi-parametric models
We study the contextual pricing problem, where in each round a seller observes a context, sets a price, and receives a binary purchase signal. We adopt a semi-parametric model in which the demand follows a linear parametric form composed with an unknown link function from a $\beta$-Hรถlder class. Prior work established regret rates of $\tilde{\mathcal{O}}(T^{2/3})$ for $\beta=1$ and $\tilde{\mathcal{O}}(T^{3/5})$ for $\beta=2$. Under a uni-modality condition, we propose a unified algorithm that combines the stationary subroutine of Wang & Chen (2025) with local polynomial regression, achieving the general rate $\tilde{\mathcal{O}}(T^{\frac{\beta+1}{2\beta+1}})$ for all $\beta \ge 1$. This recovers and strengthens existing results, while also addressing a gap in the prior analysis for $\beta=2$. Our analysis develops tighter semi-parametric confidence regions, removes derivative lower bound assumptions from earlier work, and offers a sharper explorationโexploitation trade-off. These insights not only extend theoretical guarantees to general $\beta$ but also improve practical performance by reducing the need for long forced-exploration phases.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1748. Co-occurring Associated REtained concepts in Diffusion Unlearning
๐ฌ 20
โญ 5.60
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ unlearning diffusion concept erasure safety
Unlearning has emerged as a key technique to mitigate harmful content generation in diffusion models. However, existing methods often remove not only the target concept, but also benign co-occurring concepts. Unlearning nudity can unintentionally suppress the concept of person, preventing a model from generating images with person. We define these undesirably suppressed co-occurring concepts that must be preserved $\textbf{CARE}$ ($\textbf{C}$o-occurring $\textbf{A}$ssociated $\textbf{RE}$tained concepts). Then, we introduce the $\textbf{CARE score}$, a general metric that directly quantifies their preservation across unlearning tasks. With this foundation, we propose $\textbf{ReCARE}$ ($\textbf{R}$obust $\textbf{e}$rasure for $\textbf{CARE}$), a framework that explicitly safeguards CARE while erasing only the target concept. ReCARE automatically constructs the CARE-set, a curated vocabulary of benign co-occurring tokens extracted from target images, and leverages this vocabulary during training for stable unlearning. Extensive experiments across various target concepts ($\textit{Nudity}$, $\textit{Van Gogh}$ style, and $\textit{Tench}$ object) demonstrate that ReCARE achieves overall state-of-the-art performance in balancing robust concept erasure, overall utility, and CARE preservation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1749. DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models
๐ฌ 26
โญ 5.60
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models Safety Jailbreak Attack
The rapid advancement of Diffusion Large Language Models (dLLMs) introduces unprecedented vulnerabilities that are fundamentally distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. In this paper, we conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: *intra-step* and *inter-step* dynamics. Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final output. These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential. To unlock this potential, we propose **DiffuGuard**, a training-free defense framework that addresses vulnerabilities through a dual-stage approach: **Stochastic Annealing Remasking** dynamically introduces controlled randomness to mitigate greedy selection bias, while **Block-level Audit and Repair** exploits internal model representations for autonomous risk detection and guided correction. Comprehensive experiments on four dLLMs demonstrate DiffuGuard's exceptional effectiveness, reducing Attack Success Rate against six diverse jailbreak methods from **47.9%** to **14.7%** while preserving model utility and efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1750. EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
๐ฌ 21
โญ 5.60
๐ generative models
๐ท๏ธ Video Editing Content Generation Artificial Intelligence
Recent advances in foundation models highlight a clear trend toward unification and scaling, showing emergent capabilities across diverse domains. While image generation and editing have rapidly transitioned from task-specific to unified frameworks, video generation and editing remain fragmented due to architectural limitations and data scarcity. In this work, we introduce EditVerse, a unified framework for image and video generation and editing within a single model. By representing all modalities, i.e., text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning, natural cross-modal knowledge transfer, and flexible handling of inputs and outputs with arbitrary resolutions and durations. To address the lack of video editing training data, we design a scalable data pipeline that curates 232K video editing samples and combines them with large-scale image and video datasets for joint training. Furthermore, we present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions. Extensive experiments and user studies demonstrate that EditVerse achieves state-of-the-art performance, surpassing existing open-source and commercial models, while exhibiting emergent editing and generation abilities across modalities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1751. Implicit Models: Expressive Power Scales with Test-Time Compute
๐ฌ 25
โญ 5.60
๐ learning theory
๐ท๏ธ Implicit models Deep equilibrium models Expressive power
Implicit models, an emerging model class, compute outputs by iterating a single parameter block to a fixed point. This architecture realizes an infinite-depth, weight-tied network that trains with constant memory, significantly reducing memory needs for the same level of performance compared to explicit models. While it is empirically known that these compact models can often match or even exceed the accuracy of larger explicit networks by allocating more test-time compute, the underlying reasons are not yet well understood.
We study this gap through a non-parametric analysis of expressive power. We provide a strict mathematical characterization, showing that a simple and regular implicit operator can, through iteration, progressively express more complex mappings. We prove that for a broad class of implicit models, this process allows the model's expressive power to grow with test-time compute, ultimately matching a much richer function class. The theory is validated across four domains: imaging, scientific computing, operations research, and LLM reasoning, demonstrating that as test-time iterations increase, the complexity of the learned mapping rises, while the solution quality simultaneously improves and stabilizes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1752. RLP: Reinforcement as a Pretraining Objective
๐ฌ 28
โญ 5.60
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reinforcement Learning Pretraining Reasoning Large Language Models
The dominant paradigm for training large reasoning models starts with pre-training using next-token prediction loss on vast amounts of data. Reinforcement learning, while powerful in scaling reasoning, is introduced only as the very last phase of post-training, preceded by supervised fine-tuning. While dominant, is this an optimal way of training? In this paper, we present RLP, an information-driven reinforcement pretraining objective, that brings the core spirit of reinforcement learning---exploration---to the last phase of pretraining. The key idea is to treat chain-of-thought as an exploratory action, with rewards computed based on the information gain it provides for predicting future tokens. This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining. More concretely, the reward signal measures the increase in log-likelihood of the next token when conditioning on both context and a sampled reasoning chain, compared to conditioning on context alone. This approach yields a verifier-free dense reward signal, allowing for efficient training for the full document stream during pretraining. Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning. Pretraining with RLP on Qwen3-1.7B-Base lifts the overall average across an eightโbenchmark mathโandโscience suite by 19%. With identical postโtraining, the gains compound, with the largest improvements on reasoningโheavy tasks such as AIME25 and MMLUโPro. Applying RLP to the hybrid NVIDIA-Nemotron-Nano-12B-v2-Base increases the overall average from 42.81% to 61.32% and raises the average on scientific reasoning by 23%, demonstrating scalability across architectures and model sizes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1753. R2PS: Worst-Case Robust Real-Time Pursuit Strategies under Partial Observability
๐ฌ 31
โญ 5.60
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ pursuit-evasion game partial observability dynamic programming belief preservation reinforcement learning real-time pursuit strategy worst-case robustness
Computing worst-case robust strategies in pursuit-evasion games (PEGs) is time-consuming, especially when real-world factors like partial observability are considered. While important for general security purposes, real-time applicable pursuit strategies for graph-based PEGs are currently missing when the pursuers only have imperfect information about the evader's position. Although state-of-the-art reinforcement learning (RL) methods like Equilibrium Policy Generalization (EPG) and Grasper provide guidelines for learning graph neural network (GNN) policies robust to different game dynamics, they are restricted to the scenario of perfect information and do not take into account the possible case where the evader can predict the pursuers' actions. This paper introduces the first approach to worst-case robust real-time pursuit strategies (R2PS) under partial observability. We first prove that a traditional dynamic programming (DP) algorithm for solving Markov PEGs maintains optimality under the asynchronous moves by the evader. Then, we propose a belief preservation mechanism about the evader's possible positions, extending the DP pursuit strategies to a partially observable setting. Finally, we embed the belief preservation into the state-of-the-art EPG framework to finish our R2PS learning scheme, which leads to a real-time pursuer policy through cross-graph reinforcement learning against the asynchronous-move DP evasion strategies. After reinforcement learning, our policy achieves robust zero-shot generalization to unseen real-world graph structures and consistently outperforms the policy directly trained on the test graphs by the existing game RL approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1754. ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning
๐ฌ 31
โญ 5.60
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Supervised Fine-tuning Data Selection
Data quality plays a critical role in enhancing supervised fine-tuning (SFT) for large language models (LLMs), and token-level data selection has emerged as a promising direction for its fine-grained nature. Despite their strong empirical performance, existing token-level selection methods share two key limitations: (1) requiring training or accessing an additional reference model, and (2) relying solely on loss information for token selection, which cannot well preserve semantically important tokens that are not favored by loss-based metrics. To address these challenges, we propose **ssToken**, a **S**elf-modulated and **S**emantic-aware **Token** Selection approach. ssToken leverages readily accessible history models to compute the per-token loss difference with the current model, which serves as a self-modulated signal that enables the model to adaptively select tokens along its optimization trajectory, rather than relying on excess loss from an offline-trained reference model as in prior works. We further introduce a semantic-aware, attention-based token importance estimation metric, orthogonal to loss-based selection and providing complementary semantic information for more effective filtering. Extensive experiments across different model families and scales demonstrate that both self-modulated selection and semantic-aware selection alone outperform full-data fine-tuning, while their integrationโssTokenโachieves synergistic gains and further surpasses prior token-level selection methods, delivering performance improvements while maintaining training efficiency.
Source code is available at https://anonymous.4open.science/r/Submission2116-B7C5.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1755. Random Spiking Neural Networks are Stable and Spectrally Simple
๐ฌ 18
โญ 5.60
๐ learning theory
๐ท๏ธ Spiking Neural Networks Stability Simplicity Bias Random Networks
Spiking neural networks (SNNs) are a promising paradigm for energy-efficient computation, yet their theoretical foundationsโespecially regarding stability and robustnessโremain limited compared to artificial neural networks. In this work, we study discrete-time leaky integrate-and-fire (LIF) SNNs through the lens of Boolean function analysis. We focus on noise sensitivity and stability in classification tasks, quantifying how input perturbations affect outputs. Our main result shows that wide LIF-SNN classifiers are stable on average, a property explained by the concentration of their Fourier spectrum on low-frequency components. Motivated by this, we introduce the notion of *spectral simplicity*, which formalizes simplicity in terms of Fourier spectrum concentration and connects our analysis to the *simplicity bias* observed in deep networks. Within this framework, we show that random LIF-SNNs are biased toward simple functions. Experiments on trained networks confirm that these stability properties persist in practice. Together, these results provide new insights into the stability and robustness properties of SNNs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1756. Knowing When to Quit: Probabilistic Early Exits for Speech Separation Networks
๐ฌ 18
โญ 5.60
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ speech separation speech enhancement deep learning early exit dynamic neural networks
In recent years, deep learning-based single-channel speech separation has improved
considerably, in large part driven by increasingly compute- and parameter-efficient
neural network architectures. Most such architectures are, however, designed with a
fixed compute and parameter budget, and consequently cannot scale to varying compute
demands or resources, which limits their use in embedded and heterogeneous devices such
as mobile phones and hearables.
To enable such use-cases we design a neural network architecture for speech separation
and enhancement capable of early-exit, and we propose an uncertainty-aware
probabilistic framework to jointly model the clean speech signal and error variance
which we use to derive probabilistic early-exit conditions in terms of desired
signal-to-noise ratios.
We evaluate our methods on both speech separation and enhancement tasks where we
demonstrate that early-exit capabilities can be introduced without compromising
reconstruction, and that our early-exit conditions are well-calibrated on training data
and can easily be post-calibrated on validation data, leading to large energy savings
when used with early-exit over single-exit baselines.
Our framework enables fine-grained dynamic compute-scaling of neural networks while
achieving state-of-the-art performance and interpretable exit conditions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1757. RewardEval: Advancing Reward Model Evaluation
๐ฌ 8
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ reward models benchmark evaluation post-training reinforcement learning from human feedback
Reward models are used throughout the post-training of language models to capture nuanced signals from preference data and provide a training target for optimization across instruction following, reasoning, safety, and more domains. The community has begun establishing best practices for evaluating reward models, from the development of benchmarks that test capabilities in specific skill areas to others that test agreement with human preferences. At the same time, progress in evaluation has not been mirrored by the effectiveness of reward models in downstream tasks -- simpler direct alignment algorithms are reported to work better in many cases. This paper introduces RewardEval, a new multi-skill reward modeling benchmark designed to bring new, challenging data for accuracy-based reward model evaluation -- models score about 20 points on average lower on RewardEval compared to RewardBench, a widely-used existing reward model evaluation-- while being highly correlated with downstream performance. Compared to most other benchmarks, RewardEval sources new human prompts instead of existing prompts from downstream evaluations, facilitating more rigorous evaluation practices. In this paper, we describe our benchmark construction process and report how existing models perform on it, while quantifying and providing new insights on how performance on the benchmark correlates with downstream use of the models in both inference-time scaling algorithms, like best-of-N sampling, and RLHF training algorithms like proximal policy optimization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1758. LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing
๐ฌ 15
โญ 5.50
๐ generative models
๐ท๏ธ Image Editing; Face Editing; Identity Preservation; Landmark-tokenized
Recent multimodal models for instruction-based face editing enable semantic manipulation but still struggle with precise attribute control and identity preservation. Structural facial representations such as landmarks are effective for intermediate supervision, yet most existing methods treat them as rigid geometric constraints, which can degrade identity when conditional landmarks deviate significantly from the source (e.g., large expression or pose changes, inaccurate landmark estimates). To address these limitations, we propose LaTo, a landmark-tokenized diffusion transformer for fine-grained, identity-preserving face editing. Our key innovations include: (1) a landmark tokenizer that directly quantizes raw landmark coordinates into discrete facial tokens, obviating the need for dense pixel-wise correspondence; (2) a location-mapped positional encoding and a landmark-aware classifier-free guidance that jointly facilitate flexible yet decoupled interactions among instruction, geometry, and appearance, enabling strong identity preservation; and (3) a landmark predictor that leverages visionโlanguage models to infer target landmarks from instructions and source images, whose structured chain-of-thought improves estimation accuracy and interactive control. To mitigate data scarcity, we curate HFL-150K, to our knowledge the largest benchmark for this task, containing over 150K real face pairs with fine-grained instructions. Extensive experiments show that LaTo outperforms state-of-the-art methods by 7.8% in identity preservation and 4.6% in semantic consistency. Code and dataset will be made publicly available upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1759. Beyond Magic Words: Sharpness-Aware Prompt Evolving for Robust Large Language Models with TARE
๐ฌ 21
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM LLM Agent Prompt Evolving
The performance of Large Language Models (LLMs) hinges on carefully engineered prompts. However, prevailing prompt optimization methods, ranging from heuristic edits and reinforcement learning to evolutionary search, primarily target point-wise accuracy. They seldom enforce paraphrase invariance or searching stability, and therefore cannot remedy this brittleness in practice. Automated prompt search remains brittle: small, semantically preserving paraphrases often cause large performance swings. We identify this brittleness as the **textual sharpness** of the **prompt landscape**. In this work, we provide the first formal treatment of textual sharpness in the discrete, semantic space of prompts, together with an operational robustness criterion over a semantic neighborhood; the design is black-box or API-only, requiring no gradients to update the model's parameters. Then we introduce **TARE** (Textual Sharpness-Aware Evolving), a derivative-free framework that alternates between an inner, sampling-based adversarial search that stresses a prompt with hard paraphrases and an outer, robust selection that prefers candidates whose neighborhoods remain strong. We further propose **ATARE**, which learns anisotropic weights to shape the semantic neighborhood and adapts its radius over time to balance exploration and fidelity. Diverse tasks evaluate our methods, whose design for minimizing textual sharpness gap leads to prompts that preserve accuracy under paraphrasing, outperforming accuracy-only prompt search while remaining computationally practical. The code is available for anonymous access at https://anonymous.4open.science/r/ATARE_TARE/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1760. Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs
๐ฌ 14
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reinforcement Learning Large Language Models Generative Models Post Training Chain of Thought
Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLMs), with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs' performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates across tokens with varying probabilities, thereby enhancing the efficiency of RL training. Experimental results demonstrate that they substantially improve the performance of GRPO-trained LLMs, achieving up to a 46.2% improvement in K&K Logic Puzzle reasoning tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1761. Ego-Foresight: Self-supervised Learning of Agent-Aware Representations for Improved RL
๐ฌ 15
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Agent-awareness Reinforcement Learning Self-supervised Learning Robotics
Despite the significant advances in Deep Reinforcement Learning (RL) observed in the last decade, the amount of training experience necessary to learn effective policies remains one of the primary concerns in both simulated and real environments. Looking to solve this issue, previous work has shown that improved efficiency can be achieved by separately modeling the agent and environment, but usually requires a supervisory signal. In contrast to RL, humans can perfect a new skill from a small number of trials and often do so without a supervisory signal, making neuroscientific studies of human development a valuable source of inspiration for RL. In particular, we explore the idea of motor prediction, which states that humans develop an internal model of themselves and of the consequences that their motor commands have on the immediate sensory inputs. Our insight is that the movement of the agent provides a cue that allows the duality between the agent and environment to be learned. To instantiate this idea, we present Ego-Foresight (EF), a self-supervised method for disentangling agent information based on motion and prediction. Our main finding is that, when used as an auxiliary task in feature learning, self-supervised agent-awareness improves the sample-efficiency and performance of the underlying RL algorithm. To test our approach, we study the ability of EF to predict agent movement and disentangle agent information. Then, we integrate EF with both model-free and model-based RL algorithms to solve simulated control tasks, showing improved sample-efficiency and performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1762. Physics vs Distributions: Pareto Optimal Flow Matching with Physics Constraints
๐ฌ 16
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Flow Matching Physics PDE Diffusion Models
Physics-constrained generative modeling aims to produce high-dimensional samples that are both physically consistent and distributionally accurate, a task that remains challenging due to often conflicting optimization objectives. Recent advances in flow matching and diffusion models have enabled efficient generative modeling, but integrating physical constraints often degrades generative fidelity or requires costly inference-time corrections. Our work is the first to recognize the trade-off between distributional and physical accuracy. Based on the insight of inherently conflicting objectives, we introduce a method that enforces physical constraints at training time using conflict-free gradient updates and unrolling to mitigate Jensen's gap. Our approach avoids manual loss balancing and enables simultaneous optimization of generative and physical objectives. As a consequence, physics constraints do not impede inference performance. We benchmark our method across three representative PDE benchmarks. PBFM achieves a Pareto-optimal trade-off, competitive inference speed, and generalizes to a wide range of physics-constrained generative tasks, providing a practical tool for scientific machine learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1763. Constrained Decoding of Diffusion LLMs with Context-Free Grammars
๐ฌ 15
โญ 5.50
๐ generative models
๐ท๏ธ diffusion llm constrained decoding llm code generation json multi-region infilling fill in the middle code synthesis
Large language models (LLMs) have shown promising performance across diverse domains. Many practical applications of LLMs, such as code completion and structured data extraction, require adherence to syntactic constraints specified by a formal language. Yet, due to their probabilistic nature, LLM output is not guaranteed to adhere to such formal languages. To address this, prior work has proposed constrained decoding to restrict LLM generation to particular formal languages. However, existing works are not applicable to the emerging paradigm of diffusion LLMs, as this requires supporting token generation in arbitrary order instead of the traditional left-to-right order. In this paper, we address this challenge and present the first constrained decoding method for diffusion models, one that can handle formal languages captured by context-free grammars. We begin by reducing constrained decoding to the more general additive infilling problem, which asks whether a partial output with holes can be completed to a valid word in the target language. This problem also naturally subsumes the previously unaddressed multi-region infilling constrained decoding. We then reduce this problem to the task of deciding whether the intersection of the target language and a regular language is empty and present an efficient algorithm to solve this task for context-free languages. Empirical results on various applications, such as C++ code infilling and structured data extraction in JSON, demonstrate that our method achieves near-perfect syntactic correctness while consistently preserving or improving functional correctness. Importantly, our efficiency optimizations ensure that the computational overhead remains practical.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1764. BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving
๐ฌ 21
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Diffusion policy closed-loop planning end-to-end autonomous driving
Diffusion-based planners have shown great promise for autonomous driving due to their ability to capture multi-modal driving behaviors. However, guiding these models effectively in reactive, closed-loop environments remains a significant challenge. Simple conditioning often fails to provide sufficient guidance in complex and dynamic driving scenarios. Recent work attempts to use typical expert driving behaviors (i.e., anchors) to guide diffusion models but relies on a truncated schedule, which introduces theoretical inconsistencies and can compromise performance. To address this, we introduce BridgeDrive, a novel anchor-guided diffusion bridge policy for closed-loop trajectory planning. Our approach provides a principled diffusion framework that effectively translates anchors into fine-grained trajectory plans, appropriately responding to varying traffic conditions. Our planner is compatible with efficient ODE solvers, a critical factor for real-time autonomous driving deployment. We achieve state-of-the-art performance on the Bench2Drive benchmark, improving the success rate by 7.72\% over prior arts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1765. Learning Concept Bottleneck Models from Mechanistic Explanations
๐ฌ 15
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ interpretability concept bottleneck models computer vision explainable ai
Concept Bottleneck Models (CBMs) aim for ante-hoc interpretability by learning a bottleneck layer that predicts interpretable concepts before the decision. State-of-the-art approaches typically select which concepts to learn via human specification, open knowledge graphs, prompting an LLM, or using general CLIP concepts. However, concepts defined a-priori may not have sufficient predictive power for the task or even be learnable from the available data. As a result, these CBMs often significantly trail their black-box counterpart when controlling for information leakage. To address this, we introduce a novel CBM pipeline named Mechanistic CBM (M-CBM), which builds the bottleneck directly from a black-box modelโs own learned concepts. These concepts are extracted via Sparse Autoencoders (SAEs) and subsequently named and annotated on a selected subset of images using a Multimodal LLM. For fair comparison and leakage control, we also introduce the Number of Contributing Concepts (NCC), a decision-level sparsity metric that extends the recently proposed NEC metric. Across diverse datasets, we show that M-CBMs consistently surpass prior CBMs at matched sparsity, while improving concept predictions and providing concise explanations. Our code is available at https://anonymous.4open.science/r/M-CBM-85D9.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1766. Hystar: Hypernetwork-driven Style-adaptive Retrieval via Dynamic SVD Modulation
๐ฌ 21
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Image Retrieval VisionโLanguage Models Hypernetworks Style Adaptation Contrastive Learning
Query-based image retrieval (QBIR) requires retrieving relevant images given diverse and often stylistically heterogeneous queries, such as sketches, artworks, or low-resolution previews. While large-scale vision--language representation models (VLRMs) like CLIP offer strong zero-shot retrieval performance, they struggle with distribution shifts caused by unseen query styles. In this paper, we propose the Hypernetwork-driven Style-adaptive Retrieval (Hystar), a lightweight framework that dynamically adapts model weights to each queryโs style. Hystar employs a hypernetwork to generate singular-value perturbations ($\Delta S$) for attention layers, enabling flexible per-input adaptation, while static singular-value offsets on MLP layers ensure cross-style stability. To better handle semantic confusions across styles, we design StyleNCE as part of Hystar, an optimal-transport-weighted contrastive loss that emphasizes hard cross-style negatives. Extensive experiments on multi-style retrieval and cross-style classification benchmarks demonstrate that Hystar consistently outperforms strong baselines, achieving state-of-the-art performance while being parameter-efficient and stable across styles.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1767. Learning from the Electronic Structure of Molecules across the Periodic Table
๐ฌ 13
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Interatomic potentials electronic structure materials science
Machine-Learned Interatomic Potentials (MLIPs) require vast amounts of atomic structure data to learn forces and energies, and their performance continues to improve with training set size. Meanwhile, the even greater quantities of accompanying data in the Hamiltonian matrix $\mathbf{H}$ behind these datasets has so far gone unused for this purpose. Here, we provide a recipe for integrating the orbital interaction data within $\mathbf{H}$ towards training pipelines for atomic-level properties. We first introduce HELM ('Hamiltonian-trained Electronic-structure Learning for Molecules'), a state-of-the-art Hamiltonian prediction model which bridges the gap between Hamiltonian prediction and universal MLIPs by scaling to $\mathbf{H}$ of structures with 100+ atoms, high elemental diversity, and large basis sets including diffuse functions. To accompany HELM, we release a curated Hamiltonian matrix dataset, 'OMol\_CSH\_58k', with unprecedented elemental diversity (58 elements), molecular size (up to 150 atoms), and basis set (def2-TZVPD). Finally, we introduce 'Hamiltonian pretraining' as a method to extract meaningful descriptors of atomic environments even from a limited number atomic structures, and repurpose this shared embedding space to improve performance on energy-prediction in low-data regimes. Our results highlight the use of electronic interactions as a rich and transferable data source for representing chemical space.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1768. Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning
๐ฌ 17
โญ 5.50
๐ generative models
๐ท๏ธ Reinforcement Learning GFlowNets Robust RL Regularized RL Generative Models Scientific Discovery
A major bottleneck in scientific discovery consists of narrowing an exponentially large set of objects, such as proteins or molecules, to a small set of promising candidates with desirable properties. While this process can rely on expert knowledge, recent methods leverage reinforcement learning (RL) guided by a proxy reward function to enable this filtering. By employing various forms of entropy regularization, these methods aim to learn samplers that generate diverse candidates that are highly rated by the proxy function. In this work, we make two main contributions. First, we show that these methods are liable to generate overly diverse, suboptimal candidates in large search spaces. To address this issue, we introduce a novel unified operator that combines several regularized RL operators into a general framework that better targets peakier sampling distributions. Secondly, we offer a novel, robust RL perspective of this filtering process. The regularization can be interpreted as robustness to a compositional form of uncertainty in the proxy function (i.e., the true evaluation of a candidate differs from the proxy's evaluation). Our analysis leads us to a novel, easy-to-use algorithm we name trajectory general mellowmax (TGM): we show it identifies higher quality, diverse candidates than baselines in both synthetic and real-world tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1769. LLM DNA: Tracing Model Evolution via Functional Representations
๐ฌ 27
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Large Language Model Representations Fingerprint Embedding Evolution
The explosive growth of large language models (LLMs) has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented or unclear, complicating LLM management. Existing methods are limited by task specificity, fixed model sets, or strict assumptions about tokenizers or architectures. Inspired by biological DNA, we address these limitations by mathematically defining *LLM DNA* as a low-dimensional, bi-Lipschitz representation of functional behavior. We prove that LLM DNA satisfies *inheritance* and *genetic determinism* and establish its existence. Building on this theory, we derive a general, scalable, training-free pipeline for DNA extraction. In experiments across 305 LLMs, DNA aligns with prior studies on limited subsets and achieves superior or competitive performance on specific tasks. Beyond these tasks, DNA comparisons uncover previously undocumented relationships among LLMs. We further construct the evolutionary tree of LLMs using phylogenetic algorithms, which align with shifts from encoder-decoder to decoder-only architectures, reflect temporal progression, and reveal distinct evolutionary speeds across LLM families.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1770. RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation
๐ฌ 21
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ planning repository generation agent code generation
Large language models excel at generating individual functions or single files of code, yet generating complete repositories from scratch remains a fundamental challenge. This capability is key to building coherent software systems from high-level specifications and realizing the full potential of automated code generation. The process requires planning at two levels: deciding what features and modules to build (proposal stage) and defining their implementation details (implementation stage). Current approaches rely on natural language planning, which often produces unclear specifications, misaligned components, and brittle designs due to its inherent ambiguity and lack of structure. To address these limitations, we introduce the Repository Planning Graph (RPG), a structured representation that encodes capabilities, file structures, data flows, and functions in a unified graph. By replacing free-form natural language with an explicit blueprint, RPG enables consistent long-horizon planning for repository generation. Building on RPG, we develop ZeroRepo, a graph-driven framework that operates in three stages: proposal-level planning, implementation-level construction, and graph-guided code generation with test validation To evaluate, we construct RepoCraft, a benchmark of six real-world projects with 1,052 tasks. On RepoCraft, ZeroRepo produces nearly 36K Code Lines and 445K Code Tokens, on average 3.9ร larger than the strongest baseline (Claude Code), and 68ร larger than others. It also achieves 81.5% coverage and 69.7% test accuracy, improving over Claude Code by 27.3 and 35.8 points. Further analysis shows that RPG models complex dependencies, enables more sophisticated planning through near-linear scaling, and improves agent understanding of repositories, thus accelerating localization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1771. Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions
๐ฌ 10
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Medical benchmark LLM evaluation LLM sycophancy Medical agent Adversarial generation
Cancer patients are increasingly turning to large language models (LLMs) for medical information, making it critical to assess how well these models handle complex, personalized questions.
However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with patient details.
In this paper, we first have three hematology-oncology physicians evaluate cancer-related questions drawn from real patients.
While LLM responses are generally accurate, the models frequently fail to recognize or address false presuppositions} in the questions, posing risks to safe medical decision-making.
To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions.
On this benchmark, no frontier LLM---including GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet---corrects these false presuppositions more than $43\%$ of the time.
To study mitigation strategies, we further construct a 150-question Cancer-Myth-NFP set, in which physicians confirm the absence of false presuppositions.
We find typical mitigation strategies, such as adding precautionary prompts with GEPA optimization, can raise accuracy on Cancer-Myth to $80\%$, but at the cost of misidentifying presuppositions in $41\%$ of Cancer-Myth-NFP questions and causing a $10\%$ relative performance drop on other medical benchmarks.
These findings highlight a critical gap in the reliability of LLMs, show that prompting alone is not a reliable remedy for false presuppositions, and underscore the need for more robust safeguards in medical AI systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1772. G-Merging: Graph Models Merging for Parameter-Efficient Multi-Task Knowledge Consolidation
๐ฌ 22
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Model Merging Parameter Efficient Fine-Tuning Multi-task Learning
The pretrain-finetuning paradigm has achieved notable success in graph learning. Moreover, merging models fine-tuned on different tasks to enable a parameter-efficient model with multi-task capabilities is gaining increasing attention for its practicality. However, existing model merging methods, such as weight averaging and task arithmetic, struggle to generalize well to graph structures and Graph Neural Network (GNN) models due to the unique structural heterogeneity of graph data. In this paper, we propose an innovative graph model merging framework called G-Merging for merging multiple task-specific fine-tuned GNN models. G-Merging first employs task arithmetic to coarsely merge graph models, capturing shared cross-task knowledge. Second, it introduces a Topology-aware Wasserstein Distance (TWD) loss to train lightweight task adapters, preserving domain-specific graph patterns via aligning the embeddings of merged and fine-tuned models. Third, G-Merging integrates the adapters into a training-free, topology-aware router within a mixture-of-experts (MoE) architecture, dynamically routing input graphs to task-specific adapters based on structural similarity, thereby mitigating conflicts and enhancing knowledge sharing. Extensive experiments on 8 graph downstream datasets demonstrate the effectiveness of G-Merging, showing impressive performance close to or exceeding individual finetuned models while improving parameters and training efficiency. Our code is available at https://anonymous.4open.science/r/G-Merging-D515
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1773. ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction
๐ฌ 24
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Speech-to-Speech Spoken Dialogues LLM benchmark RL RLAIF GRPO
Speech-to-Speech (S2S) models have shown promising dialogue capabilities, but their ability to handle paralinguistic cuesโsuch as emotion, tone, and speaker attributesโand to respond appropriately in both content and style remains underexplored. Progress is further hindered by the scarcity of high-quality and expressive demonstrations. To address this, we introduce a novel reinforcement learning (RL) framework for paralinguistic-aware S2S, ParaS2S, which evaluates and optimizes both content and speaking style directly at the waveform level. We first construct ParaS2SBench, a benchmark comprehensively evaluates S2S modelsโ output for content and style appropriateness from diverse and challenging input queries. It scores the fitness of input-output pairs and aligns well with human judgments, serving as an automatic judge for model outputs. With this scalable scoring feedback, we enable the model to explore and learn from diverse unlabeled speech via Group Relative Policy Optimization (GRPO). Experiments show that existing S2S models fail to respond appropriately to paralinguistic attributes, performing no better than pipeline-based baselines. Our RL-based strong baseline achieves a 11% relative improvement in response content and styleโs appropriateness on ParaS2SBench over supervised fine-tuning (SFT), surpassing all prior models while requiring substantially fewer warm-up annotations than pure SFT.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1774. LiFR-Seg: Anytime High-Frame-Rate Segmentation via Event-Guided Propagation
๐ฌ 5
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Event Camera Neuromorphic Computing
Dense semantic segmentation in dynamic environments is fundamentally limited by the low-frame-rate (LFR) nature of standard cameras, which creates critical perceptual gaps between frames.
To solve this, we introduce *Anytime Interframe Semantic Segmentation*: a new task for predicting segmentation at any arbitrary time using only a single past RGB frame and a stream of asynchronous event data.
This task presents a core challenge: how to robustly propagate dense semantic features using a motion field derived from sparse and often noisy event data, all while mitigating feature degradation in highly dynamic scenes.
We propose LiFR-Seg, a novel framework that directly addresses these challenges by propagating deep semantic features through time. The core of our method is an *uncertainty-aware warping process*, guided by an event-driven motion field and its learned, explicit confidence. A *temporal memory attention* module further ensures coherence in dynamic scenarios.
We validate our method on the DSEC dataset and a new high-frequency synthetic benchmark (SHF-DSEC) we contribute. Remarkably, our LFR system achieves performance (73.82\% mIoU on DSEC) that is statistically indistinguishable from an HFR upper-bound (within 0.09\%) that has full access to the target frame.
% We further demonstrate superior robustness in *highly dynamic* (M3ED-Drone \& Quadruped) and *low-light* (DSEC-Night) scenarios, where our method can even surpass the HFR baseline.
We further demonstrate superior robustness across extreme scenarios: in highly dynamic (M3ED) tests, our method closely matches the HFR baseline's performance, while in the low-light (DSEC-Night) evaluation, it even surpasses it.
This work presents a new, efficient paradigm for achieving robust, high-frame-rate perception with low-frame-rate hardware.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1775. Fly-CL: A Fly-Inspired Framework for Enhancing Efficient Decorrelation and Reduced Training Time in Pre-trained Model-based Continual Representation Learning
๐ฌ 11
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ continual learning fly olfactory circuit class incremental learning decorrelation
Using a nearly-frozen pretrained model, the continual representation learning paradigm reframes parameter updates as a similarity-matching problem to mitigate catastrophic forgetting. However, directly leveraging pretrained features for downstream tasks often suffers from multicollinearity in the similarity-matching stage, and more advanced methods can be computationally prohibitive for real-time, low-latency applications. Inspired by the fly olfactory circuit, we propose Fly-CL, a bio-inspired framework compatible with a wide range of pretrained backbones. Fly-CL substantially reduces training time while achieving performance comparable to or exceeding that of current state-of-the-art methods. We theoretically show how Fly-CL progressively resolves multicollinearity, enabling more effective similarity matching with low time complexity. Extensive simulation experiments across diverse network architectures and data regimes validate Fly-CLโs effectiveness in addressing this challenge through a biologically inspired design.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1776. A Representer Theorem for Hawkes Processes via Penalized Least Squares Minimization
๐ฌ 14
โญ 5.50
๐ learning theory
๐ท๏ธ Hawkes processes kernel methods representer theorem point processes least squares loss
The representer theorem is a cornerstone of kernel methods, which aim to estimate latent functions in reproducing kernel Hilbert spaces (RKHSs) in a nonparametric manner. Its significance lies in converting inherently infinite-dimensional optimization problems into finite-dimensional ones over dual coefficients, thereby enabling practical and computationally tractable algorithms. In this paper, we address the problem of estimating the latent triggering kernels--functions that encode the interaction structure between events--for linear multivariate Hawkes processes based on observed event sequences within an RKHS framework. We show that, under the principle of penalized least squares minimization, a novel form of representer theorem emerges: a family of transformed kernels can be defined via a system of simultaneous integral equations, and the optimal estimator of each triggering kernel is expressed as a linear combination of these transformed kernels evaluated at the data points. Remarkably, the dual coefficients are all analytically fixed to unity, obviating the need to solve a costly optimization problem to obtain the dual coefficients. This leads to a highly efficient estimator capable of handling large-scale data more effectively than conventional nonparametric approaches. Empirical evaluations on synthetic datasets reveal that the proposed method attains competitive predictive accuracy while substantially improving computational efficiency over existing state-of-the-art kernel method-based estimators.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1777. MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning
๐ฌ 21
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ med-vlm multi-agent collaboration multimodal medical reasoning medical vqa reinforcement learning
Medical Large Vision-Language Models (Med-LVLMs) have shown strong potential in multimodal diagnostic tasks. However, existing single-agent models struggle to generalize across diverse medical specialties, limiting their performance. Recent efforts introduce multi-agent collaboration frameworks inspired by clinical workflows, where general practitioners (GPs) and specialists interact in a fixed sequence. Despite improvements, these static pipelines lack flexibility and adaptability in reasoning. To address this, we propose MMedAgent-RL, a reinforcement learning (RL)-based multi-agent framework that enables dynamic, optimized collaboration among medical agents. Specifically, we train two GP agents based on Qwen2.5-VL via RL: the triage doctor learns to assign patients to appropriate specialties, while the attending physician integrates the judgments from multi-specialists and its own knowledge to make final decisions. To address the inconsistency in specialist outputs, we introduce a curriculum learning (CL)-guided RL strategy with dynamic entropy regulation, progressively teaching the attending physician to balance between imitating specialists and correcting their mistakes. Experiments on five medical VQA benchmarks demonstrate that MMedAgent-RL outperforms both open-source and proprietary Med-LVLMs. Notably, it achieves an average performance gain of 23.6\% over strong baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1778. Avey Bidirectional Architecture
๐ฌ 4
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Bidirectional models Transformer-based encoders
Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attentionโs ability to deliver bidirectional contextualization with high parallelism, as popularized by BERT-style architectures. Recently, Avey was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1779. Is Pure Exploitation Sufficient in Exogenous MDPs with Linear Function Approximation?
๐ฌ 22
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Exogenous Markov Decision Processes Regret Analysis Linear Function Approximation Exploration-free
Exogenous MDPs (Exo-MDPs) capture sequential decision-making where uncertainty comes solely from exogenous inputs that evolve independently of the learnerโs actions. This structure is especially common in operations research applications such as inventory control, energy storage, and resource allocation, where exogenous randomness (e.g., demand, arrivals, or prices) drives system behavior. Despite decades of empirical evidence that greedy, exploitation-only methods work remarkably well in these settings, theory has lagged behind: all existing regret guarantees for Exo-MDPs rely on explicit exploration or tabular assumptions.
We show that exploration is unnecessary.
We propose Pure Exploitation Learning ($\texttt{PEL}$) and prove the first general finite-sample regret bounds for exploitation-only algorithms in Exo-MDPs. In the tabular case, PEL achieves $\widetilde{O}(H^2|\Xi|\sqrt{K})$. For large, continuous endogenous state spaces, we introduce $\texttt{LSVI-PE}$, a simple linear-approximation method whose regret is polynomial in the feature dimension, exogenous state space, and horizon, independent of the endogenous state and action spaces.
Our analysis introduces two new tools: counterfactual trajectories and Bellman-closed feature transport, which together allow greedy policies to have accurate value estimates without optimism.
Experiments on synthetic and resource-management tasks show $\texttt{PEL}$ consistently outperforming baselines. Overall, our results overturn the conventional wisdom that exploration is required, demonstrating that in Exo-MDPs, pure exploitation is enough.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1780. Enhanced Continual Learning of Vision-Language Models with Model Fusion
๐ฌ 19
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ continual learning multi-domain task incremental learning vision-language models model fusion
Vision-Language Models (VLMs) represent a significant breakthrough in artificial intelligence by integrating visual and textual modalities to achieve impressive zero-shot capabilities.
However, VLMs are susceptible to catastrophic forgetting when sequentially fine-tuned on multiple downstream tasks. Existing continual learning methods for VLMs face various limitations, often relying on additional reference datasets, compromising zero-shot performance, or being restricted to parameter-efficient fine-tuning scenarios.
In this paper, we propose a novel Continual Decoupling-Unifying (ConDU) approach that pioneers the use of model fusion for continual learning in VLMs.
Specifically, ConDU maintains a unified model along with task triggers and prototype sets, employing an iterative process of decoupling task experts for previous tasks and unifying them with the task expert for the newly learned task.
Additionally, we introduce an inference strategy for zero-shot scenarios by aggregating predictions from multiple decoupled task experts.
Extensive experiments on the MTIL benchmark show that ConDU achieves up to a 2\% improvement in average performance across all seen tasks compared to state-of-the-art baselines, while also enhancing zero-shot capabilities relative to the original VLM.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1781. Horseshoe Splatting: Handling Structural Sparsity for Uncertainty-Aware Gaussian-Splatting Radiance Field Rendering
๐ฌ 20
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Bayesian Neural Network Gaussian splatting Horseshoe Prior Structural Sparsity Uncertainty
We introduce Horseshoe Splatting, a Bayesian extension of 3D Gaussian Splatting (3DGS) that jointly addresses structured sparsity in per-splat covariances and delivers calibrated uncertainty. While neural radiance fields achieve high-fidelity view synthesis and 3DGS attains real-time rendering with explicit anisotropic Gaussians, existing pipelines do not explicitly encode structural sparsity in the covarianceโe.g., axis-wise variances or pairwise correlationsโleaving noise-dominated components insufficiently regularized. Uncertainty is likewise essential for trustworthy and robust novel-view prediction, yet most 3DGS variants remain deterministic. We place a global-local Horseshoe prior on the covariance scales, whose spike-at-zero and heavy-tails adaptively shrink irrelevant directions while preserving the salient structure. We fit the model with a factorized variational inference scheme that mirrors the Horseshoe's inverse-Gamma augmentation, enabling Monte Carlo rendering and pixel-wise posterior uncertainty with minimal overhead. Theoretically, we establish posterior contraction rates for the scale parameters and transfer them to the rendered image via a local Lipschitz mapping, providing guarantees that estimation error and predictive uncertainty diminish with data. Empirically, Horseshoe Splatting produces high-quality uncertainty maps while matching state-of-the-art 3DGS visual fidelity and runtime, yielding a practical, uncertainty-aware renderer that is robust to structured sparsity in the radiance field.
The code is anonymously available at https://anonymous.4open.science/r/hs-25C5/README.md.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1782. SmartChunk Retrieval: Query-Aware Chunk Compression with Planning for Efficient Document RAG
๐ฌ 19
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Retrieval-Augmented Generation (RAG) Long-Document QA Adaptive Chunking Efficient Information Retrieval GRPO Reinforcement learning
Retrieval-augmented generation (RAG) has strong potential for producing accurate and factual outputs by combining language models (LMs) with evidence retrieved from large text corpora. However, current pipelines are limited by static chunking and flat retrieval: documents are split into short, predetermined, fixed-size chunks, embeddings are retrieved uniformly, and generation relies on whatever chunks are returned. This design brings challenges, as retrieval quality is highly sensitive to chunk size, often introduces noise from irrelevant or misleading chunks, and scales poorly to large corpora. We present SmartChunk retrieval, a query-adaptive framework for efficient and robust long-document question answering (QA). SmartChunk uses (i) a planner that predicts the optimal chunk abstraction level for each query, and (ii) a lightweight compression module that produces high-level chunk embeddings without repeated summarization. By adapting retrieval granularity on the fly, SmartChunk balances accuracy with efficiency and avoids the drawbacks of fixed strategies. Notably, our planner can reason about chunk abstractions through a novel reinforcement learning scheme, STITCH, which boosts accuracy and generalization. To reflect real-world applications, where users face diverse document types and query styles, we evaluate SmartChunk on five QA benchmarks plus one out-of-domain dataset. Across these evaluations, SmartChunk outperforms state-of-the-art RAG baselines, while reducing cost. Further analysis demonstrates strong scalability with larger corpora and consistent gains on out-of-domain datasets, highlighting its effectiveness as a general framework for adaptive retrieval.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1783. Comparing the learning dynamics of in-context learning and fine-tuning in language models
๐ฌ 12
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ in-context learning supervised fine-tuning inductive biases learning dynamics
Pretrained language models can acquire novel tasks either through in-context learning (ICL)---adapting behavior via activations without weight updates---or through supervised fine-tuning (SFT), where parameters are explicitly updated. Prior work has reported differences in their generalization performance and inductive biases, but the origins of these differences remain poorly understood. In this work, we treat ICL and SFT as distinct learning algorithms and directly compare the learning dynamics they induce across medium-sized models, analyzing both the evolution of their inductive biases and the underlying internal representations. We find that ICL preserves rich input representations but imposes stronger priors inherited from pretraining, whereas SFT suppresses task-irrelevant features---potentially explaining its weaker generalization in few-shot regimes. These results highlight a mechanistic distinction between context-driven and weight-driven learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1784. VeriTrail: Closed-Domain Hallucination Detection with Traceability
๐ฌ 11
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ hallucination detection faithfulness fact-checking traceability provenance error localization
Even when instructed to adhere to source material, language models often generate unsubstantiated content โ a phenomenon known as โclosed-domain hallucination.โ This risk is amplified in processes with multiple generative steps (MGS), compared to processes with a single generative step (SGS). However, due to the greater complexity of MGS processes, we argue that detecting hallucinations in their final outputs is necessary but not sufficient: it is equally important to trace where hallucinated content was likely introduced and how faithful content may have been derived from the source through intermediate outputs. To address this need, we present VeriTrail, the first closed-domain hallucination detection method designed to provide traceability for both MGS and SGS processes. We also introduce the first datasets to include all intermediate outputs as well as human annotations of final outputsโ faithfulness for their respective MGS processes. We demonstrate that VeriTrail outperforms baseline methods on both datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1785. Sample Smart, Not Hard: Correctness-First Decoding for Better Reasoning in LLMs
๐ฌ 13
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Sampler model uncertainty LLM reasoning min-p calibration chain-of-thought self-consistency
Large Language Models (LLMs) are increasingly applied to complex tasks that require extended reasoning. In such settings, models often benefit from diverse chains-of-thought to arrive at multiple candidate solutions. This requires two competing objectives: to inject enough stochasticity to explore multiple reasoning chains, and to ensure sufficient accuracy and quality in each path. Existing works pursue the first objective by increasing exploration at highly uncertain steps with higher temperature or larger candidate token sets, while others improve reliability by rejecting samples with low confidence post generation, implying that low confidence correlates with low answer quality. These two lines of thought are in conflict, as they conflate different sources of uncertainty. To resolve this, we argue that the decoding rule should be calibrated by *correctness*, not confidence alone. We should sample from tokens with higher estimated correctness, and reduce sampling where expected correctness is low. We propose simple strategies that achieve this goal: **Greedy-Threshold** makes sampling greedy at very low confidence steps. **Calibrated-TopK** and **Calibrated-ฮต** set truncation threshold based on estimated rank-wise correctness. Together, our findings challenge prevailing heuristics about decoding under uncertainty, showing consistent gains across math and general reasoning benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1786. Feature compression is the root cause of adversarial fragility in neural networks
๐ฌ 17
โญ 5.50
๐ learning theory
๐ท๏ธ Adversarial learning Deep neural network Robust learning
In this paper, we uniquely study the adversarial robustness of deep neural networks (NN) for classification tasks against that of optimal classifiers. We look at the smallest magnitude of possible additive perturbations that can change a classifier's output. We provide a matrix-theoretic explanation of the adversarial fragility of deep neural networks for classification. In particular, our theoretical results show that a neural network's adversarial robustness can degrade as the input dimension $d$ increases. Analytically, we show that neural networks' adversarial robustness can be only $1/\sqrt{d}$ of the best possible adversarial robustness of optimal classifiers. Our theories match remarkably well with numerical experiments of practically trained NN, including NN for ImageNet images. The matrix-theoretic explanation is consistent with an earlier information-theoretic feature-compression-based explanation for the adversarial fragility of neural networks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1787. MoRA: Missing Modality Low-Rank Adaptation for Visual Recognition
๐ฌ 16
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Classification Missing Modality Parameter-Efficient Fine-Tuning
Pre-trained vision language models have shown remarkable performance on visual recognition tasks, but they typically assume the availability of complete multimodal inputs during both training and inference. In real-world scenarios, however, modalities may be missing due to privacy constraints, collection difficulties, or resource limitations. While previous approaches have addressed this challenge using prompt learning techniques, they fail to capture the cross-modal relationships necessary for effective multimodal visual recognition and suffer from inevitable computational overhead. In this paper, we introduce MoRA, a parameter-efficient fine-tuning method that explicitly models cross-modal interactions while maintaining modality-specific adaptations. MoRA introduces modality-common parameters between text and vision encoders, enabling bidirectional knowledge transfer. Additionally, combined with the modality-specific parameters, MoRA allows the backbone model to maintain inter-modality interaction and enable intra-modality flexibility. Extensive experiments on standard benchmarks demonstrate that MoRA achieves an average performance improvement in missing-modality scenarios by 5.24% and uses only 25.90% of the inference time compared to the SOTA method while requiring only 0.11% of trainable parameters compared to full fine-tuning. The code is available at https://anonymous.4open.science/r/mora-20667.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1788. AutoTool: Automatic Scaling of Tool-Use Capabilities in RL via Decoupled Entropy Constraints
๐ฌ 14
โญ 5.50
๐ reinforcement learning
๐ท๏ธ llm RL tool use auto think
Tool use represents a critical capability for AI agents, with recent advances focusing on leveraging reinforcement learning (RL) for test-time scaling to achieve better performance through more deliberate reasoning.
However, there are some key challenges in current RL-based scaling approaches:
(a) direct RL training often struggles to scale up thinking length sufficiently to solve complex problems,
and (b) scaled-up models tend to overthink simpler problems, resulting in substantial token inefficiency.
To address these challenges, we propose a novel training paradigm that first employs warm-up supervised fine-tuning to help models distinguish between simple and complex problems, followed by RL that enable models to automatically determine appropriate reasoning trajectories.
Furthermore, to tackle the issue of automatic thinking-length scaling, we discover that entropy-based optimization objectives effectively maintain model diversity while successfully unlocking the model's scaling capabilities.
Based on this insight, we introduce an entropy-based long-short reasoning fusion RL strategy.
Our experiments on three benchmarks demonstrate that model successfully achieves auto-scaling for efficient tool use, achieving significant 9.8\% accuracy improvements while reducing computational overhead by ~81\%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1789. MAGO: Beyond Fixed Hyperparameters with Multi-Objective Pareto Optimization for Hybrid LLM Reasoning
๐ฌ 16
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Multi-objective optimization Pareto optimization Large language models Hybrid reasoning Chain-of-thought reasoning Reinforcement learning
Large language models (LLMs) with advanced step-by-step reasoning capabilities have achieved remarkable performance in complex problem-solving through chain-of-thought (CoT) reasoning. However, uniformly applying elaborate reasoning to all queries creates substantial computational inefficiency, as many problems can be solved directly without extended reasoning chains. Current hybrid reasoning approaches rely on static hyperparameters and heuristic single-objective optimization, leading to suboptimal trade-offs and poor adaptation to varying task complexities. To address these limitations, we propose a multi-objective adaptive generation optimization (MAGO) framework, which integrates multi-objective optimization with dynamic adaptive weighting into hybrid reasoning. MAGO optimizes three competing objectives simultaneously: accuracy (maintaining solution correctness), efficiency (minimizing computational costs through appropriate mode selection), and calibration (ensuring mode selection aligns with model capabilities). The framework employs Pareto frontier maintenance with correlation-aware optimization to automatically explore the full trade-off space, avoiding the spatial constraints that limit fixed-weight approaches to narrow cone-shaped regions of the objective space. Unlike existing methods requiring manual hyperparameter tuning, MAGO's Pareto optimization dynamically adapts weights based on task complexity and training progress, achieving principled and adaptive decision-making across varying problem complexities. Comprehensive evaluation on mathematical reasoning benchmarks including AIME, Minerva Algebra, MATH-500, and GSM-8K shows $2.2\times$ to $3\times$ token-efficiency gains and relative accuracy improvements of $0.6\%$ to $9.4\%$ over heuristic baselines, while remaining competitive with the strongest task-specific models. Additional experiments on CommonsenseQA and MedQA further confirm the framework's generalizability beyond mathematics, achieving $1$ to $2\%$ higher accuracy and approximately $2\times$ efficiency improvement without additional fine-tuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1790. Tequila: Deadzone-free Ternary Quantization for Large Language Models
๐ฌ 16
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Ternary Quantization Large Language Models Edge Computing
Quantization techniques are essential for the deployment of Large Language Models (LLMs) on edge devices. However, prevailing methods often rely on mixed-precision multiplication that lacks efficient hardware support, making it not feasible. Ternary weight quantization addresses this by constraining weights to {-1, 0, 1}, replacing expensive multiplications with hardware-efficient additions. However, such aggressive compression leads to significant accuracy degradation, even after costly quantization-aware training with massive data. We identify the core issue as _**deadzone trapping**: a large number of weights are trapped at the deadzone boundary._ This occurs because these weights receive only noisy, less informative gradients, preventing stable escape from the deadzone and severely impeding model capacity and optimization. To address this issue, we propose **Tequila**, a trapping-free quantization optimization method that reactivates deadzone-trapped weights by repurposing them as dynamic biases. This allows the repurposed weights to provide a continuous signal in the forward pass and, critically, receive direct, meaningful gradient signals during backpropagation, thereby enhancing model capacity and optimization with nearly _zero_ inference overhead. Extensive evaluations demonstrate that Tequila outperforms state-of-the-art (SOTA) ternary quantization methods across five benchmarks. Specifically, on the ARC benchmark, it achieves $>4$% accuracy gain over the SOTA baseline, nearly matching full-precision performance (within $<1$% gap) with an $3.0\times$ inference speedup. Consequently, Tequila offers a highly practical and efficient implementation for the deployment of advanced LLMs in resource-constrained environments. The code is available at \url{https://anonymous.4open.science/r/Tequila-2B5F/}
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1791. Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models
๐ฌ 16
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Multi-modal LLM Intuitive Physics
While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in image and video understanding, their ability to comprehend the physical world has become an increasingly important research focus. Despite their improvements, current MLLMs struggle significantly with high-level physics reasoning.
In this work, we investigate the first step of physical reasoning, i.e., **intuitive physics understanding**, revealing substantial limitations in understanding the dynamics of continuum objects.
To isolate and evaluate this specific capability, we introduce two fundamental benchmark tasks: Next Frame Selection (NFS) and Temporal Coherence Verification (TCV). Our experiments demonstrate that even state-of-the-art MLLMs perform poorly on these foundational tasks.
To address this limitation, we propose Scene Dynamic Field (SDF), a concise approach that leverages physics simulators within a multi-task fine-tuning framework.
SDF substantially improves performance, achieving up to $20.7\%$ gains on fluid tasks while showing strong generalization to unseen physical domains. This work not only highlights a critical gap in current MLLMs but also presents a promising cost-efficient approach for developing more physically grounded MLLMs. Our code and data will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1792. MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites
๐ฌ 11
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Visual Captioning Multimodal Large Language Model
Generalist visual captioning goes beyond a simple appearance description task, but requires integrating a series of visual cues into a caption and handling various visual domains. In this task, current open-source models present a large performance gap with commercial ones, which limits various applications such as data synthesis. To bridge the gap, this paper proposes CapFlow, a novel multi-agent collaboration workflow. CapFlow demonstrates for the first time that, by capitalizing on open-source models, it is possible to achieve caption quality on par with GPT-4.1 in various domains with an 89.5\% reduction in costs. By leveraging CapFlow as the data synthesizer, we produce high-quality visual captions from image and video domains at scale, and obtain a generalist visual captioner via fine-tuning, namely MetaCaptioner. Through extensive experiments, we show that MetaCaptioner not only achieves comparable captioning capabilities with commercial models but also reaches top-tier multimodal performance in the open-source community. We hope CapFlow and MetaCaptioner can benefit future multimodal research by providing a strong and cost-effective visual captioning solution. Our source code and models will be publicly released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1793. Beyond Accuracy: Are Time Series Foundation Models Well-Calibrated?
๐ฌ 9
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Time Series Foundation Models Calibration Confidence
The recent development of foundation models for time series data has generated considerable interest in using such models across a variety of applications. Although foundation models achieve state-of-the-art predictive performance, their calibration properties remain relatively underexplored, despite the fact that calibration can be critical for many practical applications. In this paper, we investigate the calibration-related properties of five recent time series foundation models and two competitive baselines. We perform a series of systematic evaluations assessing model calibration (i.e., over- or under-confidence), effects of varying prediction heads, and calibration under long-term autoregressive forecasting. We find that time series foundation models are consistently better calibrated than baseline models and tend not to be either systematically over- or under-confident, in contrast to the overconfidence often seen in other deep learning models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1794. NeMo-map: Neural Implicit Flow Fields for Spatio-Temporal Motion Mapping
๐ฌ 13
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Neural Implicit Representation Human Motion Representation Maps of Dynamics
Safe and efficient robot operation in complex human environments can benefit from good models of site-specific motion patterns. Maps of Dynamics (MoDs) provide such models by encoding statistical motion patterns in a map, but existing representations use discrete spatial sampling and typically require costly offline construction. We propose a continuous spatio-temporal MoD representation based on implicit neural functions that directly map coordinates to the parameters of a Semi-Wrapped Gaussian Mixture Model. This removes the need for discretization and imputation for unevenly sampled regions, enabling smooth generalization across both space and time. Evaluated on two public datasets with real-world people tracking data, our method achieves better accuracy of motion representation and smoother velocity distributions in sparse regions while still being computationally efficient, compared to available baselines. The proposed approach demonstrates a powerful and efficient way of modeling complex human motion patterns and high performance in the trajectory prediction downstream task.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1795. Scaling Multi-Task Bayesian Optimization with Large Language Models
๐ฌ 20
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Bayesian optimization large language models protein design meta learning scientific discovery
In multi-task Bayesian optimization, the goal is to leverage experience from optimizing existing tasks to improve the efficiency of optimizing new ones. While approaches using multi-task Gaussian processes or deep kernel transfer exist, the performance improvement is marginal when scaling beyond a moderate number of tasks. We introduce **BOLT**, an initialization-only transfer strategy that distills prior BO runs into an LLM which proposes candidates for new tasks, while the surrogate at test time remains single-task. The LLM is periodically fine-tuned on top solutions from completed runs, creating a closed loop where better BO outputs yield better initializations over time. This decoupled design scales to roughly 1500 tasks without the saturation observed for shared-surrogate MTBO and adds only a small, amortized overhead relative to the BO inner loops. We evaluate on two domains: database query optimization and antimicrobial peptide design. We demonstrate that LLM-generated initializations steadily improve and accelerate BO, and with sufficient fine-tuning, a few LLM samples often match or surpass full ''from-scratch'' BO with far fewer oracle calls.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1796. MARS-Sep: Multimodal-Aligned Reinforced Sound Separation
๐ฌ 13
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Universal Sound Separation Multimodal Learning Reinforcement Learning
Universal sound separation faces a fundamental misalignment: models optimized for low-level signal metrics often produce semantically contaminated outputs, failing to suppress perceptually salient interference from acoustically similar sources.
We introduce a preference alignment perspective, analogous to aligning LLMs with human intent.
To address this, we introduce MARS-Sep, a reinforcement learning framework that reformulates separation as decision making. Instead of simply regressing ground-truth masks, MARS-Sep learns a factorized Beta mask policy that is steered by a preference reward model and optimized by a stable, clipped trust-region surrogate. The reward, derived from a progressively-aligned audio-text-vision encoder, directly incentivizes semantic consistency with query prompts.
Extensive experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation, with notable improvements in signal metrics and semantic quality. Our code is available at https://anonymous.4open.science/r/MARS-Sep. Sound separation samples are available at https://mars-sep.github.io/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1797. TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models
๐ฌ 24
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ MLLMs
Multimodal Large Language Models (MLLMs), particularly smaller, deployable variants, exhibit a critical deficiency in understanding temporal and procedural visual data, a bottleneck hindering their application in real-world embodied AI. This gap is largely caused by a systemic failure in training paradigms, which lack large-scale, procedurally coherent data. To address this problem, we introduce TPRU, a large-scale dataset sourced from diverse embodied scenarios such as robotic manipulation and GUI navigation. TPRU is systematically designed to cultivate temporal reasoning through three complementary tasks: Temporal Reordering, Next-Frame Prediction, and Previous-Frame Review. A key feature is the inclusion of challenging negative samples, compelling models to transition from passive observation to active, cross-modal validation. We leverage TPRU with a reinforcement learning (RL) fine-tuning methodology, specifically targeting the enhancement of resource-efficient models. Experiments show our approach yields dramatic gains: on our manually curated TPRU-Test, the accuracy of TPRU-7B soars from 50.33\% to 75.70\%, a state-of-the-art result that significantly outperforms vastly larger baselines, including GPT-4o. Crucially, these capabilities generalize effectively, demonstrating substantial improvements on established benchmarks. We will release our dataset and models to the community.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1798. Evaluating Text Creativity across Diverse Domains: a Dataset and Large Language Model Evaluator
๐ฌ 23
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ creativity evaluation text evaluation
Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, in this paper, we propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human-generated and synthetic data in training highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs. We will release all data, code, and models publicly to support further research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1799. VoG: Enhancing LLM Reasoning through Stepwise Verification on Knowledge Graphs
๐ฌ 21
โญ 5.50
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ LLM reasoning Knowledge Graphs KG-enhanced LLM
Large Language Models (LLMs) excel at various reasoning tasks but still encounter challenges such as hallucination and factual inconsistency in knowledge-intensive tasks, primarily due to a lack of external knowledge and factual verification. These challenges could be mitigated by leveraging knowledge graphs (KGs) to support more reliable LLM reasoning. However, existing KG-augmented LLM frameworks still rely on static integration mechanisms that cannot adjust reasoning in response to evolving context and retrieved evidence, resulting in error propagation and incomplete reasoning. To alleviate these issues, we propose **V**erify-**o**n-**G**raph (**VoG**), a scalable and model-agnostic framework to enhance LLM reasoning via iterative retrieval, stepwise verification, and adaptive revision. Besides performing KG retrieval guided by an initially generated reasoning plan, VoG iteratively verifies and revises the reasoning plan, correcting intermediate errors in consideration of the varying contextual conditions. During plan revision, VoG leverages a context-aware multi-armed bandit strategy, guided by reward signals that capture uncertainty and semantic consistency, to enhance the alignment between the reasoning plan and retrieved evidence in a more adaptive and reliable way. Experimental results across three benchmark datasets show that VoG consistently improves both reasoning accuracy and efficiency. Our code is available at https://anonymous.4open.science/r/VoG-132C/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1800. Dragging with Geometry: From Pixels to Geometry-Guided Image Editing
๐ฌ 18
โญ 5.50
๐ generative models
๐ท๏ธ Diffusion Model; Drag-based Image Editing
Interactive point-based image editing serves as a controllable editor, enabling precise and flexible manipulation of image content. However, previous methods predominantly center on 2D pixel plane, neglecting the underlying 3D geometric structure. As a result, they often produce imprecise and inconsistent edits, particularly in geometry-intensive scenarios such as rotations and perspective transformations. To address these limitations, we propose a novel geometry-guided drag-based image editing methodโGeoDrag, which addresses three key challenges: 1) incorporating 3D geometric cues into pixel-level editing, 2) mitigating discontinuities caused by geometry-only guidance, and 3) resolving conflicts arising from multi-point dragging. Built upon a unified displacement field that jointly encodes 3D geometry and 2D spatial priors, GeoDrag enables coherent, high-fidelity, and structure-consistent editing in a single forward pass. In addition, a conflict-free partitioning strategy is introduced to isolate editing regions, effectively preventing interference and ensuring consistency. Extensive experiments across various editing scenarios validate the effectiveness of our method, showing superior precision, structural consistency, and reliable multi-point editability. Our code and models will be released publicly. Project page: https://geodrag-site.github.io.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1801. A Joint Diffusion Model with Pre-Trained Priors for RNA SequenceโStructure Co-Design
๐ฌ 12
โญ 5.50
๐ generative models
๐ท๏ธ RNA design Diffusion models Generative models
RNA molecules underlie regulation, catalysis, and therapeutics in biological systems, yet de novo RNA design remains difficult with the tight and highly non-linear sequenceโstructure coupling.
The RNA sequenceโstructure co-design problem generates nucleotide sequences and 3D conformations jointly, which is challenging due to RNAโs conformational flexibility, non-canonical base pairing, and the scarcity of 3D data.
We introduce a joint generative framework that embeds RoseTTAFold2NA as the denoiser into a dual diffusion model, injecting rich cross-molecular priors while enabling sample-efficient learning from limited RNA data. Our method couples a discrete diffusion process for sequences with an $SE(3)$-equivariant diffusion for rigid-frame translations and rotations over all-atom coordinates. The architecture supports flexible conditioning,
and is further enhanced at inference via lightweight RL techniques that optimize task-aligned rewards.
Across de novo RNA design as well as complex and protein-conditioned design tasks, our approach yields high self-consistency and confidence scores, improving over recent diffusion/flow baselines trained from scratch. Results demonstrate that leveraging pre-trained structural priors within a joint diffusion framework is a powerful paradigm for RNA design under data scarcity, enabling high-fidelity generation of standalone RNAs and functional RNAโprotein interfaces.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1802. Can LLMs Move Beyond Short Exchanges to Realistic Therapy Conversations?
๐ฌ 30
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Real-word Counseling CBT Therapy Mental Health
Recent incidents have revealed that large language models (LLMs) deployed in mental health contexts can generate unsafe guidance, including reports of chatbots encouraging self-harm. Such risks highlight the urgent need for rigorous, clinically valid evaluation before integration into care. However, existing benchmarks remain inadequate: 1) they rely on synthetic or weakly validated data, undermining clinical reliability; 2) they reduce counseling to isolated QA or single-turn tasks, overlooking the extended, adaptive nature of real interactions; and 3) they rarely capture the formal therapeutic structure of sessions, such as rapport building, guided exploration, intervention, and closure. These gaps risk overestimating LLM competence and obscuring safety-critical failures. To address this, we present \textbf{CareBench-CBT}, the largest clinically validated benchmark for CBT-based counseling. It unifies three components: 1) we provide thousands of expert-curated and validated items to ensure data reliability; 2) we include realistic multi-turn dialogues to capture long-form therapeutic interaction; and 3) we align all sessions with CBTโs formal structure, enabling process-level evaluation of empathy, therapeutic alignment, and intervention quality. All data are anonymized, double-reviewed by 21 licensed professionals, and validated with reliability and competence metrics. Evaluating 18 state-of-the-art LLMs reveals consistent gaps: high scores on public QA degrade under expert rephrasing, vignette reasoning remains difficult, and dialogue competence falls well below human counselors. CareBench-CBT provides a rigorous foundation for advancing safe and responsible integration of LLMs into mental health care. All code and data are released in the Supplementary Materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1803. ReLaSH: Reconstructing Joint Latent Spaces for Efficient Generation of Synthetic Hypergraphs with Hyperlink Attributes
๐ฌ 26
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ hypergraphs latent space models structured data generation
Hypergraph network data, which capture multi-way interactions among entities, have become increasingly prevalent in the big data era, spanning fields such as social science, medical research, and biology. Generating synthetic hyperlinks with attributes from an observed hypergraph has broad applications in data augmentation, simulation, and advancing the understanding of real-world complex systems. This task, however, poses unique challenges due to special properties of hypergraphs, including discreteness, hyperlink sparsity, and the mixed data types of hyperlinks and their attributes, rendering many existing generative models unsuitable. In this paper, we introduce ReLaSH (REconstructing joint LAtent Spaces for Hypergraphs with attributes), a general generative framework for producing realistic synthetic hypergraph data with hyperlink attributes via training a likelihood-based joint embedding model and reconstructing the joint latent space. Given a hypergraph dataset, ReLaSH first embeds the hyperlinks and their attributes into a joint latent space by training a likelihood-based model, and then reconstructs this joint latent space using a distribution-free generator. The generation task is completed by first sampling embeddings from the distribution-free generator and then decoding them into hyperlinks and attributes through the trained likelihood-based model. Compared with existing generative models, ReLaSH explicitly accounts for the unique structure of hypergraphs and jointly models hyperlinks and their attributes. Moreover, the likelihood-based embedding model provides efficiency and interpretability relative to deep black-box architectures, while the distribution-free generator in the joint latent space ensures flexibility. We theoretically demonstrate consistency and generalizability of ReLaSH. Empirical results on a range of real-world datasets from diverse domains demonstrate the strong performance of ReLaSH, underscoring its broad utility and effectiveness in practical applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1804. WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent
๐ฌ 21
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Agent Web Agent Deep Research Visual Question Answering (VQA) Tool-augmented Reasoning Multimodal Information-Seeking Benchmark
Web agents such as deep research have demonstrated superhuman cognitive abilities, capable of solving highly challenging information-seeking problems. However,
most research remains largely text-centric, overlooking visual information in the
real world. This makes multimodal deep research highly challenging, as such
agents require much stronger perceptual, logical, and knowledge-based reasoning
abilities, as well as proficiency in more sophisticated tools. To address this limitation, we introduce WebWatcher, a multimodal agent for deep research with joint
reasoning ability across both visual and textual modalities. It uses high-quality
synthetic trajectories for efficient cold start training, utilizes various tools for deep
reasoning, and further enhances generalization through reinforcement learning. To
better evaluate the capabilities of multimodal agents, we propose BrowseComp-VL,
a benchmark with the style of BrowseComp that requires complex information
retrieval involving both visual and textual information. Experimental results show
that WebWatcher outperforms the prompt-based workflow and open-source agents
on HLE and BrowseComp-VL, and demonstrates its perception, multimodal reasoning, and searching capabilities across the other three benchmarks, respectively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1805. Architecture-Agnostic Test-Time Adaptation via Backprop-Free Embedding Alignment
๐ฌ 22
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Test-time adaptation; efficiency; feature space; embedding alignment
Test-Time Adaptation (TTA) adapts a deployed model during online inference to mitigate the impact of domain shift. While achieving strong accuracy, most existing methods rely on backpropagation, which is memory and computation intensive, making them unsuitable for resource-constrained devices. Recent attempts to reduce this overhead often suffer from high latency or are tied to specific architectures such as ViT-only or CNN-only.
In this work, we revisit domain shift from an embedding perspective. Our analysis reveals that domain shift induces three distinct structural changes in the embedding space: translation (mean shift), scaling (variance shift), and rotation (covariance shift). Based on this insight, we propose Progressive Embedding Alignment (PEA), a backpropagation-free and architecture-agnostic TTA approach. By applying a novel covariance alignment procedure at each intermediate layer, PEA efficiently corrects the embedding distortions with only two forward passes. Extensive experiments demonstrate that PEA achieves state-of-the-art performance in both accuracy and efficiency, while also proving versatile across different architectures including ViTs and CNNs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1806. DMAP: A Distribution Map for Text
๐ฌ 14
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Entropy Statistical Text Analysis Post-training Supervised fine-tuning
Large Language Models (LLMs) are a powerful tool for statistical text analysis, with derived sequences of next-token probability distributions offering a wealth of information. Extracting this signal typically relies on metrics such as perplexity, which do not adequately account for context; how one should interpret a given next-token probability is dependent on the number of reasonable choices encoded by the shape of the conditional distribution. In this work, we present DMAP, a mathematically grounded method that maps a text, via a language model, to a set of samples in the unit interval that jointly encode rank and probability information. This representation enables efficient, model-agnostic analysis and supports a range of applications. We illustrate its utility through three case studies: (i) validation of generation parameters to ensure data integrity, (ii) examining the role of probability curvature in machine-generated text detection, and (iii) a forensic analysis revealing statistical fingerprints left in downstream models that have been subject to post-training on synthetic data. Our results demonstrate that DMAP offers a unified statistical view of text that is simple to compute on consumer hardware, widely applicable, and provides a foundation for further research into text analysis with LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1807. In-Context Watermarks for Large Language Models
๐ฌ 19
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Watermark In-context Learning
The growing use of large language models (LLMs) for sensitive applications has highlighted the need for effective watermarking techniques to ensure the provenance and accountability of AI-generated text. However, most existing watermarking methods require access to the decoding process, limiting their applicability in real-world settings. One illustrative example is the use of LLMs by dishonest reviewers in the context of academic peer review, where conference organizers have no access to the model used but still need to detect AI-generated reviews. Motivated by this gap, we introduce In-Context Watermarking (ICW), which embeds watermarks into generated text solely through prompt engineering, leveraging LLMs' in-context learning and instruction-following abilities. We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method. We further examine the Indirect Prompt Injection (IPI) setting as a specific case study, in which watermarking is covertly triggered by modifying input documents such as academic manuscripts.
Our experiments validate the feasibility of ICW as a model-agnostic, practical watermarking approach. Moreover, our findings suggest that as LLMs become more capable, ICW offers a promising direction for scalable and accessible content attribution.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1808. OpenAgentSafety: A Comprehensive Framework For Evaluating Real-World AI Agent Safety
๐ฌ 10
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Agents Safety Risks Datasets Benchmarks Tool-Use User Interactions Frameworks
Recent advances in AI agents capable of solving complex, everyday tasks-- from software engineering to customer service-- have enabled deployment in real-world settings, but their possibilities for unsafe behavior demands rigorous evaluation. While prior benchmarks have attempted to assess agent safety, most fall short by relying on simulated environments, narrow task domains, or unrealistic tool abstractions. We introduce OpenAgentSafety, a comprehensive and modular framework for evaluating agent behavior across eight critical risk categories. Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms; and supports over 350 multi-turn, multi-user tasks spanning both benign and adversarial user intents. OpenAgentSafety is designed for extensibility, allowing researchers to add tools, tasks, websites, and adversarial strategies with minimal effort. It combines rule-based evaluation with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors. Empirical analysis of five prominent LLMs in agentic scenarios reveals unsafe behavior in 51.2% of safety-vulnerable tasks with Claude-Sonnet-3.7, to 72.7% with o3-mini, highlighting critical risks and the need for stronger safeguards before real-world deployment of LLM agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1809. Glance for Context: Learning When to Leverage LLMs for Node-Aware GNN-LLM Fusion
๐ฌ 17
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ GNN Graph Learning GNN-LLM Homophily Heterophily
Learning on text-attributed graphs has motivated the use of Large Language Models (LLMs) for graph learning. However, most fusion strategies are applied uniformly across all nodes and attain only small overall performance gains. We argue this result stems from aggregate metrics that obscure when LLMs provide benefit, inhibiting actionable signals for new strategies. In this work, we reframe LLMโGNN fusion around nodes where GNNs typically falter. We first show that performance can significantly differ between GNNs and LLMs, with each excelling on distinct structural patterns, such as local homophily. To leverage this finding, we propose **GLANCE** (**G**NN with **L**LM **A**ssistance for **N**eighbor- and **C**ontext-aware **E**mbeddings), a framework that invokes an LLM to refine a GNN's prediction. GLANCE employs a lightweight router that, given inexpensive per-node signals, decides whether to query the LLM. Since the LLM calls are non-differentiable, the router is trained with an advantage-based objective that compares the utility of querying the LLM against relying solely on the GNN. Across multiple benchmarks, GLANCE achieves the best performance balance across node subgroups, achieving significant gains on heterophilous nodes (up to +5.8%) while simultaneously achieving top overall performance (up to +1.1%). Our findings advocate for adaptive, node-aware, GNN-LLM architectures, showing that selectively invoking the LLM where it adds value enables scalable application of LLMs to large graphs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1810. JAPAN: Joint Adaptive Prediction Areas with Normalising Flow
๐ฌ 19
โญ 5.50
๐ generative models
๐ท๏ธ Uncertainty Quantification Normalising Flows Joint Prediction Areas
Conformal prediction provides a model-agnostic framework for uncertainty quantification with finite-sample validity guarantees, making it an attractive tool for constructing reliable prediction sets. However, existing approaches commonly rely on residual-based conformity scores, which impose geometric constraints and struggle when the underlying distribution is multimodal. In particular, they tend to produce overly conservative prediction areas centred around the mean, often failing to capture the true shape of complex predictive distributions. In this work, we introduce JAPAN (Joint Adaptive Prediction Areas with Normalising-Flows), a flow-based framework that uses density estimates for several conformal scores. By leveraging flow-based models, JAPAN estimates the (predictive) density and constructs prediction areas by thresholding on the estimated density scores, enabling compact, potentially disjoint, and context-adaptive regions that retain finite-sample coverage guarantees. We theoretically motivate the efficiency of JAPAN and empirically validate it across multivariate regression and forecasting tasks, demonstrating good calibration and tighter prediction areas compared to existing baselines. Furthermore, several density-based conformity scores showcase the flexibility of our proposed framework.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1811. Learning to Reason without External Rewards
๐ฌ 18
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ RL Reasoning LLM
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidenceโtermed self-certaintyโas its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks and achieves competitive, sometimes better, generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available in the supplementary materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1812. KVComm: Enabling Efficient LLM Communication through Selective KV Sharing
๐ฌ 24
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Multi-Agent Systems Inter-LLM Communication Multi-agent Debate
Large Language Models (LLMs) are increasingly deployed in multi-agent systems, where effective inter-model communication is crucial. Existing communication protocols either rely on natural language, incurring high inference costs and information loss, or on hidden states, which suffer from information concentration bias and inefficiency. To address these limitations, we propose KVComm, a novel communication framework that enables efficient communication between LLMs through selective sharing of KV pairs. KVComm leverages the rich information encoded in the KV pairs while avoiding the pitfalls of hidden states. We introduce a KV layer-wise selection strategy based on attention importance scores with a Gaussian prior to identify the most informative KV pairs for communication. Extensive experiments across diverse tasks and model pairs demonstrate that KVComm achieves comparable performance to the upper-bound method, which directly merges inputs to one model without any communication, while transmitting as few as 30\% of layers' KV pairs. Our study highlights the potential of KV pairs as an effective medium for inter-LLM communication, paving the way for scalable and efficient multi-agent systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1813. CircuitNet 3.0: A Multi-Modal Dataset with Task-Oriented Augmentation for AI-Driven Circuit Design
๐ฌ 21
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Dataset Benchmark Machine learning Electric design automatic
Integrated circuit (IC) designs require transforming high-level specifications into physical layouts, demanding extensive expertise and specialized tools, as well as months of time and numerous iterations. While Machine Learning (ML) has shown promise in various research domains, the lack of large-scale, open datasets limits its application in chip design. To address this limitation, we introduce CircuitNet 3.0, a large-scale, comprehensive, and open-source dataset curated to facilitate the evaluation of ML models on challenging timing and power prediction tasks. Starting with a diverse set of 8,659 validated open-source designs, we employ a systematic framework to generate over 15,000 instances. Through specialized syntax-tree mutation strategies and principled, task-oriented filtering methodology, we enrich each design with multi-modal information spanning multiple design stages, including complete design flow documentation, register-transfer-level (RTL) designs and corresponding netlists, detailed physical layouts, and comprehensive performance metrics. The experimental results convincingly demonstrate that ML models leveraging multi-stage, multi-modal circuit representations significantly improve performance over existing open-source datasets in electronic design automation (EDA) tasks, paving the way for efficient and accessible circuit representation learning. The dataset and codes are available in https://anonymous.4open.science/r/ICLR26-CircuitNet3-272B.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1814. Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle
๐ฌ 22
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ multimodal large language model reinforcement learning
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM). However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing, where most advantages in a batch concentrate near zero, and Rollout Silencing, where the proportion of rollouts contributing non-zero gradients diminishes over time. These issues lead to suboptimal gradient updates and hinder long-term learning efficiency. To address these issues, we propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition. It introduces (1) Pairwise Trajectory Sampling, which selects high-contrast trajectories with large advantages to improve gradient signal quality, and (2) Advantage-based Trajectory Shuffle, which increases exposure of valuable rollouts through informed batch reshuffling. Experiments across multiple reasoning benchmarks show that our framework consistently outperforms strong RL baselines with minimal overhead. These results highlight the importance of data-centric adaptations for more efficient RL training in MLLM.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1815. SAIR: Enabling Deep Learning for Protein-Ligand Interactions with a Synthetic Structural Dataset
๐ฌ 12
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Protein Ligand Dataset Affinity
Accurate prediction of protein-ligand binding affinities remains a cornerstone problem in drug discovery. While binding affinity is inherently dictated by the 3D structure and dynamics of protein-ligand complexes, current deep learning approaches are limited by the lack of high-quality experimental structures with annotated binding affinities. To address this limitation, we introduce the Structurally Augmented IC50 Repository (SAIR), the largest publicly available dataset of protein-ligand 3D structures with associated activity data. The dataset comprises $5,244,285$ structures across $1,048,857$ unique protein-ligand systems, curated from the ChEMBL and BindingDB databases, which were then computationally folded using the Boltz-1x model. We provide a comprehensive characterization of the dataset, including distributional statistics of proteins and ligands, and evaluate the structural fidelity of the folded complexes using PoseBusters. Our analysis reveals that approximately $3 \%$ of structures exhibit physical anomalies, predominantly related to internal energy violations. As an initial demonstration, we benchmark several binding affinity prediction methods, including empirical scoring functions (Vina, Vinardo), a 3D convolutional neural network (Onionnet-2), and a graph neural network (AEV-PLIG). While machine learning-based models consistently outperform traditional scoring function methods, neither exhibit a high correlation with ground truth affinities, highlighting the need for models specifically fine-tuned to synthetic structure distributions. This work provides a foundation for developing and evaluating next-generation structure and binding-affinity prediction models and offers insights into the structural and physical underpinnings of protein-ligand interactions.
The link to the data will be added upon publication, to preserve anonymity of the submission.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1816. KaVa: Latent Reasoning via Compressed KV-Cache Distillation
๐ฌ 20
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ llm reasoning latent reasoning efficiency
Large Language Models (LLMs) excel at multi-step reasoning problems with explicit chain-of-thought (CoT), but verbose traces incur significant computational costs and memory overhead, and often carry redundant, stylistic artifacts. Latent reasoning has emerged as an efficient alternative that internalizes the thought process, but it suffers from a critical lack of supervision, limiting its effectiveness on complex, natural-language reasoning traces.
In this work we propose KaVa, the first framework that bridges this gap by distilling knowledge directly from a compressed KV-cache of the teacher into a latent-reasoning student via self-distillation, leveraging the representational flexibility of continuous latent tokens to align stepwise KV trajectories. We show that the abstract, unstructured knowledge within compressed KV-cache, which lacks direct token correspondence, can serve as a rich supervisory signal for a latent reasoning student.
Empirically, the approach consistently outperforms strong latent baselines, exhibits markedly smaller degradation from equation-only to natural-language traces, and scales to larger backbones while preserving efficiency. These results establish compressed KV-cache distillation as a scalable supervision signal for latent reasoning, combining the accuracy of CoT-trained teachers with the efficiency and deployability of latent inference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1817. MixLinear: Extreme Low Resource Multivariate Time Series Forecasting with $0.1K$ Parameters
๐ฌ 13
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ Long-term Time Series Forecasting Segmentation Adaptive Low-Rank Spectral Filtering
Recently, there has been a growing interest in Long-term Time Series Forecasting (LTSF), which involves predicting long-term future values by analyzing a large amount of historical time-series data to identify patterns and trends. Significant challenges exist in LTSF due to its complex temporal dependencies and high computational demands. Although Transformer-based models offer high forecasting accuracy, they are often too compute-intensive to be deployed on devices with hardware constraints.
In this paper, we propose MixLinear, which synergistically combines orthogonal segment-based trend extraction in the time domain with adaptive low-rank spectral filtering in the frequency domain. Our approach exploits the complementary structural sparsity of time series: local temporal patterns are efficiently captured through mathematically linear transformations that separate intra-segment and inter-segment correlations, while global trends are compressed into an ultra-low-dimensional frequency latent space through learnable rank-constrained filters. By reducing the parameter scale of a downsampled $n$-length input/output one-layer linear model from $O(n^2)$ to $O(n)$, MixLinear achieves efficient computation without sacrificing accuracy.
Extensive evaluations show that MixLinear achieves forecasting performance comparable to, or surpasses, state-of-the-art models with significantly fewer parameters ($0.1K$), which makes it well suited for deployment on devices with limited computational capacity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1818. MoM: Linear Sequence Modeling with Mixture-of-Memories
๐ฌ 21
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Efficient/Low-Resource Methods for NLP Linear Sequence Modeling Machine Learning for NLP
Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive tasks. To address this limitation, we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. MoM serves as a general framework that can be seamlessly combined with diverse memory update mechanisms across linear models. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1819. OrthoRF: Exploring Orthogonality in Object-Centric Representations
๐ฌ 13
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Object Discovery Object-Centric Representations Structured Representation Learning Orthogonality
Neural synchrony is hypothesized to help the brain organize visual scenes into structured multi-object representations. In machine learning, synchrony-based models analogously learn object-centric representations by storing binding in the phase of complex-valued features. Rotating Features (RF) instantiate this idea with vector-valued activations, encoding object presence in magnitudes and affiliation in orientations. We propose Orthogonal Rotating Features (OrthoRF), which enforces orthogonality in RFโs orientation space via an inner-product loss and architectural modifications. This yields sharper phase alignment and more reliable grouping. In evaluations of unsupervised object discovery, including settings with overlapping objects, noise, and out-of-distribution tests, OrthoRF matches or outperforms current models while producing more interpretable representations, and it eliminates the post-hoc clustering required by many synchrony-based approaches. Unlike current models, OrthoRF also recovers occluded object parts, indicating stronger grouping under occlusion. Overall, orthogonality emerges as a simple, effective inductive bias for synchrony-based object-centric learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1820. Language in the Flow of Time: Time-Series-Paired Texts Weaved into a Unified Temporal Narrative
๐ฌ 12
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ Time Series Modeling Multimodal Learning Time Series Forecasting
While many advances in time series models focus exclusively on numerical data, research on multimodal time series, particularly those involving contextual textual information, remains in its infancy. With recent progress in large language models and time series learning, we revisit the integration of paired texts with time series through the Platonic Representation Hypothesis, which posits that representations of different modalities converge to shared spaces. In this context, we identify that time-series-paired texts may naturally exhibit periodic properties that closely mirror those of the original time series. Building on this insight, we propose a novel framework, Texts as Time Series (TaTS), which considers the time-series-paired texts to be auxiliary variables of the time series. TaTS can be plugged into any existing numerical-only time series models and effectively enable them to handle time series data with paired texts. Through extensive experiments on both multimodal time series forecasting and imputation tasks across benchmark datasets with various existing time series models, we demonstrate that TaTS can enhance multimodal predictive performance without modifying model architectures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1821. Flock: A Knowledge Graph Foundation Model via Learning on Random Walks
๐ฌ 26
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ knowledge graphs link prediction knowledge graph foundation models invariance equivariance random walks
We study the problem of zero-shot link prediction on knowledge graphs (KGs), which requires models to generalize over novel entities and novel relations. Knowledge graph foundation models (KGFMs) address this task by enforcing equivariance over both nodes and relations, learning from structural properties of nodes and relations, which are then transferable to novel graphs with similar structural properties. However, the conventional notion of deterministic equivariance imposes inherent limits on the expressive power of KGFMs, preventing them from distinguishing structurally similar but semantically distinct relations. To overcome this limitation, we introduce probabilistic node-relation equivariance, which preserves equivariance in distribution while incorporating a principled randomization to break symmetries during inference. Building on this principle, we present Flock, a KGFM that iteratively samples random walks, encodes them into sequences via a recording protocol, embeds them with a sequence model, and aggregates representations of nodes and relations via learned pooling. Crucially, Flock respects probabilistic node-relation equivariance and is a universal approximator for isomorphism-invariant link-level functions over KGs. Empirically, Flock perfectly solves our new diagnostic dataset Petals where current KGFMs fail, and achieves state-of-the-art performances on entity- and relation prediction tasks on 54 KGs from diverse domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1822. ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning
๐ฌ 20
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ large language model reasoning model safety alignment
Large Language Models have shown impressive generative capabilities across diverse tasks, but their safety remains a critical concern. Existing post-training alignment methods, such as SFT and RLHF, reduce harmful outputs yet leave LLMs vulnerable to jailbreak attacks, especially advanced optimization-based ones. Recent system-2 approaches enhance safety by adding inference-time reasoning, where models assess potential risks before producing responses. However, we find these methods fail against powerful out-of-distribution jailbreaks, such as AutoDAN-Turbo and Adversarial Reasoning, which conceal malicious goals behind seemingly benign prompts. We observe that all jailbreaks ultimately aim to embed a core malicious intent, suggesting that extracting this intent is key to defense. To this end, we propose ARMOR, which introduces a structured three-step reasoning pipeline: (1) analyze jailbreak strategies from an external, updatable strategy library, (2) extract the core intent, and (3) apply policy-based safety verification. We further develop ARMOR-Think, which decouples safety reasoning from general reasoning to improve both robustness and utility. Evaluations on advanced optimization-based jailbreaks and safety benchmarks show that ARMOR achieves state-of-the-art safety performance, with an average harmful rate of 0.002 and an attack success rate of 0.06 against advanced optimization-based jailbreaks, far below other reasoning-based models. Moreover, ARMOR demonstrates strong generalization to unseen jailbreak strategies, reducing their success rate to zero. These highlight ARMORโs effectiveness in defending against OOD jailbreak attacks, offering a practical path toward secure and reliable LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1823. Optimizing Data Augmentation through Bayesian Model Selection
๐ฌ 11
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Bayesian Neural Network Variational Inference Data Augmentation
Data Augmentation (DA) has become an essential tool to improve robustness and generalization of modern machine learning.
However, when deciding on DA strategies it is critical to choose parameters carefully, and this can be a daunting task which is traditionally left to trial-and-error or expensive optimization based on validation performance.
In this paper, we counter these limitations by proposing a novel framework for optimizing DA.
In particular, we take a probabilistic view of DA, which leads to the interpretation of augmentation parameters as model (hyper)-parameters, and the optimization of the marginal likelihood with respect to these parameters as a Bayesian model selection problem.
Due to its intractability, we derive a tractable ELBO, which allows us to optimize augmentation parameters jointly with model parameters. We provide extensive theoretical results on variational approximation quality, generalization guarantees, invariance properties, and connections to empirical Bayes.
Through experiments on computer vision and NLP tasks, we show that our approach improves calibration and yields robust performance over fixed or no augmentation.
Our work provides a rigorous foundation for optimizing DA through Bayesian principles with significant potential for robust machine learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1824. $PhyWorldBench$: A Comprehensive Evaluation of Physical Realism in Text-to-Video Models
๐ฌ 13
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Video Generation Video Evaluation
Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents $PhyWorldBench$
, a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles like object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel "Anti-Physics" category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that could utilize current MLLM to evaluate the physics realism in a zero-shot fashion. We evaluate 10 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with a detailed comparison and analysis. we identify pivotal challenges models face in adhering to real-world physics. Through systematic testing of their outputs across 1,050 curated promptsโspanning fundamental, composite, and anti-physics scenariosโwe identify pivotal challenges these models face in adhering to real-world physics. We then rigorously examine their performance on diverse physical phenomena with varying prompt types, deriving targeted recommendations for crafting prompts that enhance fidelity to physical principles.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1825. G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge
๐ฌ 19
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ GraphRAG RAG LLM
Large language models (LLMs) excel at complex reasoning but remain limited by static and incomplete parametric knowledge. Retrieval-augmented generation (RAG) mitigates this by incorporating external knowledge, yet existing RAGs struggle with knowledge-intensive tasks due to fragmented information and weak modeling of knowledge structure. Graphs offer a natural way to model relationships within knowledge, but LLMs are inherently unstructured and cannot effectively reason over graph-structured data. Recent graph-enhanced RAG (GraphRAG) attempts to bridge this gap by constructing tailored graphs and enabling LLMs to reason on them. However, these methods often depend on ad-hoc graph designs, heuristic search, or costly agent pipelines, which hinder scalability and generalization. To address these challenges, we present G-reasoner, a unified framework that integrates graph and language foundation models for scalable reasoning over diverse graph-structured knowledge. Central to our approach is QuadGraph, a standardized four-layer abstraction that unifies heterogeneous knowledge sources into a common graph representation. Building on this, we introduce a 34M-parameter graph foundation model (GFM) that jointly captures graph topology and textual semantics, and is integrated with LLMs to enhance reasoning in downstream applications. To ensure scalability and efficiency, mixed-precision training and distributed message-passing are implemented to scale GFM with more GPUs. Extensive experiments on six benchmarks show that G-reasoner consistently outperforms state-of-the-art baselines, significantly enhances LLM reasoning, and achieves strong efficiency and cross-graph generalization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1826. Dual Language Models: Balancing sample-efficiency and overfitting resilience
๐ฌ 9
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ language model pretraining training objective mixed training objective masked diffusion
This paper combines autoregressive and masked-diffusion training objectives without any architectural modifications, resulting in flexible models that outperform the standard single-objective models in both settings. Autoregressive language modeling has been a popular approach, partly because of its training efficiency; however, this comes at the cost of susceptibility to overfitting. On the other hand, masked-diffusion language models are less efficient to train while being more resilient to overfitting. In this work, we demonstrate that dual-objective training achieves the best of both worlds. To derive the optimal ratio of the masked-diffusion and autoregressive objectives, we train and evaluate 50 language models under varying levels of data repetition. We show that it is optimal to combine both objectives under all evaluated settings and that the optimal ratio is similar whether targeting autoregressive or masked-diffusion downstream performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1827. Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction
๐ฌ 28
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Theorem Proving Reasoning
Automated theorem proving (ATP)
---the task of generating a proof that passes automated proof verification given a math question in formal language--- is a critical challenge at the intersection of mathematics and Artificial Intelligence (AI). We introduce Goedel-Prover-V2, a family of two language models that establish a new state-of-the-art (SOTA) in open-source ATP, using the Lean proof assistant. In addition to standard expert iteration and reinforcement learning, our approach incorporates three key innovations: (1) During training when improvement plateaus on human questions, the prover does {\em scaffolded data synthesis} to generate synthetic questions of increasing difficulty for its own training; (2) The prover is trained to self-correct using Lean compiler feedback; (3) Improved test-time exploration through checkpoint averaging to balance accuracy and diversity.
Our small model, Goedel-Prover-V2-8B, reaches 84.6\% pass@32 on MiniF2F and outperforms DeepSeek-Prover-V2-671B despite being $80\times$ smaller. Our flagship model, Goedel-Prover-V2-32B, achieves 88.1\% on MiniF2F at pass@32 in standard mode and 90.4\% in self-correction mode, outperforming prior SOTA by a large margin. Additionally, our flagship model solves 86 problems on PutnamBench at pass@184, securing first place among open-source models and surpassing DeepSeek-Prover-V2-671B's record of 47 problems by pass@1024 with about $20\times$ smaller model size and significantly lower compute budget. To support community research, we have open-sourced the prover models and a SOTA statement formalizer, with all training datasets to be released in the near future.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1828. Bridging Generalization Gap of Heterogeneous Federated Clients Using Generative Models
๐ฌ 15
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Federated Learning Model Heterogeneity
Federated Learning (FL) is a privacy-preserving machine learning framework facilitating collaborative training across distributed clients. However, its performance is often compromised by data heterogeneity among participants, which can result in local models with limited generalization capability. Traditional model-homogeneous approaches address this issue primarily by regularizing local training procedures or dynamically adjusting client weights during aggregation. Nevertheless, these methods become unsuitable in scenarios involving clients with heterogeneous model architectures. In this paper, we propose a model-heterogeneous FL framework that enhances clientsโ generalization performance on unseen data without relying on parameter aggregation. Instead of model parameters, clients share feature distribution statistics (mean and covariance) with the server. Then each client trains a variational transposed convolutional neural network using Gaussian latent variables sampled from these distributions, and use it to generate synthetic data. By fine-tuning local models with the synthetic data, clients achieve significant improvement of generalization ability. Experimental results demonstrate that our approach not only attains higher generalization accuracy compared to existing model-heterogeneous FL frameworks, but also reduces communication costs and memory consumption.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1829. Demystifying Robot Diffusion Policies: Action Memorization and a Simple Lookup Table Alternative
๐ฌ 12
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Action Memorization Action Lookup Table Diffusion Policy Analysis
Diffusion policies for visuomotor robot manipulation tasks achieve remarkable dexterity and robustness while only training on a small number of task demonstrations. However, the reason for this performance remains a mystery. In this paper, we offer a surprising hypothesis: diffusion policies essentially memorize an action lookup table---\emph{and this is beneficial}. We posit that, at runtime, diffusion policies find the closest training image to the test image in a latent space, and recall the associated training action (i.e. action chunk), offering reactivity without the need for action generalization. This is effective in the sparse data regime, where there is not enough data density for the model to learn action generalization. We support this claim with systematic empirical evidence, showing that even when conditioned on highly out of distribution (OOD) images, Diffusion Policy still outputs an action chunk from the training data. We evaluate and compare three representative policy families on the same data set: Diffusion Policy, Action Chunking with Transformers (ACT), and GR00T, a pre-trained generalist Vision-Language-Action (VLA) model. We show that Diffusion Policy gives strong action memorization giving surprising robustness in OOD regimes, ACT shows action interpolation with poor robustness in OOD regimes, and GR00T (benefiting from substantial pre-training) shows both action interpolation and OOD robustness. As a simple alternative to Diffusion Policy, we introduce the Action Lookup Table (ALT) policy, showing that an explicit lookup table policy can perform comparably in this low data regime. Despite its simplicity, ALT attains Diffusion Policyโlevel performance while also providing faster inference and explicit OOD detection via latent-distance thresholds. These results reframe diffusion policies for robot manipulation as reactive memory retrieval under data sparsity, and provide practical tools for interpreting, evaluating, and monitoring such policies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1830. UniCA: Unified Covariate Adaptation for Time Series Foundation Model
๐ฌ 20
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ time series foundation model adaptation covariate-aware forecasting heterogeneous covariates
Time Series Foundation Models (TSFMs) have achieved remarkable success through large-scale pretraining. However, their design primarily targets real-valued series, limiting their ability to handle general forecasting tasks involving diverse and often \emph{heterogeneous covariates}โsuch as categorical variables and multimodal data (e.g., images, text)โwhich are typically task-specific and difficult to leverage during pretraining. To address this gap, we propose Unified Covariate Adaptation (UniCA), a framework to bridge TSFMs with general covariate-aware forecasting. UniCA first performs covariate homogenization to transform heterogeneous covariates into high-level homogeneous series representations and then fuses them via a unified attention-based fusion mechanism. UniCA is compatible and universal for adaptation with both homogeneous and heterogeneous covariates, incorporating extra covariate information while preserving the generalization ability of TSFMs. Extensive experiments on multiple unimodal and multimodal covariate-aware forecasting benchmarks demonstrate the superiority of UniCA, highlighting the promise of covariate-aware TSFM adaptation in real-world forecasting scenarios. Code: https://anonymous.4open.science/r/UniCA-C5E0.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1831. LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration
๐ฌ 19
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Inverse Solver Langevin sampling Consistency Models Video Interpolation
Computational imaging methods increasingly rely on powerful generative diffusion models to tackle challenging image restoration tasks. In particular, state-of-the-art zero-shot image inverse solvers leverage distilled text-to-image latent diffusion models (LDMs) to achieve unprecedented accuracy and perceptual quality with high computational efficiency. However, extending these advances to high-definition video restoration remains a significant challenge, due to the need to recover fine spatial detail while capturing subtle temporal dependencies. Consequently, methods that naively apply image-based LDM priors on a frame-by-frame basis often result in temporally inconsistent reconstructions. We address this challenge by leveraging recent advances in Video Consistency Models (VCMs), which distill video latent diffusion models into fast generators that explicitly capture temporal causality. Building on this foundation, we propose LVTINO, the first zero-shot or plug-and-play inverse solver for high definition video restoration with priors encoded by VCMs. Our conditioning mechanism bypasses the need for automatic differentiation and achieves state-of-the-art video reconstruction quality with only a few neural function evaluations, while ensuring strong measurement consistency and smooth temporal transitions across frames. Extensive experiments on a diverse set of video inverse problems show significant perceptual improvements over current state-of-the-art methods that apply image LDMs frame by frame, establishing a new benchmark in both reconstruction fidelity and computational efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1832. Hidden Breakthroughs in Language Model Training
๐ฌ 17
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ interpretability techniques loss disaggregation phase transitions
Loss curves are smooth during most of model training, so visible discontinuities stand out as possible conceptual breakthroughs. Studying these breakthroughs enables a deeper understanding of learning dynamics, but only when they are properly identified. This paper argues that similar breakthroughs occur frequently throughout training but they are obscured by a loss metric that collapses all variation into a single scalar. To find these hidden transitions, we introduce POLCA, a method for decomposing changes in loss along arbitrary bases of the low-rank training subspace. We use our method to identify clusters of samples that share similar changes in loss during training, disaggregating the overall loss into that of smaller groups of conceptually similar data. We validate our method on synthetic arithmetic and natural language tasks, showing that POLCA recovers clusters that represent interpretable breakthroughs in the model's capabilities. We demonstrate the promise of these hidden phase transitions as a tool for unsupervised interpretability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1833. Constantly Improving Image Models Need Constantly Improving Benchmarks
๐ฌ 14
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ unified model image generation benchmark native multimodal model
Recent advances in image generation, often driven by proprietary systems like GPT-4o Image Gen, regularly introduce new capabilities that reshape how users interact with these models. Existing benchmarks often lag behind and fail to capture these emerging use cases, leaving a gap between community perceptions of progress and formal evaluation. To address this, we present ECHO, a framework for constructing benchmarks directly from real-world evidence of model use: social media posts that showcase novel prompts and qualitative user judgments. Applying this framework to GPT-4o Image Gen, we construct a dataset of over 35,000 prompts curated from such posts. Our analysis shows that ECHO (1) discovers creative and complex tasks absent from existing benchmarks, such as re-rendering product labels across languages or generating receipts with specified totals, (2) more clearly distinguishes state-of-the-art models from alternatives, and
(3) surfaces community feedback that we use to inform the design of metrics for model quality (e.g., measuring observed shifts in color, identity, and structure).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1834. Darwin Gรถdel Machine: Open-Ended Evolution of Self-Improving Agents
๐ฌ 15
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Self-improving AI Open-endedness
Most of today's AI systems are constrained by human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The scientific method, on the other hand, is a cumulative and open-ended system, where each innovation builds upon previous artifacts, enabling future discoveries. There is growing hope that the current manual process of advancing AI could itself be automated. If done safely, such automation would accelerate AI development and allow us to reap its benefits much sooner. This prospect raises the question of how AI systems can endlessly improve themselves while getting better at solving relevant problems. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The Gรถdel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin Gรถdel Machine (DGM), a novel self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM grows an archive of generated coding agents. It samples agents from this archive, which self-modify to create new, interesting versions of themselves. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). Overall, the DGM represents a significant step toward self-improving AI, capable of gathering its own stepping stones along a path that unfolds into endless innovation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1835. Efficient Differentiable Contact Model with Long-range Influence
๐ฌ 9
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Differentiable Simulation Policy Search Motion Planning
With the maturation of differentiable physics, its role in various downstream applicationsโsuch as model-predictive control, robotic design optimization, and neural PDE solversโhas become increasingly important. However, the derivative information provided by differentiable simulators can exhibit abrupt changes or vanish altogether, impeding the convergence of gradient-based optimizers. In this work, we demonstrate that such erratic gradient behavior is closely tied to the design of contact models. We further introduce a set of properties that a contact model must satisfy to ensure well-behaved gradient information. Lastly, we present a practical contact model for differentiable rigid-body simulators that satisfies all of these properties while maintaining computational efficiency. Our experiments show that, even from simple initializations, our contact model can discover complex, contact-rich control signals, enabling the successful execution of a range of downstream locomotion and manipulation tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1836. Learning Distributions over Permutations and Rankings with Factorized Representations
๐ฌ 14
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ permutations rankings fisher-yates insertion-vectors ranking movielens
Learning distributions over permutations is a fundamental problem in machine learning, with applications in ranking, combinatorial optimization, structured prediction, and data association. Existing methods rely on mixtures of parametric families or neural networks with expensive variational inference procedures. In this work, we propose a novel approach that leverages alternative representations for permutations, including Lehmer codes, Fisher-Yates draws, and Insertion-Vectors. These representations form a bijection with the symmetric group, allowing for unconstrained learning using conventional deep learning techniques, and can represent any probability distribution over permutations. Our approach enables a trade-off between expressivity of the model family and computational requirements. In the least expressive and most computationally efficient case our method subsumes previous families of well established probabilistic models over permutations, including Mallow's and the Repeated Insertion Model. Experiments indicate our method significantly outperforms current approaches on the jigsaw puzzle benchmark, a common task for permutation learning. However, we argue this benchmark is limited in its ability to assess learning probability distributions, as the target is a delta distribution (i.e., a single correct solution exists). We therefore propose two additional benchmarks: learning cyclic permutations and re-ranking movies based on user preference. We show that our method learns non-trivial distributions even in the least expressive mode, while traditional models fail to even generate valid permutations in this setting.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1837. VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing
๐ฌ 20
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Robot Learning Distillation Vision Foundation Models
Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full retraining to incorporate robot-domain knowledge.
We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. We then fine-tune only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Code and checkpoints are available in the supplementary materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1838. Cascadia: An Efficient Cascade Serving System for Large Language Models
๐ฌ 13
โญ 5.50
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ Distributed Parallel and Cluster Computing
Recent advances in large language models (LLMs) have intensified the need to deliver both rapid responses and high-quality outputs. More powerful models yield better results but incur higher inference latency, whereas smaller models are faster yet less capable. Recent work proposes balancing this latencyโquality trade-off using model cascades, which route simpler queries to smaller models and more complex ones to larger models. However, enabling efficient cascade serving remains challenging. Current frameworks lack effective mechanisms for handling (i) the huge and varying resource demands of different LLMs, (ii) the inherent heterogeneity of LLM workloads, and (iii) the co-optimization of system deployment and routing strategy.
Motivated by these observations, we introduce Cascadia, a novel cascade serving framework designed explicitly to schedule request routing and deploy model cascades for fast, quality-preserving LLM serving. Cascadia employs a bi-level optimization method: at the deployment level, it uses a mixed-integer linear program to select resource allocations and parallelism strategies based on LLM information and workload characteristics; at the routing level, it applies a Chebyshev-guided method to iteratively co-optimize the routing strategy and the system deployment produced by the deployment level. Our extensive evaluation on diverse workload traces and different model cascades (DeepSeek and the Llama series) demonstrates that Cascadia significantly outperforms both single-model deployments and the state-of-the-art cascade serving baseline, achieving up to 4$\times$ (2.3$\times$ on average) tighter latency SLOs and up to 5$\times$ (2.4$\times$ on average) higher throughput while maintaining target answer quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1839. FARTrack: Fast Autoregressive Visual Tracking with High Performance
๐ฌ 16
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Autoregressive Tracking Efficient Tracking Visual Object Tracking
Inference speed and tracking performance are two critical evaluation metrics in the field of visual tracking. However, high-performance trackers often suffer from slow processing speeds, making them impractical for deployment on resource-constrained devices. To alleviate this issue, we propose $\textbf{FARTrack}$, a $\textbf{F}$ast $\textbf{A}$uto-$\textbf{R}$egressive $\textbf{T}$racking framework. Since autoregression emphasizes the temporal nature of the trajectory sequence, it can maintain high performance while achieving efficient execution across various devices. FARTrack introduces $\textbf{Task-Specific Self-Distillation}$ and $\textbf{Inter-frame Autoregressive Sparsification}$, designed from the perspectives of $\textbf{shallow-yet-accurate distillation}$ and $\textbf{redundant-to-essential token optimization}$, respectively. Task-Specific Self-Distillation achieves model compression by distilling task-specific tokens layer by layer, enhancing the model's inference speed while avoiding suboptimal manual teacher-student layer pairs assignments. Meanwhile, Inter-frame Autoregressive Sparsification sequentially condenses multiple templates, avoiding additional runtime overhead while learning a temporally-global optimal sparsification strategy. FARTrack demonstrates outstanding speed and competitive performance. It delivers an AO of 70.6\% on GOT-10k in real-time. Beyond, our fastest model achieves a speed of 343 FPS on the GPU and 121 FPS on the CPU. The code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1840. Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation
๐ฌ 17
โญ 5.50
๐ generative models
๐ท๏ธ Human Motion Human-Human Interaction 3D CV Motion Generation
Generating realistic 3D Human-Human Interaction (HHI) requires coherent modeling of the physical plausibility of the agents and their interaction semantics. Existing methods compress all motion information into a single latent representation, limiting their ability to capture fine-grained actions and inter-agent interactions. This often leads to semantic misalignment and physically implausible artifacts, such as penetration or missed contact. We propose Disentangled Hierarchical Variational Autoencoder (DHVAE) based latent diffusion for structured and controllable HHI generation. DHVAE explicitly disentangles the global interaction context and individual motion patterns into a decoupled latent structure by employing a CoTransformer module. To mitigate implausible and physically inconsistent contacts in HHI, we incorporate contrastive learning constraints with our DHVAE to promote a more discriminative and physically plausible latent interaction space. For high-fidelity interaction synthesis, DHVAE employs a DDIM-based diffusion denoising process in the hierarchical latent space, enhanced by a skip-connected AdaLN-Transformer denoiser. Extensive evaluations show that DHVAE achieves superior motion fidelity, text alignment, and physical plausibility with greater computational efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1841. Open-Set Semantic Gaussian Splatting SLAM with Expandable Representation
๐ฌ 17
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Gaussian Splatting Dense Semantic SLAM 3D Scene Representation
This work enables everyday devices, e.g., smartphones, to dynamically capture open-ended 3D scenes with rich, expandable semantics for immersive virtual worlds. While 3DGS and foundation models hold promise for semantic scene understanding, existing solutions suffer from unscalable semantic integration, prohibitive memory costs, and cross-view inconsistency. To respond, we propose Open-Set Semantic Gaussian Splatting SLAM, a GS-SLAM system augmented by an expandable semantic feature pool that decouples condensed scene-level semantics from individual 3D Gaussians. Each Gaussian references semantics via a lightweight indexing vector, reducing memory overhead by orders of magnitude while supporting dynamic updates. Besides, we introduce a consistency-aware optimization strategy alongside a Semantic Stability Guidance mechanism to enhance long-term, cross-view semantic consistency and resolve inconsistencies. Experiments demonstrate that our system achieves high-fidelity rendering with scalable, open-set semantics across both controlled and in-the-wild environments, supporting applications like 3D localization and scene editing. These results mark an initial yet solid step towards high-quality, expressive, and accessible 3D virtual world modeling. Our code will be publicly released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1842. The Expressive Limits of Diagonal SSMs for State-Tracking
๐ฌ 13
โญ 5.50
๐ learning theory
๐ท๏ธ state-space model SSM LRNN linear RNN expressivity complex dynamical system state-tracking semigroup group automata Krohn-Rhodes
State-Space Models (SSMs) have recently been shown to achieve strong empirical performance on a variety of long-range sequence modeling tasks while remaining efficient and highly-parallelizable. However, the theoretical understanding of their expressive power remains limited.
In this work, we study the expressivity of input-Dependent Complex-valued Diagonal (DCD) State-Space Models (SSMs) on sequential state-tracking tasks for abstract groups. It is easy to show that a single DCD SSM layer with a universal decoder can track any Abelian group at finite precision by decomposing it into a product of cyclic groups. We show that this is tight by proving that such a model cannot track any non-Abelian group at finite precision.
We further establish the expressivity of multi-layer DCD SSMs. We show that a $k$-layer DCD SSM tracks a group if and only if that group has a subnormal series of length at most $k$, with Abelian factor groups. Empirically, while multi-layer models are theoretically expressive enough for solvable non-Abelian groups, we find they often fail to learn such solutions in practice, highlighting a gap between expressivity and learnability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1843. Generalization of Diffusion Models Arises with a Balanced Representation Space
๐ฌ 14
โญ 5.50
๐ generative models
๐ท๏ธ diffusion models representation learning generalization memorization denoising autoencoders
Diffusion models generate high-quality, diverse images with great generalizability, yet when overfit to the training objective, they may memorize training samples. We analyze memorization and generalization of diffusion models through the lens of representation learning. Using a two-layer ReLU denoising autoencoder (DAE) parameterization, we show that memorization corresponds to the model learning the raw data matrix for encoding and decoding, yielding spiky representations; in contrast, generalization arises when the model captures local data statistics, producing balanced representations. We validate these insights by investigating representation spaces in real-world unconditional and text-to-image diffusion models, where the same distinctions emerge. Practically, we propose a representation-based memorization detection method and a simple representation-steering method that enables controllable editing of generalized samples. Together, our results underscore that *learning good representations is central to novel and meaningful generation*.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1844. Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations
๐ฌ 15
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large language models reward hacking alignment
Chain-of-thought explanations are widely used to inspect the decision process of large language models (LLMs) and to evaluate the trustworthiness of model outputs, making them important for effective collaboration between LLMs and humans. We demonstrate that preference optimization -- a key step in the alignment phase -- can inadvertently reduce the faithfulness of these explanations. This occurs because the reward model (RM), which guides alignment, is tasked with optimizing both the expected quality of the response and the appropriateness of the explanations (e.g., minimizing bias or adhering to safety standards), creating potential conflicts. The RM lacks a mechanism to assess the consistency between the modelโs internal decision process and the generated explanation. Consequently, the LLM may engage in ``reward hacking'' by producing a final response that scores highly while giving an explanation tailored to maximize reward rather than accurately reflecting its reasoning. To address this issue, we propose enriching the RMโs input with a causal attribution of the prediction, allowing the RM to detect discrepancies between the generated self-explanation and the model's decision process. In controlled settings, we show that this approach reduces the tendency of the LLM to generate misleading explanations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1845. GCGNet: Graph-Consistent Generative Network for Time Series Forecasting with Exogenous Variables
๐ฌ 22
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ time series forecasting
Exogenous variables offer valuable supplementary information for predicting future endogenous variables. Forecasting with exogenous variables needs to consider both past-to-future dependencies (i.e., temporal correlations) and the influence of exogenous variables on endogenous variables (i.e., channel correlations). This is pivotal when future exogenous variables are available, because they may directly affect the future endogenous variables. Many methods have been proposed for time series forecasting with exogenous variables, focusing on modeling temporal and channel correlations. However, most of them use a two-step strategy, modeling temporal and channel correlations separately, which limits their ability to capture joint correlations across time and channels. Furthermore, in real-world scenarios, recorded time series are frequently affected by various forms of noises, underscoring the critical importance of robustness in such correlations modeling. To address these limitations, we propose GCGNet, a Graph-Consistent Generative Network for time series forecasting with exogenous variables. Specifically, GCGNet first employs a Variational Generator to produce coarse predictions. A Graph Discriminator then further guides it by evaluating the consistency between the generated and true correlations, where the correlations are represented as graphs, and are robust to noises. Finally, a Graph Refiner is proposed to refine the predictions to prevent degeneration and improve accuracy. Extensive experiments on 12 real-world datasets demonstrate that GCGNet outperforms state-of-the-art baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1846. Post-Training Quantization for Video Matting
๐ฌ 9
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Post-Training Quantization Video Matting
Video matting is crucial for applications such as film production and virtual reality, yet deploying its computationally intensive models on resource-constrained devices presents challenges. Quantization is a key technique for model compression and acceleration. As an efficient approach, Post-Training Quantization (PTQ) is still in its nascent stages for video matting, facing significant hurdles in maintaining accuracy and temporal coherence. To address these challenges, this paper proposes a novel and general PTQ framework specifically designed for video matting models, marking, to the best of our knowledge, the first systematic attempt in this domain. Our contributions include: (1) A two-stage PTQ strategy that combines block reconstruction-based optimization for fast, stable initial quantization and local dependency capture, followed by a global calibration of quantization parameters to minimize accuracy loss. (2) A Statistically-Driven Global Affine Calibration (GAC) method that enables the network to compensate for cumulative statistical distortions arising from factors such as neglected BN layer effects, even reducing the error of existing PTQ methods on video matting tasks up to 20%. (3) An Optical Flow Assistance (OFA) component that leverages temporal and semantic priors from frames to guide the PTQ process, enhancing the modelโs ability to distinguish moving foregrounds in complex scenes and ultimately achieving near full-precision performance even under ultra-low-bit quantization. Comprehensive quantitative and visual results show that our PTQ4VM achieves the state-of-the-art accuracy performance across different bit-widths compared to the existing quantization methods. We highlight that the 4-bit PTQ4VM even achieves performance close to the full-precision counterpart while enjoying 8ร FLOP savings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1847. TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
๐ฌ 14
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM-as-a-Judge LLM Evaluation Large Language Models
The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) \textit{Score-Comparison Inconsistency}, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) \textit{Pairwise Transitivity Inconsistency}, manifested through circular preference chains ($A\!>\!B\!>\!C\!>\!A$) and equivalence contradictions ($A\!=\!B\!=\!C\!\neq\!A$). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose \textbf{TrustJudge}, a probabilistic framework that addresses these limitations through two key innovations: 1) \textit{distribution-sensitive scoring} that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) \textit{likelihood-aware aggregation} that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudgeโs components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43\% (from 23.32\% to 14.89\%) and Pairwise Transitivity inconsistency by 10.82\% (from 15.22\% to 4.40\%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1848. Pixel-Level Residual Diffusion Transformer: Scalable 3D CT Volume Generation
๐ฌ 15
โญ 5.50
๐ generative models
๐ท๏ธ Medical Imaging 3D Diffusion Model Diffusion Transformer CT Scan Medical Image Generation
Generating high-resolution 3D CT volumes with fine details remains challenging due to substantial computational demands and optimization difficulties inherent to existing generative models. In this paper, we propose the Pixel-Level Residual Diffusion Transformer (PRDiT), a scalable generative framework that synthesizes high-quality 3D medical volumes directly at voxel-level. PRDiT introduces a two-stage training architecture comprising 1) a local denoiser in the form of an MLP-based blind estimator operating on overlapping 3D patches to separate low-frequency structures efficiently, and 2) a global residual diffusion transformer employing memory-efficient attention to model and refine high-frequency residuals across entire volumes. This coarse-to-fine modeling strategy simplifies optimization, enhances training stability, and effectively preserves subtle structures without the limitations of an autoencoder bottleneck. Extensive experiments conducted on the LIDC-IDRI and RAD-ChestCT datasets demonstrate that PRDiT consistently outperforms state-of-the-art models, such as HA-GAN, 3D LDM and WDM-3D, achieving significantly lower 3D FID, MMD and Wasserstein distance scores.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1849. WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving
๐ฌ 15
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Autonomous Driving Video Generation
Recent advances in driving-scene generation and reconstruction have demonstrated significant potential for enhancing autonomous driving systems by producing scalable and controllable training data. Existing generation methods primarily focus on synthesizing diverse and high-fidelity driving videos; however, due to limited 3D consistency and sparse viewpoint coverage, they struggle to support convenient and high-quality novel-view synthesis (NVS). Conversely, recent 3D/4D reconstruction approaches have significantly improved NVS for real-world driving scenes, yet inherently lack generative capabilities. To overcome this dilemma between scene generation and reconstruction, we propose \textbf{WorldSplat}, a novel feed-forward framework for 4D driving-scene generation. Our approach effectively generates consistent multi-track videos through two key steps: ((i)) We introduce a 4D-aware latent diffusion model integrating multi-modal information to produce pixel-aligned 4D Gaussians in a feed-forward manner. ((ii)) Subsequently, we refine the novel view videos rendered from these Gaussians using a enhanced video diffusion model. Extensive experiments conducted on benchmark datasets demonstrate that \textbf{WorldSplat} effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1850. ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks
๐ฌ 44
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ multi-modal red-teaming multi-modal alignment agent safety adversarial robustness
As vision-language models (VLMs) gain prominence, their multimodal interfaces also introduce new safety vulnerabilities, making the safety evaluation challenging and critical. Existing red-teaming efforts are either restricted to a narrow set of adversarial patterns or depend heavily on manual engineering, lacking scalable exploration of emerging real-world adversarial strategies. To bridge this gap, we propose ARMs, an adaptive red-teaming agent that systematically conducts comprehensive risk assessments for VLMs. Given a target harmful behavior or risk definition, ARMs automatically optimizes diverse red-teaming strategies with reasoning-enhanced multi-step orchestration, to effectively elicit harmful outputs from target VLMs. This is the first red teaming framework that provides controllable generation given risk definitions. We propose 11 novel multimodal attack strategies, covering diverse adversarial patterns of VLMs (e.g., reasoning hijacking, contextual cloaking), and integrate 17 red-teaming algorithms with ARMs. To balance the diversity and effectiveness of the attack, we design a layered memory with an epsilon-greedy attack algorithm. Extensive experiments on different instance-based benchmarks and policy-based safety evaluations show that ARMs achieves the state-of-the-art attack success rate (ASR), improving ASR by an average of 52.1% compared to existing baselines and even exceeding 90% ASR on Claude-4-Sonnet, a constitutionally-aligned model widely recognized for its robustness. We show that the diversity of red-teaming instances generated by ARMs is significantly higher, revealing emerging vulnerabilities in VLMs. Leveraging ARMs, we construct ARMs-Bench, a large-scale multimodal safety benchmark comprising 30K red-teaming instances spanning 51 diverse risk categories, grounded in both real-world multimodal threats and regulatory risks. Fine-tuning with ARMs-Bench substantially reduces ASR while preserving general utility of VLMs, providing actionable insights to improve multimodal safety alignment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1851. Video Unlearning via Low-Rank Refusal Vector
๐ฌ 15
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ video generation machine unlearning
Video generative models achieve high-quality synthesis from natural-language prompts by leveraging large-scale web data. However, this training paradigm inherently exposes them to unsafe biases and harmful concepts, introducing the risk of generating undesirable or illicit content. To mitigate unsafe generations, existing machine unlearning approaches either rely on filtering, and can therefore be bypassed, or they update model weights, but with costly fine-tuning or training-free closed-form edits. We propose the first training-free weight update framework for concept removal in video diffusion models.
From five paired safe/unsafe prompts, our method estimates a refusal vector and integrates it into the model weights as a closed-form update. A contrastive low-rank factorization further disentangles the target concept from unrelated semantics, it ensures a selective concept suppression and it does not harm generation quality. Our approach reduces unsafe generations on the Open-Sora and ZeroScopeT2V models across the T2VSafetyBench and SafeSora benchmarks, with average reductions of 36.3% and 58.2% respectively, while preserving prompt alignment and video quality. This establishes an efficient and scalable solution for safe video generation without retraining nor any inference overhead.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1852. PHAT: Modeling Period Heterogeneity for Multivariate Time Series Forecasting
๐ฌ 19
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ Time Series Forecasting; time series data; deep learning
While existing multivariate time series forecasting models have advanced signif- icantly in modeling periodicity, they largely neglect the periodic heterogeneity common in real-world data, where variables exhibit distinct and dynamically chang- ing periods. To effectively capture this periodic heterogeneity, we propose PHAT (Period Heterogeneity-Aware Transformer). Specifically, PHAT arranges multivari- ate inputs into a three-dimensional "periodic bucket" tensor, where the dimensions correspond to variable group characteristics with similar periodicity, time steps aligned by phase, and offsets within the period. By restricting interactions within buckets and masking cross-bucket connections, PHAT effectively avoids inter- ference from inconsistent periods. We also propose a positive-negative attention mechanism, which captures periodic dependencies from two perspectives: periodic alignment and periodic deviation. Additionally, the periodic alignment attention scores are decomposed into positive and negative components, with a modulation term encoding periodic priors. This modulation constrains the attention mechanism to more faithfully reflect the underlying periodic trends. A mathematical expla- nation is provided to support this property. We evaluate PHAT comprehensively on 14 real-world datasets against 18 baselines, and the results show that it signif- icantly outperforms existing methods, achieving highly competitive forecasting performance. Our sources is available at Anonymous GitHub.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1853. LEGACY: A Lightweight Dynamic Gradient Compression Strategy for Distributed Deep Learning
๐ฌ 10
โญ 5.50
๐ optimization
๐ท๏ธ Distributed Computing Compressed Communication Federated Learning
Distributed learning has achieved remarkable success in training deep neural networks (DNNs) on large datasets, but the communication bottleneck limits its scalability. Various compression techniques have been proposed to alleviate this limitation; however, they either use fixed parameters throughout training or rely on complex and computationally intensive methods to adapt compression parameters. Instead of the hard-to-tune hyperparameters required by adaptive compressors, this paper investigates the impact of two fundamental factors in DNN trainingโthe layer size of the networks and their training phasesโto design a simple yet efficient dynamic scheduler for any compressor, guiding the selection of compression parameters. We present a **L**ightweight **E**fficient **G**r**A**dient **C**ompression strategy**Y** or LEGACY, which, in theory, can work with any compression technique to produce a simple dynamic counterpart. We benchmark LEGACY on distributed and federated training, involving seven different DNN architectures, ranging from ResNet, Transformer-XL, to GPT-2, across large and challenging datasets, including ImageNet, WikiText-103, and OpenWebText. On ImageNet-1K, with an equivalent average data volume, LEGACY's dynamic compression strategies improve the Top-1 accuracy of ResNet-50 by 7-11% compared to uniform Top-0.1% compression, while on WikiText-103, the layer-based dynamic strategy reduces the perplexity of Transformer-XL by ~26% relative to the same baseline. In addition, we evaluate LEGACY under constrained and federated settings, and demonstrate that it scales effectively to a 100-worker configuration while maintaining strong accuracy under aggressive compression. We publish anonymized code at: https://github.com/LEGACY-compression/LEGACY.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1854. MASAM: Multimodal Adaptive Sharpness-Aware Minimization for Heterogeneous Data Fusion
๐ฌ 23
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal learning Modality imbalance Generalization.
Multimodal learning requires integrating heterogeneous modalities, such as structured records, visual imagery, and temporal signals. It has been revealed that this heterogeneity causes modality encoders to converge at different rates, making the multimodal learning imbalanced. We empirically observe that such an imbalance is related to the sharpness of the solution. Modality encoders that converge faster could be dragged into sharp regions due to inter-modal interference, degrading the generalization capability of unimodal features learned. Sharpness-Aware Minimization is effective in improving generalization via finding solutions in flat regions. However, its application in multimodal scenarios is challenging: 1) SAM pays excessive attention to the dominant modality, exacerbating modality imbalance, and 2) the perturbation gradient calculation is affected by interference from other modalities. To address these issues, we propose Multimodal Adaptive Sharpness-Aware Minimization (MASAM), which optimizes different modalities based on their dominance. We design an Adaptive Perturbation Score (APS) using convergence speed and gradient alignment to identify dominant modalities for SAM application. Our Modality-Decoupled Perturbation Scaling (MDPS) then reduces inter-modal interference during optimization, better aligning each modality with shared information. Extensive empirical evaluations on five multimodal datasets and six downstream tasks demonstrate that MASAM consistently attains flatter solutions, achieves balanced multimodal learning, and subsequently surpasses state-of-the-art methods across diverse datasets and tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1855. D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
๐ฌ 21
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ embodied ai vision-language-action models inverse dynamics models
Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection.
Desktop environments---particularly gaming---offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning.
We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks.
Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains.
Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152ร compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation.
Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6\% success rate on LIBERO manipulation and 83.3\% on CANVAS navigation benchmarks.
This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics.
We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models. (Demo available at [link](https://www.notion.so/D2E-Scaling-Vision-Action-Pretraining-on-Desktop-Data-for-Transfer-to-Embodied-AI-279e81a6e92380b4a672d19c924494eb?source=copy_link))
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1856. Two (narrow) heads are better than (an arbitrarily wide) one
๐ฌ 14
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Impossibility Result Transformers Attention mechanism Graphs Theory Induction heads In-context learning
In this paper, we establish a dimension- and precision-independent impossibility result for a simplified transformer model. Due to their size, a comprehensive understanding of the internal operations of frontier large language models (LLMs) is beyond the reach of current methods, but research into small and interpretable models has proven successful. We study the representational limits of attention, the core of transformer models, through the lens of the Endpoint Selection Problem (ESP), a simple yet expressive learning task defined over arcs of a directed graph.
Our main theoretical results are twofold: (i) 1-head, 1-layer, attention-only transformers cannot solve ESP on any graph containing a cycle, even with unbounded dimension and precision; but, all DAGs (Directed Acyclic Graph) are solvable with zero error (ii) in contrast, a 2-head, 1-layer, attention-only transformer can solve ESP on arbitrary directed graphs with constant embedding dimension and logarithmic precision. Prior lower bounds were conditional on bounds on dimension and precision. Through a transformation, we extend our impossibility result from ESP to the much studied 2-hop induction head problem. Further, we uncover a surprising connection to NP-completeness by showing that the optimal error of the 1-head transformer is em exactly related to the size of MAS (Maximum Acyclic Subgraph) and hence inapproximable.
Finally, we validate our theory with experiments and observe that gradient-based optimization can reliably find 1-head solutions for DAGs and 2-head solutions for arbitrary graphs with cycles, whereas 1-head models struggle to reach the optimal solution in graphs with cycles. We believe that our techniques are of independent interest and have the potential to establish a new fine-grained hierarchy of transformer architectures, each with greater problem-solving power than the last.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1857. Exposing Weaknesses of Large Reasoning Models through Graph Algorithm Problems
๐ฌ 4
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Large Reasoning Models Graph Algorithm Problems Large Language Models
Large Reasoning Models (LRMs) have advanced rapidly, yet existing benchmarks on mathematics, code, and common-sense reasoning remain limited: they lack long-context evaluation, offer insufficient challenge, and provide answers that are difficult to verify programmatically. We introduce GrAlgoBench, a benchmark designed to evaluate LRMs through graph algorithm problems. Such problems are particularly well-suited for probing reasoning abilities: they demand long-context reasoning, allow fine-grained control of difficulty levels, and enable standardized programmatic evaluation. Across nine tasks, our systematic experiments reveal two major weaknesses of current LRMs. First, accuracy deteriorates sharply with longer contexts inputโfalling below 50% once graphs exceed 120 nodesโdriven by frequent execution errors, weak memory, and redundant reasoning. Second, LRMs suffer from an "over-thinking" phenomenon, primarily driven by extensive yet largely ineffective self-verification, which inflates reasoning traces without improving correctness. By exposing these limitations, GrAlgoBench establishes graph algorithm problems as a rigorous, multidimensional, and practically relevant testbed for advancing the study of reasoning in LRMs. Code is available at https://anonymous.4open.science/r/GrAlgoBench-7D17.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1858. CaTS: Calibrated Test-Time Scaling for Efficient LLM Inference
๐ฌ 12
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Test-time Scaling Model Calibration Efficient inference Language Modeling Scaling
Increasing test-time computation is a straightforward approach to enhancing the quality of responses in Large Language Models (LLMs). While Best-of-N sampling and Self-Consistency with majority voting are simple and effective, they require a fixed number of sampling responses for each query, regardless of its complexity. This could result in wasted computation for simpler questions and insufficient exploration for more challenging ones. In this work, we argue that model confidence of responses can be used for improving the efficiency of test-time scaling. Unfortunately, LLMs are known to be overconfident and provide unreliable confidence estimation. To address this limitation, we introduce Self-Calibration by distilling Self-Consistency-derived confidence into the model itself. This enables reliable confidence estimation at test time with one forward pass. We then design Calibrated Test-Time Scaling (CaTS), adapting common repeated sampling methods, such as self-consistency and Best-of-N to handle queries of various difficulty. We also show that CaTS-SC is provably better than vanilla self-consistency. Experiments on three LLMs across nine datasets demonstrate the effectiveness of our approach. Specifically, applying confidence-based Early Stopping (CaTS-ES) to Best-of-N improves MathQA accuracy from 73.7 to 83.6 with a sample budget of 16 responses, demonstrating the effectiveness of the confidence-based sampling strategy at inference time.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1859. RefineBench: Evaluating Refinement Capability in Language Models
๐ฌ 14
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Refinement Large Language Model Checklist
Can language models (LMs) self-refine their own responses? This question is increasingly relevant as more than 10% of real-world user interactions involve refinement requests (see Appendix G). Yet prior studies have largely tested LMs on verifiable tasks such as competition math or symbolic reasoning with simplified scaffolds, whereas users often pose open-ended queries and provide varying degrees of feedback about what went wrong. The recent advent of reasoning models that exhibit self-reflection patterns in their chain-of-thought further motivates this question. To address it, we introduce RefineBench, a benchmark of 1,002 challenging problems across 11 domains paired with a checklist-based evaluation framework. We evaluate two refinement modes: (1) guided refinement, where an LM is provided natural language feedback, and (2) self-refinement, where LMs attempt to improve without guidance. In the self-refinement setting, even frontier LMs such as Gemini 2.5 Pro and GPT-4.1 achieve modest baseline scores of 31.1 and 23.4, respectively, and most models fail to consistently improve across iterations (e.g., Gemini-2.5-Pro gains only +1.8%, while DeepSeek-R1 declines by โ0.2%). By contrast, in guided refinement, both proprietary LMs and large open-weight LMs (>70B) can leverage targeted feedback to refine responses to near-perfect levels within five turns. These findings suggest that frontier LMs require breakthroughs to self-refine effectively when their initial responses are incorrect, and that RefineBench provides a valuable testbed for tracking progress.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1860. Soft Tokens, Hard Truths
๐ฌ 13
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ reinforcement learning large language models math reasoning latent reasoning soft thinking continuous tokens reasoning
The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of discrete tokens could simulate a superposition of several reasoning paths simultaneously. Theoretical results have formally proven that continuous tokens have much greater expressivity and can solve specific problems more efficiently. However, practical use of continuous tokens has been limited by strong training difficulties: previous works either just use continuous tokens at inference time on a pre-trained discrete-token model, or must distill the continuous CoT from ground-truth discrete CoTs and face computational costs that limit the CoT to very few tokens.
This is the first work introducing a scalable method to learn continuous CoTs via reinforcement learning (RL), without distilling from reference discrete CoTs. We use "soft" tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration. Computational overhead is minimal, enabling us to learn continuous CoTs with hundreds of tokens. On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@$1$ and surpass them for pass@$32$, showing greater CoT diversity. In systematic comparisons, the best-performing scenario is to train with continuous CoT tokens then use discrete tokens for inference, meaning the "soft" models can be deployed in a standard way. Finally, we show continuous CoT RL training better preserves the predictions of the base model on out-of-domain tasks, thus providing a softer touch to the base model.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1861. Singleton-Optimized Conformal Prediction
๐ฌ 28
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Conformal Prediction; Uncertainty Quantification
Conformal prediction can be used to construct prediction sets that cover the true outcome with a desired probability, but can sometimes lead to large prediction sets that are costly in practice. The most useful outcome is a singleton prediction---an unambiguous decision---yet existing efficiency-oriented methods primarily optimize average set size. Motivated by this, we propose a new non-conformity score that is motivated by minimizing the probability of producing non-singleton sets while maintaining coverage. Starting from a non-convex constrained optimization problem as a motivation, we provide a convex-geometric reformulation and associated algorithm for computing the non-conformity score and associated split conformal prediction sets in $O(K)$ time for $K$-class problems. Using this score in split conformal prediction, we introduce Singleton-Optimized Conformal Prediction (SOCOP). We evaluate our method in experiments on image classification and LLM multiple-choice answering, comparing with standard non-conformity scores such as the (negative) label probability estimates and their cumulative distribution function; both of which are motivated by aiming to optimize average length. The results show that SOCOP increases singleton frequency (sometimes by over 20\%) compared to the above scores, with minimal impact on average set size.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1862. DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science
๐ฌ 14
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ LLM Dataset Benchmark Data Science
The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create a emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following. Unlike many existing benchmarks that rely on human- or model-based judges, all tasks in DARE-bench have verifiable ground truth, ensuring objective and reproducible evaluation. To cover a broad range of tasks and support agentic tools, DARE-bench consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets. Extensive evaluations show that even highly capable models such as gpt-o4-mini struggle to achieve good performance, especially in machine learning modeling tasks. Using DARE-bench training tasks for fine-tuning can substantially improve model performance. For example, supervised fine-tuning boosts Qwen3-32Bโs accuracy by 1.83ร and reinforcement learning boosts Qwen3-4Bโs accuracy by more than 8ร. These significant improvements verify the importance of DARE-bench both as an accurate evaluation benchmark and critical training data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1863. Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds
๐ฌ 21
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ hyperbolic geometry modality alignment multimodal learning
Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1864. OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
๐ฌ 10
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Spatial Reasoning Vision-Language Models Benchmark
Spatial reasoning is a key aspect of cognitive psychology and remains a bottleneck for current vision-language models (VLMs). While extensive research has aimed to evaluate or improve VLMs' understanding of basic spatial relations, such as distinguishing left from right, near from far, and object counting, these tasks cover only the most elementary layer of spatial reasoning and are largely approaching saturation in the latest reasoning models. In this work, we introduce OmniSpatial, a comprehensive and challenging benchmark for spatial reasoning, grounded in cognitive psychology. OmniSpatial covers four major categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking, with 50 fine-grained subcategories. Through careful manual annotation, we construct over 8.4K question-answer pairs. Extensive experiments show that both open- and closed-source VLMs exhibit significant limitations in comprehensive spatial reasoning. We also explore two strategiesโPointGraph (explicit scene graph cues) and SpatialCoT (novel-view chain-of-thought)โto bolster spatial reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1865. Self-Supervised Learning from Structural Invariance
๐ฌ 24
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Self-supervised learning representation learning disentanglement
Joint-embedding self-supervised learning (SSL), the key paradigm for unsupervised representation learning from visual data, learns from invariances between semantically-related data pairs.
We study the one-to-many mapping problem in SSL,
where each datum may be mapped to multiple valid targets.
This arises when data pairs come from naturally occurring generative processes, e.g., successive video frames.
We show that existing methods struggle to flexibly capture this conditional uncertainty. As a remedy, we introduce a variational distribution that models this uncertainty in the latent space, and derive a lower bound on the pairwise mutual information. We also propose a simpler variant of the same idea using sparsity regularization.
Our model, AdaSSL, applies to both contrastive and predictive SSL methods, and we empirically show its versatility on identifiability, generalization, fine-grained image understanding, and world modeling on videos.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1866. MATHMO: Automated Mathematical Modeling Through Adaptive Search
๐ฌ 12
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ automated modeling autoformulation
Mathematical modeling is the process of understanding and predicting complex real-world phenomena.
Traditionally, it is a time-intensive effort reliant on deep human expertise and iterative refinement.
Automating this intricate process, therefore, offers the potential to significantly accelerate discovery and broaden the application of mathematical modeling across diverse domains. Such automation, however, must address inherent challenges, including fundamental modeling uncertainty, balancing multiple conflicting objectives, and incorporating subjective qualities into assessing model utility.
We approach this by conceptualizing mathematical modeling as a sequential decision-making problem under uncertainty.
In response, we introduce $\texttt{MATHMO}$, a novel adaptive search method designed to automatically navigate the complex decisions in selecting mathematical frameworks, specifying model formulations, and defining algorithmic procedures.
Specifically, $\texttt{MATHMO}$ employs a principled bi-level search strategy---combining high-level exploration across diverse frameworks and local intra-framework model refinements---leveraging Large Language Models for exploration, surrogate evaluations, and incorporating subjective preferences into the automated process. We demonstrate $\texttt{MATHMO}$'s efficacy on diverse real-world tasks, where it successfully discovers Pareto-efficient frontiers of models that balance varied objectives, including subjective criteria.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1867. Dual-Kernel Adapter: Expanding Spatial Horizons for Data-Constrained Medical Image Analysis
๐ฌ 28
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Adapter; Medical Image Analysis; Data-Limited Training;
Adapters have become a widely adopted strategy for efficient fine-tuning of foundation models, particularly in resource-constrained settings. However, their performance under extreme data scarcityโcommon in medical imaging due to high annotation costs, privacy regulations, and fragmented datasetsโremains underexplored. In this work, we present the first comprehensive study of adapter-based fine-tuning for vision foundation models in low-data medical imaging scenarios. We find that, contrary to their promise, conventional Adapters can degrade performance under severe data constraints, performing even worse than simple linear probing when trained on less than 1\% of the corresponding training data. Through systematic analysis, we identify a sharp reduction in Effective Receptive Field (ERF) as a key factor behind this degradation. Motivated by these findings, we propose the Dual-Kernel Adapter (DKA), a lightweight module that expands spatial context via large-kernel convolutions while preserving local detail with small-kernel counterparts. Extensive experiments across diverse classification and segmentation benchmarks show that DKA significantly outperforms existing Adapter methods, establishing new leading results in both data-constrained and data-rich regimes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1868. DiffPBR: Point-Based Rendering via Spatial-Aware Residual Diffusion
๐ฌ 21
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Point-based graphics Novel view synthesis Neural rendering
Neural radiance fields and 3D Gaussian splatting (3DGS) have significantly advanced 3D reconstruction and novel view synthesis (NVS). Yet, achieving high-fidelity and view-consistent renderings directly from point clouds---without costly per-scene optimization---remains a core challenge. In this work, we present DiffPBR, a diffusion-based framework that synthesizes coherent, photorealistic renderings from diverse point cloud inputs. We demonstrate that diffusion models, when guided by viewpoint-projected noise explicitly constrained by scene geometry and visibility, naturally enforce geometric consistency across camera motion. To achieve this, we first introduce adaptive CoNo-Splatting, a technique for fast and faithful rasterization that ensures efficient and effective handling of point clouds. Secondly, we integrate residual learning into the neural re-rendering pipeline, which improves convergence, generalization, and visual quality across diverse rendering tasks. Extensive experiments show that our method outperforms existing baselines with an improvement of **3~5dB** in rendered image quality, a reduction from **41 to 8** in GPU hours for training, and an increase from **3.6fps to 10fps** (our one-step variant) in rendering speed frequency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1869. PIRN: Prototypical-based Intra-modal Reconstruction with Normality Communication for Multi-modal Anomaly Detection.
๐ฌ 15
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ anomaly detection
Unsupervised Multimodal anomaly detection (MAD) โ identifying defects by jointly analyzing RGB images and 3D data โ is crucial for quality control in manufacturing. However, existing MAD methods struggle when only a few normal samples are available. Cross-modal alignment models fail to learn stable correspondences with scarce training data, and memory-based approaches misclassify any unseen normal variation as anomalous.To addresses the few-shot challenge, we introduce PIRN (Prototypical-based Intra-modal Reconstruction with Normality Communication for Multi-modal Anomaly Detection.), a prototype-based intra-modal reconstruction framework with explicit cross-modal knowledge transfer. PRINC features three key innovations: (1) Balanced Prototype Assignment (BPA) formulates tokenโtoโprototype routing as a balanced optimalโtransport problem, guaranteeing uniform utilisation of all prototypes and preventing codebook collapse.(2) Adaptive Prototype Refinement (APR) treats prototypes as adaptive memory and updates them onโtheโfly with gated GRU cells driven by optimallyโmatched image context, expanding coverage to unseen yet normal variations while suppressing anomalies. (3) Multiโmodal Normality Communication (MNC) exchanges complementary normal cues across modalities via gated crossโattention. MNC enables one modality to reconstruct its feature map not only from its own prototypes, but also with high-level normal patterns provided by the other modality.Extensive experiments on standard benchmarks demonstrate that PIRN significantly outperforms prior methods, achieving new state-of-the-art results, especially in challenging few-shot scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1870. CPiRi: Channel Permutation-Invariant Relational Interaction for Multivariate Time Series Forecasting
๐ฌ 12
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ Multivariate Time Series Forecasting Channel Permutation Invariance Spatio-temporal Decoupling Meta-Learning Foundation Models
Current methods for multivariate time series forecasting can be classified into channel-dependent and channel-independent models. Channel-dependent models learn cross-channel features but often overfit the channel ordering, which hampers adaptation when channels are added or reordered. Channel-independent models treat each channel in isolation to increase flexibility, yet this neglects inter-channel dependencies and limits performance. To address these limitations, we propose CPiRi, a channel permutation invariant (CPI) framework that infers cross-channel structure from data rather than memorizing a fixed ordering, enabling deployment in settings with structural and distributional co-drift without retraining. CPiRi couples spatio-temporal decoupling architecture with permutation-invariant regularization training strategy: a frozen pretrained temporal encoder extracts high-quality temporal features, a lightweight spatial module learns content-driven inter-channel relations, while a channel shuffling strategy enforces CPI during training. We further ground CPiRi in theory by analyzing permutation equivariance in multivariate time series forecasting. Experiments on multiple benchmarks show state-of-the-art results. CPiRi remains stable when channel orders are shuffled and exhibits strong inductive generalization to unseen channels even when trained on only half of the channels, while maintaining practical efficiency on large-scale datasets. The source code and models will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1871. Wavelet Predictive Representations for Non-Stationary Reinforcement Learning
๐ฌ 13
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Non-Stationary Reinforcement Learning Representation Learning
The real world is inherently non-stationary, with ever-changing factors, such as weather conditions and traffic flows, making it challenging for agents to adapt to varying environmental dynamics. Non-Stationary Reinforcement Learning (NSRL) addresses this challenge by training agents to adapt rapidly to sequences of distinct Markov Decision Processes (MDPs). However, existing NSRL approaches often focus on tasks with regularly evolving patterns, leading to limited adaptability in highly dynamic settings. Inspired by the success of Wavelet analysis in time series modeling, specifically its ability to capture signal trends at multiple scales, we propose WISDOM to leverage wavelet-domain predictive task representations to enhance NSRL. WISDOM captures these multi-scale features in evolving MDP sequences by transforming task representation sequences into the wavelet domain, where wavelet coefficients represent both global trends and fine-grained variations of non-stationary changes. In addition to the auto-regressive modeling commonly employed in time series forecasting, we devise a wavelet temporal difference (TD) update operator to enhance tracking and prediction of MDP evolution. We theoretically prove the convergence of this operator and demonstrate policy improvement with wavelet task representations. Experiments on diverse benchmarks show that WISDOM significantly outperforms existing baselines in both sample efficiency and asymptotic performance, demonstrating its remarkable adaptability in complex environments characterized by non-stationary and stochastically evolving tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1872. From Embedding to Control: Representations for Stochastic Multi-Object Systems
๐ฌ 22
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Embedding to Control Controllable Embedding Graph Representation for Linear Control
This paper studies how to achieve accurate modeling and effective control in stochastic nonlinear dynamics with multiple interacting objects. However, non-uniform interactions and random topologies make this task challenging. We address these challenges by proposing Graph Controllable Embeddings (GCE), a general framework to learn stochastic multi-object dynamics for linear control. Specifically, GCE is built on Hilbert space embeddings, allowing direct embedding of probability distributions of controlled stochastic dynamics into a reproducing kernel Hilbert space (RKHS), which enables linear operations in its RKHS while retaining nonlinear expressiveness. We provide theoretical guarantees on the existence, convergence, and applicability of GCE. Notably, a mean field approximation technique is adopted to efficiently capture inter-object dependencies and achieves provably low sample complexity. By integrating graph neural networks, we construct data-dependent kernel features which are capable of adapting to dynamic interaction patterns and generalizing to even unseen topologies with only limited training instances. GCE scales seamlessly to multi-object systems of varying sizes and topologies. Leveraging the linearity of Hilbert spaces, GCE also supports simple yet effective control algorithms for synthesizing optimal sequences. Experiments on physical systems, robotics, and power grids validate GCE and demonstrate consistent performance improvement over various competitive embedding methods in both in-distribution and few-shot tests.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1873. GuardAlign: Robust Safety Alignment in Multimodal Large Language Models
๐ฌ 5
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ MLLM VLMs Safety Alignment
Multimodal large language models (MLLMs) have achieved remarkable progress in visionโlanguage reasoning tasks, yet ensuring their safety remains a critical challenge. Recent input-side defenses detect unsafe images with CLIP and prepend safety prefixes to prompts, but they still suffer from inaccurate detection in complex scenes and unstable safety signals during decoding. To address these issues, we propose **GuardAlign**, a training-free defense framework that integrates two strategies. First, OT-enhanced safety detection leverages optimal transport to measure distribution distances between image patches and unsafe semantics, enabling accurate identification of malicious regions without additional computational cost. Second, cross-modal attentive calibration strengthens the influence of safety prefixes by adaptively reallocating attention across layers, ensuring that safety signals remain consistently activated throughout generation. Extensive evaluations on six representative MLLMs demonstrate that GuardAlign reduces unsafe response rates by up to 39% on SPA-VL, while preserving utility, achieving an improvement on VQAv2 from 78.51% to 79.21%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1874. DRBench: A Realistic Benchmark for Enterprise Deep Research
๐ฌ 14
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Benchmark deep research reasoning enterprise insight recall factuality heterogeneous data persona-grounded tasks multi-domain evaluation scalable data synthesis Docker AI agent LLM
We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike prior benchmarks that focus on simple questions or web-only queries, DRBench evaluates agents on multi-step queries (for example, "What changes should we make to our product roadmap to ensure compliance with this standard?") that require identifying supporting facts from both the public web and private company knowledge base. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web. Tasks are generated through a carefully designed synthesis pipeline with human-in-the-loop verification, and agents are evaluated on their ability to recall relevant insights, maintain factual accuracy, and produce coherent, well-structured reports. We release 100 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance. We demonstrate the effectiveness of DRBench by evaluating diverse DR agents across open- and closed-source models (such as GPT, Llama, and Qwen) and DR strategies, highlighting their strengths, weaknesses, and the critical path for advancing enterprise deep research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1875. Guaranteed Simply Connected Mesh Reconstruction from an Unorganized Point Cloud
๐ฌ 14
โญ 5.50
๐ optimization
๐ท๏ธ 3D Reconstruction spectral techniques topological guarantee Laplacian Hodge decomposition
We introduce an approach that reconstructs a closed surface mesh from a noisy point cloud, where the topology of surface is guaranteed to be simply connected, i.e., homeomorphic to a topological 2-sphere. This task enjoys a wide range of applications, e.g., 3D organ and vessel reconstruction from CT scans. Central to our approach is a robust module that takes a collection of oriented triangles in a 3D triangulation as input and outputs a simply connected volumetric mesh whose boundary approximates the input triangles. Starting from a 3D Delaunay triangulation of the input point cloud and initial triangle orientations obtained through a spectral approach, our approach alternates between applying the module to obtain a reconstruction and using that reconstruction to reorient the input triangles. Experimental results on real and synthetic datasets demonstrate the effectiveness of our approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1876. Information Shapes Koopman Representation
๐ฌ 33
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Koopman Operator Latent subspace reconstruction representation for physical systems
The Koopman operator provides a powerful framework for modeling dynamical systems and has attracted growing interest from the machine learning community. However, its infinite-dimensional nature makes identifying suitable finite-dimensional subspaces challenging, especially for deep architectures. We argue that these difficulties come from suboptimal representation learning, where latent variables fail to balance expressivity and simplicity. This tension is closely related to the information bottleneck (IB) dilemma: constructing compressed representations that are both compact and predictive. Rethinking Koopman learning through this lens, we demonstrate that latent mutual information promotes simplicity, yet an overemphasis on simplicity may cause latent space to collapse onto a few dominant modes. In contrast, expressiveness is sustained by the von Neumann entropy, which prevents such collapse and encourages mode diversity. This insight leads us to propose an information-theoretic Lagrangian formulation that explicitly balances this tradeoff. Furthermore, we propose a new algorithm based on the Lagrangian formulation that encourages both simplicity and expressiveness, leading to a stable and interpretable Koopman representation. Beyond quantitative evaluations, we further visualize the learned manifolds under our representations, observing empirical results consistent with our theoretical predictions. Finally, we validate our approach across a diverse range of dynamical systems, demonstrating improved performance over existing Koopman learning methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1877. AlphaSAGE: Structure-Aware Alpha Mining via GFlowNets for Robust Exploration
๐ฌ 15
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Alpha Mining Generative Flow Networks
The automated mining of predictive signals, or alphas, is a central challenge in quantitative finance. While Reinforcement Learning (RL) has emerged as a promising paradigm for generating formulaic alphas, existing frameworks are fundamentally hampered by a triad of interconnected issues. First, they suffer from reward sparsity, where meaningful feedback is only available upon the completion of a full formula, leading to inefficient and unstable exploration. Second, they rely on semantically inadequate sequential representations of mathematical expressions, failing to capture the structure that determine an alpha's behavior. Third, the standard RL objective of maximizing expected returns inherently drives policies towards a single optimal mode, directly contradicting the practical need for a diverse portfolio of non-correlated alphas. To overcome these challenges, we introduce **AlphaSAGE** (**S**tructure-**A**ware Alpha Mining via **G**enerative Flow Networks for Robust **E**xploration), a novel framework is built upon three cornerstone innovations: (1) a structure-aware encoder based on Relational Graph Convolutional Network (RGCN); (2) a new framework with Generative Flow Networks (GFlowNets); and (3) a dense, multi-faceted reward structure. Empirical results demonstrate that AlphaSAGE outperforms existing baselines in mining a more diverse, novel, and highly predictive portfolio of alphas, thereby proposing a new paradigm for automated alpha mining. Our code is available at https://anonymous.4open.science/r/AlphaSAGE-3BA9.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1878. Thought Branches: Interpreting LLM Reasoning Requires Resampling
๐ฌ 4
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ Mechanistic interpretability interpretability reasoning models thinking models chain-of-thought on-policy causal interventions agentic misalignment unfaithfulness inferenceโtime scaling testโtime compute
We argue that interpreting reasoning models from a single chain-of-thought (CoT) is fundamentally inadequate. To understand computation and causal influence, one must study reasoning as a distribution of possible trajectories elicited by a given prompt. We approximate this distribution via on-policy resampling and use it to answer concrete questions about the causes of model decisions. First, when a model states a reason for its action, does that reason actually cause the action? In agentic misalignment scenarios where models seemingly blackmail to preserve themselves, we resample specific sentences to measure their downstream effects. We find that normative self-preservation sentences have unusually small and non-resilient causal impact on the final decision across models, indicating they are not a meaningful driver of blackmail. Second, are handwritten edits to CoT sufficient for steering reasoning? We find that off-policy sentence insertions common in earlier literature yield small and unstable effects in decision-making tasks, whereas on-policy resampling produces larger and more consistent effects. Third, how do we attribute causal influence when models modify their plans or correct prior errors during reasoning? We introduce a resilience metric and counterfactual importance that repeatedly resample to remove sentences such that similar content doesn't reappear downstream. Critical planning statements resist removal but have large effects when successfully eliminated. Fourth, what can our methods, which focus on the mechanistic roles of CoT, teach us about unfaithful reasoning? Adapting causal mediation analysis, we edit hint pathways mid-trajectory and find that prompt hints exert smooth and cumulative influences rather than single-step pivots. Hidden information can influence the trajectory of reasoning by shifting what decisions are made at different junctures in a CoT, and these biases can be modeled and quantified with resampling. Overall, studying distributions via resampling enables reliable causal analysis, clearer narratives of model reasoning, and principled guidance for CoT interventions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1879. Fair Classification by Direct Intervention on Operating Characteristics
๐ฌ 29
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ algorithmic fairness; post-processing; linear-fractional constraints; minimal interventions; constrained optimization
We develop new classifiers under group fairness in the attribute-aware setting for binary classification with multiple group fairness constraints (e.g., demographic parity (DP), equalized odds (EO), and predictive parity (PP)).
We propose a novel approach based on directly intervening on the operating characteristics of a pre-trained base classifier, by:
(i) identifying optimal operating characteristics using the base classifier's group-wise ROC convex hulls;
(ii) post-processing the base classifier to match those targets.
As practical post-processors,
we consider randomizing a mixture of group-wise thresholding rules subject to minimizing the expected number of interventions.
We further extend our approach to handle multiple protected attributes and multiple linear fractional constraints.
On standard datasets (COMPAS and ACSIncome),
our method simultaneously
satisfies approximate DP, EO, and PP with few interventions and a nearly optimal drop in accuracy; and compare favorably to previous methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1880. KV-Cache Transform Coding for Compact Storage in LLM Inference
๐ฌ 26
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ transformer kv cache compression
Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20x compression, while maintaining reasoning and long-context accuracy in Llama 3.1, Mistral-NeMo, and R1-Qwen 2.5. Across benchmarks including AIME25, LiveCodeBench, GSM8K, MMLU, Qasper, RULER, and MATH500, KVTC consistently outperforms inference-time baselines such as token eviction, quantization, and SVD-based methods, delivering substantially higher compression ratios. These results support KVTC as a practical building block for memory-efficient LLM serving with reusable KV caches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1881. The Intricate Dance of Prompt Complexity, Quality, Diversity and Consistency in T2I Models
๐ฌ 17
โญ 5.50
๐ generative models
๐ท๏ธ text-to-image models prompt complexity synthetic data
Text-to-image (T2I) models offer great potential for creating virtually limitless synthetic data, a valuable resource compared to fixed and finite real datasets. Previous works evaluate the utility of synthetic data from T2I models on three key desiderata: quality, diversity, and consistency. While prompt engineering is the primary means of interacting with T2I models, the systematic impact of prompt complexity on these critical utility axes remains underexplored. In this paper, we first conduct synthetic experiments to motivate the difficulty of generalization w.r.t. prompt complexity and explain the observed difficulty with theoretical derivations. Then, we introduce a new evaluation framework that can compare the utility of real data and synthetic data, and present a comprehensive analysis of how prompt complexity influences the utility of synthetic data generated by commonly used T2I models. We conduct our study across diverse datasets, including CC12M, ImageNet-1k, and DCI, and evaluate different inference-time intervention methods. Our synthetic experiments show that generalizing to more general conditions is harder than the other way round, since the former needs an estimated likelihood that is not learned by diffusion models. Our large-scale empirical experiments reveal that increasing prompt complexity results in lower conditional diversity and prompt consistency, while reducing the synthetic-to-real distribution shift, which aligns with the synthetic experiments. Moreover, current inference-time interventions can augment the diversity of the generations at the expense of moving outside the support of real data. Among those interventions, prompt expansion, by deliberately using a pre-trained language model as a likelihood estimator, consistently achieves the highest performance in both image diversity and aesthetics, even higher than that of real data. Combining advanced guidance interventions with prompt expansion results in the most appealing utility trade-offs of synthetic data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1882. A2ASecBench: A Protocol-Aware Security Benchmark for Agent-to-Agent Multi-Agent Systems
๐ฌ 15
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Agent-to-agent protocol multi-agent systems security benchmark
Multi-agent systems (MAS) built on large language models (LLMs) increasingly rely on agent-to-agent (A2A) protocols to enable capability discovery, task orchestration, and artifact exchange across heterogeneous stacks. While these protocols promise interoperability, they also introduce new vulnerabilities. In this paper, we present the first comprehensive security evaluation of A2A-MAS. We develop a taxonomy and threat model that categorize risks into supply-chain manipulations and protocol-logic weaknesses, and we detail six concrete attacks spanning all A2A stages and components with impacts on confidentiality, integrity, and availability. Building on this taxonomy, we introduce A2ASecBench, the first A2A-specific security benchmark framework capable of probing diverse and previously unexplored attack vectors. Our framework incorporates a dynamic adapter layer for deployment across heterogeneous agent stacks and downstream workloads, alongside a joint safetyโutility evaluation methodology that explicitly measures the trade-off between harmlessness and helpfulness by pairing adversarial trials with benign tasks. We empirically validate our framework using official A2A Project demos across three representative high-stakes domains (travel, healthcare, and finance), demonstrating that the identified attacks are both pervasive and highly effective, consistently bypassing default safeguards. These findings highlight the urgent need for protocol-level defenses and standardized benchmarking to secure the next generation of agentic ecosystems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1883. Fastcar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge
๐ฌ 17
โญ 5.50
๐ generative models
๐ท๏ธ Video Generation Efficient Video Generation Auto-Regressive Video Generation
Auto-regressive (AR) models, initially successful in language generation, have recently shown promise in visual generation tasks due to their superior sampling efficiency.
Unlike image generation,
video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during decoding.
We first make specific key observations: (i) MLP modules in the decode phase dominate the inference latency, and (ii) there exists high temporal redundancy in MLP outputs of adjacent frames.
With the insights, we propose **FastCar** to accelerate the decode phase for the AR video generation by exploring the temporal redundancy. The Temporal Attention Score (TAS) is proposed to determine whether to apply the replay strategy (i.e., reusing cached MLP outputs from the previous frame to reduce redundant computations) with detailed theoretical analysis and justification.
Furthermore, we develop a hardware accelerator on FPGA with Dynamic Resource Scheduling based on TAS to enable better resource utilization and faster inference.
Experimental results demonstrate the effectiveness of our method, which outperforms traditional sparse attention approaches with more than 2.1x decoding speedup and higher energy efficiency on the edge.
Furthermore, by combining FastCar and sparse attention, FastCar can boost the performance of sparse attention with alleviated drifting, demonstrating our unique advantages for high-resolution and long-duration video generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1884. Uniform Discrete Diffusion with Metric Path for Video Generation
๐ฌ 14
โญ 5.50
๐ generative models
๐ท๏ธ Text-to-Video Generation Discrete-valued Space
Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform Discrete diffusion
with Metric path (UDM), a simple yet powerful framework that bridges the gap with continuous methods and enables scalable
video generation. At its core, UDM formulates video synthesis as iterative refinement over discrete spatio-temporal tokens.
It is based on two key designs: a Linearized Metric-Path and a Resolution-dependent Timestep Shifting mechanism. This design enables UDM to scale efficiently to high-resolution image synthesis and long-duration video generation (up to 32k tokens), while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies multiple tasks, including video interpolation and image-to-video synthesis, within a single model. Extensive experiments on challenging video and image generation benchmarks show that UDM consistently outperforms prior discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1885. 3DCS: Datasets and Benchmark for Evaluating Conformational Sensitivity in Molecular Representations
๐ฌ 20
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Molecule Benchmark AI for Science
Molecular representations (MRs) that capture 3D conformations are critical for applications such as reaction prediction, drug design, and material discovery. Yet despite the rapid development of molecular representation models, there is no comprehensive benchmark to evaluate their treatment of 3D conformational information.
We introduce 3DCS, the first benchmark for 3D Conformational Sensitivity in MRs. 3DCS evaluates whether representations within the same molecule (i) preserve geometric variation, (ii) capture chirality, and (iii) reflect the energy landscape. To enable this, we curate three large-scale datasets ($>$1M molecules, $\sim$10M conformers) spanning relaxed torsional scans, chiral drug candidates, and AIMD trajectories, and propose a unified GeometryโChiralityโEnergy (GCE) evaluation framework.
Empirical analysis reveals that while modern data-driven MRs are highly geometry-sensitive, they inconsistently handle chirality and poorly align with energy, which is often overlooked. 3DCS thus provides the first rigorous benchmark for developing physically grounded, functionally reliable 3D molecular representations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1886. Preference-based Policy Optimization from Sparse-reward Offline Dataset
๐ฌ 12
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Offline Reinforcement Learning Preference-based Reinforcement Learning
Offline reinforcement learning (RL) holds the promise of training effective policies from static datasets without the need for costly online interactions. However, offline RL faces key limitations, most notably the challenge of generalizing to unseen or infrequently encountered state-action pairs. When a value function is learned from limited data in sparse-reward environments, it can become overly optimistic about parts of the space that are poorly represented, leading to unreliable value estimates and degraded policy quality. To address these challenges, we introduce a novel approach based on contrastive preference learning that bypasses direct value function estimation. Our method trains policies by contrasting successful demonstrations with failure behaviors present in the dataset, as well as synthetic behaviors generated outside the support of the dataset distribution. This contrastive formulation mitigates overestimation bias and improves robustness in offline learning. Empirical results on challenging sparse-reward offline RL benchmarks show that our method substantially outperforms existing state-of-the-art baselines in both learning efficiency and final performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1887. VeriRole: Verifiable Role-Awareness through Hint-Guided Reinforcement Learning
๐ฌ 15
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ role-playing character agent RPCA LLM
Maintaining role-awareness in Role-Playing Conversational Agents (RPCAs) is a significant challenging, largely because the creative nature of role-playing makes it difficult to design verifiable reward signals for reinforcement learning (RL). To address this, we propose VeriRole, a new framework designed to enhance the role-awareness of agents through a structured, verifiable reasoning process. The core of our framework is a 'hint' mechanism, designed to first extract deterministic cues from the context, before the main response generation.Building on these hints, we introduce a Verifiable Role-Awareness Reward (VRAR) to provide a verifiable signal for role-awareness. Experimental results demonstrate the effectiveness of our approach. Our Qwen2.5-32B model, optimized with VeriRole, achieves an 18.9% and 4.55% increase in average scores on the RAIDEN and CharacterEval benchmarks, respectively. These results confirm that VeriRole can effectively quantify and improve role-awareness, leading to superior persona consistency and robustness. To ensure reproducibility, all prompts are detailed in the Appendix, and the associated training data will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1888. PathChat-SegR1: Reasoning Segmentation in Pathology via SO-GRPO
๐ฌ 22
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Clinical Reasoning Reinforcement Learning๏ผReasoning Segmentation
Segmentation in pathology image requires handling out-of-domain tissue morphologies and new pathologies beyond training distributions, where traditional closed-set segmentation approaches fail to generalize. Reasoning segmentation enables zero-shot generalization via prompting with text queries. However, existing reasoning segmentation models face three barriers when applied to pathology: (1) the vision encoder lack pathology-specific knowledge and robustness to staining variations, (2) the large language model (LLM) backbone for reasoning fails to identify whether it has gathered sufficient semantic context to trigger the segmentation output, and (3) no reasoning segmentation benchmarks and datasets exist for pathology analysis. Consequently, we introduce PathChat-SegR1, a reasoning segmentation model built upon pathology-specific vision encoders trained with a novel stain-invariant self-distillation for robust pathology image representations. Moreover, we propose Segmentation-Optimized GRPO (SO-GRPO), a reinforcement learning method specifically for reasoning segmentation that learns to determine optimal segmentation timing based on accumulated reasoning context. Finally, we construct a pathology-specific reasoning segmentation benchmark of 118,667 triplets of pathology image, ground-truth mask, query, and reasoning chain including both public and private pathology images. Zero-shot evaluation on pathology images with out-of-domain morphologies/pathologies shows 61\% improvement over state-of-the-art segmentation models. Code is available at https://anonymous.4open.science/r/PathChat-Seg-3116.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1889. AdaCache: Adaptive Caching and Context Augmentation for Efficient LLM Serving
๐ฌ 4
โญ 5.50
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ LLM Inference Retrieval-Augmented Generation Caching Recomputation
Retrieval-Augmented Generation (RAG) significantly enhances Large Language Models by integrating external knowledge sources, but at the cost of substantial computational overhead from extended input sequences.
Current RAG systems exhibit two fundamental inefficiencies: redundant processing of frequently retrieved text chunks across multiple queries, and uniform deep retrieval that over-provisions context regardless of query complexity.
We present AdaCache, an adaptive caching framework that addresses these limitations through dual optimization strategies.
First, we introduce a cache-aware partial recomputation mechanism that profiles attention patterns to construct selective cache variants, enabling flexible reuse while preserving cross-chunk dependencies.
Second, we develop adaptive context augmentation that dynamically determines optimal retrieval depth via lightweight confidence estimation, avoiding unnecessary overhead on simple queries.
Comprehensive experiments across diverse datasets and LLMs demonstrate that AdaCache delivers substantial improvements in Time-To-First-Token compared to state-of-the-art RAG caching systems, while preserving generation quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1890. Disrupting Hierarchical Reasoning: Adversarial Protection for Geographic Privacy in Multimodal Reasoning Models
๐ฌ 19
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Adversarial Protection Privacy Protection Multi-Modal Large Language Models Hierarchical Reasoning Geographic Inference
Multi-modal large reasoning models (MLRMs) pose significant privacy risks by inferring precise geographic locations from personal images through hierarchical chain-of-thought reasoning. Existing privacy protection techniques, primarily designed for perception-based models, prove ineffective against MLRMs' sophisticated multi-step reasoning processes that analyze environmental cues. We introduce **ReasonBreak**, a novel adversarial framework specifically designed to disrupt hierarchical reasoning in MLRMs through concept-aware perturbations. Our approach is founded on the key insight that effective disruption of geographic reasoning requires perturbations aligned with conceptual hierarchies rather than uniform noise. ReasonBreak strategically targets critical conceptual dependencies within reasoning chains, generating perturbations that invalidate specific inference steps and cascade through subsequent reasoning stages. To facilitate this approach, we contribute **GeoPrivacy-6K**, a comprehensive dataset comprising 6,341 ultra-high-resolution images ($\geq$2K) with hierarchical concept annotations. Extensive evaluation across seven state-of-the-art MLRMs (including GPT-o3, GPT-5, Gemini 2.5 Pro) demonstrates ReasonBreak's superior effectiveness, achieving a 14.4\% improvement in tract-level protection (33.8\% vs 19.4\%) and nearly doubling block-level protection (33.5\% vs 16.8\%). This work establishes a new paradigm for privacy protection against reasoning-based threats.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1891. Universal Multi-Domain Translation via Diffusion Routers
๐ฌ 15
โญ 5.50
๐ generative models
๐ท๏ธ Diffusion Models Multi-Domain Translation
Multi-domain translation (MDT) aims to learn translations between multiple domains, yet existing approaches either require fully aligned tuples or can only handle domain pairs seen in training, limiting their practicality and excluding many cross-domain mappings. We introduce universal MDT (UMDT), a generalization of MDT that seeks to translate between any pair of $K$ domains using only $K-1$ paired datasets with a central domain. To tackle this problem, we propose Diffusion Router (DR), a unified diffusion-based framework that models all central$\leftrightarrow$non-central translations with a single noise predictor conditioned on the source and target domain labels. DR enables indirect non-central translations by routing through the central domain. We further introduce a novel scalable learning strategy with a variational-bound objective and an efficient Tweedie refinement procedure to support direct non-central mappings. Through evaluation on three large-scale UMDT benchmarks, DR achieves state-of-the-art results for both indirect and direct translations, while lowering sampling cost and unlocking novel tasks such as sketch$\leftrightarrow$segmentation. These results establish DR as a scalable and versatile framework for universal translation across multiple domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1892. Quantifying Cross-Attention Interaction in Transformers for Interpreting TCR-pMHC Binding
๐ฌ 14
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ CD4+ T cell response epitope prediction explainable AI multi-modal learning transformer models deep learning
CD8+ โkillerโ T cells and CD4+ โhelperโ T cells play a central role in the adaptive immune system by recognizing antigens presented by Major Histocompatibility Complex (pMHC) molecules via T Cell Receptors (TCRs). Modeling binding between T cells and the pMHC complex is fundamental to understanding basic mechanisms of human immune response as well as in developing therapies. While transformer-based models such as TULIP have achieved impressive performance in this domain, their black-box nature precludes interpretability and thus limits a deeper mechanistic understanding of T cell response.
Most existing post-hoc explainable AI (xAI) methods are confined to encoder-only, co-attention, or model-specific architectures and cannot handle encoder-decoder transformers used in TCR-pMHC modeling. To address this gap, we propose Quantifying Cross-Attention Interaction (QCAI), a new post-hoc method designed to interpret the cross-attention mechanisms in transformer decoders. Quantitative evaluation is a challenge for XAI methods; we have compiled TCR-XAI, a benchmark consisting of 274 experimentally determined TCR-pMHC structures to serve as ground truth for binding. Using these structures we compute physical distances between relevant amino acid residues in the TCR-pMHC interaction region and evaluate how well our method and others estimate the importance of residues in this region across the dataset. We show that QCAI achieves state-of-the-art performance on both interpretability and prediction accuracy under the TCR-XAI benchmark.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1893. STAR: Similarity-guided Teacher-Assisted Refinement for Super-Tiny Function Calling Models
๐ฌ 22
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Language Models Knowledge Distillation Reinforcement Learning
The proliferation of Large Language Models (LLMs) in function calling is pivotal for creating advanced AI agents, yet their large scale hinders widespread adoption, necessitating transferring their capabilities into smaller ones. However, existing paradigms are often plagued by overfitting, training instability, ineffective binary rewards for multi-solution tasks, and the difficulty of synergizing techniques. We introduce STAR: Similarity-guided Teacher-Assisted Refinement, a novel holistic framework that effectively transfers LLMs' capabilities to super-tiny models. STAR consists of two core technical innovations: (1) Constrained Knowledge Distillation (CKD), a training objective that augments top-k forward KL divergence to suppress confidently incorrect predictions, ensuring training stability while preserving exploration capacity for downstream RL. STAR holistically synergizes these strategies within a cohesive training curriculum, enabling super-tiny models to achieve exceptional performance on complex function calling tasks; (2) Similarity-guided RL (Sim-RL), a RL mechanism that introduces a fine-grained, similarity-based reward. This provides a robust, continuous, and rich signal for better policy optimization by evaluating the similarity between generated outputs and the ground truth. Extensive experiments on challenging and renowned benchmarks demonstrate the effectiveness of our method. Our STAR models establish SOTA in their size classes, significantly outperforming baselines. Remarkably, our 0.6B STAR model achieves the best performance among all open models under 1B, surpassing even several well-known open models at a larger scale. STAR demonstrates a training framework that distills capabilities of LLMs into super-tiny models, paving the way for powerful, accessible, and efficient AI agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1894. EnvSocial-Diff: A Diffusion-Based Crowd Simulation Model with Environmental Conditioning and Individual- Group Interaction
๐ฌ 13
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Crowd simulation Social physics force Diffusion model
Modeling realistic pedestrian trajectories requires accounting for both social interactions and environmental context, yet most existing approaches largely emphasize social dynamics. We propose EnvSocial-Diff: a diffusion-based crowd simulation model informed by social physics and augmented with environmental conditioning and individualโgroup interaction. Our structured environmental conditioning module explicitly encodes obstacles, objects of interest, and lighting levels, providing interpretable signals that capture scene constraints and attractors. In parallel, the individualโgroup interaction module goes beyond individual-level modeling by capturing both fine-grained interpersonal relations and group-level conformity through a graph-based design. Experiments on multiple benchmark datasets demonstrate that EnvSocial-Diff outperforms the latest state-of-the-art methods, underscoring the importance of explicit environmental conditioning and multi-level social interaction for realistic crowd simulation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1895. DCFold: Efficient Protein Structure Generation with Single Forward Pass
๐ฌ 19
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ consistency model protein structure generation
AlphaFold3 introduces a diffusion-based architecture that elevates protein structure prediction to all-atom resolution with improved accuracy. This state-of-the-art performance has established AlphaFold3 as a foundation model for diverse generation and design tasks. However, its iterative design substantially increases inference time, limiting practical deployment in downstream settings such as virtual screening and protein design. We propose DCFold, a single-step generative model that attains AlphaFold3-level accuracy. Our Dual Consistency training framework, which incorporates a novel Temporal Geodesic Matching (TGM) scheduler, enables DCFold to achieve a 15ร acceleration in inference while maintaining predictive fidelity. We validate its effectiveness across both structure prediction and binder design benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1896. Enhancing Sparse Event Detection in Healthcare Time-Series via Adaptive Gate of ContextโDetail Interaction
๐ฌ 14
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ Event detection Time series analysis Healthcare
Accurate detection of clinically meaningful events in healthcare time-series data is crucial for reliable downstream analysis and decision support. However, most existing methods struggle to jointly localize event boundaries and classify event types; even detection transformer (DETR)-based approaches show limited performance when confronted with extremely sparse events typical of clinical recordings. To address these challenges, we propose a coarse-to-fine detection framework combining a global context explorer, a local detail inspector, and an adaptive gating module (AGM) that fuses multiple label perspectives. The AGM uses transformed labelsโencoding event presence and temporal positionโto improve learning on sparse events. This design acts as a switch that selectively activates detailed feature extraction only when an event is likely, thereby reducing noise and improving efficiency in sparse settings. We evaluate our framework on diverse healthcare datasetsโincluding arrhythmia detection, emotion recognition, and human-activity monitoringโand demonstrate substantial performance gains over existing DETR-based models, with particularly strong improvements in sparse event detection. With precise and robust event detection, our framework enables interpretation and actionable insights in real-world clinical applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1897. LiveWeb-IE: A Benchmark For Online Web Information Extraction
๐ฌ 24
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ web information extraction web scraping
Web information extraction (WIE) is the task of automatically extracting data from web pages, offering high utility for various applications.
The evaluation of WIE systems has traditionally relied on benchmarks built from HTML snapshots captured at a single point in time.
However, this offline evaluation paradigm fails to account for the temporally evolving nature of the web; consequently, performance on these static benchmarks often fails to generalize to dynamic real-world scenarios.
To bridge this gap, we introduce LiveWeb-IE, a new benchmark designed for evaluating WIE systems directly against live websites.
Based on trusted and permission-granted websites, we curate natural language queries that require information extraction of various data categories, such as text, images, and hyperlinks.
We further design these queries to represent four levels of complexity, based on the number and cardinality of attributes to be extracted, enabling a granular assessment of WIE systems.
In addition, we propose Visual Grounding Scraper (VGS), a novel multi-stage agentic framework that mimics human cognitive processes by visually narrowing down web page content to extract desired information.
Extensive experiments across diverse backbone models demonstrate the effectiveness and robustness of VGS.
We believe that this study lays the foundation for developing practical and robust WIE systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1898. KeepLoRA: Continual Learning with Residual Gradient Adaptation
๐ฌ 18
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ continual learning parameter-efficient fine-tuning
Continual learning for pre-trained vision-language models requires balancing three competing objectives: retaining pre-trained knowledge, preserving knowledge from a sequence of learned tasks, and maintaining the plasticity to acquire new knowledge. This paper presents a simple but effective approach called KeepLoRA to effectively balance these objectives. We first analyze the knowledge retention mechanism within the model parameter space and find that general knowledge is mainly encoded in the principal subspace, while task-specific knowledge is encoded in the residual subspace. Motivated by this finding, KeepLoRA learns new tasks by restricting LoRA parameter updates in the residual subspace to prevent interfering with previously learned capabilities. Specifically, we infuse knowledge for a new task by projecting its gradient onto a subspace orthogonal to both the principal subspace of pre-trained model and the dominant directions of previous task features. Our theoretical and empirical analyses confirm that KeepLoRA balances the three objectives and achieves state-of-the-art performance. The source code is available in the supplementary material.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1899. Near Optimal Robust Federated Learning Against Data Poisoning Attack
๐ฌ 8
โญ 5.50
๐ learning theory
๐ท๏ธ federated learning data poisoning attack
We revisit data poisoning attacks in the federated learning system. There will be $m$ worker nodes (each has $n$ training data samples) cooperatively training one model for a machine-learning task, and a fraction (i.e., $\alpha$) of the workers may suffer from the data poisoning attack. We mainly focus on the challenging and practical case where $n$ is small and $m$ is large, such that each worker does not have enough statistical information to identify the poisoned data by itself, while in total they have enough data to learn the task if the poisoned data are detected. Therefore, we propose a mechanism for workers to cooperatively detect workers with poisoned data. In terms of attack loss, our mechanism achieves $\tilde{O}((\frac{1}{n})^{\frac{1}{2}}+(\frac{d}{mn})^{\frac{1}{2}})$ in IID setting and $\tilde{O}((\frac{1}{\gamma})^{\frac{1}{2}}+(\frac{1}{n})^{\frac{1}{2}}+(\frac{d}{mn})^{\frac{1}{2}})$ in non-IID setting, where $d$ is the VC-dimension of the learning model and $\gamma$ is a concentration parameter characterizing the non-IIDness. Alongside attack loss, our mechanism limits the adversaryโs free-ride gain even when it cannot be directly quantified by the attack loss. We also propose the lower bound of the attack loss, and our proposed algorithm matches the lower bound when $m\rightarrow \infty$ both in IID setting and non-IID setting.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1900. VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning
๐ฌ 19
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal LLM Reinforcement Learning Video Understanding
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language tasks yet remain limited in long video understanding due to the limited context window. Consequently, prevailing approaches tend to rely on uniform frame sampling or static pre-selection, which might overlook critical evidence and unable to correct its initial selection error during its reasoning process. To overcome these limitations, we propose VideoZoomer, a novel agentic framework that enables MLLMs to dynamically control their visual focus during reasoning. Starting from a coarse low-frame-rate overview, VideoZoomer invokes a temporal zoom tool to obtain high-frame-rate clips at autonomously chosen moments, thereby progressively gathering fine-grained evidence in a multi-turn interactive manner. Accordingly, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase on a curated dataset of distilled exemplar and reflection trajectories, followed by reinforcement learning to further refine the agentic policy. Extensive experiments demonstrate that our 7B model demonstrates diverse and complex reasoning patterns, yielding strong results across a broad set of long video understanding and reasoning benchmarks. These emergent capabilities allow it to consistently surpass existing open-source models and even rival proprietary systems on challenging tasks, while achieving superior efficiency under reduced frame budgets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1901. Towards Prompt-Robust Machine-Generated Text Detection
๐ฌ 26
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM detection Rewrite-based detection Learning distance Prompt robust
Modern large language models (LLMs) such as GPT, Claude, and Gemini have transformed the way we learn, work, and communicate. Yet, their ability to produce highly human-like text raises serious concerns about misinformation and academic integrity, making it an urgent need for reliable algorithms to detect LLM-generated content. In this paper, we start by presenting a geometric approach to demystify rewrite-based detection algorithms, revealing their underlying rationale and demonstrating their generalization ability. Building on this insight, we introduce a novel rewrite-based detection algorithm that adaptively learns the distance between the original and rewritten text. Theoretically, we demonstrate that employing an adaptively learned distance function is more effective for detection than using a fixed distance. Empirically, we conduct extensive experiments with over 100 settings, and find that our approach demonstrates superior performance over baseline algorithms in the majority of scenarios. In particular, it achieves relative improvements from 57.8\% to 80.6\% over the strongest baseline across different target LLMs (e.g., GPT, Claude, and Gemini).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1902. MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs
๐ฌ 26
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Semi-structured sparsity Policy gradient learning Probabilistic Relaxation
The rapid scaling of large language models(LLMs) has made inference efficiency a primary bottleneck in the practical deployment. To address this, semi-structured sparsity offers a promising solution by strategically retaining $N$ elements out of every $M$ weights, thereby enabling hardware-friendly acceleration and reduced memory. However, existing (N:M)-compatible approaches typically fall into two categories: rule-based layerwise greedy search, which suffers from considerable errors, and gradient-driven combinatorial learning, which incurs prohibitive training costs. To tackle these challenges, we propose a novel linear-space probabilistic framework named MaskPro, which aims to learn a prior categorical distribution for every $M$ consecutive weights and subsequently leverages this distribution to generate the (N:M)-sparsity throughout an $N$-way sampling without replacement. Furthermore, to mitigate the training instability induced by the high variance of policy gradients in the super large combinatorial space, we propose a novel update method by introducing a moving average tracker of loss residuals instead of vanilla loss. Finally, we conduct comprehensive theoretical analysis and extensive experiments to validate the superior performance of MaskPro, as well as its excellent scalability in memory efficiency and exceptional robustness to data samples.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1903. AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions
๐ฌ 15
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Visual Question Answering VisionโLanguage Models
Visual Question Answering (VQA) is a core task for evaluating the capabilities of VisionโLanguage Models (VLMs). Existing VQA benchmarks primarily feature clear and unambiguous imageโquestion pairs, whereas real-world scenarios often involve varying degrees of ambiguity that require nuanced reasoning and context-appropriate response strategies. Although recent studies have begun to address ambiguity in VQA, they lack (1) a systematic categorization of ambiguity levels and (2) datasets and models that support strategy-aware responses. In this paper, we introduce Ambiguous Visual Question Answering (AQuA), a fine-grained dataset that classifies ambiguous VQA instances into four levels according to the nature and degree of ambiguity, along with the optimal response strategy for each case. Our evaluation of diverse open-source and proprietary VLMs shows that most models fail to adapt their strategy to the ambiguity type, frequently producing overconfident answers rather than seeking clarification or acknowledging uncertainty. To address this challenge, we fine-tune VLMs on AQuA, enabling them to adaptively choose among multiple response strategies, such as directly answering, inferring intent from contextual cues, listing plausible alternatives, or requesting clarification. VLMs trained on AQuA achieve strategic response generation for ambiguous VQA, demonstrating the ability to recognize ambiguity, manage uncertainty, and respond with context-appropriate strategies, while outperforming both open-source and closed-source baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1904. TIGaussian: Disentangle Gaussians for Spatial-Awared Text-Image-3D Alignment
๐ฌ 10
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Multimodal Gaussian Splatting Contrastive Learning
While visual-language models have profoundly linked features between texts and images, the incorporation of 3D modality data, such as point clouds and 3D Gaussians, further enables pretraining for 3D-related tasks, e.g., cross-modal retrieval, zero-shot classification, and scene recognition.
As challenges remain in extracting 3D modal features and bridging the gap between different modalities,
we propose TIGaussian, a framework that harnesses 3D Gaussian Splatting (3DGS) characteristics to strengthen cross-modality alignment through multi-branch 3DGS tokenizer and modality-specific 3D feature alignment strategies. Specifically, our multi-branch 3DGS tokenizer decouples the intrinsic properties of 3DGS structures into compact latent representations, enabling more generalizable feature extraction. To further bridge the modality gap, we develop a bidirectional cross-modal alignment strategies: a multi-view feature fusion mechanism that leverages diffusion priors to resolve perspective ambiguity in image-3D alignment, while a text-3D projection module adaptively maps 3D features to text embedding space for better text-3D alignment.
Extensive experiments on various datasets demonstrate the state-of-the-art performance of \method in multiple tasks.
Our source code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1905. Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning
๐ฌ 24
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ LLM pre-training data selection mask learning
A fine-grained data recipe is crucial for pre-training large language models (LLMs), as it can significantly enhance training efficiency and model performance. One important ingredient in the recipe is to select samples based on scores produced by defined rules, LLM judgment, or statistical information in embeddings, which can be roughly categorized into quality and diversity metrics. Due to the high computational cost when applied to trillion-scale token pre-training datasets such as FineWeb and DCLM, these two or more types of metrics are rarely considered jointly in a single selection process. However, in our empirical study, selecting samples based on quality metrics exhibit severe diminishing returns during long-term pre-training, while selecting on diversity metrics removes too many valuable high-quality samples, both of which limit pre-trained LLMs' capabilities. Therefore, we introduce DATAMASK, a novel and efficient joint learning framework designed for large-scale pre-training data selection that can simultaneously optimize multiple types of metrics in a unified process, with this study focusing specifically on quality and diversity metrics. DATAMASK approaches the selection process as a mask learning problem, involving iterative sampling of data masks, computation of policy gradients based on predefined objectives with sampled masks, and updating of mask sampling logits. Through policy gradient-based optimization and various acceleration enhancements, it significantly reduces selection time by 98.9% compared to greedy algorithm, enabling our study to explore joint learning within trillion-scale tokens. With DATAMASK, we select a subset of about 10% from the 15 trillion-token FineWeb dataset, termed FineWeb-Mask. Evaluated across 12 diverse tasks, this high-quality and diverse subset achieves significant improvements of 3.2% on a 1.5B dense model and 1.9% on a 7B MoE model after pre-training with hundreds of billions of tokens, demonstrating its effectiveness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1906. On the Interpolation Effect of Score Smoothing in Diffusion Models
๐ฌ 4
โญ 5.50
๐ learning theory
๐ท๏ธ score-based diffusion model empirical score function norm-bounded neural network barron norm function smoothing subspace recovery
Score-based diffusion models have achieved remarkable progress in various domains with the ability to generate new data samples that do not exist in the training set. In this work, we study the hypothesis that such creativity arises from an interpolation effect caused by a smoothing of the empirical score function. Focusing on settings where the training set lies uniformly in a one-dimensional subspace, we probe the interplay between score smoothing and the denoising dynamics with analytical solutions and numerical experiments. In particular, we demonstrate how a smoothed score function can lead to the generation of samples that interpolate among the training data within their subspace while avoiding full memorization. Moreover, we present theoretical and empirical evidence that learning score functions with regularized neural networks can have a similar effect as score smoothing, including in simple nonlinear settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1907. On Predictability of Reinforcement Learning Dynamics for Large Language Models
๐ฌ 41
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Reinforcement Learning Interpretability Low Rank
Recent advances in reasoning capabilities of large language models (LLMs) are largely driven by reinforcement learning (RL), yet the underlying parameter dynamics during RL training remain poorly understood. This work identifies two fundamental properties of RL-induced parameter updates in LLMs: (1) Rank-1 Dominance, where the top singular subspace of the parameter update matrix nearly fully determines reasoning improvements, recovering over 99\% of performance gains; and (2) Rank-1 Linear Dynamics, where this dominant subspace evolves linearly throughout training, enabling accurate prediction from early checkpoints. Extensive experiments across 13 LLMs and 10 algorithms validate the generalizability of these properties. More importantly, based on these findings, we propose AlphaRL, a plug-in acceleration framework that extrapolates the final parameter update using a short early training window, achieving up to 2.5ร speedup while retaining > 96\% of reasoning performance without extra modules or hyperparameter tuning. This positions our finding as a versatile and practical tool for large-scale RL, opening a path toward principled, interpretable, and efficient training paradigm for LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1908. Controlling Repetition in Protein Language Models
๐ฌ 28
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Protein Language Models Reliable Protein Generation Repetition Control
Protein language models (PLMs) have enabled advances in structure prediction and de novo protein design, yet they frequently collapse into pathological repetition during generation. Unlike in text, where repetition merely reduces readability, in proteins it undermines structural confidence and functional viability. To unify this problem, we present the first systematic study of repetition in PLMs. We first propose quantitative metrics to characterize motif-level and homopolymer repetition and then demonstrate their negative impact on folding reliability. To address this challenge, we propose UCCS (Utility-Controlled Contrastive Steering), which steers protein generation with a constrained dataset.
Instead of naively contrasting high- vs. low-repetition sequences, we construct contrastive sets that maximize differences in repetition while tightly controlling for structural utility. This disentanglement yields steering vectors that specifically target repetition without degrading foldability. Injected at inference, these vectors consistently reduce repetition without retraining or heuristic decoding. Experiments with ESM-3 and ProtGPT2 in CATH, UniRef50, and SCOP show that our method outperforms decoding penalties and other baselines, substantially lowering repetition while preserving AlphaFold confidence scores. Our results establish repetition control as a central challenge for PLMs and highlight dataset-guided steering as a principled approach for reliable protein generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1909. LaVCa: LLM-assisted Visual Cortex Captioning
๐ฌ 15
โญ 5.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Neuroscience Computer vision Visual systems Captioning Large language model Semantics Neuroimaging Functional magnetic resonance imaging
Understanding the properties of neural populations (or voxels) in the human brain can advance our comprehension of human perceptual and cognitive processing capabilities and contribute to developing brain-inspired computer models. Recent encoding models using deep neural networks (DNNs) have successfully predicted voxel-wise activity. However, interpreting the properties that explain voxel responses remains challenging because of the black-box nature of DNNs. As a solution, we propose LLM-assisted Visual Cortex Captioning (LaVCa), a data-driven approach that leverages large language models (LLMs) to generate natural-language captions for images to which voxels are selective. By applying LaVCa for image-evoked brain activity, we demonstrate that LaVCa generates captions that describe voxel selectivity more accurately than the previous approaches. The captions generated by LaVCa quantitatively capture more detailed properties than the existing method at both the inter-voxel and intra-voxel levels. Furthermore, we find richer representational content within cortical regions that prior neuroimaging studies have deemed selective for simpler categories. These findings offer profound insights into human visual representations by assigning detailed captions throughout the visual cortex while highlighting the potential of LLM-based methods in understanding brain representations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1910. On the Expressive Power of GNNs for Boolean Satisfiability
๐ฌ 10
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ SAT GNN Weisfeiler-Leman expressivity
Machine learning approaches to solving Boolean Satisfiability (SAT) aim to replace handcrafted heuristics with learning-based models. Graph Neural Networks have emerged as the main architecture for SAT solving, due to the natural graph representation of Boolean formulas. We analyze the expressive power of GNNs for SAT solving through the lens of the Weisfeiler-Leman (WL) test. As our main result, we prove that the full WL hierarchy cannot, in general, distinguish between satisfiable and unsatisfiable instances. We show that indistinguishability under higher-order WL carries over to practical limitations for WL-bounded solvers that set variables sequentially. We further study the expressivity required for several important families of SAT instances, including regular, random and planar instances. To quantify expressivity needs in practice, we conduct experiments on random instances from the G4SAT benchmark and industrial instances from the 2024 SAT competition. Our results suggest that while random instances are largely distinguishable, industrial instances often require more expressivity to predict a satisfying assignment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1911. Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning
๐ฌ 13
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language model Math Reasoning Reinforcement Learning
Reinforcement learning with verifiable rewards has significantly advanced the reasoning capabilities of large language models, yet how to explicitly steer training toward exploration or exploitation remains an open problem. We introduce Token Hidden Reward (THR), a token-level metric that quantifies each tokenโs influence on the likelihood of correct responses under Group Relative Policy Optimization (GRPO). We find that training dynamics are dominated by a small subset of tokens with high absolute THR values. Most interestingly, tokens with positive THR strengthen confidence in correct outputs, thus favoring exploitation, while tokens with negative THR preserve probability mass for alternative outputs, enabling exploration. This insight suggests a natural intervention: a THR-guided reweighting algorithm that modulates GRPOโs learning signals to explicitly bias training toward exploitation or exploration. We validate the efficacy of this algorithm on diverse math reasoning benchmarks. By amplifying tokens with positive THR value and weakening negative ones, our algorithm improves greedy-decoding accuracy, favoring exploitation. The reverse strategy yields consistent gains in Pass@K accuracy, favoring exploration. We further demonstrate that our algorithm integrates seamlessly with other RL objectives such as GSPO and generalizes across architectures including Llama. These findings establish THR as a principled and fine-grained mechanism for dynamically controlling exploration and
exploitation in RL-tuned LLMs, providing new tools for targeted fine-tuning in reasoning-intensive applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1912. Bayesian Robust Cooperative Multi-Agent Reinforcement Learning Against Unknown Adversaries
๐ฌ 15
โญ 5.50
๐ reinforcement learning
๐ท๏ธ multi-agent reinforcement learning adversarial attacks Bayesian games robust RL
We consider the problem of robustness against adversarial attacks in cooperative multi-agent reinforcement learning (c-MARL) at deployment time, where agents can face an adversary with an unknown objective. We address the uncertainty about the adversarial objective by proposing a Bayesian Dec-POMDP game model with a continuum of adversarial types, corresponding to distinct attack objectives. To compute a perfect Bayesian equilibrium (PBE) of the game, we introduce a novel partitioning scheme of adversarial policies based on their performance against a reference c-MARL policy. This allows us to cast the problem as finding a PBE in a finite-type Bayesian game. To compute the adversarial policies, we introduce the concept of an externally constrained reinforcement learning problem and present a provably convergent algorithm for solving it. Building on this, we propose to use a simultaneous gradient update scheme to obtain robust Bayesian c-MARL policies. Experiments on diverse benchmarks show that our approach, called BATPAL, outperforms state-of-the-art baselines under a wide variety of attack strategies, highlighting its robustness and adaptiveness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1913. Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception
๐ฌ 20
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Omni model Multimodal large language model Detailed captioning Audio understanding Video understanding Benchmark Evaluation
Fine-grained perception of multimodal information is critical for advancing humanโAI interaction.
With recent progress in audioโvisual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning.
However, their capacity to capture and accurately describe fine-grained details remains limited explored.
In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark.
We first identify an inherent ``co-growth'' between the level of detail and the degree of hallucination in current OLMs.
To address this, we propose \textbf{Omni-Detective}, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data.
Based on the data generated with Omni-Detective, we train two captioning models: \textbf{Audio-Captioner} for audio-only detailed perception, and \textbf{Omni-Captioner} for audioโvisual detailed perception.
Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro.
On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset.
Given the absence of a dedicated benchmark for omni detailed perception, we design \textbf{Omni-Cloze}, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment.
Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority and human preference alignment of Omni-Cloze in evaluating such detailed captions.
All the agentic data pipeline, models, and the benchmark will be open-source to facilitate further research and development for omni detailed perception.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1914. Evaluating Cross-Modal Reasoning Ability and Problem Charactaristics with Multimodal Item Response Theory
๐ฌ 11
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ VLM Evaluation IRT
Multimodal Large Language Models (MLLMs) have recently emerged as general architectures capable of reasoning over diverse modalities.
Benchmarks for MLLMs should measure their ability for crossโmodal integration. However, current benchmarks are filled with shortcut questions, which can be solved using only single modality, and thereby yielding unreliable rankings.
For example, in vision-language cases, we can find the correct answer without either the image or the text.
These low-quality questions unnecessarily increase the size and computational requirements of benchmarks.
We introduce a multi-modal and multidimensional item response theory framework (M$^3$-IRT) that extends classical IRT by decomposing both model ability and item difficulty into imageโonly, textโonly, and crossโmodal components. M$^3$-IRT estimates crossโmodal ability of MLLMs and each questionโs crossโmodal difficulty, enabling compact, highโquality subsets that better reflect multimodal reasoning.
Across 24 VLMs on three benchmarks, M$^3$-IRT prioritizes genuinely crossโmodal questions over shortcuts and preserves ranking fidelity even when 50\% of items are artificially generated lowโquality questions, thereby reducing evaluation cost while improving reliability. M$^3$-IRT thus offers a practical tool for assessing crossโmodal reasoning and refining multimodal benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1915. Unlearning Evaluation through Subset Statistical Independence
๐ฌ 14
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Machine Unlearning
Evaluating machine unlearning remains challenging, as existing methods typically require retraining reference models or performing membership inference attacksโboth rely on prior access to training configuration or supervision label, making them impractical in realistic scenarios. Motivated by the fact that most unlearning algorithms remove a small, random subset of the training data, we propose a subset-level evaluation framework based on statistical independence. Specifically, we design a tailored use of the HilbertโSchmidt Independence Criterion to assess whether the model outputs on a given subset exhibit statistical dependence, without requiring model retraining or auxiliary classifiers. Our method provides a simple, standalone evaluation procedure that aligns with unlearning workflows. Extensive experiments demonstrate that our approach reliably distinguishes in-training from out-of-training subsets and clearly differentiates unlearning effectiveness, even when existing evaluations fall short.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1916. MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs
๐ฌ 5
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Multi-modal Large Language Model; Jailbreak Attack; Cross-Image Reasoning; Reasoning
Multimodal Large Language Models (MLLMs) have achieved remarkable performance but remain vulnerable to jailbreak attacks that can induce harmful content and undermine their secure deployment. Previous studies have shown that introducing additional inference steps, which disrupt security attention, can make MLLMs more susceptible to being misled into generating malicious content.
However, these methods rely on single-image masking or isolated visual cues, which only modestly extend reasoning paths and thus achieve limited effectiveness, particularly against strongly aligned commercial closed-source models. To address this problem, in this paper, we propose Multi-Image Dispersion and Semantic Reconstruction (MIDAS), a multimodal jailbreak framework that decomposes harmful semantics into risk-bearing subunits, disperses them across multiple visual clues,
and leverages cross-image reasoning to gradually reconstruct the malicious intent, thereby bypassing existing safety mechanisms. The proposed MIDAS enforces longer and more structured multi-image chained reasoning, substantially increases the modelโs reliance on visual cues while delaying the exposure of malicious semantics and significantly reducing the modelโs security attention, thereby improving the performance of jailbreak against advanced MLLMs. Extensive experiments across different datasets and MLLMs demonstrate that the proposed MIDAS outperforms state-of-the-art jailbreak attacks for MLLMs and achieves an average attack success rate of 81.46\% across 4 closed-source MLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1917. How hard is learning to cut? Trade-offs and sample complexity
๐ฌ 11
โญ 5.50
๐ optimization
๐ท๏ธ Integer programming branch-and-cut branch-and-bound sample complexity learning theory
In the recent years, branch-and-cut algorithms have been the target of data-driven approaches designed to enhance the decision making in different phases of the algorithm such as branching, or the choice of cutting planes (cuts). In particular, for cutting plane selection two score functions have been proposed in the literature to evaluate the quality of a cut: branch-and-cut tree size and gap closed. In this paper, we present new sample complexity lower bounds, valid for both scores. We show that for a wide family of classes $\mathcal{F}$ that maps an instance to a cut, learning over an unknown distribution of the instances to minimize those scores requires at least (up to multiplicative constants) as many samples as learning from the same class function $\mathcal{F}$ any generic target function (using square loss). Our results also extend to the case of learning from a restricted set of cuts, namely those from the Simplex tableau. To the best of our knowledge, these constitute the first lower bounds for the learning-to-cut framework. We compare our bounds to known upper bounds in the case of neural networks and show they are nearly tight, suggesting that both scores (gap closed and tree size) are of comparable difficulty from a learning standpoint. Guided by this insight, we provide empirical evidence -- by using a graph neural network cut selection evaluated on various integer programming problems -- that gap closed is a practical and effective proxy for minimizing the tree size. Although the gap closed score has been extensively used in the integer programming literature, this is the first principled analysis discussing both scores simultaneously both theoretically and computationally.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1918. Automatic Dialectic Jailbreak: A Framework for Generating Effective Jailbreak Strategies
๐ฌ 6
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Jailbreak Attacks Large Language Models Multi-Objective Game White-box Jailbreak Black-box Jailbreak
Large language models (LLMs) can be jailbroken to produce malicious or unethical content with embedded jailbreaking prompts. Unfortunately, current jailbreak attack techniques suffer from adaptability issues due to reliance on the fixed evaluation models and incapability problems of surviving from a wide range of defense mechanisms. In this work, we propose to model the the jailbreak attack problem as a Stackelberg multi-objective game between two LLMs engaged in a Hegelian-Dialectic-style debate enabling the automatic generation of jailbreak strategy (ADJ). In the ADJ, iterative thesis-antithesis-synthesis cycles of Hegelian dialectical reasoning are executed to guarantee that both attacker and defender can maximize their own utility while minimizing that of their opponent. We propose to map the optimization problem from the original parameter space into a Hilbert space via Haar wavelet transformation, for efficiently extracting localized and structurally significant information. In this functional space, we solve a convex multi-objective optimization problem to construct a common descent direction that better aligns with the objectives in the ADJ. In order to ensure sufficient descent for each objective in ADJ, we construct a subset of descent components and directly integrate them into the optimization objective. We theoretically validate the existence of a ParetoโNash equilibrium achieved by our Automatic Dialectic Jailbreak method and demonstrate that our algorithm is able to converge to this ParetoโNash equilibrium.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1919. C-Voting: Confidence-Based Test-Time Voting without Explicit Energy Functions
๐ฌ 11
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ reasoning test-time scaling voting recurrent models
Neural network models with latent recurrent processing, where identical layers are recursively applied to the latent state, have gained attention as promising models for performing reasoning tasks.
A strength of such models is that they enable test-time scaling, where the models can enhance their performance in the test phase without additional training.
Models such as the Hierarchical Reasoning Model (HRM) and Artificial Kuramoto Oscillatory Neurons (AKOrN) can facilitate deeper reasoning by increasing the number of recurrent steps, thereby enabling the completion of challenging tasks, including Sudoku, Maze solving, and AGI benchmarks.
In this work, we introduce confidence-based voting (C-voting), a test-time scaling strategy designed for recurrent models with multiple latent candidate trajectories.
Initializing the latent state with multiple candidates using random variables, C-voting selects the one maximizing the average of top-1 probabilities of the predictions, reflecting the modelโs confidence.
Additionally, it yields $4.9\\%$ higher accuracy on Sudoku-hard than the energy-based voting strategy, which is specific to models with explicit energy functions.
An essential advantage of Cโvoting is its applicability: it can be applied to recurrent models without requiring an explicit energy function.
Finally, we introduce a simple attention-based recurrent model with randomized initial values named ItrSA++, and demonstrate that when combined with C-voting, it outperforms HRM on Sudoku-extreme ($95.2\\%$ vs. $55.0\\%$) and Maze ($78.6\\%$ vs. $74.5\\%$) tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1920. Discrete Latent Features Ablate Adversarial Attack: A Robust Prompt Tuning Framework for VLMs
๐ฌ 14
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Prompt Learning Adversarial Robustness Vision-Language Models
While adversarial fine-tuning can enhance the robustness of vision-language models (VLMs), such approaches are computationally expensive. Adversarial prompt tuning has emerged as a practical alternative. However, existing methods are limited by their reliance on vulnerable continuous image features. To mitigate the vulnerability in the feature representation, we propose **DEFEAT** (**D**iscrete Lat**E**nt **F**eatur**E** based **A**dversarial **T**raining), a robust prompt tuning framework for VLMs.
Specifically, the DEFEAT method introduces a perturbation discrete shield module that reconstructs discrete latent features and designs a logits fusion strategy, substantially reducing the discrepancy between clean and adversarial image representations.
Moreover, the DEFEAT method integrates prompt tuning with adversarial training while applying regularization from learnable prompts to hand-crafted prompts, further enhancing the adversarial robustness.
Extensive experiments across 15 datasets validate the effectiveness of the proposed DEFEAT method among existing adversarial prompt tuning methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1921. HWC-Loco: A Hierarchical Whole-Body Control Approach to Robust Humanoid Locomotion
๐ฌ 17
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Humanoid Reinforcement Learning Whole-body Control
Humanoid robots, capable of assuming human roles in various workplaces, have become essential to the advancement of embodied intelligence. However, as robots with complex physical structures, learning a control model that can operate robustly across diverse environments remains inherently challenging, particularly under the discrepancies between training and deployment environments. In this study, we propose HWC-Loco, a robust whole-body control algorithm tailored for humanoid locomotion tasks. By reformulating policy learning as a robust optimization problem, HWC-Loco explicitly learns to recover from safety-critical scenarios. While prioritizing safety guarantees, overly conservative behavior can compromise the robot's ability to complete the given tasks. To tackle this challenge, HWC-Loco leverages a hierarchical policy for robust control. This policy can dynamically resolve the trade-off between goal-tracking and safety recovery, guided by human behavior norms and dynamic constraints. To evaluate the performance of HWC-Loco, we conduct extensive comparisons against state-of-the-art humanoid control models, demonstrating HWC-Loco's superior performance across diverse terrains, robot structures, and locomotion tasks under both simulated and real-world environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1922. Enhancing Multi-Image Understanding through Delimiter Token Scaling
๐ฌ 18
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LVLM Multi-Image Understanding Training-free
Large Vision-Language Models (LVLMs) achieve strong performance on single-image tasks, but their performance declines when multiple images are provided as input.
One major reason is the cross-image information leakage, where the model struggles to distinguish information across different images.
Existing LVLMs already employ delimiter tokens to mark the start and end of each image, yet our analysis reveals that these tokens fail to effectively block cross-image information leakage.
To enhance their effectiveness, we propose a method that scales the hidden states of delimiter tokens.
This enhances the modelโs ability to preserve image-specific information by reinforcing intra-image interaction and limiting undesired cross-image interactions.
Consequently, the model is better able to distinguish between images and reason over them more accurately.
Experiments show performance gains on multi-image benchmarks such as Mantis, MuirBench, MIRB and QBench2.
We further evaluate our method on text-only tasks that require clear distinction.
The method improves performance on multi-document and multi-table understanding benchmarks, including TQABench, MultiNews and WCEP-10.
Notably, our method requires no additional training or inference cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1923. BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Behavioural Change
๐ฌ 23
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Ambivalence hesitancy affective computing emotion recognition in videos multimodal eHealth behavioral change
This paper introduces the Behavioral Ambivalence/Hesitancy (BAH) dataset collected for the Ambivalence/Hesitancy (A/H) recognition task in videos. In particular, this task involves recognizing conflicting emotions linked to A/H from question-and-answer videos captured for behavior analysis. The dataset contains videos from 224 subjects with different age, ethnicity collected across 9 Canadian provinces via webcam through our developed web platform. Each user answers to 7 questions that we designed to induce Ambivalence/Hesitancy. Each video captures the response for one question in the subject's environment totaling 1,118 videos for a total duration of 8.26 hours with 1.5 hours of A/H. BAH is a first and unique dataset for Ambivalence/Hesitancy recognition. Our behavioral team annotated timestamp segment where A/H occurs providing frame- and video-level annotation, in addition to used cues for annotation such as face, audio, body, and language. Video transcripts and their timestamps as well as per-frame cropped and aligned faces are also included. This work offers initial baselines for A/H recognition in videos at frame- and video-level with different analysis with single and multimodal setups. The data, code, and pretrained weights are publicly accessible.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1924. Hilbert: Recursively Building Formal Proofs with Informal Reasoning
๐ฌ 11
โญ 5.50
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Formal Mathematics Automated Theorem Proving Mathematical Reasoning Lean 4 LLMs for Math Agents
Large Language Models (LLMs) demonstrate impressive mathematical reasoning abilities, but their solutions frequently contain errors that cannot be automatically verified. Formal theorem proving systems such as Lean 4 offer automated verification with complete accuracy, motivating recent efforts to build specialized prover LLMs that generate verifiable proofs in formal languages. However, a significant gap remains: current prover LLMs solve substantially fewer problems than general-purpose LLMs operating in natural language. We introduce Hilbert, an agentic framework that bridges this gap by combining the complementary strengths of informal reasoning and formal verification. Our system orchestrates four components: an informal LLM that excels at mathematical reasoning, a specialized prover LLM optimized for Lean 4 tactics, a formal verifier, and a semantic theorem retriever. Given a problem that the prover is unable to solve, Hilbert employs recursive decomposition to split the problem into subgoals that it solves with the prover or reasoner LLM. It leverages verifier feedback to refine incorrect proofs as necessary. Experimental results demonstrate that Hilbert, substantially outperforms existing approaches on key benchmarks, achieving 99.2\% on miniF2F, 6.6\% points above the best publicly available method. Hilbert achieves the **best known result** on PutnamBench. It solves 462/660 problems (70.0\%), outperforming proprietary approaches like SeedProver (50.4\%) and achieving a 422\% improvement over the best publicly available baseline. Thus, Hilbert effectively narrows the gap between informal reasoning and formal proof generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1925. AgentFold: Long-Horizon Web Agents with Proactive Context Folding
๐ฌ 4
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Web Agent Context Management AI Agent
LLM-based web agents show immense promise for information seeking, yet their effectiveness on long-horizon tasks is hindered by a fundamental trade-off in context management. Prevailing ReAct-based agents suffer from context saturation as they accumulate noisy, raw histories, while methods that fixedly summarize the full history at each step risk the irreversible loss of critical details. Addressing these, we introduce AgentFold, a novel agent paradigm inspired by the human cognitive process of retrospective consolidation. AgentFold treats its context as a dynamic cognitive workspace to be actively sculpted, rather than a passive log to be filled. At each step, it learns to execute a folding operation, which manages its historical trajectory at multiple scales: it can perform granular condensations to preserve vital, fine-grained details, or deep consolidations to abstract away entire multi-step sub-tasks. The results on prominent benchmarks are striking: our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or matches open-source models of a dramatically larger scale, such as the GLM-4.5-355B-A32B and the DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like OpenAI's o4-mini. Model will be open-sourced.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1926. Verifier-free Test-Time Sampling for Vision Language Action Models
๐ฌ 22
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Vision-Language-Action Models Test-Time Scaling Robotic Manipulation
Vision-Language-Action models (VLAs) have demonstrated remarkable performance in robot control. However, they remain fundamentally limited in tasks that require high precision due to their single-inference paradigm. While test-time scaling approaches using external verifiers have shown promise, they require additional training and fail to generalize to unseen conditions. We propose Masking
Distribution Guided Selection (MG-Select), a novel test-time scaling framework for VLAs that leverages the model's internal properties without requiring additional training or external modules. Our approach utilizes KL divergence from a reference action token distribution as a confidence metric for selecting optimal action from multiple candidates. We introduce a reference distribution generated by the same VLA but with randomly masked states and language conditions as inputs, providing action uncertainty while remaining aligned with the target task distribution. Additionally, we propose a joint training strategy that enables the model to learn both conditional and unconditional distributions by applying dropout to state and language conditions, thereby further improving the quality of the reference distribution. Our experiments demonstrate that MG-Select provides a reliable reference for action selection through task-relevant condition masking
and consistently improves base models across diverse simulation and real-world benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1927. Rethinking Pareto Frontier: On the Optimal Trade-offs in Fair Classification
๐ฌ 10
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ fairness-accuracy tradeoff
Fairness has become an arising concern in machine learning with its prevalence in decision-making processes, and the trade-offs between various fairness notions and between fairness and accuracy has been empirically observed. However, the inheritance of such trade-offs, as well as the quantification of the best achievable trade-offs, i.e., the Pareto optimal trade-offs, under varied constraints on fairness notions has been rarely and improperly discussed. Owing to the sub-optimality of fairness interventions, existing work fails to provide informative characterization regarding these trade-offs. In light of existing limitations, in this work, we propose a reformulation of the model-specific (MS) Pareto optimal trade-off, where we frame it as convex optimization problems involving fairness notions and accuracy w.r.t. the confusion vector. Our formulation provides an efficient approximation of the best achievable accuracy under dynamic fairness constraints, and yields systematical analysis regarding the fairness-accuracy trade-off. Going beyond the discussion on fairness-accuracy trade-offs, we extend the discussion to the trade-off between fairness notions, which characterizes the impact of accuracy on the compatibility between fairness notions. Inspired by our reformulation, we propose a last-layer retraining framework with group-dependent bias, and we prove theoretically the superiority of our method over existing baselines. Experimental results demonstrate the effectiveness of our method in achieving better fairness-accuracy trade-off, and that our MS Pareto frontiers sufficiently quantify the two trade-offs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1928. BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching
๐ฌ 18
โญ 5.50
๐ generative models
๐ท๏ธ Diffusion Model Video Generation Cache
Recent advancements in Diffusion Transformers (DiTs) have established them as the state-of-the-art method for video generation. However, their inherently sequential denoising process results in inevitable latency, limiting real-world applicability. Existing acceleration methods either compromise visual quality due to architectural modifications or fail to reuse intermediate features at proper granularity. Our analysis reveals that DiT blocks are the primary contributors to inference latency. Across diffusion timesteps, the feature variations of DiT blocks exhibit a U-shaped pattern with high similarity during intermediate timesteps, which suggests substantial computational redundancy. In this paper, we propose Block-Wise Caching (BWCache), a training-free method to accelerate DiT-based video generation. BWCache dynamically caches and reuses features from DiT blocks across diffusion timesteps. Furthermore, we introduce a similarity indicator that triggers feature reuse only when the differences between block features at adjacent timesteps fall below a threshold, thereby minimizing redundant computations while maintaining visual fidelity. Extensive experiments on several video diffusion models demonstrate that BWCache achieves up to 2.24$\times$ speedup with comparable visual quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1929. Finite-Time Convergence Analysis of ODE-based Generative Models for Stochastic Interpolants
๐ฌ 17
โญ 5.50
๐ generative models
๐ท๏ธ Stochastic Interpolants Diffusion Models Deterministic Sampler
Stochastic interpolants offer a robust framework for continuously transforming samples between arbitrary data distributions via ordinary or stochastic differential equations (ODEs/SDEs), holding significant promise for generative modeling. While previous studies have analyzed the finite-time convergence rate of discrete-time implementations for SDEs, the ODE counterpart remains largely unexplored. In this work, we bridge this gap by presenting a rigorous finite-time convergence analysis of numerical implementations for ODEs in the framework of stochastic interpolants. We establish novel discrete-time total variation error bounds for two widely used numerical solvers: the first-order forward Euler method and the second-order Heun's method. Our analysis also yields optimized iteration complexity results and step size schedules that enhance computational efficiency. Notably, when specialized to the diffusion model setting, our theoretical guarantees for the second-order method improve upon prior results in terms of both smoothness requirements and dimensional dependence. Our theoretical findings are corroborated by numerical and image generation experiments, which validate the derived error bounds and complexity analyses.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1930. Taming Hierarchical Image Coding Optimization: A Spectral Regularization Perspective
๐ฌ 22
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Learned Image Compression Hierarchical Variational Autoencoders Spectral Regularization
Hierarchical coding offers distinct advantages for learned image compression by capturing multi-scale representations to support scale-wise modeling and enable flexible quality scalability, making it a promising alternative to single-scale models. However, its practical performance remains limited. Through spectral analysis of training dynamics, we reveal that existing hierarchical image coding approaches suffer from cross-scale energy dispersion and spectral aliasing, resulting in optimization inefficiency and performance bottlenecks. To address this, we propose explicit spectral regularization schemes for hierarchical image coding, consisting of (i) intra-scale frequency regularization, which encourages a smooth lowโtoโhigh frequency buildup as scales increase, and (ii) inter-scale similarity regularization, which suppresses spectral aliasing across scales. Both regularizers are applied only during training and impose no overhead at inference. Extensive experiments demonstrate that our method accelerates the training of the vanilla model by 2.3$\times$, delivers an average 20.65\% rateโdistortion gain over the latest VTM-22.0 on public datasets, and outperforms existing single-scale approaches, thereby setting a new state of the art in learned image compression.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1931. VMoBA: Mixture-of-Block Attention for Video Diffusion Models
๐ฌ 4
โญ 5.50
๐ generative models
๐ท๏ธ Video Generation Sparse Attention Training Acceleration MoBA
The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are designed as training-free inference accelerators or do not optimally capture the unique spatio-temporal characteristics inherent in video data when trained natively. This paper introduces Video Mixture of Block Attention (VMoBA), a novel sparse attention mechanism specifically adapted for VDMs. Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, which revealed strong spatio-temporal locality, varying query importance, and head-specific concentration levels, VMoBA enhances the original MoBA framework with three key modifications: (1) a layer-wise recurrent block partition scheme (1D-2D-3D) to dynamically adapt to diverse spatio-temporal attention patterns and improve efficiency; (2) global block selection to prioritize the most salient query-key block interactions across an entire attention head; and (3) threshold-based block selection to dynamically determine the number of attended blocks based on their cumulative similarity. Extensive experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92$\times$ FLOPs and 1.48$\times$ latency speedup, while attaining comparable or even superior generation quality to full attention. Furthermore, VMoBA exhibits competitive performance in training-free inference, offering 2.40$\times$ FLOPs and 1.35$\times$ latency speedup for high-res video generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1932. Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models
๐ฌ 32
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language model reasoning self-supervised RL
Although reinforcement learning with verifiable rewards (RLVR) shows promise in improving the reasoning ability of large language models (LLMs), the scaling up dilemma remains due to the reliance on human-annotated labels especially for complex tasks.
Recent self-rewarding methods provide a label-free alternative that exhibits the eliciting potential of LLM reasoning, but they often suffer from the non-negligible training collapse issue, as the single-view supervision signal easily forms the self-consistent illusion, yielding the reward hacking. Inspired by the success of self-supervised learning, we propose \textit{Co-rewarding}, a novel self-supervised RL framework that improves training stability by seeking complementary supervision from another views. Specifically, we instantiate Co-rewarding in two ways: (1) \textit{Co-rewarding-I} is a data-side instantiation that derives reward signals from contrastive agreement across semantically analogous questions; and (2) \textit{Co-rewarding-II} is a model-side instantiation that maintains a slowly-updated reference teacher with pseudo labels to realize self-distillation. Intuitively, such instantiations introduce different levels of discrepancy to increase the difficulty of training collapse on trivial reasoning solutions. We also explore their orthogonally combined version to further boost the performance. Empirically, Co-rewarding exhibits stable training across various setups, and outperforms other self-rewarding baselines by $+3.31\%$ improvements on average on multiple mathematical reasoning benchmarks, especially by $+7.49\%$ on Llama-3.2-3B-Instruct. Notably, Co-rewarding reaches or even surpasses ground-truth (GT) labeled reward in several cases of RLVR, such as achieving a Pass@$1$ of $94.01\%$ on GSM8K with Qwen3-8B-Base.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1933. MindPilot: Closed-loop Visual Stimulation Optimization for Brain Modulation with EEG-guided Diffusion
๐ฌ 23
โญ 5.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Neuroscience Brain Modulation EEG Closed-loop Brain Coding BCI Generative Model Black-box Guidance Encoding Model
Whereas most brainโcomputer interface research has focused on decoding neural signals into behavior or intent, the reverse challengeโusing controlled stimuli to steer brain activityโremains far less understood, particularly in the visual domain.
However, designing images that consistently elicit desired neural responses is difficult: subjective states lack clear quantitative measures, and EEG feedback is both noisy and non-differentiable.
We introduce MindPilot, the first closed-loop framework that uses EEG signals as optimization feedback to guide naturalistic image generation. Unlike prior work limited to invasive settings or low-level flicker stimuli, MindPilot leverages non-invasive EEG with natural images, treating the brain as a black-box function and employing a pseudo-model guidance mechanism to iteratively refine images without requiring explicit rewards or gradients. We validate MindPilot in both simulation and human experiments, demonstrating (i) efficient retrieval of semantic targets, (ii) closed-loop optimization of EEG spectral features, and (iii) human-subject validations in mental matching and emotion regulation tasks. Our results establish the feasibility of EEG-guided image synthesis and open new avenues for non-invasive closed-loop brain modulation, bidirectional brainโcomputer interfaces, and neural signalโguided generative modeling. Our code is available at \url{https://anonymous.4open.science/r/MindPilot-0924}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1934. $\textbf{Re}^{2}$: Unlocking LLM Reasoning via Reinforcement Learning with Re-solving
๐ฌ 17
โญ 5.50
๐ generative models
๐ท๏ธ LLM Reasoning Reinforcement learning Re-solving Mechanism Test-time Scaling
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning performance of large language models (LLMs) by increasing test-time compute.
However, even after extensive RLVR training, such models still tend to generate unnecessary and low-quality steps in their chain-of-thought (CoT), leading to inefficient overthinking and lower answer quality.
We show that when the initial direction or quality of the CoT is suboptimal, the model often fails to reach the correct answer, even after generating several times more tokens than when the initial CoT is well-initialized.
To this end, we introduce $\textit{\textbf{Re}inforcement Learning with \textbf{Re}-solving}$ (Re$^2$), in which LLMs learn to flexibly abandon unproductive reasoning paths and restart the solution process when necessary, rather than always committing to a final answer.
Re$^2$ applies pure reinforcement learning without any preliminary supervised fine-tuning, successfully amplifying the rare redo behavior in vanilla models from only 0.5\% to over 30\%.
This leads to substantial performance gains over standard RLVR under the same training compute budget, and also demonstrates notable improvements in test-time performance as the number of samples increases.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1935. Neuron-Level Analysis of Cultural Understanding in Large Language Models
๐ฌ 16
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ cultural understanding neuron LLM interpretability
As large language models (LLMs) are increasingly deployed worldwide, ensuring their fair and comprehensive cultural understanding is important.
However, LLMs exhibit cultural bias and limited awareness of underrepresented cultures, while the mechanisms underlying their cultural understanding remain underexplored.
To fill this gap, we conduct a neuron-level analysis to identify neurons that drive cultural behavior, introducing a gradient-based scoring method with additional filtering for precise refinement.
We identify both culture-general neurons contributing to cultural understanding regardless of cultures, and culture-specific neurons tied to an individual culture.
These neurons account for less than 1% of all neurons and are concentrated in shallow to middle MLP layers.
We validate their role by showing that suppressing them substantially degrades performance on cultural benchmarks (by up to 30%), while performance on general natural language understanding (NLU) benchmarks remains largely unaffected.
Moreover, we show that culture-specific neurons support knowledge of not only the target culture, but also related cultures.
Finally, we demonstrate that training on NLU benchmarks can diminish models' cultural understanding when we update modules containing many culture-general neurons.
These findings provide insights into the internal mechanisms of LLMs and offer practical guidance for model training and engineering.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1936. Object Fidelity Diffusion for Remote Sensing Image Generation
๐ฌ 8
โญ 5.50
๐ generative models
๐ท๏ธ Image generation remote sensing
High-precision controllable remote sensing image generation is both meaningful and challenging. Existing diffusion models often produce low-fidelity objects due to their inability to adequately capture morphological details, which may affect the robustness and reliability of object detection models. To enhance the accuracy and fidelity of generated objects in remote sensing, this paper proposes Object Fidelity Diffusion (OF-Diff), which effectively improves the fidelity of generated objects. Specifically, we are the first to extract the prior shapes of objects based on the layout for diffusion models in remote sensing. Then, we introduce a self-distillation diffusion model with consistency distillation loss, which can generate high-fidelity remote sensing images without providing real images during the sampling phase. Furthermore, we introduce DDPO to fine-tune the diffusion process, making the generated remote sensing images more diverse and semantically consistent. Comprehensive experiments demonstrate that OF-Diff outperforms state-of-the-art methods in the remote sensing across key quality metrics. Notably, the performance of several polymorphic and small object classes shows significant improvement. For instance, the mAP increases by 8.3%, 7.7%, and 4.0% for airplanes, ships, and vehicles, respectively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1937. Towards True Speech-to-Speech Models Without Text Guidance
๐ฌ 23
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ multimodal large language model large language model speech language model
Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction. We will release our code and models to support further research in true speech-to-speech foundation models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1938. Weight Decay may matter more than ยตP for Learning Rate Transfer in Practice
๐ฌ 12
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ maximal update parametrization llm pretraining hyperparameter transfer learning dynamics adamw mup weight decay hyperparameter tuning scaling law transformer
Transferring the optimal learning rate from small to large neural networks can enable efficient training at scales where hyperparameter tuning is otherwise prohibitively expensive. To this end, the Maximal Update Parameterization (ยตP) proposes a learning rate scaling designed to keep the update dynamics of internal representations stable across different model widths. However, the scaling rules of ยตP rely on strong assumptions, particularly about the geometric alignment of a layerโs inputs with both its weights and gradient updates. In this large-scale empirical investigation, we show that these assumptions hold only briefly at the start of training in the practical setups where learning rate transfer is most valuable, such as LLM training. For the remainder of training it is weight decay rather than ยตP that correctly stabilizes the update dynamics of internal representations across widths, facilitating learning rate transfer. This suggests ยตP's scaling primarily acts as a form of implicit learning rate warmup, allowing us to largely replace it with modified warmup schedules. Together these findings fundamentally challenge prevailing beliefs about learning rate transfer and can explain empirical practice such as why ยตP requires the independent weight decay variant for successful transfer.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1939. ETGS: Explicit Thermodynamics Gaussian Splatting for Dynamic Thermal Reconstruction
๐ฌ 14
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Reconstruction; Thermal Reconstruction๏ผExplicit Thermodynamics๏ผ
We propose ETGS, a method for reconstructing dynamic thermal scenes by embedding explicit thermodynamic modeling into 3D Gaussian Splatting. Each Gaussian is equipped with physically interpretable thermal parameters, and its thermaldynamics evolution is described by a first-order heat-transfer ODE with an analytical closed-form solution. This formulation avoids numerical integration, enables efficient rendering at arbitrary timestamps, and naturally handles irregular sampling and out-of-order observations. We also introduce the Rapid Heat Dynamics (RHD) dataset, which provides millisecond-aligned RGBโIR image pairs covering typical thermal processes such as cooling, warming, heating, and heat transfer. Experiments on RHD show that ETGS captures rapid thermal dynamics more accurately than existing static and dynamic baselines, while maintaining training and rendering efficiency close to that of static 3DGS. Code and dataset will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1940. Diffusion Models as Dataset Distillation Priors
๐ฌ 17
โญ 5.50
๐ generative models
๐ท๏ธ diffusion mdoels dataset distillation diffusion priors kernel method
Dataset distillation aims to synthesize compact yet informative datasets from large ones. A significant challenge in this field is achieving a trifecta of diversity, generalization, and representativeness in a single distilled dataset. Although recent generative dataset distillation methods adopt powerful diffusion models as their foundation models, the inherent representativeness prior in diffusion models is overlooked. Consequently, these approaches often necessitate the integration of external constraints to enhance data quality. To address this, we propose Diffusion As Priors (DAP), which formalizes representativeness by quantifying the similarity between synthetic and real data in feature space using a Mercer kernel. We then introduce this prior as guidance to steer the reverse diffusion process, enhancing the representativeness of distilled samples without any retraining. Extensive experiments on large-scale datasets, such as ImageNet-1K and its subsets, demonstrate that DAP outperforms state-of-the-art methods in generating high-fidelity datasets while achieving superior cross-architecture generalization. Our work not only establishes a theoretical connection between diffusion priors and the objectives of dataset distillation but also provides a practical, training-free framework for improving the quality of the distilled dataset.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1941. Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement
๐ฌ 15
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models ; Retrieval-Augmented Generation ; Reinforcement Learning
Retrieval-augmented generation (RAG) improves performance on knowledge-intensive tasks but can be derailed by wrong, irrelevant, or conflicting retrieved text, causing models to rely on inaccurate evidence and cascade errors. We propose Knowledgeable-R1, a reinforcement-learning framework that explicitly trains large language models to use parametric knowledge (PK) to resist contextual interference while still exploiting external context when it is reliably helpful. Knowledgeable-R1 introduces a joint sampling scheme that generates paired responses with and without retrieval, and learns both local advantages (within each decoding regime) and global advantages under the same input to quantify when to ignore misleading context versus adopt it. We employ an asymmetric advantage transformation that amplifies exploratory behaviors toward parametric knowledge. Experiments show that \method significantly improves robustness and reasoning accuracy in knowledge conflict scenarios and general RAG scenarios, outperforming SOTA baselines by 23% in counterfactual scenarios, and without degradation when the retrieved context is fully accurate.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1942. Dual-Path Condition Alignment for Diffusion Transformers
๐ฌ 16
โญ 5.50
๐ generative models
๐ท๏ธ Diffusion Transformer Self-Supervised Learning Representation Learning.
Denoising-based generative models have been significantly advanced by representation-alignment (REPA) loss, which leverages pre-trained visual encoders to guide intermediate network features. However, REPA's reliance on external visual encoders introduces two critical challenges: potential \textit{distribution mismatches} between the encoder's training data and the generation target, and the high \textit{computational costs} of pre-training. Inspired by the observation that REPA primarily aids early layers in capturing robust semantics, we propose an unsupervised alternative that avoids external visual encoder and the assumption of consistent data distribution. We introduce \textit{\textbf{DU}al-\textbf{P}ath condition \textbf{A}lignment} (\textbf{DUPA}), a novel self-alignment framework, which independently noises an image multiple times and processes these noisy latents through decoupled diffusion transformer, then aligns the derived conditions\textemdash low-frequency semantic features extracted from each path. Experiments demonstrate that DUPA achieves FID$=$1.46 on ImageNet 256$\times$256 with only 400 training epochs, outperforming all methods that do not rely on external supervision. Critically, DUPA accelerates training of its base model by 5$\times$ and inference by 10$\times$. DUPA is also model-agnostic and can be readily applied to any denoising-based generative model, showcasing its excellent scalability and generalizability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1943. Decoupled MeanFlow: Turning Flow Models into Flow Maps for Accelerated Sampling
๐ฌ 15
โญ 5.50
๐ generative models
๐ท๏ธ Few-step diffusion Diffusion models Flow-based models generative models diffusion transformer
Denoising generative models, such as diffusion and flow-based models, produce high-quality samples but require many denoising steps due to discretization error. Flow maps, which estimate the average velocity between timesteps, mitigate this error and enable faster sampling. However, their training typically demands architectural changes that limit compatibility with pretrained flow models. We introduce Decoupled MeanFlow, a simple decoding strategy that converts flow models into flow map models without architectural modifications. Our method conditions the final blocks of diffusion transformers on the subsequent timestep, allowing pretrained flow models to be directly repurposed as flow maps. Combined with enhanced training techniques, this design enables high-quality generation in as few as 1โ4 steps. Notably, we find that training flow models and subsequently converting them is more efficient and effective than training flow maps from scratch. On ImageNet 256$\times$256, our model attains a 1-step FID of 2.16, surpassing prior art by a large margin, and achieves a 4-step FID of 1.51, matching the performance of standard flow models while delivering over 125$\times$ faster inference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1944. QKV Projections Require a Fraction of Their Memory
๐ฌ 19
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Memory Efficient Training Pre-training Finetuning Approximate Matrix Multiplication Compressed Activations
The Multi-Head Attention mechanism is central to LLM operation, and multiple works target its compute and memory efficiency during training.
While most works focus on approximating the scaled dot product, the memory consumption of the linear projections that compute the $Q$, $K$, and $V$ tensors from the input $x$ is often overlooked.
To address this, we propose Point-Approximate Matrix Multiplication (PAMM), a novel tensor compression technique that reduces memory consumption of the $Q,K,V$ projections in attention layers by a factor of up to $\times 512$, effectively erasing their memory footprint, while achieving similar or better final perplexity. PAMM is fully composable with efficient attention techniques such as FlashAttention, making it a practical and complementary method for memory-efficient LLM training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1945. Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models
๐ฌ 30
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ large language models chain-of-thought reasoning hop generalization mechanistic interpretability
Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems.
However, recent studies reveal a sharp performance drop in *reasoning hop generalization* scenarios, where the required number of reasoning steps exceeds training distributions while the underlying algorithm remains unchanged.
The internal mechanisms driving this failure remain poorly understood.
In this work, we conduct a systematic study on tasks from multiple domains, and find that errors concentrate at token positions of a few critical error types, rather than being uniformly distributed.
Closer inspection reveals that these token-level erroneous predictions stem from internal *competition mechanisms*: certain attention heads, termed *erroneous processing heads* (ep heads), tip the balance by amplifying incorrect reasoning trajectories while suppressing correct ones.
Notably, removing individual ep heads during inference can often restore the correct predictions.
Motivated by these insights, we propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process.
Extensive experiments across different tasks and LLMs show that it consistently improves reasoning hop generalization, highlighting both its effectiveness and potential.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1946. BOLT: DecisionโAligned Distillation and Budget-Aware Routing for Constrained Multimodal QA on Robots
๐ฌ 14
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ multimodal question answering vision-language models robotics knowledge distillation resource-constrained AI
Robotic systems can require multimodal reasoning under stringent constraints of latency, memory, and energy. Standard instruction tuning and token-level distillation fail to deliver decision quality, reliability, and interpretability under these constraints. We introduce BOLT, a decision-aligned distillation and budget-aware routing framework that treats multi-choice prediction as a decision surface to be aligned during training and selectively refined at inference. During training, BOLT introduces Option-level Decision Distillation to align student models directly on the decision surface of multi-choice answers, thereby eliminating prompt artifacts, improving calibration, and optimizing the exact output space. At inference, BOLT activates Budget-aware Test-time Augmentation, a calibrated router that uses low-cost signals such as confidence, margin, entropy, retrieval affinity, and agreement across short question decompositions to trigger high-resolution reevaluation, type-matched retrieval exemplars, or question decomposition only when their expected benefit outweighs cost. On Robo2VLM-1, a 2B BOLT student distilled from LLaVA-1.5-13B improves accuracy from 28.66 in zero-shot to 42.89 with decision distillation and to 50.50 with budgeted routing, surpassing the 13B teacher at 36.74. It lowers expected calibration error, strengthens the risk-coverage frontier, and slashes GPU memory from 26,878 MB for the teacher to 3,035 MB for the distilled student, and 3,817 MB with all augmentations enabled. By constraining outputs to valid options while exposing retrieved evidence and decomposition traces, BOLT reduces hallucination and provides transparent decision-making, enabling large-model quality on edge robots.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1947. xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity
๐ฌ 14
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ xLSTM Transformers Scaling Laws Sequence Modeling TFLA Linear Attention Inference
Scaling laws play a central role in the success of Large Language Models (LLMs), enabling the prediction of model performance relative to compute budgets prior to training.
While Transformers have been the dominant architecture, recent alternatives such as xLSTM offer linear complexity with respect to context length while remaining competitive in the billion-parameter regime.
We conduct a comparative investigation on the scaling behavior of Transformers and xLSTM along the following lines, providing insights to guide future model design and deployment.
First, we study the scaling behavior for xLSTM in compute-optimal and over-training regimes using both IsoFLOP and parametric fit approaches on a wide range of model sizes (80M-7B) and number of training tokens (2B-2T).
Second, we examine the dependence of optimal model sizes on context length, a pivotal aspect that was largely ignored in previous work.
Finally, we analyze inference-time scaling characteristics.
Our findings reveal that in typical LLM training and inference scenarios, xLSTM scales favorably compared to Transformers.
Notably, xLSTM models consistently Pareto-dominate Transformer models, delivering lower cross-entropy loss for the same compute budget.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1948. pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning
๐ฌ 16
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multi-Modal LLMs Spatial Reasoning 3D Vision
Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D reconstruction, camera-pose recovery, novel-view rendering, etc. These operations convert raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations. Notably, pySpatial requires no gradient-based fine-tuning and operates in a fully zero-shot setting. Experimental evaluations on the challenging MindCube and Omni3D-Bench benchmarks demonstrate that our framework pySpatial consistently surpasses strong MLLM baselines; for instance, it outperforms GPT-4.1-mini by 12.94% on MindCube. Furthermore, we conduct real-world indoor navigation experiments where the robot can successfully traverse complex environments using route plans generated by pySpatial, highlighting the practical effectiveness of our approach. Our project website will be available at https://pySpatial.github.io.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1949. Bridging Input Feature Spaces Towards Graph Foundation Models
๐ฌ 19
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph Neural Networks Graph Foundatin Models
Unlike vision and language domains, graph learning lacks a shared input space, as input features differ across graph datasets not only in semantics, but also in value ranges and dimensionality. This misalignment prevents graph models from generalizing across datasets, limiting their use as foundation models.
In this work, we propose ALL-IN, a simple and theoretically grounded method that enables transferability across datasets with different input features. Our approach projects node features into a shared random space and constructs representations via covariance-based statistics, thus eliminating dependence on the original feature space.
We show that the computed node-covariance operators and the resulting node representations are invariant in distribution to permutations of the input features. We further demonstrate that the expected operator exhibits invariance to general orthogonal transformations of the input features.
Empirically, ALL-IN achieves strong performance across diverse node- and graph-level tasks on unseen datasets with new input features, without requiring architecture changes or retraining. These results point to a promising direction for input-agnostic, transferable graph models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1950. Confident Block Diagonal Structure-Aware Invariable Graph Completion for Incomplete Multi-view Clustering
๐ฌ 11
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Incomplete multi-view clustering nvariable graph completion onfident block diagonal structure learningci
Multi-view clustering (MVC) adopts complementary information from multiple views to reveal the underlying structure of the data. However, the conventional MVC-based methods remain a crucial challenge on the incomplete multi-view clustering (IMVC) tasks, when some views of the multi-view data are missing. Particularly, current IMVC methods suffer from two main limitations: 1) they focused on recovering the missing data, yet often overlooked the potential inaccuracies in imputed values caused by the absence of true label information; 2) the recovered features were learned from the complete data, neglecting the distributional discrepancy between the complete and incomplete instances. In order to tackle these issues, in this paper, a confident block diagonal structure-aware invariable graph completion-based incomplete multi-view clustering method (CBDS_IMVC) is proposed. Specifically, we first design a confident-aware missing-view inferring strategy, where the confident block diagonal structures (CBDS) are learned to guarantee that recovered instances of all views have the same strict invariable local structure with the constraint of CBDS. Subsequently, we proposed an invariable graph completion strategy to learn the intrinsic structure across all views. Each parts are jointly trained, complementing and promoting each other to achieve the optimum together. Compared to other state-of-the-art methods, the proposed CBDS_IMVC demonstrates superior performance across multiple benchmark datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1951. GDR-learners: Orthogonal Learning of Generative Models for Potential Outcomes
๐ฌ 15
โญ 5.50
๐ causal reasoning
๐ท๏ธ causal machine learning orthogonal learning deep generative models potential outcomes estimation
Various deep generative models have been proposed to estimate potential outcomes distributions from observational data. However, none of them have the favorable theoretical property of general Neyman-orthogonality and, associated with it, quasi-oracle efficiency and double robustness. In this paper, we introduce a general suite of generative Neyman-orthogonal (doubly-robust) learners that estimate the conditional distributions of potential outcomes. Our proposed generative doubly-robust learners (GDR-learners) are flexible and can be instantiated with many state-of-the-art deep generative models. In particular, we develop GDR-learners based on (a) conditional normalizing flows (which we call GDR-CNFs), (b) conditional generative adversarial networks (GDR-CGANs), (c) conditional variational autoencoders (GDR-CVAEs), and (d) conditional diffusion models (GDR-CDMs). Unlike the existing methods, our GDR-learners possess the properties of quasi-oracle efficiency and rate double robustness, and are thus asymptotically optimal. In a series of (semi-)synthetic experiments, we demonstrate that our GDR-learners are very effective and outperform the existing methods in estimating the conditional distributions of potential outcomes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1952. Erase or Hide? Suppressing Spurious Unlearning Neurons for Robust Unlearning
๐ฌ 17
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Unlearning AI Safety Explainability
Large language models trained on web-scale data can memorize private or sensitive knowledge, raising significant privacy risks.
Although some unlearning methods mitigate these risks, they remain vulnerable to "relearning" during subsequent training, allowing a substantial portion of forgotten knowledge to resurface.
In this paper, we show that widely used unlearning methods cause shallow alignment: instead of faithfully erasing target knowledge, they generate spurious unlearning neurons that amplify negative influence to hide it.
To overcome this limitation, we introduce Ssiuu, a new class of unlearning methods that employs attribution-guided regularization to prevent spurious negative influence and faithfully remove target knowledge.
Experimental results confirm that our method reliably erases target knowledge and outperforms strong baselines across two practical retraining scenarios: (1) adversarial injection of private data, and (2) benign attack using an instruction-following benchmark.
Our findings highlight the necessity of robust and faithful unlearning methods for safe deployment of language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1953. Debugging Concept Bottleneck Models through Removal and Retraining
๐ฌ 19
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ concept bottleneck prototypical part network interpretability human debugging
Concept Bottleneck Models (CBMs) use a set of human-interpretable concepts to predict the final task label, enabling domain experts to not only validate the CBM's predictions, but also intervene on incorrect concepts at test time. However, these interventions fail to address systemic misalignment between the CBM and the expert's reasoning, such as when the model learns shortcuts from biased data. To address this, we present a general interpretable debugging framework for CBMs that follows a two-step process of *Removal* and *Retraining*. In the *Removal* step, experts use concept explanations to identify and remove any undesired concepts. In the *Retraining* step, we introduce **CBDebug**, a novel method that leverages the interpretability of CBMs as a bridge for converting concept-level user feedback into sample-level auxiliary labels. These labels are then used to apply supervised bias mitigation and targeted augmentation, reducing the modelโs reliance on undesired concepts. We evaluate our framework with both real and automated expert feedback, and find that **CBDebug** significantly outperforms prior retraining methods across multiple CBM architectures (PIP-Net, Post-hoc CBM) and benchmarks with known spurious correlations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1954. LearnIR: Learnable Posterior Sampling for Real-World Image Restoration
๐ฌ 16
โญ 5.50
๐ generative models
๐ท๏ธ Image restoration diffusion model residual
Image restoration in real-world conditions is highly challenging due to heterogeneous degradations such as haze, noise, shadows, and blur. Existing diffusion-based methods remain limited: conditional generation struggles to balance fidelity and realism, inversion-based approaches accumulate errors, and posterior sampling requires a known forward operator that is rarely available. We introduce **LearnIR**, a learnable diffusion posterior sampling framework that eliminates this dependency by training a lightweight model to directly predict gradient correction distributions, enabling *Diffusion Posterior Sampling Correction (DPSC)* that maintains consistency with the true image distribution during sampling. In addition, a *Dynamic Resolution Module (DRM)* dynamically adjusts resolution to preserve global structures in early stages and refine fine textures later, while avoiding the need for a pretrained VAE. Experiments on ISTD, O-HAZE, HazyDet, REVIDE, and our newly constructed FaceShadow dataset show that LearnIR achieves state-of-the-art performance in PSNR, SSIM, and LPIPS.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1955. ArtUV: Artist-style UV Unwrapping
๐ฌ 10
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ UV unwrapping Artist-style Auto-Encoder
UV unwrapping is an essential task in computer graphics, enabling various visual editing operations in rendering pipelines. However, existing UV unwrapping methods struggle with time-consuming, fragmentation, lack of semanticity, and irregular UV islands, limiting their practical use. An artist-style UV map must not only satisfy fundamental criteria, such as overlap-free mapping and minimal distortion, but also uphold higher-level standards, including clean boundaries, efficient space utilization, and semantic coherence.
We introduce ArtUV, a fully automated, end-to-end method for generating artist-style UV unwrapping. We simulates the professional UV mapping process by dividing it into two stages: surface seam prediction and artist-style UV parameterization. In the seam prediction stage, SeamGPT is used to generate semantically meaningful cutting seams. Then, in the parameterization stage, a rough UV obtained from an optimization-based method, along with the mesh, is fed into an Auto-Encoder, which refines it into an artist-style UV map. Our method ensures semantic consistency and preserves topological structure, making the UV map ready for 2D editing. We evaluate ArtUV across multiple benchmarks and show that it serves as a versatile solution, functioning seamlessly as either a plug-in for professional rendering tools or as a standalone system for rapid, high-quality UV generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1956. From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning
๐ฌ 15
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Multi-Agent System๏ผLarge Language Model๏ผEvidence-Based Reasoning
The chemical reaction recommendation is to select proper reaction condition parameters for chemical reactions, which is pivotal to accelerating chemical science.With the rapid development of large language models (LLMs), there is growing interest in leveraging their reasoning and planning capabilities for reaction condition recommendation.Despite their success, existing methods rarely explain the rationale behind the recommended reaction conditions, limiting their utility in high-stakes scientific workflows. In this work, we propose ChemMAS, a multi-agent system that reframes condition prediction as an evidence-based reasoning task. ChemMAS decomposes the task into mechanistic grounding, multi-channel recall, constraint-aware agentic debate, and rationale aggregation. Each decision is backed by interpretable justifications grounded in chemical knowledge and retrieved precedents. Experiments show that ChemMAS achieves 20โ35\% gains over domain-specific baselines and outperforms general-purpose LLMs by 10โ15\% in Top-1 similarity, while offering falsifiable, human-trustable rationales, which establishes a new paradigm for explainable AI in scientific discovery.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1957. How Catastrophic is Your LLM? Certifying Risk in Conversation
๐ฌ 13
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models Catastrophic risks Multi-turn Attack Certification Safety
Large Language Models (LLMs) can produce catastrophic responses in conversational settings that pose serious risks to public safety and security. Existing evaluations often fail to fully reveal these vulnerabilities because they rely on fixed attack prompt sequences, lack statistical guarantees, and do not scale to the vast space of multi-turn conversations. In this work, we propose C$^3$LLM, a novel, principled Certification framework for Catastrophic risks in multi-turn Conversation for LLMs that bounds the probability of an LLM generating catastrophic responses under multi-turn conversation distributions with statistical guarantees. We model multi-turn conversations as probability distributions over query sequences, represented by a Markov process on a query graph whose edges encode semantic similarity to capture realistic conversational flow, and quantify catastrophic risks using confidence intervals. We define several inexpensive and practical distributionsโrandom node, graph path, and adaptive with rejection. Our results demonstrate that these distributions can reveal substantial catastrophic risks in frontier models, with certified lower bounds as high as 70\% for the worst model, highlighting the urgent need for improved safety training strategies in frontier LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1958. Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
๐ฌ 20
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Natural Language Processing Code
Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into operational code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, particularly from the authors of those papers, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1959. Reasoning Scaffolding: Distilling the Flow of Thought from LLMs
๐ฌ 24
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Reasoning Distillation Large Reasoning Model Reasoning Scaffolding Semantic Signals
The prevailing approach to distilling reasoning from Large Language Models (LLMs)โbehavioral cloning from textual rationalesโis fundamentally limited. It teaches Small Language Models (SLMs) to mimic surface-level patterns rather than the underlying algorithmic structure of thought, resulting in a critical lack of logical robustness. We argue that instead of cloning text, distillation should transfer this algorithmic structure directly. We introduce Reasoning Scaffolding, a framework that reframes reasoning as a structured generation process. Our method first abstracts the teacher's thought process into a sequence of discrete, interpretable semantic signals (e.g., Contrast, Addition) that act as a scaffold. The student model is then trained via a multi-task objective to both (1) predict the next semantic signal, anticipating the reasoning flow, and (2) generate the corresponding step, conditioned on that signal. This multi-task scheme acts as a powerful regularizer, compelling the student to internalize the computational patterns of coherent reasoning. On a suite of challenging reasoning benchmarks, our method significantly outperforms state-of-the-art distillation in both accuracy and logical consistency, providing a path towards creating smaller models that are genuine reasoners, not just fluent mimics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1960. GEOMETRY OF UNCERTAINTY: LEARNING METRIC SPACES FOR MULTIMODAL STATE ESTIMATION IN RL
๐ฌ 14
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Multimodal RL
Estimating the state of an environment from high-dimensional, noisy observations is a fundamental challenge in reinforcement learning (RL). Traditional approaches rely on probabilistic models to account for the uncertainty, but often require explicit noise assumptions, in turn limiting generalization. In this work, we propose a novel method to learn a structured latent representation, in which distances between states directly correlate with the minimum number of actions required to transition between them. The proposed metric space formulation provides a geometric interpretation of uncertainty without the need for explicit probabilistic modeling. To achieve this, we introduce a multimodal latent transition model and a sensor fusion mechanism based on inverse distance weighting, allowing for the adaptive integration of multiple sensor modalities without prior knowledge of noise distributions. We empirically validate the approach on a range of RL tasks, demonstrating improved robustness to sensor noise and superior state estimation compared to baseline methods. Our experiments show enhanced performance of an RL agent via the learned representation, eliminating the need of explicit noise augmentation. The presented results suggest that leveraging transition-aware metric spaces provides a principled and scalable solution for robust state estimation in sequential decision-making.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1961. Finite-Time Analysis of Actor-Critic Methods with Deep Neural Network Approximation
๐ฌ 15
โญ 5.50
๐ learning theory
๐ท๏ธ finite-time analysis actor-critic deep neural network
Actorโcritic (AC) algorithms underpin many of todayโs most successful reinforcement learning (RL) applications, yet their finite-time convergence in realistic settings remains largely underexplored. Existing analyses often rely on oversimplified formulations and are largely confined to linear function approximation. In practice, however, nonlinear approximations with deep neural networks dominate AC implementations, leaving a substantial gap between theory and practice. In this work, we provide the first finite-time analysis of single-timescale AC with deep neural network approximation in continuous state-action spaces. In particular, we consider the challenging time-average reward setting, where one needs to simultaneously control three highly-coupled error terms including the reward error, the critic error, and the actor error. Our novel analysis is able to establish convergence to a stationary point at a rate $\widetilde{\mathcal{O}}(T^{-1/2})$, where $T$ denotes the total number of iterations, thereby providing theoretical grounding for widely used deep AC methods. We substantiate these theoretical guarantees with experiments that confirm the proven convergence rate and further demonstrate strong performance on MuJoCo benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1962. Play to Generalize: Learning to Reason Through Game Play
๐ฌ 17
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ reinforcement learning; large language models
Developing reasoning capabilities in multimodal large language models (MLLMs) remains challenging. Motivated by literature suggesting that gameplay promotes transferable reasoning skills, we propose a novel post-training method, Visual Game Learning (ViGaL), where MLLMs develop generalizable reasoning skills through playing arcade-like games. Specifically, we show that training a 7B-parameter MLLM via reinforcement learning (RL) on simple games like Snake significantly enhances the downstream performance on multimodal math benchmarks like MathVista, and on multi-discipline questions like MMMU, without seeing any worked solutions, equations, or diagrams during RL.
Remarkably, our model outperforms specialist models post-trained on benchmark-oriented multimodal reasoning data, while preserving the modelโs performance on general visual benchmarks, a challenge where specialist models often fall short.
Our findings suggest that multimodal reasoning can emerge from gameplay, pointing to a promising strategy of designing surrogate tasks for RL post-training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1963. Improving Autoregressive Video Modeling with History Understanding
๐ฌ 16
โญ 5.50
๐ generative models
๐ท๏ธ autoregressive video generation diffusion models
Video autoregressive generation (VideoAR) sequentially predicts future frames conditioned on history frames. Despite the advance of recent diffusion-based VideoAR, the role of conditioning signalโinternal representations of history framesโremains underexplored. Inspired by the success of strong condition representations in text-conditioned generation, we investigate: \textit{Can better internal representations of history frames improve VideoAR performance?} Through systematic analysis, we show that history representation quality positively correlates with VideoAR, and that enhancing these representations provides gains that cannot be achieved by refining future frames representations alone. Based on these insights, we propose \textbf{MiMo} (Masked History Modeling), a novel framework that seamlessly integrates representation learning into diffusion-based VideoAR. MiMo applies masks to history frame tokens and trains the model to predict masked tokens of current and future frames alongside the diffusion objective, yielding predictive and robust history representations without relying on vision foundation models (VFMs) or heavy architectural changes. Extensive experiments demonstrate that MiMo achieves competitive performance in video prediction and generation tasks while substantially improving training efficiency. Our work underscores the importance of history representations in VideoAR.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1964. Point-MoE: Large-Scale Multi-Dataset Training with Mixture-of-Experts for 3D Semantic Segmentation
๐ฌ 18
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Semantic Segmentation Mixture of Expert Point Cloud Understanding
While massively both scaling data and models have become central in NLP and 2D vision, their benefits for 3D point cloud understanding remain limited.
We study the initial step of 3D point cloud scaling under a realistic regime: large-scale multi-dataset joint training for 3D semantic segmentation, with no dataset labels available at inference time.
Point clouds arise from a wide range of sensors (e.g., depth cameras, LiDAR) and scenes (e.g., indoor, outdoor), yielding heterogeneous scanning patterns, sampling densities, and semantic biases; naively mixing such datasets degrades standard backbones.
We introduce **Point-MoE**, a Mixture-of-Experts design that expands capacity through sparsely activated expert MLPs and a lightweight top-$k$ router, allowing tokens to select specialized experts without requiring dataset supervision.
Trained jointly on a diverse mix of indoor and outdoor datasets and evaluated on seen datasets and in zero-shot settings, Point-MoE outperforms prior methods without using dataset labels for either training or inference.
This outlines a scalable path for 3D perception: letting the model discover structure in heterogeneous 3D data rather than imposing it via manual curation or dataset-specific heuristics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1965. TokenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Ditching
๐ฌ 21
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Memory Efficient Fine Tuning
Fine tuning has been regarded as a de facto approach for adapting large language models (LLMs) to downstream tasks, but the high training memory consumption inherited from LLMs makes this process inefficient. Among existing memory efficient approaches, activation-related optimization has proven particularly effective, as activations consistently dominate overall memory consumption. Although prior arts offer various activation optimization strategies, their data-agnostic nature ultimately results in ineffective and unstable fine tuning. In this paper, we propose TokenSeek, a universal plugin solution for various transformer-based models through instance-aware token seeking and ditching, achieving significant fine-tuning memory savings (e.g., requiring only 14.8% of the memory on Llama3.2 1B) with on-par or even better performance. Furthermore, our interpretable token seeking process reveals the underlying reasons for its effectiveness, offering valuable insights for future research on token efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1966. DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing
๐ฌ 17
โญ 5.50
๐ generative models
๐ท๏ธ Video Editing Video Customization Video Inpainting Diffusion Transformer Computer Vision
With the rapid progress of video generation, demand for customized video editing is surging, where subject swapping constitutes a key component yet remains under-explored. Prevailing swapping approaches either specialize in narrow domainsโsuch as human-body animation or hand-object interactionโor rely on some indirect editing paradigm or ambiguous text prompts that compromise final fidelity. In this paper, we propose DreamSwapV, a mask-guided, subject-agnostic, end-to-end framework that swaps any subject in any video for customization with a user-specified mask and reference image. To inject fine-grained guidance, we introduce multiple conditions and a dedicated condition fusion module that integrates them efficiently. In addition, an adaptive mask strategy is designed to accommodate subjects of varying scales and attributes, further improving interactions between the swapped subject and its surrounding context. Through our elaborate two-phase dataset construction and training scheme, our DreamSwapV outperforms existing methods, as validated by comprehensive experiments on VBench indicators and our first introduced DreamSwapV-Benchmark.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1967. TriQDef: Disrupting Semantic and Gradient Alignment to Prevent Adversarial Patch Transferability in Quantized Neural Networks
๐ฌ 13
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Patch-based attacks adversarial transferability model quantization
Quantized Neural Networks (QNNs) are widely deployed in edge and resource-constrained environments for their efficiency in computation and memory. While quantization distorts gradient landscapes and weakens pixel-level attacks, it offers limited robustness against patch-based adversarial attacksโlocalized, high-saliency perturbations that remain highly transferable across bit-widths. Existing defenses either overfit to specific quantization settings or fail to address this cross-bit vulnerability.
We propose \textbf{TriQDef}, a tri-level quantization-aware defense framework that disrupts the transferability of patch-based attacks across QNNs. TriQDef integrates: (1) a \emph{Feature Disalignment Penalty (FDP)} that enforces semantic inconsistency by penalizing perceptual similarity in intermediate features; (2) a \emph{Gradient Perceptual Dissonance Penalty (GPDP)} that misaligns input gradients across quantization levels using structural metrics such as Edge IoU and HOG Cosine; and (3) a \emph{Joint Quantization-Aware Training Protocol} that applies these penalties within a \emph{shared backbone} jointly optimized across multiple quantizers.
Extensive experiments on CIFAR-10 and ImageNet show that TriQDef lowers Attack Success Rates (ASR) by over 40\% on unseen patch and quantization combinations while preserving high clean accuracy. These results highlight the importance of disrupting both semantic and perceptual gradient alignment to mitigate patch transferability in QNNs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1968. A State-Transition Framework for Efficient LLM Reasoning
๐ฌ 17
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models reasoning efficient reasoning
While Long Chain-of-Thought (CoT) reasoning significantly improves Large Language Models (LLMs) performance on complex reasoning tasks, the substantial computational and memory costs of generating long CoT sequences limit their efficiency and practicality.
Existing studies usually enhance the reasoning efficiency of LLMs by compressing CoT sequences.
However, this approach conflicts with testโtime scaling, limiting the reasoning capacity of LLMs.
In this paper, we propose an efficient reasoning framework that models the reasoning process of LLMs as a stateโtransition process.
Specifically, we first apply a linear attention mechanism to estimate the LLMโs reasoning state, which records the historical reasoning information from previous reasoning steps.
Then, based on the query prompt and the reasoning state, the LLM can efficiently perform the current reasoning step and update the state.
With the linear attention, each token in the current reasoning step can directly retrieve relevant historical reasoning information from the reasoning state, without explicitly attending to tokens in previous reasoning steps.
In this way, the computational complexity of attention is reduced from quadratic to linear, significantly improving the reasoning efficiency of LLMs.
In addition, we propose a state-based reasoning strategy to mitigate the over-thinking issue caused by noisy reasoning steps.
Extensive experiments across multiple datasets and model sizes demonstrate that our framework not only improves the reasoning efficiency of LLMs but also enhances their reasoning performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1969. Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs
๐ฌ 31
โญ 5.50
๐ causal reasoning
๐ท๏ธ Erasable Reinforcement Learning Multi-hop Reasoning LLM-based Agents
While search-augmented large language models (LLMs) exhibit impressive capabilities, their reliability in complex multi-hop reasoning remains limited. This limitation arises from three fundamental challenges: decomposition errors, where tasks are incorrectly broken down; retrieval missing, where key evidence fails to be retrieved; and reasoning errors, where flawed logic propagates through the reasoning chain. A single failure in any of these stages can derail the final answer. We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process. ERL explicitly identifies faulty steps, erases them, and regenerates reasoning in place, preventing defective logic from propagating through the reasoning chain. This targeted correction mechanism turns brittle reasoning into a more resilient process. Models trained with ERL, termed ESearch, achieve substantial improvements on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with the 3B model achieving +8.48% EM and +11.56% F1, and the 7B model achieving +5.38% EM and +7.22% F1 over previous state-of-the-art(SOTA) results. These findings suggest that erasable reinforcement learning provides a powerful paradigm shift for robust multi-step reasoning in LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1970. Automated Formalization via Conceptual Retrieval-Augmented LLMs
๐ฌ 14
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Autoformalization Retrieval-augmented Generation
Interactive theorem provers (ITPs) require manual formalization, which is labor-intensive and demands expert knowledge. While automated formalization offers a potential solution, it faces two major challenges: model hallucination (e.g., undefined predicates, symbol misuse, and version incompatibility) and the semantic gap caused by ambiguous or missing premises in natural language descriptions. To address these issues, we propose CRAMF, a Concept-driven Retrieval-Augmented Mathematical Formalization framework. CRAMF enhances LLM-based autoformalization by retrieving formal definitions of core mathematical concepts, providing contextual grounding during code generation. However, applying retrieval-augmented generation (RAG) in this setting is non-trivial due to the lack of structured knowledge bases, the polymorphic nature of mathematical concepts, and the high precision required in formal retrieval. We introduce a framework for automatically constructing a concept-definition knowledge base from Mathlib4, the standard mathematical library for the Lean 4 theorem prover, indexing over 26,000 formal definitions and 1,000+ core mathematical concepts. To address conceptual polymorphism, we propose contextual query augmentation with domain- and application-level signals. In addition, we design a dual-channel hybrid retrieval strategy with reranking to ensure accurate and relevant definition retrieval. Experiments on miniF2F, ProofNet, and our newly proposed AdvancedMath benchmark show that CRAMF can be seamlessly integrated into LLM-based autoformalizers, yielding consistent improvements in translation accuracyโachieving up to 62.1% and an average of 29.9% relative improvement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1971. IMSE: Intrinsic Mixture of Spectral Experts Fine-tuning for Test-Time Adaptation
๐ฌ 18
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ test time adaptation continual learning
Test-time adaptation (TTA) has been widely explored to prevent performance degradation when test data differ from the training distribution.
However, fully leveraging the rich representations of large pretrained models with minimal parameter updates remains underexplored.
In this paper, we propose a novel approach, Intrinsic Mixture of Spectral Experts (IMSE), that leverages the spectral experts inherently embedded in Vision Transformers.
We decompose each linear layer via singular value decomposition (SVD) and adapt only the singular values, referring to each decomposed rank-1 component as a spectral expert while keeping the singular vectors fixed.
We further identify a key limitation of entropy minimization in TTA: it often reduces feature variance, causing the model to rely on domain-specific cues rather than class-discriminative features.
To address this, we propose a diversity maximization loss based on singular vectorโinput alignment, which maximizing diversity of response pattern.
In the continual test-time adaptation (CTTA) scenario, beyond preserving pretrained knowledge, it is crucial to retain and reuse knowledge from previously observed domains. We introduce Domain-Aware Spectral Code Retrieval, which estimates input distributions to detect domain shifts, and retrieves adapted singular values for rapid adaptation.
Extensive experiments show that our method achieves state-of-the-art performance on ImageNet-C/R/A under single-domain TTA. In CTTA, it improves accuracy by 3.4pp with 2,000$\times$ fewer trainable parameters.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1972. SONA: Learning Conditional, Unconditional, and Mismatching-Aware Discriminator
๐ฌ 24
โญ 5.50
๐ generative models
๐ท๏ธ Generative adversarial network conditional generation generative models
Deep generative models have made significant advances in generating complex content, yet conditional generation remains a fundamental challenge. Existing conditional generative adversarial networks often struggle to balance the dual objectives of assessing authenticity and conditional alignment of input samples within their conditional discriminators. To address this, we propose a novel discriminator design that integrates three key capabilities: unconditional discrimination, matching-aware supervision to enhance alignment sensitivity, and adaptive weighting to dynamically balance all objectives. Specifically, we introduce Sum of Naturalness and Alignment (SONA), which employs separate projections for naturalness (authenticity) and alignment in the final layer with an inductive bias, supported by dedicated objective functions and an adaptive weighting mechanism. Extensive experiments on class-conditional generation tasks show that \ours achieves superior sample quality and conditional alignment compared to state-of-the-art methods. Furthermore, we demonstrate its effectiveness in text-to-image generation, confirming the versatility and robustness of our approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1973. Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting
๐ฌ 11
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ VLAs Embodied Reasoning Action Representation
Fine-tuning vision-language models (VLMs) on robot teleoperation data to create vision-language-action (VLA) models is a promising paradigm for training generalist policies, but it suffers from a fundamental tradeoff: learning to produce actions often diminishes the VLM's foundational reasoning and multimodal understanding, hindering generalization to novel scenarios, instruction following, and semantic understanding. We argue that this catastrophic forgetting is due to a distribution mismatch between the VLM's internet-scale pretraining corpus and the robotics fine-tuning data. Inspired by this observation, we introduce VLM2VLA: a VLA training paradigm that first resolves this mismatch at the data level by representing low-level actions with natural language. This alignment makes it possible to train VLAs solely with Low-Rank Adaptation (LoRA), thereby minimally modifying the VLM backbone and averting catastrophic forgetting. As a result, the VLM can be fine-tuned on robot teleoperation data without fundamentally altering the underlying architecture and without expensive co-training on internet-scale VLM datasets. Through extensive Visual Question Answering (VQA) studies and over 800 real-world robotics experiments, we demonstrate that VLM2VLA preserves the VLM's core capabilities, enabling zero-shot generalization to novel tasks that require open-world semantic reasoning and multilingual instruction following.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1974. On the Alignment Between Supervised and Self-Supervised Contrastive Learning
๐ฌ 18
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Representation alignment self-supervised learning contrastive learning
Self-supervised contrastive learning (CL) has achieved remarkable empirical success, often producing representations that rival supervised pre-training on downstream tasks. Recent theory explains this by showing that the CL loss closely approximates a supervised surrogate, Negatives-Only Supervised Contrastive Learning (NSCL), as the number of classes grows. Yet this loss-level similarity leaves an open question: {\em Do CL and NSCL also remain aligned at the representation level throughout training, not just in their objectives?}
We address this by analyzing the representation alignment of CL and NSCL models trained under shared randomness (same initialization, batches, and augmentations). First, we show that their induced representations remain similar: specifically, we prove that the similarity matrices of CL and NSCL stay close under realistic conditions. Our bounds provide high-probability guarantees on alignment metrics such as centered kernel alignment (CKA) and representational similarity analysis (RSA), and they clarify how alignment improves with more classes, higher temperatures, and its dependence on batch size. In contrast, we demonstrate that parameter-space coupling is inherently unstable: divergence between CL and NSCL weights can grow exponentially with training time.
Finally, we validate these predictions empirically, showing that CLโNSCL alignment strengthens with scale and temperature, and that NSCL tracks CL more closely than other supervised objectives. This positions NSCL as a principled bridge between self-supervised and supervised learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1975. Dual Goal Representations
๐ฌ 15
โญ 5.50
๐ reinforcement learning
๐ท๏ธ reinforcement learning goal-conditioned rl offline rl
In this work, we introduce dual goal representations for goal-conditioned reinforcement learning (GCRL). A dual goal representation characterizes a state by "the set of temporal distances from all other states"; in other words, it encodes a state through its relations to every other state, measured by temporal distance. This representation provides several appealing theoretical properties. First, it depends only on the intrinsic dynamics of the environment and is invariant to the original state representation. Second, it contains provably sufficient information to recover an optimal goal-reaching policy, while being able to filter out exogenous noise. Based on this concept, we develop a practical goal representation learning method that can be combined with any existing GCRL algorithm. Through diverse experiments on the OGBench task suite, we empirically show that dual goal representations consistently improve offline goal-reaching performance across 20 state- and pixel-based tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1976. VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning
๐ฌ 15
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ MLLMs AVR Fine-grained Perception
Recent strides in multimodal large language models (MLLMs) have demonstrated significant progress in many reasoning tasks, but they still fail in Abstract Visual Reasoning (AVR) tasks. Our experimental findings indicate that the core bottleneck lies not only in the reasoning capabilities of MLLMs but more critically in their absence of fine-grained perception. To address this issue, we present VisuRiddles, a dedicated resource for AVR research. It consists of (i) a benchmark, collected from real-world data, for the systematic evaluation of MLLMs' AVR capabilities, and (ii) a synthesizer, which automatically generates AVR instances enriched with perceptual descriptions and reasoning chains, enabling supervised training and deeper investigation. Building on VisuRiddles, we propose a two-stage training paradigm that progressively enhances perceptual ability and strengthens reasoning, producing the Perception-Augmented Visual Reasoner (PAVR). Experiments demonstrate that PAVR unifies perception and reasoning, substantially outperforming both open-source and commercial MLLMs, thereby underscoring fine-grained perception as the primary bottleneck in AVR.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1977. Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity
๐ฌ 25
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models LLMs Post-Training Reasoning theorem proving Lean f-divergences Amari $\alpha$-divergences Distributional Matching diversity
Reinforcement Learning (RL) has become the _de facto_ standard for tuning LLMs to solve tasks involving reasoning.
However, growing evidence shows that such models often suffer from a significant loss in diversity.
We argue that this arises because RL implicitly optimizes the Reverse KL to a target distribution, which concentrates on certain high-probability regions of the target while neglecting others.
In this work, we instead begin from an explicit target distribution, obtained by filtering out incorrect answers while preserving the relative probabilities of correct ones.
Starting from a pre-trained LLM, we approximate this target distribution using Amari's $\alpha$-divergence family, which unifies prior approaches and enables direct control of the precisionโdiversity trade-off by interpolating between mode-seeking and mass-covering divergences.
On a Lean theorem-proving benchmark, our method achieves state-of-the-art performance along the coverageโprecision Pareto frontier, unmatched by other methods along the coverage axis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1978. LeSTD: LLM Compression via Learning-based Sparse Tensor Decomposition
๐ฌ 18
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Compression Post-training Compression Tucker Decomposition Sparsity
Large Language Models (LLMs) achieve remarkable success, but their massive parameter counts present significant deployment challenges. Post-training tensor decomposition offers a promising, data-free compression strategy by exploiting structural redundancies within the model weights. However, existing tensor methods face a critical limitation: the dense core tensor bottleneck. While these methods find a shared low-rank basis, the resulting dense core tensor grows polynomially with the chosen ranks, becoming a new storage bottleneck and capping the maximum achievable compression. To overcome this fundamental barrier, we introduce LeSTD (\textbf{Le}arning-based \textbf{S}parse \textbf{T}ensor \textbf{D}ecomposition), a novel two-stage framework for the high-ratio compression of Multi-Head Attention (MHA) blocks. LeSTD first employs an iterative algorithm to identify a high-quality, and shared orthogonal basis that jointly represents all attention heads. Subsequently, it introduces a principled, importance-based pruning algorithm that learns an ultra-sparse core tensor by systematically removing the least salient elements and refitting the remaining ones to preserve model fidelity. By decoupling basis optimization from core sparsification, LeSTD breaks the compression ceiling imposed by the dense core, enabling significantly higher compression ratios than prior methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1979. Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource
๐ฌ 18
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language models (LLM) Pre-training Mixture-of-Experts (MoE)
Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute. However, can MoEs surpass dense architectures under strictly equal resource constraints โ that is, when the total parameter count, training compute, and data budget are identical? This question remains under-explored despite its significant practical value and potential. In this paper, we propose a novel perspective and methodological framework to study this question thoroughly. First, we comprehensively investigate the architecture of MoEs and achieve an optimal model design that maximizes the performance. Based on this, we subsequently find that an MoE model with activation rate in an optimal region is able to outperform its dense counterpart under the same total parameter, training compute and data resource. More importantly, this optimal region remains consistent across different model sizes. Although additional amount of data turns out to be a trade-off for enhanced performance, we show that this can be resolved via reusing data. We validate our findings through extensive experiments, training nearly 200 language models at 2B scale and over 50 at 7B scale, cumulatively processing 50 trillion tokens. All code and models will be released publicly.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1980. Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models
๐ฌ 12
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ reward models LLM-based evaluators biases
Language models serve as proxies for human preference judgements in alignment and evaluation, yet they exhibit systematic miscalibration, prioritizing superficial patterns over substantive qualities. This bias manifests as overreliance on features like length, structure, and style, leading to issues like reward hacking and unreliable evaluations. However, the connection between training data artifacts and the miscalibrated preferences exhibited by models remains poorly understood.
In this work, we systematically investigate the relationship between training data biases and preference model miscalibration across five idiosyncratic features of language model generations: length, structure, jargon, sycophancy and vagueness. Using controlled counterfactual pairs, we first quantify the extent to which preference models favor responses with artificially magnified biases (\textit{skew}), finding this preference occurs in $>60$\% of instances, and model preferences show high \textit{miscalibration} ($\approx 40$\%) compared to human preferences. Notably, bias features only show mild negative correlations to human preference labels (mean $r_{\mathrm{human}} = -0.12$) but show moderately strong positive correlations with labels from a strong reward model (mean $r_{\mathrm{model}} = +0.36$), suggesting that models may overrely on spurious cues.
To mitigate these issues, we propose a simple post-training method based on counterfactual data augmentation (CDA) using synthesized contrastive examples. Fine-tuning models with CDA reduces average miscalibration from 39.4\% to 32.5\% and average absolute skew difference from 20.5\% to 10.0\%, while maintaining overall RewardBench performance, indicating that targeted debiasing can strengthen the reliability of preference models within standard alignment pipelines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1981. First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation
๐ฌ 21
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ Influence function Data valuation Model Debugging Detrimental Sample Detection
Identifying how training samples influence/impact Large Language Model (LLM) decision-making is essential for effectively interpreting model decisions and auditing large-scale datasets. Current training sample influence estimation methods (also known as influence functions) undertake this goal by utilizing information flow through the model via its first-order and higher-order gradient terms. However, owing to the large model sizes of today consisting of billions of parameters, these influence computations are often restricted to some subset of model layers to ensure computational feasibility. Prior seminal work by Yeh et al. (2022) in assessing which layers are best suited for computing language data influence concluded that the first (embedding) layers are the most informative for this purpose, using a hypothesis based on influence scores canceling out (i.e., the cancellation effect). In this work, we propose theoretical and empirical evidence demonstrating how the cancellation effect is unreliable, and that middle attention layers are better estimators for influence. Furthermore, we address the broader challenge of aggregating influence scores across layers, and showcase how alternatives to standard averaging (such as ranking and vote-based methods) can lead to significantly improved performance. Finally, we propose better methods for evaluating influence score efficacy in LLMs without undertaking model retraining, and propose a new metric known as the Noise Detection Rate (NDR) that exhibits strong predictive capability compared to the cancellation effect. Through extensive experiments across LLMs of varying types and scales, we concretely determine that the first (layers) are not necessarily better than the last (layers) for LLM influence estimation, contrasting with prior knowledge in the field.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1982. TGM: A Modular and Efficient Library for Machine Learning on Temporal Graphs
๐ฌ 23
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Temporal Graph Learning Dynamic Graphs Deep Learning Programming Framework Software Libraries
Well-designed open-source software drives progress in Machine Learning (ML) research. While static graph ML enjoys mature frameworks like PyTorch Geometric and DGL, ML for temporal graphs (TG), networks that evolve over time, lacks comparable infrastructure. Existing TG libraries are often tailored to specific architectures, hindering support for diverse models in this rapidly evolving field. Additionally, the divide between continuous- and discrete-time dynamic graph methods (CTDG and DTDG) limits direct comparisons and idea transfer. To address these gaps, we introduce Temporal Graph Modelling (TGM), a research-oriented library for ML on temporal graphs, the first to unify CTDG and DTDG approaches. TGM offers first-class support for dynamic node features, time-granularity conversions, and native handling of link-, node-, and graph-level tasks. Empirically, TGM achieves an average 7.8ร speedup across multiple models, datasets, and tasks compared to the widely used DyGLib, and an average 175ร speedup on graph discretization relative to available implementations. Beyond efficiency, we show in our experiments how TGM unlocks entirely new research possibilities by enabling dynamic graph property prediction and time-driven training paradigms, opening the door to questions previously impractical to study.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1983. ERTACache: Error Rectification and Timesteps Adjustment for Efficient Diffusion
๐ฌ 19
โญ 5.50
๐ generative models
๐ท๏ธ Diffusion
Diffusion models suffer from substantial computational overhead due to their inherently iterative inference process. While feature caching offers a promising acceleration strategy by reusing intermediate outputs across timesteps, naive reuse often incurs noticeable quality degradation.
In this work, we formally analyze the cumulative error introduced by caching and decompose it into two principal components: feature shift error, caused by inaccuracies in cached outputs, and step amplification error, which arises from error propagation under fixed timestep schedules.
To address these issues, we propose ERTACache a principled caching framework that jointly rectifies both error types. Our method employs an offline residual profiling stage to identify reusable steps, dynamically adjusts integration intervals via a trajectory-aware correction coefficient, and analytically approximates cache-induced errors through a closed-form residual linearization model. Together, these components enable accurate and efficient sampling under aggressive cache reuse.
Extensive experiments across standard image and video generation benchmarks show that ERTACache achieves up to 2x inference speedup while consistently preserving or even improving visual quality. Notably, on the state-of-the-art Wan 2.1 video diffusion model, ERTACache delivers 2x acceleration with minimal VBench degradation, effectively maintaining baseline fidelity while significantly improving efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1984. Hyden: A Hybrid Dual-Path Encoder for Monocular Geometry of High-resolution Images
๐ฌ 10
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D vision Monocular Depth Estimation Monocular Surface Normal Estimation
We present a hybrid dual-path vision encoder (Hyden) for high-resolution monocular depth, point map and surface normal estimation, surpassing state-of-the-art accuracy with a fraction of the inference cost. The architecture pairs a low-resolution Vision Transformer branch for global context with a full-resolution CNN branch for fine details, fusing features via a lightweight MLP before decoding. By exploiting the linear scaling of CNNs and constraining transformer computation to a fixed resolution, the model delivers fast inference even on multi-megapixel inputs. To overcome the scarcity of high-quality high-resolution supervision, we introduce a self-distillation framework that generates pseudo-labels from existing models at both lower resolution full images and high-resolution cropsโglobal labels preserve geometric accuracy, while local labels capture sharper details. To demonstrate the flexibility of our approach, we integrate Hyden and our self-distillation method into DepthAnything-v2 for depth estimation and MoGe2 for surface normal and metric point map prediction, achieving state-of-the-art results on high-resolution benchmarks with the lowest inference latency among competing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1985. Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences
๐ฌ 14
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Alignment Direct Preference Optimization Reinforcement Learning f-divergences
DPO and related algorithms align language models by directly optimizing the RLHF objective: find a policy that maximizes the Bradley-Terry reward while staying close to a reference policy through a KL divergence penalty. Previous work showed that this approach could be further generalized: the original problem remains tractable even if the KL divergence is replaced by a family of $f$-divergence with a convex generating function $f$. Our first contribution is to show that convexity of $f$ is not essential. Instead, we identify a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable. Our next contribution is to establish a second condition on $f$ that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero. We refer to any $f$ that satisfies this condition as displacement-resistant. We finally focus on a specific DPO-inducing and displacement-resistant $f$, leading to our novel SquaredPO loss. Compared to DPO, this new loss offers stronger theoretical guarantees while performing competitively in practice.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1986. Noise-Aware Generalization: Robustness to In-Domain Noise and Out-of-Domain Generalization
๐ฌ 18
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ label noise domain generalization noise-robust generalization
Methods addressing Learning with Noisy Labels (LNL) and multi-source Domain Generalization (DG) use training techniques to improve downstream task performance in the presence of label noise or domain shifts, respectively. Prior work often explores these tasks in isolation, with only limited work that evaluates how label noise affects existing DG methods without also exploring methods to reduce its effect. However, many applications require methods that are robust to both label noise and distribution shifts, which we refer to as Noise-Aware Generalization (NAG), and when these problems are considered together new challenges emerge. E.g., most LNL methods identify noise by detecting distribution shifts in a classโs samples, i.e., they assume that distribution shifts often correspond to label noise. In NAG distribution shifts can be due to label noise or domain shifts, breaking the assumptions used by LNL methods. DG methods often overlook the effect of label noise entirely, which can confuse a model during training, reducing performance. A naive solution to this issue is to make a similar assumption as many DG methods, where we presume to have domain labels during training, enabling us to isolate the two types of shifts. However, this ignores valuable cross-domain information. Specifically, our proposed DL4ND approach improves noise detection by taking advantage of the observation that noisy samples that may appear indistinguishable within a single domain often show greater variation when compared across domains. Experiments show DL4ND significantly improves performance across seven diverse datasets, offering a promising direction for tackling NAG.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1987. JointDiff: Bridging Continuous and Discrete in Multi-Agent Trajectory Generation
๐ฌ 14
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multi-Agent Diffusion Controllable Trajectory
Generative models often treat continuous data and discrete events as separate processes, creating a gap in modeling complex systems where they interact synchronously. To bridge this gap, we introduce $\textbf{JointDiff}$, a novel diffusion framework designed to unify these two processes by simultaneously generating continuous spatio-temporal data and synchronous discrete events. We demonstrate its efficacy in the sports domain by simultaneously modeling multi-agent trajectories and key possession events. This joint modeling is validated with non-controllable generation and two novel controllable generation scenarios: $\textit{weak-possessor-guidance}$, which offers flexible semantic control over game dynamics through a simple list of intended ball possessors, and $\textit{text-guidance}$, which enables fine-grained, language-driven generation. To enable the conditioning with these guidance signals, we introduce $\textbf{CrossGuid}$, an effective conditioning operation for multi-agent domains. We also share a new unified sports benchmark enhanced with textual descriptions for soccer and football datasets. JointDiff achieves state-of-the-art performance, demonstrating that joint modeling is crucial for building realistic and controllable generative models for interactive systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1988. On the Tension Between Optimality and Adversarial Robustness in Policy Optimization
๐ฌ 21
โญ 5.50
๐ reinforcement learning
๐ท๏ธ reinforcement learning adversarial robustness policy optimization theory-practice gap bilevel optimization
Achieving optimality and adversarial robustness in deep reinforcement learning has long been regarded as conflicting goals. Nonetheless, recent theoretical insights presented in CAR suggest a potential alignment, raising the important question of how to realize this in practice.
This paper first identifies a key gap between theory and practice by comparing standard policy optimization (SPO) and adversarially robust policy optimization (ARPO). Although they share theoretical consistency, *a fundamental tension between robustness and optimality arises in practical policy gradient methods*. SPO tends toward convergence to vulnerable first-order stationary policies (FOSPs) with strong natural performance, whereas ARPO typically favors more robust FOSPs at the expense of reduced returns. Furthermore, we attribute this tradeoff to the *reshaping effect of the strongest adversaries* in ARPO, which significantly complicates the global landscape by inducing *deceptive sticky FOSPs*. This improves robustness but makes navigation more challenging. To alleviate this, we develop the *BARPO*, a bilevel framework unifying SPO and ARPO by modulating adversary strength, thereby facilitating navigability while preserving global optima. Extensive empirical results demonstrate that BARPO consistently outperforms vanilla ARPO, providing a practical approach to reconcile theoretical and empirical performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1989. Heterogeneous Front-Door Effects: Debiased Estimation with Quasi-Oracle Guarantees
๐ฌ 11
โญ 5.50
๐ causal reasoning
๐ท๏ธ front-door heterogeneous-treatment-effect.orthogonal-statistical-learning
In many applications, treatment and outcome are confounded by unobservables, yet mediators remain unconfounded. The frontโdoor (FD) adjustment identifies causal effects through mediators even with unmeasured confounding. However, most estimators focus on *average treatment effects*, and work on *heterogeneous treatment effect* (HTE) estimation remain scarce. We address this gap with two *debiased* learners for heterogeneous FD effects: *FDโDRโLearner* and *FDโRโLearner*. Both attain fast, quasi-oracle rates (i.e., performance comparable to an oracle that knows the nuisances) even when nuisance functions converge as slowly as $n^{-1/4}$. Beyond theory, we demonstrate fast convergence and debiasedness in synthetic and real-world evaluations. Our results show that the proposed learners deliver robust and debiased HTE estimates under the FD scenario.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1990. SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports
๐ฌ 21
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Multimodal Large Language Models Sports Understanding Benchmark
Artificial Intelligence brings powerful new tools to sports, from automated officiating to tactical analysis, but these applications all depend on a core reasoning capability.
Deeply understanding sports requires an intricate blend of fine-grained visual perception and rule-based reasoningโa challenge that pushes the limits of current multimodal models.
To succeed, models must master three critical capabilities: perceiving nuanced visual details, applying abstract sport rule knowledge, and grounding that knowledge in specific visual evidence.
Current sports benchmarks either cover single sports or lack the detailed reasoning chains and precise visual grounding needed to robustly evaluate these core capabilities in a multi-sport context.
To address this gap, we introduce SportR, the first multi-sports large-scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence.
Our benchmark provides a dataset of 5,017 images and 2,101 videos. To enable granular evaluation, we structure our benchmark around a progressive hierarchy of question-answer (QA) pairs designed to probe reasoning at increasing depthsโfrom simple infraction identification to complex penalty prediction.
For the most advanced tasks requiring multi-step reasoning, such as determining penalties or explaining tactics, we provide 7,118 high-quality, human-authored Chain-of-Thought (CoT) annotations.
In addition, our benchmark incorporates both image and video modalities and provides manual bounding box annotations to directly test visual grounding in the image part. Extensive experiments demonstrate the profound difficulty of our benchmark. State-of-the-art baseline models perform poorly on our most challenging tasks. While training on our data via Supervised Fine-Tuning and Reinforcement Learning improves these scores, they remain relatively low, highlighting a significant gap in current model capabilities. SportR presents a new challenge for the community, providing a critical resource to drive future research in multimodal sports reasoning. The dataset is available at https://anonymous.4open.science/r/ICLR2026-E071/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1991. Low-Rank Few-Shot Node Classification by Node-Level Graph Diffusion
๐ฌ 16
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Few-Shot Node Classification Low-Rank Few-Shot Graph Diffusion Model Low-Rank Learning
In this paper, we propose a novel node-level graph diffusion method with low-rank feature learning for few-shot node classification (FSNC), termed Low-Rank Few-Shot Graph Diffusion Model or LR-FGDM.
LR-FGDM first employs a novel Few-Shot Graph Diffusion Model (FGDM) as a node-level graph generative method to generate an augmented graph with an enlarged support set, then performs low-rank transductive classification to obtain the few-shot node classification results.
Our graph diffusion model, FGDM, comprises two components, the Hierarchical Graph Autoencoder (HGAE) with an efficient hierarchical edge reconstruction method and a new prototypical regularization, and the Latent Diffusion Model (LDM).
The low-rank regularization is robust to the noise inherently introduced by the diffusion model and empirically inspired by the Low Frequency Property. We also provide a strong theoretical guarantee justifying the low-rank regularization for the transductive classification in few-shot learning. Extensive experimental results evidence the effectiveness of LR-FGDM for few-shot node classification, which outperforms the current state-of-the-art. The code of the LR-FGDM is available at \url{https://anonymous.4open.science/r/LR-FGDM/}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1992. Generalized Parallel Scaling with Interdependent Generations
๐ฌ 15
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language model inference scaling reasoning reinforcement learning post-training attention
Parallel LLM inference scaling involves sampling a set of $N>1$ responses for a single input prompt. However, these $N$ parallel responses tend to be generated independently from each other, partitioning compute resources and leaving potentially useful information in one generation untapped by others. This is in contrast to response length scaling where past computation is used in all future steps. For higher quality responses and response sets, we propose Bridge to generate interdependent responses in parallel by rethinking batched LLM hidden states as holistic tensors rather than independent slices. With only a small amount (2.8\%-5.1\%) of new parameters, Bridge improves the relative mean accuracy gains from reinforcement learning with verifiable rewards by up to 39\% and boosts consistency of correct responses. Trained once, Bridge scales to any generation width, all with greater performance than independent generations, unlocking a more general mode of parallel scaling that effectively leverages information between sequences, compatible with any post-generation aggregation technique.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1993. Rote Learning Considered Useful: Generalizing over Memorized Data in LLMs
๐ฌ 14
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ generalization continual learning fine-tuning memorization
Rote learning is a memorization technique based on repetition. Many researchers argue that rote learning hinders generalization because it encourages verbatim memorization rather than deeper understanding. This concern extends even to factual knowledge, which inevitably requires a certain degree of memorization.
In this work, we challenge this view and demonstrate that large language models (LLMs) can, in fact, generalize over rote memorized data. We introduce a two-phase โmemorize-then-generalizeโ framework, where the model first rote memorizes factual subject-object associations using a synthetic semantically meaningless key token and then learns to generalize by fine-tuning on a small set of semantically meaningful prompts. Extensive experiments over 8 LLMs show that the models can reinterpret rote memorized data through the semantically meaningful prompts, as evidenced by the emergence of structured, semantically aligned latent representations between the key token and the semantically meaningful prompts.
This surprising finding opens the door to both effective and efficient knowledge injection as well as possible risks of repurposing the memorized data for malicious usage.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1994. On the Predictive Power of Representation Dispersion in Language Models
๐ฌ 21
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ Embedding geometry Unsupervised evaluation Mechanistic interpretability Large Language Models Label-free metrics
We show that a language modelโs ability to predict text is tightly linked to the breadth of its embedding space: models that spread their contextual representations more widely tend to achieve lower perplexity. Concretely, we find that representation dispersionโthe average pairwise cosine distance among hidden vectorsโstrongly and negatively correlates with perplexity across diverse model families (LLaMA, Qwen, and others) and domains (Wikipedia, news, scientific abstracts). Beyond illustrating this link, we show how dispersion can be leveraged for a range of practical tasksโwithout requiring labeled data. First, measuring dispersion on unlabeled text allows us to predict downstream accuracy in new domains, offering a dataโefficient tool for model selection. Next, we find that identifying layers with higher dispersion pinpoints the best representations for retrievalโbased methods such as kNNโLM, bypassing exhaustive layerโbyโlayer searches. Finally, we integrate a simple โpushโawayโ objective into training, which increases dispersion in both singleโdomain and crossโdomain scenarios and directly improves perplexity in each.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1995. Point-UQ: An Uncertainty-Quantification Paradigm for Point Cloud Few-Shot Class Incremental Learning
๐ฌ 4
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D point cloud processing few-shot learning class-incremental learning
3D few-shot class-incremental learning (3D FSCIL) requires effectively integrating novel classes from limited samples while preserving base-class knowledge, without succumbing to catastrophic forgetting the learned knowledge or overfitting the novel ones.
Current 3D FSCIL approaches predominantly focus on fine-tuning feature representations yet retain static decision boundaries.
This leads to a critical trade-off: excessive adaptation to new samples tends to erase previously learned knowledge, while insufficient adaptation hinders novel-class recognition.
We argue that the key to effective incremental learning lies not only in feature enhancement but also in adaptive decision-making.
To this end, we introduce **Point-UQ**, an incremental training-free paradigm for 3D **point** clouds based on **u**ncertainty **q**uantification, which shifts the focus from feature tuning to dynamic decision optimization.
Point-UQ comprises two co-designed modules: *Attention-driven Adaptive Enhancement (AAE)* and *Uncertainty-quantification Decision Decoupling (UDD)*.
The former module fuses multi-scale features into calibrated representations, where prediction entropy serves as a reliable measure of per-sample epistemic uncertainty while preserving original feature semantics. Building on AAE-derived calibrated entropy, the UDD module dynamically arbitrates between semantic classifiers and geometric prototypesโenabling robust base-class knowledge retention and accurate novel-class recognition in 3D FSCIL without retraining.
Extensive experiments on ModelNet, ShapeNet, ScanObjectNN, and CO3D demonstrate that our approach outperforms state-of-the-art methods by 4% in average accuracy, setting a new standard for robust 3D incremental learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1996. ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
๐ฌ 12
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Active Data Selection Direct Preference Optimization Human Feedback LLM Alignment
The recent success in using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks, such as question answering, mathematical reasoning, and code generation. However, achieving effective LLM alignment depends on high-quality human preference datasets. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive reward function assumptions, such as linear latent reward function. To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model that is used for active data selection. As a result, ActiveDPO explicitly accounts for the influence of LLM on data selection, unlike methods that select the data without considering the LLM that is being aligned, thereby leading to more effective and efficient data collection. Our extensive experiments demonstrate that ActiveDPO outperforms existing methods across various models and real-life preference datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1997. A Benchmark for Deep Information Synthesis
๐ฌ 18
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Benchmark Deep Information Synthesis LLM agents Deep Research AI agents
Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval. To address this, we introduce DEEPSYNTH , a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights. DEEPSYNTH contains 120 tasks collected across 7 domains and data sources covering 42 countries. DEEPSYNTH is constructed using a multi-stage data collection pipeline that requires annotators to collect official data sources, create hypotheses, perform manual analysis and design tasks with verifiable answers. When evaluated on DEEPSYNTH, 9 state-of-the-art LLMs and deep research agents achieve a maximum F1 score of 8.97. Our analysis reveals that current agents struggle with hallucinations and reasoning over large information spaces, highlighting \ourdata as a crucial benchmark for guiding future research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1998. UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos
๐ฌ 28
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Simulation Real-to-Sim Sim-to-Real Digital Twin Robot Navigation Reinforcement Learning
Urban embodied AI agents, ranging from delivery robots to quadrupeds, are increasingly populating our cities, navigating chaotic streets to provide last-mile connectivity. Training such agents requires diverse, high-fidelity urban environments to scale, yet existing human-crafted or procedurally generated simulation scenes either lack scalability or fail to capture real-world complexity. We introduce UrbanVerse, a data-driven real-to-sim system that converts crowd-sourced city-tour videos into physics-aware, interactive simulation scenes. UrbanVerse consists of: (i) UrbanVerse-100K, a repository of 100k+ annotated urban 3D assets with semantic and physical attributes, and (ii) UrbanVerse-Gen, an automatic pipeline that extracts scene layouts from video and instantiates metric-scale 3D simulations using retrieved assets. Running in IsaacSim, UrbanVerse offers 160 high-quality constructed scenes from 24 countries, along with a curated benchmark of 10 artist-designed test scenes. Experiments show that UrbanVerse scenes preserve real-world semantics and layouts, achieving human-evaluated realism comparable to manually crafted scenes. In urban navigation, policies trained in UrbanVerse exhibit scaling power laws and strong generalization, improving success by +6.3% in simulation and +30.1% in zero-shot sim-to-real transfer comparing to prior methods, accomplishing a 300 m real-world mission with only two interventions. We invite readers to explore our anonymous [Demo Page](https://anonymoususeruseanonymousname.github.io/) and [Documentation Page](https://anonymousrepohasanonymousname.github.io/index.html).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
1999. The Softmax Bottleneck Does Not Limit the Probabilities of the Most Likely Tokens
๐ฌ 13
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Softmax Bottleneck+ Transformer+ Output Projection Matrix+ Large Language Models+
In many popular transformer architectures, an output projection matrix linearly maps lower-dimensional embeddings into a higher-dimensional space of logits.
It has been shown that this leads to a {\em softmax bottleneck} that prevents the production of arbitrary probability distributions. It has been argued that this limits large language models (LLMs) in their ability to express next token probabilities that perfectly align with the statistics of natural language. We focus on the ability of such models to produce accurate probabilities for just the top-$m$ tokens. We provide theoretical bounds that show that even a randomly initialized projection matrix can successfully do this for rather large values of $m$, supported by empirical results on random and trained matrices. This suggests that the softmax bottleneck does not significantly limit the capabilities of LLMs. We also derive bounds on the maximal value of $m$ for which this is possible, given an embedding dimension, bounding the possible performance of any trained matrix.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2000. Interference-Isolated Elastic Weight Consolidation and Knowledge Calibration for Incremental Object Detection
๐ฌ 4
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ continual learning object detection
Incremental Object Detection (IOD) enables AI systems to continuously learn new object classes over time while retaining knowledge of previously learned categories. This capability is essential for adapting to dynamic environments without forgetting prior information. Although existing IOD methods have made progress in mitigating catastrophic forgetting, they usually lack explicit and quantitative modeling of information conflicts during knowledge preservation, making task boundaries ambiguous. Such conflicts often stem from the fact that a single image can contain objects belonging to previous, present, and future tasks, where unlabeled past and future objects are often mistakenly treated as background. In this paper, we propose a novel approach grounded in Elastic Weight Consolidation (EWC) to alleviate conflict knowledge preservation caused by task interference. Specifically, we introduce the Interference Knowledge Isolated Elastic Weight Consolidation (IKI-EWC) framework for IOD, which leverages the mispredictions of the old detector on new task data to estimate task conflicts and suppresses them at the parameter level. By reformulating the Bayesian posterior of model parameters, we derive a mathematical relationship between previously learned knowledge and interference knowledge, enabling targeted elimination of conflicts during model weight updates. In addition, we also propose a prototype-based knowledge calibration (PKC) mechanism to further preserve old knowledge during the training of the objector's classification head. This method employs a learnable projection layer to compensate semantic drift in old class prototypes, and then jointly trains the classification head using both calibrated prototypes and current task features, thereby mitigating forgetting caused by classifier updates. Extensive experiments on PASCAL VOC and MS-COCO benchmarks demonstrate the effectiveness of the proposed method, outperforming state-of-the-art approaches across various settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2001. Softmax Transformers are Turing-Complete
๐ฌ 15
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ soft attention FLaNN recursively enumerable Turing-complete formal languages
Hard attention Chain-of-Thought (CoT) transformers are known to be Turing-complete. However, it is an open problem whether softmax attention Chain-of-Thought (CoT) transformers are Turing-complete. In this paper, we prove a stronger result that length-generalizable softmax CoT transformers are Turing-complete.
More precisely, our Turing-completeness proof goes via the CoT extension of the Counting RASP (C-RASP), which correspond to softmax CoT transformers that admit length generalization. We prove Turing-completeness for CoT C-RASP with causal masking over a unary alphabet (more generally, for the letter-bounded languages). While we show that this is actually not Turing-complete for arbitrary languages, we prove that its extension with relative positional encoding is Turing-complete for arbitrary languages. We empirically validate our theoretical results by training transformers for various languages that require complex (non-linear) arithmetic reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2002. Are EEG Foundation Models Worth It? Comparative Evaluation with Traditional Decoders in Diverse BCI Tasks
๐ฌ 6
โญ 5.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Foundation Model BrainโComputer Interface EEG Benchmark
Foundation models have recently emerged as a promising approach for learning generalizable EEG representations for brainโcomputer interfaces (BCIs). Yet, their true advantages over traditional methodsโparticularly classical non-neural approachesโremain unclear. In this work, we present a comprehensive benchmark of state-of-the-art EEG foundation models, evaluated across diverse datasets, decoding tasks, and six evaluation protocols, with rigorous statistical testing. We introduce spatiotemporal EEGFormer (ST-EEGFormer), a simple yet effective Vision Transformer (ViT)-based baseline, pre-trained solely with masked autoencoding (MAE) on over 8M EEG segments. Our results show that while fine-tuned foundation models perform well in data-rich, population-level settings, they often fail to significantly outperform compact neural networks or even classical non-neural decoders in data-scarce scenarios. Furthermore, linear probing remains consistently weak, and performance varies greatly across downstream tasks, with no clear scaling law observed among neural network decoders. These findings expose a substantial gap between pre-training and downstream fine-tuning, often diminishing the benefits of complex pre-training tasks. We further identify hidden architectural factors that affect performance and emphasize the need for transparent, statistically rigorous evaluation. Overall, this study calls for community-wide efforts to construct large-scale EEG datasets and for fair, reproducible benchmarks to advance EEG foundation models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2003. MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling
๐ฌ 17
โญ 5.50
๐ generative models
๐ท๏ธ Human Video Generation Coherent Video Generation Human Video Dataset
Existing video generation models predominantly emphasize appearance fidelity while exhibiting limited ability to synthesize complex human motions, such as whole-body movements, long-range dynamics, and fine-grained humanโenvironment interactions. This often leads to unrealistic or physically implausible movements with inadequate structural coherence. To conquer these challenges, we propose MoSA, which decouples the process of human video generation into two components, i.e., structure generation and appearance generation. MoSA first employs a 3D structure transformer to generate a human motion sequence from the text prompt. The remaining video appearance is then synthesized under the guidance of this structural sequence. We achieve fine-grained control over the sparse human structures by introducing Human-Aware Dynamic Control modules with a dense tracking constraint during training. The modeling of humanโenvironment interactions is improved through the proposed contact constraint. Those two components work comprehensively to ensure the structural and appearance fidelity across the generated videos. This paper also contributes a large-scale human video dataset, which features more complex and diverse motions than existing human video datasets. We conduct comprehensive comparisons between MoSA and a variety of approaches, including general video generation models, human video generation models, and human animation models. Experiments demonstrate that MoSA substantially outperforms existing approaches across the majority of evaluation metrics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2004. Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding
๐ฌ 20
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ token_level hallucination control self checking
Large Language Models (LLMs) often hallucinate, generating content inconsistent with the input. Retrieval-Augmented Generation (RAG) and Reinforcement Learning with Human Feedback (RLHF) can mitigate hallucinations but require resource-intensive retrieval or large-scale fine-tuning. Decoding-based methods are lighter yet lack explicit hallucination control. To address this, we present \textbf{Token-Guard}, a token-level hallucination control method based on self-checking decoding. Token-Guard performs internal verification at each reasoning step to detect hallucinated tokens before they propagate. Candidate fragments are further evaluated in a latent space with explicit hallucination risk scoring, while iterative pruning and regeneration dynamically correct detected errors. Experiments on HALU datasets show Token-Guard substantially reduces hallucinations and improves generation accuracy, offering a scalable, lightweight solution for reliable LLM outputs. Our code is publicly available\footnote{Anonymous Github Link: \url{https://anonymous.4open.science/r/Token_Guard-00C3}}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2005. SASFT: Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs
๐ฌ 18
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ LLMs interpretability multilingualism
Large Language Models (LLMs) have impressive multilingual capabilities, but they suffer from unexpected code-switching, also known as language mixing, which involves switching to unexpected languages in the model response. This problem leads to poor readability and degrades the usability of model responses.
However, existing work on this issue lacks a mechanistic analysis and shows limited effectiveness.
In this paper, we first provide an in-depth analysis of unexpected code-switching using sparse autoencoders and find that when LLMs switch to a language, the features of that language exhibit excessive pre-activation values. Based on our findings, we propose $\textbf{S}$parse $\textbf{A}$utoencoder-guided $\textbf{S}$upervised $\textbf{F}$ine$\textbf{t}$uning (SASFT), which teaches LLMs to maintain appropriate pre-activation values of specific language features during training. Experiments on five models across three languages demonstrate that SASFT consistently reduces unexpected code-switching by more than 50\% compared to standard supervised fine-tuning, with complete elimination in four cases. Moreover, SASFT maintains or even improves the models' performance on six multilingual benchmarks, showing its effectiveness in addressing code-switching while preserving multilingual capabilities. The code and data are available at https://anonymous.4open.science/r/SASFT-71CC.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2006. K-Prism: A Knowledge-Guided and Prompt Integrated Universal Medical Image Segmentation Model
๐ฌ 19
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Medical Image Image Segmentation Universal Model Prompt Integration Knowledge-Guided
Medical image segmentation is fundamental to clinical decision-making, yet existing models remain fragmented. They are usually trained on single knowledge sources and specific to individual tasks, modalities, or organs. This fragmentation contrasts sharply with clinical practice, where experts seamlessly integrate diverse knowledge: anatomical priors from training, exemplar-based reasoning from reference cases, and iterative refinement through real-time interaction. We present $\textbf{K-Prism}$, a unified segmentation framework that mirrors this clinical flexibility by systematically integrating three knowledge paradigms: (i) $\textit{semantic priors}$ learned from annotated datasets, (ii) $\textit{in-context knowledge}$ from few-shot reference examples, and (iii) $\textit{interactive feedback}$ from user inputs like clicks or scribbles. Our key insight is that these heterogeneous knowledge sources can be encoded into a dual-prompt representation: 1-D sparse prompts defining $\textit{what}$ to segment and 2-D dense prompts indicating $\textit{where}$ to attend, which are then dynamically routed through a Mixture-of-Experts (MoE) decoder. This design enables flexible switching between paradigms and joint training across diverse tasks without architectural modifications. Comprehensive experiments on 18 public datasets spanning diverse modalities (CT, MRI, X-ray, pathology, ultrasound, etc.) demonstrate that K-Prism achieves state-of-the-art performance across semantic, in-context, and interactive segmentation settings. Code will be released upon publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2007. DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization
๐ฌ 9
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Reasoning Model Efficient Reasoning
Recent large reasoning models (LRMs) driven by reinforcement learning algorithms (e.g., GRPO) have achieved remarkable performance on challenging reasoning tasks. However, these models suffer from overthinking, generating unnecessarily long and redundant reasoning even for simple questions, which substantially increases computational cost and response latency. While existing methods incorporate length rewards to GRPO to promote concise reasoning, they incur significant performance degradation. We identify the root cause: when rewards for correct but long rollouts are penalized, GRPO's group-relative advantage function can assign them negative advantages, actively discouraging valid reasoning. To overcome this, we propose Decoupled Reward Policy Optimization (DRPO), a novel framework that decouples the length-based learning signal of correct rollouts from incorrect ones. DRPO ensures that reward signals for correct rollouts are normalized solely within the positive group, shielding them from interference by negative samples. The DRPO's objective is grounded in integrating an optimized positive data distribution, which maximizes length-based rewards under a KL regularization, into a discriminative objective. We derive a closed-form solution for this distribution, enabling efficient computation of the objective and its gradients using only on-policy data and importance weighting. Of independent interest, this formulation is general and can incorporate other preference rewards of positive data beyond length. Experiments on mathematical reasoning tasks demonstrate DRPO's significant superiority over six efficient reasoning baselines. Notably, with a 1.5B model, our method achieves 77\% length reduction with only 1.1\% performance loss on simple questions like GSM8k dataset, while the follow-up baseline sacrifices 4.3\% for 68\% length reduction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2008. On The Expressive Power of GNN Derivatives
๐ฌ 19
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph Neural Networks GNNs Expressivity Message Passing Geometric deep learning Differential geometry Symmetry
Despite significant advances in Graph Neural Networks (GNNs), their limited expressivity remains a fundamental challenge. Research on GNN expressivity has produced many expressive architectures, leading to architecture hierarchies with models of increasing expressive power. Separately, derivatives of GNNs with respect to node features have been widely studied in the context of the oversquashing and over-smoothing phenomena, GNN explainability, and more. To date, these derivatives remain unexplored as a means to enhance GNN expressivity. In this paper, we show that these derivatives provide a natural way to enhance the expressivity of GNNs. We introduce High-Order Derivative GNN (HOD-GNN), a novel method that enhances the expressivity of Message Passing Neural Networks (MPNNs) by leveraging high-order node derivatives of the base model. These derivatives generate expressive structure-aware node embeddings processed by a second GNN in an end-to-end trainable architecture. Theoretically, we show that the resulting architecture family's expressive power aligns with the WL hierarchy. We also draw deep connections between HOD-GNN, Subgraph GNNs, and popular structural encoding schemes. For computational efficiency, we develop a message-passing algorithm for computing high-order derivatives of MPNNs that exploits graph sparsity and parallelism. Evaluations on popular graph learning benchmarks demonstrate HOD-GNNโs strong performance on popular graph learning tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2009. Membership Inference Attacks Against Fine-tuned Diffusion Language Models
๐ฌ 24
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Membership Inference Attack LLM AI Privacy
Diffusion Language Models (DLMs) represent a promising alternative to autoregressive language models, using bidirectional masked token prediction. Yet their susceptibility to privacy leakage via Membership Inference Attacks (MIA) remains critically underexplored. This paper presents the first systematic investigation of MIA vulnerabilities in DLMs. Unlike the autoregressive models' single fixed prediction pattern, DLMs' multiple maskable configurations exponentially increase attack opportunities. This ability to probe many independent masks dramatically improves detection chances. To exploit this, we introduce SAMA (Subset-Aggregated Membership Attack), which addresses the sparse signal challenge through robust aggregation. SAMA samples masked subsets across progressive densities and applies sign-based statistics that remain effective despite heavy-tailed noise. Through inverse-weighted aggregation prioritizing sparse masks' cleaner signals, SAMA transforms sparse memorization detection into a robust voting mechanism. Experiments on nine datasets show SAMA achieves 30\% relative AUC improvement over the best baseline, with up to 8$\times$ improvement at low false positive rates. These findings reveal significant, previously unknown vulnerabilities in DLMs, necessitating the development of tailored privacy defenses.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2010. Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment
๐ฌ 14
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Multilingual Enhancement Large Language Models
The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment. However, recent efforts to extend alignment to other languages often require substantial resources, either through large-scale, high-quality supervision in the target language or through pairwise alignment with high-resource languages, which limits scalability.
In this work, we propose a resource-efficient method for improving multilingual safety alignment.
We introduce a plug-and-play Multi-Lingual Consistency (MLC) loss that can be integrated into existing monolingual alignment pipelines.
By improving collinearity between multilingual representation vectors, our method encourages directional consistency at the multilingual semantic level in a single update. This allows simultaneous alignment across multiple languages using only multilingual prompt variants without requiring additional response-level supervision in low-resource languages. We validate the proposed method across different model architectures and alignment paradigms, and demonstrate its effectiveness in enhancing multilingual safety with limited impact on general model utility. Further evaluation across languages and tasks indicates improved cross-lingual generalization, suggesting the proposed approach as a practical solution for multilingual consistency alignment under limited supervision.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2011. RF-MatID: Dataset and Benchmark for Radio Frequency Material Identification
๐ฌ 23
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Material identification Radio frequency (RF) sensing Dataset Benchmarking
Accurate material identification plays a crucial role in embodied AI systems, enabling a wide range of applications. However, current vision-based solutions are limited by the inherent constraints of optical sensors, while radio-frequency (RF) approaches, which can reveal intrinsic material properties, have received growing attention. Despite this progress, RF-based material identification remains hindered by the lack of large-scale public datasets and the limited benchmarking of learning-based approaches. In this work, we present RF-MatID, the first open-source, large-scale, wide-band, and geometry-diverse RF dataset for fine-grained material identification. RF-MatID includes 16 fine-grained categories grouped into 5 superclasses, spanning a broad frequency range from 4 to 43.5 GHz, and comprises 142k samples in both frequency- and time-domain representations. The dataset systematically incorporates controlled geometry perturbations, including variations in incidence angle and stand-off distance. We further establish a multi-setting, multi-protocol benchmark by evaluating state-of-the-art deep learning models, assessing both in-distribution performance and out-of-distribution robustness under cross-angle and cross-distance shifts. The 5 frequency-allocation protocols enable systematic frequency- and region-level analysis, thereby facilitating real-world deployment. RF-MatID aims to enable reproducible research, accelerate algorithmic advancement, foster cross-domain robustness, and support the development of real-world application in RF-based material identification.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2012. Pi-CCA: Prompt-Invariant CCA Certificates for Replay-Free VisionโLanguage Continual Learning
๐ฌ 16
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ visionโlanguage learning VQA replay-free
When deployed on non-stationary data streams, foundation vision-language models require continual updates without access to past data. However, naive fine-tuning undermines their zero-shot recognition capabilities and prompt robustness. We seek a replay-free principle that preserves pre-trained cross-modal generalization under domain/prompt shifts. We introduce Prompt-Invariant CCA Certificates (Pi-CCA), a geometry-first approach that summarizes image--text alignment with a compact certificate capturing the top-k canonical spectrum and subspace. During adaptation, we match this summary using only mini-batch statistics and induce prompt robustness via averaging over perturbations. Across MTIL, X-TAIL, VLCL, and ConStruct-VL, Pi-CCA achieves state-of-the-art performance among replay-free methods.
By optimizing alignment invariants rather than proxy signals, Pi-CCA provides a simple, generator-free, constant-memory path to continual adaptation with strong zero-shot retention and resilience to prompt/style shifts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2013. Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries
๐ฌ 14
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Teacher Forcing Multi-Token Prediction Pretraining Large Language Models
Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2014. Learning to Reason over Continuous Tokens with Reinforcement Learning
๐ฌ 8
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Hybrid Reasoning Reinforcement Learning
Large Language Models (LLMs) have shown strong performance in complex reasoning tasks, especially when guided by Chain-of-Thought (CoT) prompting. However, conventional CoT reasoning in the discrete token space suffers from high computational and memory costs due to verbose intermediate steps. Recent work has explored latent reasoning in the embedding space to improve efficiency, but often at the cost of clarity and performance. In this work, we propose $\underline{Hy}$brid $\underline{Rea}$soning ($\texttt{HyRea}$), a unified framework that enables LLMs to dynamically switch between explicit (token-based) and latent (embedding-based) reasoning during inference. To train the model to make these decisions effectively, we introduce a two-stage training pipeline: (1) a supervised cold-start phase that introduces latent reasoning by replacing low-entropy CoT steps with embeddings, and (2) a reinforcement learning phase using Group Relative Policy Optimization (GRPO) to fine-tune the modelโs reasoning strategy based on task-specific rewards.
Experiments on mathematical reasoning benchmarks show that \texttt{HyRea} achieves significant reductions in token usage while maintaining or improving accuracy, offering an effective and scalable solution for efficient multi-step reasoning in LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2015. Intrinsic Entropy of Context Length Scaling in LLMs
๐ฌ 20
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ context length intrinsic entropy
There has been work discussing the impact of long context on Language Model performance: some find that long irrelevant context could harm performance, while some experimentally summarize loss reduction by relevant long context as Scaling Laws. This calls for a more thorough understanding on how long context impacts Language Modeling. In this work, we (1) propose to use Intrinsic Entropy for explaining the impact of context length on language modeling; and (2) conduct experiments on natural language and synthetic data, validating our proposed theoretical assumptions and deductions. Our theoretical framework can provide practical insights such as establishing that training dataset size dictates an optimal context length and bounds context length scaling for certain cases. We hope our work may inspire new long context Language Models, as well as future work studying Physics for Language Models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2016. ReFocusEraser: Refocusing for Small Object Removal with Robust Context-Shadow Repair
๐ฌ 10
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Diffusion-based Object Removal๏ผ Image Inpainting
Existing diffusion-based object removal and inpainting methods often fail to recover the fine structural and textural details of small objects. This is primarily due to the VAE encoderโs downsampling, which inevitably compresses small masked regions and causes significant detail loss, while the decoderโs upsampling alone cannot fully restore the lost fine details.
However, the adverse effects of this fixed compression can be mitigated by enlarging the perspective of these regions.
To this end, we propose ReFocusEraser, a two-stage framework for small object removal that combines camera-adaptive zoom-in inpainting with robust context- and shadow-aware repair. In Stage I, a camera-adaptive refocus mechanism magnifies masked regions, and a LoRA-tuned diffusion model ensures precise semantic alignment for accurate reconstruction. However, reintegrating these magnified inpainted regions into the original image introduces challenges due to VAE asymmetry, such as color shifts and seams. Stage II addresses these issues by fine-tuning an additional decoder to create a seam- and shadow-aware module that eliminates residual artifacts while preserving background consistency.
Extensive experiments demonstrate that our proposed RefocusEraser achieves state-of-the-art performance, outperforming existing methods across benchmark datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2017. OFMU: OPTIMIZATION-DRIVEN FRAMEWORK FOR MACHINE UNLEARNING
๐ฌ 9
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ machine unlearning large language models privacy bi-level optimization convergence analysis Trustworthy Machine Learning Gradient-Based Methods
Large language models deployed in sensitive applications increasingly require the
ability to unlearn specific knowledge, such as user requests, copyrighted materi-
als, or outdated information, without retraining from scratch to ensure regulatory
compliance, user privacy, and safety. This task, known as machine unlearning,
aims to remove the influence of targeted data (forgetting) while maintaining per-
formance on the remaining data (retention). A common approach is to formu-
late this as a multi-objective problem and reduce it to a single-objective prob-
lem via scalarization, where forgetting and retention losses are combined using
a weighted sum. However, this often results in unstable training dynamics and
degraded model utility due to conflicting gradient directions. To address these
challenges, we propose OFMU, a penalty-based bi-level optimization framework
that explicitly prioritizes forgetting while preserving retention through a hierar-
chical structure. Our method enforces forgetting via an inner maximization step
that incorporates a similarity-aware penalty to decorrelate the gradients of the for-
get and retention objectives, and restores utility through an outer minimization
step. To ensure scalability, we develop a two-loop algorithm with provable conver-
gence guarantees under both convex and non-convex regimes. We further provide
a rigorous theoretical analysis of convergence rates and show that our approach
achieves better trade-offs between forgetting efficacy and model utility compared
to prior methods. Extensive experiments across vision and language benchmarks
demonstrate that OFMU consistently outperforms existing unlearning methods in
both forgetting efficacy and retained utility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2018. Bridging Piano Transcription and Rendering via Disentangled Score Content and Style
๐ฌ 16
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ piano transcription expressive performance rendering disentangled representation learning
Expressive performance rendering (EPR) and automatic piano transcription (APT) are fundamental yet inverse tasks in music information retrieval: EPR generates expressive performances from symbolic scores, while APT recovers scores from performances. Despite their dual nature, prior work has addressed them independently. In this paper, we propose a unified framework that jointly models EPR and APT by disentangling note-level score content and global performance style representations from both paired and unpaired data. Our framework is built on a transformer-based sequence-to-sequence (Seq2Seq) architecture and is trained using only sequence-aligned data, without requiring fine-grained note-level alignment. To automate the rendering process while ensuring stylistic compatibility with the score, we introduce an independent diffusion-based performance style recommendation (PSR) module that generates style embeddings directly from score content. This modular component supports both style transfer and flexible rendering across a range of expressive styles. Experimental results from both objective and subjective evaluations demonstrate that our framework achieves competitive performance on EPR and APT tasks, while enabling effective contentโstyle disentanglement, reliable style transfer, and stylistically appropriate rendering. Demos are available at https://jointpianist.github.io/epr-apt/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2019. SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation
๐ฌ 23
โญ 5.50
๐ generative models
๐ท๏ธ text-to-image generation diffusion distillation distribution matching distillation
The Distribution Matching Distillation (DMD) has been successfully applied to text-to-image diffusion models such as Stable Diffusion (SD) 1.5. However, vanilla DMD suffers from convergence difficulties on large-scale flow-based text-to-image models, such as SD 3.5 and FLUX. In this paper, we first analyze the issues when applying vanilla DMD on large-scale models. Then, to overcome the scalability challenge, we propose implicit distribution alignment (IDA) to constrain the divergence between the generator and the fake distribution. Furthermore, we propose intra-segment guidance (ISG) to relocate the timestep denoising importance from the teacher model. With IDA alone, DMD converges for SD 3.5; employing both IDA and ISG, DMD converges for SD 3.5 and FLUX.1 dev. Together with a scaled VFM-based discriminator, our final model, dubbed **SenseFlow**, achieves superior performance in distillation for both diffusion based text-to-image models such as SDXL, and flow-matching models such as SD 3.5 Large and FLUX.1 dev. The source code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2020. From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning
๐ฌ 18
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Urban Navigation Foundation Models Reinforcement Learning
Navigation foundation models trained on massive web-scale data enable agents to generalize across diverse environments and embodiments. However, these models, which are trained solely on offline data, often lack the capacity to reason about the consequences of their actions or adapt through counterfactual understanding. They thus face significant limitations in the real-world urban navigation where interactive and safe behaviors, such as avoiding obstacles and moving pedestrians, are critical. To tackle these challenges, we introduce the Seeing-to-Experiencing (S2E) learning framework to scale the capability of navigation foundation models with reinforcement learning. S2E combines the strengths of pre-training on offline videos and post-training through reinforcement learning. It maintains the model's generalizability acquired from large-scale real-world videos while enhancing its interactivity through reinforcement learning in simulation environments. Specifically, we introduce two innovations:
1) an Anchor-Guided Distribution Matching strategy for offline pretraining, which stabilizes learning and models diverse motion patterns through anchor-based supervision; and
2) a Residual-Attention Module for reinforcement learning, which obtains reactive behaviors from simulation environments without erasing the modelโs pretrained knowledge.
Moreover, we establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3D Gaussian Splatting reconstructions of real-world scenes that incorporate physical interactions. It can systematically assess the generalizability and safety of navigation foundation models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2021. DiaBlo: Diagonal Blocks Are Sufficient For Finetuning
๐ฌ 12
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Finetuning Parameter-Efficient LLM Diagonal Block
Fine-tuning is a critical step for adapting large language models (LLMs) to domain-specific downstream tasks. To mitigate the substantial computational and memory costs of full-model fine-tuning, Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed to update only a small subset of model parameters. However, performance gaps between PEFT approaches and full-model fine-tuning still exist. In this work, we present *DiaBlo*, a simple yet effective PEFT approach that updates only the diagonal blocks of selected model weight matrices. Unlike Low-Rank Adaptation (LoRA) and its variants, DiaBlo eliminates the need for low-rank matrix products, thereby avoiding the reliance on auxiliary initialization schemes or customized optimization strategies to improve convergence. This design leads to stable and robust convergence while maintaining comparable memory efficiency and training speed to LoRA. Moreover, we provide theoretical guarantees showing that, under mild low-rank conditions, DiaBlo is more expressive than LoRA in the linear problem and converges to a stationary point of the general nonlinear full fine-tuning. Through extensive experiments across a range of tasksโincluding commonsense reasoning, arithmetic reasoning, code generation, and safety alignmentโwe show that fine-tuning only diagonal blocks is sufficient for strong and consistent performance. DiaBlo not only achieves competitive accuracy but also preserves high memory efficiency and fast fine-tuning speed.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2022. Towards Multimodal Time Series Anomaly Detection with Semantic Alignment and Condensed Interaction
๐ฌ 21
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ multimodal time series; anomaly detection
Time series anomaly detection plays a critical role in many dynamic systems. However, previous approaches have primarily relied on unimodal numerical data, overlooking the importance of complementary information from other modalities. In this paper, we propose a novel multimodal time series anomaly detection model (MindTS) that focuses on addressing two key challenges: (1) how to achieve semantically consistent alignment across heterogeneous multimodal data, and (2) how to filter out redundant modality information to enhance cross-modal interaction effectively. To address the first challenge, we propose Fine-grained Time-text Semantic Alignment. It integrates exogenous and endogenous text information through cross-view text fusion and a multimodal alignment mechanism, achieving semantically consistent alignment between time and text modalities. For the second challenge, we introduce Content Condenser Reconstruction, which filters redundant information within the aligned text modality and performs cross-modal reconstruction to enable interaction. Extensive experiments on six real-world multimodal datasets demonstrate that the proposed MindTS achieves competitive or superior results compared to existing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2023. CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives
๐ฌ 17
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ dilemma value-based decision-making high-stakes steerability LLM
Navigating dilemmas involving conflicting values is challenging even for humans in high-stakes domains, let alone for AI, yet prior work has been limited to everyday scenarios. To close this gap, we introduce CLASH (Character perspective-based LLM Assessments in Situations with High-stakes), a meticulously curated dataset consisting of 345 high-impact dilemmas along with 3,795 individual perspectives of diverse values. CLASH enables the study of critical yet underexplored aspects of value-based decision-making processes, including understanding of decision ambivalence and psychological discomfort as well as capturing the temporal shifts of values in the perspectives of characters. By benchmarking 14 non-thinking and thinking models, we uncover several key findings. (1) Even strong proprietary models, such as GPT-5 and Claude-4-Sonnet, struggle with ambivalent decisions, achieving only 24.06 and 51.01 accuracy. (2) Although LLMs reasonably predict psychological discomfort, they do not adequately comprehend perspectives involving value shifts. (3) Cognitive behaviors that are effective in the math-solving and game strategy domains do not transfer to value reasoning. Instead, new failure patterns emerge, including early commitment and overcommitment. (4) The steerability of LLMs towards a given value is significantly correlated with their value preferences. (5) Finally, LLMs exhibit greater steerability when reasoning from a third-party perspective, although certain values (e.g., safety) benefit uniquely from first-person framing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2024. MARS - A Foundational Map Auto-Regressor
๐ฌ 10
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Computer Vision Remote Sensing Geospatial AI Human-in-the-loop
Map generation tasks, featured by extensive non-structural vectorized data (e.g., points, polylines, and polygons), pose significant challenges to common pixel-wise generative models. Past works, by segmenting and then performing various vectorized post-processing, usually sacrifice accuracy. Motivated by the recent huge success of auto-regressive visual-language modeling, we propose the first map foundational model: Map Auto-Regressor (MARS), that is capable of generating both multi-polyline road networks and polygon buildings in a unified manner. We collected by far the largest multi-class map dataset, MAP-3M, to support the robust training. Extensive benchmarks highlight the superiority of MARS against literature works. Meanwhile, benefited from the auto-regressive and teaching-forcing based training, we develop the โChat with MARSโ capability that enables interactive human-in-the-loop map generation and correction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2025. On the Interaction of Compressibility and Adversarial Robustness
๐ฌ 18
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ compressibility compression adversarial robustness generalization safety
Modern neural networks are expected to simultaneously satisfy a host of desirable properties: accurate fitting to training data, generalization to unseen inputs, parameter and computational efficiency, and robustness to adversarial perturbations. While compressibility and robustness have each been studied extensively, a unified understanding of their interaction still remains elusive. In this work, we develop a principled framework to analyze how different forms of structured compressibility - such as neuron-level sparsity and spectral compressibility - affect adversarial robustness. We show that these forms of compression can induce a small number of highly sensitive directions in the representation space, which adversaries can exploit to construct effective perturbations. Our analysis yields a robustness bound that reveals how neuron and spectral compressibility impact $\ell_\infty$ and $\ell_2$ robustness via their effects on the learned representations. Crucially, the vulnerabilities we identify arise irrespective of how compressibility is achieved - whether via regularization, architectural bias, or implicit learning dynamics. Through empirical evaluations across synthetic and realistic tasks, we confirm our theoretical predictions, and further demonstrate that these vulnerabilities persist under adversarial training and transfer learning, and contribute to the emergence of universal adversarial perturbations. Our findings show a fundamental tension between structured compressibility and robustness and highlight new pathways for designing models that are both efficient and safe.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2026. DisTaC: Conditioning Task Vectors via Distillation for Robust Model Merging
๐ฌ 35
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Model Merging Model Editing Task Vector
Model merging has emerged as an efficient and flexible paradigm for multi-task learning, with numerous methods being proposed in recent years.
However, these state-of-the-art techniques are typically evaluated on benchmark suites that are highly favorable to model merging, and their robustness in more realistic settings remains largely unexplored.
In this work, we first investigate the vulnerabilities of model-merging methods and pinpoint the source-model characteristics that critically underlie them.
Specifically, we identify two factors that are particularly harmful to the merging process: (1) disparities in task vector norms, and (2) the low confidence of the source models. To address this issue, we propose **DisTaC** (**Dis**tillation for **Ta**sk vector **C**onditioning), a novel method that pre-conditions these problematic task vectors before the merge. DisTaC leverages knowledge distillation to adjust a task vector's norm and increase source-model confidence while preserving its essential task-specific knowledge. Our extensive experiments demonstrate that by pre-conditioning task vectors with DisTaC, state-of-the-art merging techniques can successfully integrate models that exhibit these harmful traits, where they would otherwise fail, and achieve significant performance gains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2027. Enabling Fine-Tuning of Direct Feedback Alignment via Feedback-Weight Matching
๐ฌ 13
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ direct feedback alignment deep learning fine tuning
Although Direct Feedback Alignment (DFA) has demonstrated potential by enabling efficient and parallel updates of weight parameters through direct propagation of the network's output error, its usage has been primarily restricted to training networks from scratch. In this paper, we introduce feedback-weight matching, a first method that enables reliable fine-tuning of fully connected neural networks using DFA. We provide an analysis showing that existing standard DFA struggles to fine-tune networks that are pre-trained via back-propagation. Through a thorough analysis of weight alignment (WA) and gradient alignment (GA), we demonstrate that the proposed feedback-weight matching enhances DFA's ability and stability in fine-tuning, which provides useful insights into DFA's behavior and characteristics when applied to fine-tuning. In addition, we prove that feedback-weight matching, when combined with weight decay, not only mitigates over-fitting but also further reduces the network output error, leading to improved learning performance during DFA-based fine-tuning. Experimental results show that feedback-weight matching, for the first time, enables reliable fine-tuning across various fine-tuning tasks, compared to existing standard DFA, e.g., achieving 7.97% accuracy improvement on image classification tasks (82.67% vs. 74.70%) and 0.66 higher correlation score on NLP tasks (0.76 vs. 0.10). The code is available on an anonymous GitHub.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2028. LiveResearchBench: Benchmarking Single- and Multi-Agent Systems for Citation-Grounded Deep Research
๐ฌ 18
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Benchmark Deep Research LLM Agents Multi Agent Systems
Deep research---producing comprehensive, citation-backed reports by searching across hundreds of live websites---marks an important frontier for agentic systems. To rigorously evaluate this ability, three principles are essential: tasks should be (1) user-centric, reflecting realistic information needs, (2) dynamic, requiring up-to-date information beyond parametric knowledge, and (3) unambiguous, ensuring consistent interpretation across users. Existing benchmarks fall short of these principles, often focusing on narrow domains or posing ambiguous questions that hinder fair comparison. Guided by these principles, we introduce LiveResearchBench, a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia, each requiring extensive, dynamic, real-time web search and synthesis. Built with over 1,500 hours of human labor, LiveResearchBench provides a rigorous basis for systematic evaluation. To evaluate citation-grounded long-form reports, we present DeepEval, a comprehensive suite covering both content- and report-level quality: checklists for coverage and presentation, rubric-tree assessments of citation accuracy and traceability, and metrics for consistency and depth of analysis. Using LiveResearchBench and DeepEval, we conduct a comprehensive evaluation of frontier deep research systems, including single-agent web search, single-agent deep research, and multi-agent systems. Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2029. Dual-Solver: A Generalized ODE Solver for Diffusion Models with Dual Prediction
๐ฌ 15
โญ 5.50
๐ generative models
๐ท๏ธ Generative Models Diffusion Models Fast Sampling ODE Solver
Diffusion models deliver state-of-the-art image quality. However, sampling is costly at inference time because it requires many model evaluations (number of function evaluations, NFEs).
To reduce NFEs, classical ODE multistep methods have been adopted. Yet differences in the choice of prediction type (noise/data/velocity) and integration domain (half log-SNR/noise-to-signal ratio) lead to different outcomes.
We introduce Dual-Solver, which generalizes multistep samplers by introducing learnable parameters that continuously (i) interpolate among prediction types, (ii) select the integration domain, and (iii) adjust the residual terms.
It maintains the traditional predictor-corrector structure and guarantees second-order local accuracy.
These parameters are learned with a classification-based objective using a frozen pretrained classifier (e.g., ViT or CLIP).
On ImageNet class-conditional generation (DiT, GM-DiT) and text-to-image (SANA, PixArt-$\alpha$), Dual-Solver improves FID and CLIP scores in the low-NFE regime ($3\le$ NFE $\le 9$) across backbones.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2030. OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning
๐ฌ 13
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Vision-Language-Action Models Embodied Reasoning
General-purpose robots capable of performing diverse tasks require synergistic reasoning and acting capabilities.
However, recent dual-system approaches, which separate high-level reasoning from low-level acting, often suffer from challenges such as limited mutual understanding of capabilities between systems and latency issues.
This paper introduces OneTwoVLA, a single unified vision-language-action model that can perform both acting (System One) and reasoning (System Two). Crucially, OneTwoVLA adaptively switches between two modes: explicitly reasoning at critical moments during task execution, and generating actions based on the most recent reasoning at other times.
To further unlock OneTwoVLA's reasoning and generalization capabilities, we design a scalable pipeline for synthesizing embodied reasoning-centric vision-language data, used for co-training with robot data. We validate OneTwoVLA's effectiveness through extensive experiments, highlighting its superior performance across four key capabilities: long-horizon task planning, error detection and recovery, natural human-robot interaction, and generalizable visual grounding, enabling the model to perform long-horizon, highly dexterous manipulation tasks such as making hotpot or mixing cocktails.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2031. Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
๐ฌ 22
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ KV cache eviction large language models llm long-context generation
Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each tokenโs intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), and conversational long-memory benchmarks (LongMemEval), TRIM-KV consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes. Remarkably, it even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization, suppressing noise from uninformative tokens. Qualitative analyses further reveal that learned retention scores align with human intuition, naturally recovering heuristics such as sink tokens, sliding windows, and gist compression without explicit design. Beyond efficiency, retention scores provide insights into layer- and head-specific roles, suggesting a new path toward LLM interpretability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2032. Bayesian Influence Functions for Hessian-Free Data Attribution
๐ฌ 9
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ Training Data Attribution Interpretability SGMCMC Influence Functions MCMC Loss Landscape Geometry Singular Learning Theory Robustness
Classical influence functions face significant challenges when applied to deep neural networks, primarily due to non-invertible Hessians and high-dimensional parameter spaces. We propose the local Bayesian influence function (BIF), an extension of classical influence functions that replaces Hessian inversion with loss landscape statistics that can be estimated via stochastic-gradient MCMC sampling. This Hessian-free approach captures higher-order interactions among parameters and scales efficiently to neural networks with billions of parameters. We demonstrate state-of-the-art results on predicting retraining experiments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2033. BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation
๐ฌ 15
โญ 5.50
๐ generative models
๐ท๏ธ sparse attention; video generation; step distillation
Diffusion transformers currently lead the field in high-quality video generation, but their slow iterative denoising process and prohibitive quadratic attention costs for long sequences create significant inference bottlenecks. While both step distillation and sparse attention mechanisms have shown promise as independent acceleration strategies, effectively combining these approaches presents critical challenges---training-free integration yields suboptimal results, while separately training sparse attention after step distillation requires prohibitively expensive high-quality video data. To overcome these limitations, we propose $\textit{BLADE}$, an innovative data-free joint training framework that introduces: (1) an Adaptive Block-Sparse Attention (ASA) mechanism for dynamically generating content-aware sparsity masks to focus computation on salient spatiotemporal features, and (2) a sparsity-aware step distillation paradigm, built upon Trajectory Distribution Matching (TDM), directly incorporates sparsity into the distillation process rather than treating it as a separate compression step and features fast convergence. We validate BLADE on text-to-video models like CogVideoX-5B and Wan2.1-1.3B, and our framework demonstrates remarkable efficiency gains across different scales. On Wan2.1-1.3B, BLADE achieves a 14.10$\times$ end-to-end inference acceleration over a 50-step baseline. Moreover, on models such as CogVideoX-5B with short video sequence lengths, our framework delivers a robust 8.89$\times$ speedup. Crucially, the acceleration is accompanied by a consistent quality improvement. On the VBench-2.0 benchmark, BLADE boosts the score of CogVideoX-5B to 0.569 (from 0.534) and Wan2.1-1.3B to 0.570 (from 0.563), results that are further corroborated by superior ratings in human evaluations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2034. MaRS: Memory-Adaptive Routing for Reliable Capacity Expansion and Knowledge Retention
๐ฌ 21
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Large Pre-trained Models Continual Learning Slot Expansion Knowledge Distillation
Large pre-trained models (LPMs) serve as universal backbones for vision and language tasks, but continual learning (CL) with frozen LPMs remains challenging, since shallow adaptation modules face the stabilityโplasticity dilemma and are prone to catastrophic forgetting. To address this problem, we propose MaRS (Memory-adaptive Router with Statistical control), a modular framework that decouples stable representation from adaptive capacity through three components: a frozen encoder, a slot-based memory router, and a lightweight classifier. On this basis, we design two mechanisms: (i) Statistically-Grounded Slot Expansion (SGSE) formulates expansion as a statistical decision problem, ensuring controlled growth with guarantees on false alarms and detection delay; (ii) Dual-Stage ContrastiveโDistillation Adaptation (DCDA) integrates new slots through supervised contrastive learning and knowledge distillation, preserving prior knowledge without raw replay. Experiments on diverse benchmarks show that MaRS achieves state-of-the-art performance in continual learning with frozen LPMs, combining adaptability, efficiency, and retention.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2035. ReactDance: Hierarchical Representation for High-Fidelity and Coherent Long-Form Reactive Dance Generation
๐ฌ 13
โญ 5.50
๐ generative models
๐ท๏ธ diffusion dance generation motion generation
Reactive dance generation (RDG), the task of generating a dance conditioned on a lead dancer's motion, holds significant promise for enhancing human-robot interaction and immersive digital entertainment. Despite progress in duet synchronization and motion-music alignment, two key challenges remain: generating fine-grained spatial interactions and ensuring long-term temporal coherence. In this work, we introduce $\textbf{ReactDance}$, a diffusion framework that operates on a novel hierarchical latent space to address these spatiotemporal challenges in RDG. First, for fine-grained spatial control and artistic expression, we propose Hierarchical Finite Scalar Quantization ($\textbf{HFSQ}$). This multi-scale motion representation effectively disentangles coarse body posture from high-frequency dynamics, enabling independent and detailed control over both aspects through a layered guidance mechanism. Second, to efficiently generate long sequences with high temporal coherence, we propose Blockwise Local Context ($\textbf{BLC}$), a non-autoregressive sampling strategy. Departing from slow, frame-by-frame generation, BLC partitions the sequence into blocks and synthesizes them in parallel via periodic causal masking and positional encodings. Coherence across these blocks is ensured by a dense sliding-window training approach that enriches the representation with local temporal context. Extensive experiments show that ReactDance substantially outperforms state-of-the-art methods in motion quality, long-term coherence, and sampling efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2036. Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling
๐ฌ 15
โญ 5.50
๐ generative models
๐ท๏ธ Masked diffusion models Variational autoencoders Latent variable models
Discrete diffusion models have recently shown great promise for modeling complex discrete data, with masked diffusion models (MDMs) offering a compelling trade-off between quality and generation speed. MDMs denoise by progressively unmasking multiple dimensions from an all-masked input, but their performance can degrade when using few denoising steps due to limited modeling of inter-dimensional dependencies. In this paper, we propose Variational Autoencoding Discrete Diffusion (VADD), a novel framework that enhances discrete diffusion with latent variable modeling to implicitly capture correlations among dimensions. By introducing an auxiliary recognition model, VADD enables stable training via variational lower bounds maximization and amortized inference over the training set. Our approach retains the efficiency of traditional MDMs while significantly improving sample quality, especially when the number of denoising steps is small. Empirical results on 2D toy data, pixel-level image generation, and text generation demonstrate that VADD consistently outperforms MDM baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2037. Prompt and Parameter Co-Optimization for Large Language Models
๐ฌ 19
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ prompt-parameter co-optimization shared-private parameterization supervised regularization
Prompt optimization and fine-tuning are two major approaches to improve the performance of Large Language Models (LLMs).
They enhance the capabilities of LLMs from complementary perspectives: the former through explicit natural language, and the latter through implicit parameter updates.
However, prior work has typically studied them in isolation, leaving their synergistic potential largely underexplored. To bridge this gap, in this paper, we introduce MetaTuner, a novel framework that jointly integrates prompt optimization and fine-tuning for LLM training.
Specifically, we introduce two neural networks to generate prompts and parameters, respectively, while allowing them to share a common bottom encoding layer to enable knowledge sharing.
By the guidance of the final supervised signals, our framework is optimized to discover the optimal combinations between the prompts and parameters.
Given that prompt learning involves discrete optimization while fine-tuning operates in a continuous parameter space, we design a supervised regularization loss to train our framework effectively.
Extensive experiments across diverse benchmarks show that our method consistently outperforms the baselines. To benefit the research community, we have released our project at https://anonymous.4open.science/r/metatuner.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2038. Improved Quality, Synchrony, and Preference Alignment for Joint Audio-Video Generation
๐ฌ 22
โญ 5.50
๐ generative models
๐ท๏ธ AIGC Diffusion Model Sounding Video Generation
Recent AIGC advances have rapidly expanded from text-to-image generation toward high-quality multimodal synthesis across video and audio. Within this context, joint audio-video generation (JAVG) has emerged as a fundamental task, enabling synchronized and semantically aligned sound and vision from textual descriptions. However, compared with advanced proprietary systems such as Veo3, existing open-source methods still suffer from limitations in generation quality, temporal synchrony, and alignment with human preferences. This paper presents a concise yet powerful framework for efficient and effective JAVG. First, we introduce a modality-specific mixture-of-experts (MS-MoE) design that enables effective cross-modal communication while enhancing single-modality generation quality. Then, we propose a temporal-aligned RoPE (TA-RoPE) strategy to achieve explicit, frame-level synchronization between audio and video tokens. Besides, we develop an audio-video direct preference optimization (AV-DPO) method to align model outputs with human preference across quality, consistency, and synchrony dimensions. Built upon Wan2.1-1.3B-T2V, our model achieves state-of-the-art performance merely with around 1M training entries, significantly outperforming prior approaches in both qualitative and quantitative evaluations. Comprehensive ablation studies have been conducted to validate the effectiveness of our proposed modules.
We hope this work can set a milestone for the field of native JAVG and bring new inspiration to the community.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2039. Fore-Mamba3D: Mamba-based Foreground-Enhanced Encoding for 3D Object Detection
๐ฌ 25
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ 3D Object Detection Mamba Foreground Representation
Linear modeling methods like Mamba have been merged as the effective backbone for the 3D object detection task. However, previous Mamba-based methods utilize the bidirectional encoding for the whole non-empty voxel sequence, which contains abundant useless background information in the scenes. Though directly encoding foreground voxels appears to be a plausible solution, it tends to degrade detection performance. We attribute this to the response attenuation and restricted context representation in the linear modeling for fore-only sequences. To address this problem, we propose a novel backbone, termed Fore-Mamba3D, to focus on the foreground enhancement by modifying Mamba-based encoder. The foreground voxels are first sampled according to the predicted scores. Considering the response attenuation existing in the interaction of foreground voxels across different instances, we design a regional-to-global slide window (RGSW) to propagate the information from regional split to the entire sequence. Furthermore, a semantic-assisted and state spatial fusion module (SASFMamba) is proposed to enrich contextual representation by enhancing semantic and geometric awareness within the Mamba model. Our method emphasizes foreground-only encoding and alleviates the distance-based and causal dependencies in the linear autoregression model. The superior performance across various benchmarks demonstrates the effectiveness of Fore-Mamba3D in the 3D object detection task.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2040. Counterfactual LLM-based Framework for Measuring Rhetorical Style
๐ฌ 31
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ AI for Metascience Preference Models LLM-as-Judge Computational Social Science LLM Personas Rhetorical Style Measurement
The rise of AI has fueled growing concerns about ``hype'' in machine learning papers, yet a reliable way to quantify rhetorical style independently of substantive content has remained elusive. Because strong empirical results can justify stronger claims, it is often unclear whether bold language reflects genuine evidence or merely rhetorical style. We introduce a counterfactual, LLM-based framework to disentangle rhetorical style from substantive content: multiple LLM rhetorical personas generate counterfactual writings from the same substantive content, an LLM judge compares them through pairwise evaluations, and the outcomes are aggregated using a Bradley--Terry model. Applying this method to 8,485 ICLR submissions sampled from 2017 to 2025, we generate more than 250,000 counterfactual writings and provide a large-scale quantification of rhetorical style in ML papers. Visionary framing significantly predicts downstream attention, including citations and media coverage, even after controlling for peer-review evaluations. We also observe a sharp rise in rhetorical strength after 2023, and provide evidence showing that this increase is strongly correlated with the adoption of LLM writing assistance. The reliability of our framework is validated by its robustness to the choice of personas and the high correlation between LLM judgments and human annotations. Our work demonstrates that LLMs can serve as instruments for improving how ML research is evaluated.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2041. All Patches Matter, More Patches Better: Enhance AI-Generated Image Detection via Panoptic Patch Learning
๐ฌ 18
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Generative Models Synthetic Image Detection AIGC Detection
The rapid proliferation of AI-generated images (AIGIs) highlights the pressing demand for generalizable detection methods. In this paper, we establish two key principles for AIGI detection task through systematic analysis:
**(1) All Patches Matter**, since the uniform generation process ensures that each patch inherently contains synthetic artifacts, making every patch a valuable detection source; and
**(2) More Patches Better**, as leveraging distributed artifacts across more patches improves robustness by reducing over-reliance on specific regions.
However, counterfactual analysis uncovers a critical weakness: naively trained detectors display **Few-Patch Bias**, relying disproportionately on *minority patches*.
We identify this bias to **Lazy Learner** effect, where detectors to limited patch artifacts while neglecting distributed cues.
To address this, we propose **Panoptic Patch Learning** framework, which integrates:
(1) *Randomized Patch Reconstruction*, injecting synthetic cues into randomly selected patches to diversify artifact recognition;
(2) *Patch-wise Contrastive Learning*, enforcing consistent discriminative capability across patches to ensure their uniform utilization.
Extensive experiments demonstrate that PPL enhances generalization and robustness across datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2042. Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems
๐ฌ 24
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Vision Language Models Abstract Visual Reasoning Bongard Problems
Bongard Problems (BPs) provide a challenging testbed for abstract visual reasoning (AVR), requiring models to identify visual concepts from just a few examples and describe them in natural language. Early BP benchmarks featured synthetic black-and-white drawings, which might not fully capture the complexity of real-world scenes. Subsequent BP datasets employed real-world images, albeit the represented concepts are identifiable from high-level image features, reducing the task complexity. Differently, the recently released Bongard-RWR dataset aimed at representing abstract concepts formulated in the original BPs using fine-grained real-world images. Its manual construction, however, limited the dataset size to just $60$ instances, constraining evaluation robustness. In this work, we introduce Bongard-RWR+, a BP dataset composed of $5400$ instances that represent original BP abstract concepts using real-world-like images generated via a vision language model (VLM) pipeline. Building on Bongard-RWR, we employ Pixtral-12B to describe manually curated images and generate new descriptions aligned with the underlying concepts, use Flux.1-dev to synthesize images from these descriptions, and manually verify that the generated images faithfully reflect the intended concepts. We evaluate state-of-the-art VLMs across diverse BP formulations, including binary and multiclass classification, as well as textual answer generation. Our findings reveal that while VLMs can recognize coarse-grained visual concepts, they consistently struggle with discerning fine-grained concepts, highlighting limitations in their reasoning capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2043. GTM: A General Time-series Model for Enhanced Representation Learning of Time-Series data
๐ฌ 13
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ Time series; Fundation Model;Representation learning;Pre-training strategy
Despite recent progress in time-series foundation models, challenges persist in improving representation learning and adapting to diverse downstream tasks. We introduce a General Time-series Model (GTM), which advances representation learning via a novel frequency-domain attention mechanism that captures time-granularity-aware featuresโan aspect overlooked in prior research.
We further propose a novel pre-training strategy that unifies reconstruction and autoregressive objectives through a hybrid masking mechanism. Our pre-training strategy, combined with 2D positional encoding and span shuffling, enhances the robustness and generalization of representations. GTM is established as the first generative-task-agnostic model for time-series analysis, enabling seamless adaptation to various generative tasks without any task-specific modifications.
Extensive experiments demonstrate that GTM consistently outperforms SOTA models on various generative tasks and achieves strong classification results with minimal adaptation. Furthermore, GTM exhibits clear scaling behavior, with accuracy improving as model size and pre-training data increase.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2044. Market Games for Generative Models: Equilibria, Welfare, and Strategic Entry
๐ฌ 15
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Generative model competition Nash equilibrium Welfare analysis Best-response training
Generative model ecosystems increasingly operate as competitive multi-platform markets, where platforms strategically select models from a shared pool and users with heterogeneous preferences choose among them. Understanding how platforms interact, when market equilibria exist, how outcomes are shaped by model-provider, platforms, and user behavior, and how social welfare is affected is critical for fostering beneficial market environment. In this paper, we formalize a three-layer *model-platfrom-user* market game and identify conditions for the existence of pure Nash equilibrium. Our analysis shows that market structure, whether platforms converge on similar models or differentiate by selecting distinct ones, depends not only on modelsโ global average performance but also on their localized attraction to user groups. We further examine welfare outcomes and show that expanding the model pool does not necessarily increase user welfare or market diversity. Finally, we design and evaluate best-response training schemes that allow model-provider to strategically introduce new models into competitive markets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2045. Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers
๐ฌ 11
โญ 5.50
๐ learning theory
๐ท๏ธ kolmogorov complexity minimum description length principle compression variational inference quantization
The Minimum Description Length (MDL) principle offers a formal framework for applying Occam's razor in machine learning. However, its application to neural networks such as Transformers is challenging due to the lack of a principled, universal measure for model complexity. This paper introduces the theoretical notion of asymptotically optimal description length objectives, grounded in the theory of Kolmogorov complexity. We establish that a minimizer of such an objective achieves optimal compression, for any dataset, up to an additive constant, in the limit as model resource bounds increase. We prove that asymptotically optimal objectives exist for Transformers, building on a new demonstration of their computational universality. We further show that such objectives can be tractable and differentiable by constructing and analyzing a variational objective based on an adaptive Gaussian mixture prior. Our empirical analysis shows that this variational objective selects for a low-complexity solution with strong generalization on an algorithmic task, but standard optimizers fail to find such solutions from a random initialization, highlighting key optimization challenges. More broadly, by providing a theoretical framework for identifying description length objectives with strong asymptotic guarantees, we outline a potential path towards training neural networks that achieve greater compression and generalization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2046. Fantastic Pretraining Optimizers and Where to Find Them
๐ฌ 19
โญ 5.50
๐ optimization
๐ท๏ธ optimizer benchmarking pretrain
AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2$\times$ speedup. We posit that two methodological shortcomings have obscured fair comparisons and hindered practical adoption: (i) unequal hyperparameter tuning and (ii) limited or misleading evaluation setups.
To address these two issues, we conduct a systematic study of ten deep learning optimizers across four model scales (0.1Bโ1.2B parameters) and data-to-model ratios (1--8$\times$ the Chinchilla optimum). We find that fair and informative comparisons require rigorous hyperparameter tuning and evaluations across a range of model scales and data-to-model ratios, performed at the end of training.
First, optimal hyperparameters for one optimizer may be suboptimal for another, making blind hyperparameter transfer unfair.
Second, the actual speedup of many proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size to only 1.1$\times$ for 1.2B parameter models. Thirdly, comparing intermediate checkpoints before reaching the target training budgets can be misleading, as rankings between two optimizers can flip during training due to learning rate decay.
Through our thorough investigation, we find that all the fastest optimizers such as Muon and Soap, use matrices as preconditioners --- multiplying gradients with matrices rather than entry-wise scalars. However, the speedup of matrix-based optimizers is inversely proportional to model scale, decreasing from 1.4$\times$ over AdamW for 0.1B parameter models to merely 1.1$\times$ for 1.2B parameter models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2047. ConfHit: Conformal Generative Design via Nested Testing
๐ฌ 18
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ conformal prediction generative modeling risk control molecule generation applications to drug discovery
The success of deep generative models in scientific discovery requires not only the ability to generate novel candidates but also reliable guarantees that these candidates indeed satisfy desired properties. Recent conformal-prediction methods offer a path to such guarantees, but its application to generative modeling in drug discovery is limited by budget constraints, lack of oracle access, and distribution shift. To this end, we introduce ConfHit, a distribution-free framework that provides validity guarantees under these conditions. ConfHit formalizes two central questions: (i) Certification: whether a generated batch can be guaranteed to contain at least one hit with a user-specified confidence level, and (ii) Design: whether the generation can be refined to a compact set without weakening this guarantee. ConfHit leverages weighted exchangeability between historical and generated samples to eliminate the need for an experimental oracle, constructs multiple-sample density-ratio weighted conformal p-value to quantify statistical confidence in hits, and proposes a nested testing procedure to certify and refine candidate sets of multiple generated samples while maintaining statistical guarantees. Across representative generative molecule design tasks and a broad range of methods, ConfHit consistently delivers valid coverage guarantees at multiple confidence levels while maintaining compact certified sets, establishing a principled and reliable framework for generative modeling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2048. PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning
๐ฌ 22
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Physics Reasoning Process-Level Evaluation Symbolic Equivalence Scientific Problem Solving
Benchmarks for competition-style reasoning have advanced evaluation in mathematics and programming, yet physics remains comparatively underexplored. Most existing physics benchmarks evaluate only final answers, which fail to capture reasoning processes, while recent stepwise methods rely on heuristic LLM-as-judge scoring or restrictive linear assumptions, limiting reliability and diagnostic validity.
We introduce PRISM-Physics, a process-level evaluation framework and benchmark for complex physics reasoning problems. Solutions are represented as directed acyclic graphs (DAGs) of formulas, explicitly encoding causal dependencies among intermediate steps to enable fine-grained, interpretable, and theoretically grounded scoring.
We prove the optimality of the DAG representation and the corresponding scoring policy. Combining with a fully rule-based method for symbolic formula equivalence matching that we developed, we ensure consistent validation across diverse formulations without heuristic judgments. Results show that our evaluation framework is more aligned with human experts' scoring.
Experiments on state-of-the-art LLMs reveal persistent reasoning failures in physics, while step-level scoring offers both diagnostic insight and rich signals for later training. By combining structural rigor, theoretical guarantees, and symbolic validation, PRISM-Physics provides a principled foundation for advancing process-level evaluation and guiding the development of models with deeper scientific reasoning capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2049. Exploring Synthesizable Chemical Space with Iterative Pathway Refinements
๐ฌ 14
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ drug discovery molecule generation synthesizable molecule design
A well-known pitfall of molecular generative models is that they are not guaranteed to generate synthesizable molecules. Existing solutions for this problem often struggle to effectively navigate exponentially large combinatorial space of synthesizable molecules and suffer from poor coverage. To address this problem, we introduce ReaSyn, an iterative generative pathway refinement framework that obtains synthesizable analogs to input molecules by projecting them onto synthesizable space. Specifically, we propose a simple synthetic pathway representation that allows for generating pathways in both bottom-up and top-down traversal of synthetic trees. We design ReaSyn so that both bottom-up and top-down pathways can be sampled with a single unified autoregressive model. ReaSyn can thus iteratively refine subtrees of generated synthetic trees in a bidirectional manner. Further, we introduce a discrete flow model that refines the generated pathway at the entire pathway level with edit operations: insertion, deletion, and substitution. The iterative refinement cycle of (1) bottom-up decoding, (2) top-down decoding, and (3) holistic editing constitutes a powerful pathway reasoning strategy, allowing the model to explore the vast space of synthesizable molecules. Experimentally, ReaSyn achieves the highest reconstruction rate and pathway diversity in synthesizable molecule reconstruction and the highest optimization performance in synthesizable goal-directed molecular optimization, and significantly outperforms previous synthesizable projection methods in synthesizable hit expansion. These results highlight ReaSyn's superior ability to navigate combinatorially-large synthesizable chemical space.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2050. Test-Time Iterative Error Correction for Efficient Diffusion Models
๐ฌ 15
โญ 5.50
๐ generative models
๐ท๏ธ Test-time; diffusion;
With the growing demand for high-quality image generation on resource-constrained devices, efficient diffusion models have received increasing attention. However, such models suffer from approximation errors introduced by efficiency techniques, which significantly degrade generation quality. Once deployed, these errors are difficult to correct, as modifying the model is typically infeasible in deployment environments. Through an analysis of error propagation across diffusion timesteps, we reveal that these approximation errors can accumulate exponentially, severely impairing output quality. Motivated by this insight, we propose Iterative Error Correction (IEC), a novel test-time method that mitigates inference-time errors by iteratively refining the modelโs output. IEC is theoretically proven to reduce error propagation from exponential to linear growth, without requiring any retraining or architectural changes. IEC can seamlessly integrate into the inference process of existing diffusion models, enabling a flexible trade-off between performance and efficiency. Extensive experiments show that IEC consistently improves generation quality across various datasets, efficiency techniques, and model architectures, establishing it as a practical and generalizable solution for test-time enhancement of efficient diffusion models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2051. FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
๐ฌ 20
โญ 5.50
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ LLM inference KV cache
Large language models (LLMs) have been widely deployed with rapidly expanding context windows to support increasingly demanding applications.
However, long contexts pose significant deployment challenges, primarily due to the KV cache whose size grows proportionally with context length.
While KV cache compression methods are proposed to address this issue, KV dropping methods incur considerable accuracy loss, and KV retrieval methods suffer from significant efficiency bottlenecks.
We propose FreeKV, an algorithm-system co-optimization framework to enhance KV retrieval efficiency while preserving accuracy.
On the algorithm side, FreeKV introduces speculative retrieval to shift the KV selection and recall processes out of the critical path, combined with fine-grained correction to ensure accuracy.
On the system side, FreeKV employs hybrid KV layouts across CPU and GPU memory to eliminate fragmented data transfers, and leverages double-buffered streamed recall to further improve efficiency, enabling effective overlap with computation, full latency hiding, and practical speedups from speculative recall.
Experiments demonstrate that FreeKV achieves near-lossless accuracy across various scenarios and models, delivering up to 13$\times$ speedup compared to SOTA KV retrieval methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2052. Differentiable JPEG-based Input Perturbation for Knowledge Distillation Amplification via Conditional Mutual Information Maximization
๐ฌ 25
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Knowledge Distillation JPEG Conditional Mutual Information
Maximizing conditional mutual information (CMI) has recently been shown to enhance the effectiveness of teacher networks in knowledge distillation (KD). Prior work achieves this by fine-tuning a pretrained teacher to maximize a proxy of its CMI. However, fine-tuning large-scale teachers is often impractical, and proxy-based optimization introduces inaccuracies.
To overcome these limitations, we propose Differentiable JPEG-based Input Perturbation (DJIP), a plug-and-play framework that improves teacherโstudent knowledge transfer without modifying the teacher. DJIP employs a trainable differentiable JPEG layer inserted before the teacher to perturb teacher inputs in a way that directly increases CMI. We further introduce a novel alternating optimization algorithm to efficiently learn the coding parameters of the JPEG layer to maximize the perturbed CMI. Extensive experiments on CIFAR-100 and ImageNet, across diverse distillers and architectures, demonstrate that DJIP consistently improves student accuracy-achieving up to 4.11% gains-while remaining computationally lightweight and fully compatible with standard KD pipelines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2053. Planner Aware Path Learning in Diffusion Language Models Training
๐ฌ 17
โญ 5.50
๐ generative models
๐ท๏ธ Diffusion Language Models Discrete Diffusion Diffusion Models code generation protein generation text generation
Diffusion language models have emerged as a powerful alternative to autoregressive models, enabling fast inference through more flexible and parallel generation paths. This flexibility of sampling is unlocked by new engineered sampling strategies, or *planners*, that select more favorable generation paths by iteratively planning---versus uniformly at random---where to denoise along the sequence. However, by modifying the reverse paths via planning, planners create an irrevocable mismatch between the uniformly random denoising paths during training and planning-based inference. In this paper, we systematically investigate the mismatch of discrete diffusion training and inference under planning and theoretically prove that the standard discrete diffusion training evidence lower bound (ELBO) does not accurately describe a denoiser that uses a non-uniform planner. To address this gap, we derive a new planned evidence lower bound (P-ELBO) that incorporates planner-based reverse dynamics directly into the training objective.
Using the P-ELBO, we introduce *Planner Aware Path Learning* (PAPL), a novel training scheme that aligns training and inference under a planned denoiser.
PAPL is implemented as a simple yet effective modification to the standard masked discrete diffusion loss, making it widely applicable and easy to adopt.
Empirically, we show PAPL delivers consistent gains across domains, including a 40\% relative improvement in protein sequences, improved text generation with up to a $4\times$ relative MAUVE gain, and 23\% relative improvement in code generation HumanEval pass@10.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2054. GEM: A Gym for Generalist LLMs
๐ฌ 9
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ environment gym llm multi-turn reinforcement learning
The training paradigm for large language models (LLMs) is moving from static datasets to experience-based learning, where agents acquire skills via interacting with complex environments. To facilitate this transition we introduce GEM (General Experience Maker), an open-source environment simulator designed for the age of LLMs. Analogous to OpenAI-Gym for traditional reinforcement learning (RL), GEM provides a standardized framework for the environment-agent interface, including asynchronous vectorized execution for high throughput, and flexible wrappers for easy extensibility. GEM also features a diverse suite of environments, robust integrated tools, and single-file example scripts demonstrating using GEM with five popular RL training frameworks. Along with this, we also provide a set of baselines across 24 environments using REINFORCE with Return Batch Normalization (ReBN), which---unlike GRPO---is compatible with the full RL setting of dense per-turn rewards and arbitrary discount factors. We further conduct apple-to-apple benchmarking of PPO, GRPO and REINFORCE in both single- and multi-turn settings using GEM to shed light on the algorithmic designs. GEM also functions as a convenient evaluation toolkit besides a training environment. We hope this framework can help accelerate future agentic LLM research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2055. SAGA: Structural Aggregation Guided Alignment with Dynamic View and Neighborhood Order Selection for Multiview Graph Domain Adaptation
๐ฌ 12
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Graph Domain Adaptation
Graph domain adaptation (GDA) transfers knowledge from a labeled source graph to an unlabeled target graph to alleviate label scarcity. In multi-view graphs, the challenge of mitigating domain shift is constrained by structural information across various views. Moreover, within each view, structures at different hops capture distinct neighborhood levels, which can lead to varying structural discrepancies. However, existing methods typically assume only a single-view graph structure, which cannot effectively capture the rich structural information in multi-relational graphs and hampers adaptation performances. In this paper, we tackle the challenging Multi-view Graph Domain Adaptation (MGDA) problem by proposing Structural Aggregation Guided Alignment (SAGA) that aligns multi-view graph data via dynamic view and neighborhood order selection. Specifically, we propose the notion of Structural Aggregation Distance (SAD) as a dynamic discrepancy metric that jointly considers view and neighborhood order, allowing the dominant viewโorder pair to vary during training. Through empirical analysis, we justify the validity of SAD and show that domain discrepancy in MGDA is largely governed by the dominant viewโorder pair, which evolves throughout training. Motivated by this observation, we design SAGA, which leverages SAD to dynamically identify the principal view-order pair that guides alignment, thereby effectively characterizing and mitigating both view- and hop-level structural discrepancies between multi-view graphs. Experimental results on various multi-relational graph benchmarks verify the effectiveness of our method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2056. Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time
๐ฌ 10
โญ 5.50
๐ learning theory
๐ท๏ธ Deep Learning scaling laws in-context learning transformers attention
We study in-context learning (ICL) of linear regression in a deep linear self-attention model, characterizing how performance depends on various computational and statistical resources (width, depth, number of training steps, batch size and data per context). In a joint limit where data dimension, context length, and residual stream width scale proportionally, we analyze the limiting asymptotics for three ICL settings: (1) isotropic covariates and tasks (ISO), (2) fixed and structured covariance (FS), and (3) where covariances are randomly rotated and structured (RRS). For ISO and FS settings, we find that depth only aids ICL performance if context length is limited. Alternatively, in the RRS setting where covariances change across contexts, increasing the depth leads to significant improvements in ICL, even at infinite context length. This provides a new solvable toy model of neural scaling laws which depends on both width and depth of a transformer and predicts an optimal transformer shapes as a function of compute.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2057. LightCtrl: Training-free Controllable Video Relighting
๐ฌ 11
โญ 5.50
๐ generative models
๐ท๏ธ video relighting; controllable video editing
Recent diffusion models have achieved remarkable success in image relighting, and this success has quickly been reproduced in video relighting. Although these methods can relight videos under various conditions, their ability to explicitly control the illumination in the relighted video remains limited. Therefore, we present LightCtrl, the first controllable video relighting method that offers explicit control over the video illumination through a user-supplied light trajectory in a training-free manner. This is essentially achieved by leveraging a hybrid approach that combines pre-trained diffusion models: a pre-trained image relighting diffusion model is used to relight each frame individually, followed by a video diffusion prior that enhances the temporal consistency of the relighted sequence. In particular, to enable explicit control over dynamically varying lighting in the relighted video, we introduce two key components.
First, the Light Map Injection module samples light trajectory-specific noise and injects it into the latent representation of the source video, significantly enhancing illumination coherence with respect to the conditional light trajectory.
Second, the Geometry-Aware Relighting module dynamically combines RGB and normal map latents in the frequency domain to suppress the influence of the original lighting in the input video, thereby further improving the relighted video's adherence to the input light trajectory.
Our experiments demonstrate that LightCtrl can generate high-quality video results with diverse illumination changes closely following the light trajectory condition, indicating improved controllability over baseline methods.
The code will be released to facilitate future studies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2058. PepTri: Tri-Guided All-Atom Diffusion for Peptide Design via Physics, Evolution, and Mutual Information
๐ฌ 33
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ sequence-structure peptide design all-atom guided latent diffusion
Peptides, short chains of amino acids capable of high-specificity protein binding, represent a powerful class of therapeutics. While deep generative models have shown promise for peptide design, existing approaches are often structure-centric and therefore generate sequences and structures in a decoupled manner, failing to ensure that designs are simultaneously physically stable, evolutionarily plausible, and internally coherent. To overcome this limitation, we introduce PepTri, a novel diffusion framework that addresses this by jointly generating peptide sequences and 3D structures within a unified, SE(3)-equivariant latent space. Our proposed model integrates three complementary guidance signals during the generative process: (i) physics-informed guidance via differentiable molecular mechanics to ensure structural stability and realism; (ii) evolutionary guidance to bias sequences toward conserved, functional motifs; and (iii) mutual information guidance to explicitly maximize sequence-structure coherence. This tri-guided approach ensures the generative process is steered by biophysical laws, biological priors, and information-theoretic alignment in tandem. Extensive evaluations on challenging peptide-protein design benchmarks, cross-domain (PepBench, LNR) and in-domain (PepBDB), demonstrate that PepTri substantially outperforms strong baselines, achieving state-of-the-art results in binding affinity, structural accuracy, and design diversity. Our results establish that integrating these complementary signals directly into the denoising process is crucial for generating viable, high-quality peptide medicines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2059. Controllable Logical Hypothesis Generation for Abductive Reasoning in Knowledge Graphs
๐ฌ 20
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Knowledge Graph Abductive Reasoning
Abductive reasoning in knowledge graphs aims to generate plausible logical hypotheses from observed entities, with broad applications in areas such as clinical diagnosis and scientific discovery. However, due to a lack of controllability, a single observation may yield numerous plausible but redundant or irrelevant hypotheses on large-scale knowledge graphs. To address this limitation, we introduce the task of controllable hypothesis generation to improve the practical utility of abductive reasoning. This task faces two key challenges when controlling for generating long and complex logical hypotheses: hypothesis space collapse and hypothesis reward oversensitivity.
To address these challenges, we propose **CtrlHGen**, a **C**on**tr**ollable **l**ogcial **H**ypothesis **Gen**eration framework for abductive reasoning over knowledge graphs, trained in a two-stage paradigm including supervised learning and subsequent reinforcement learning.
To mitigate hypothesis space collapse, we design a dataset augmentation strategy based on sub-logical decomposition, enabling the model to learn complex logical structures by leveraging semantic patterns in simpler components.
To address hypothesis reward oversensitivity, we incorporate smoothed semantic rewards including Dice and Overlap scores, and introduce a condition-adherence reward to guide the generation toward user-specified control constraints.
Extensive experiments on three benchmark datasets demonstrate that our model not only better adheres to control conditions but also achieves superior semantic similarity performance compared to baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2060. Gradient Intrinsic Dimensionality Alignment๏ผNarrowing The Gap Between Low-Rank Adaptation and Full Fine-Tuning
๐ฌ 25
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ PEFT LoRA Gradient Intrinsic Dimension Adaptive Alignment
Parameter-Efficient Fine-Tuning (PEFT) techniques, such as Low-Rank Adaptation (LoRA) and its variants, have emerged as critical tools for adapting large pretrained models under limited computational resources. However, a notable performance gap persists between these LoRA methods and Full Fine-Tuning (FFT).
In this paper, we investigate a key yet overlooked cause of this gap: the relationship between LoRA's low-rank adaptation subspace and true effective update directions of FFT gradients, which we define as the **gradient intrinsic dimensionality**.
To systematically quantify this dimension, we first propose a novel entropy-based estimator, uncovering substantial discrepancies (up to more than 100x) between the rank of LoRA and the gradient intrinsic dimensionality. Motivated by this finding, we introduce **RaLoRA**, which adaptively aligns the ranks of LoRA adapters with layer-specific gradient intrinsic dimensions, without increasing the number of overall parameters. We further extend this approach into **RaLoRA-Pro**, integrating intra-layer rank alignment and inter-layer parameter reallocation guided by loss sensitivity, enabling finer-grained capacity relocation under comparable parameters. Extensive experiments demonstrate the effectiveness of our methods. Specifically, compared to vanilla LoRA, our methods achieve more than +5\% improvement on GLUE, +0.57 on MT-Bench, +5.23\% on GSM8K, +5.69\% on HumanEval, and +1.58\% on image classification, confirming consistent and substantial performance gains across diverse tasks and modalities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2061. Deep Think with Confidence
๐ฌ 14
โญ 5.50
๐ generative models
๐ท๏ธ Large Language Model Reasoning
Large Language Models (LLMs) have shown great potential in reasoning tasks through test-time scaling methods like self-consistency with majority voting. However, this approach often leads to diminishing returns in accuracy and high computational overhead. To address these challenges, we introduce Deep Think with Confidence (DeepConf), a simple yet powerful method that enhances both reasoning efficiency and performance at test time. DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks. We evaluate DeepConf across a variety of tasks and the latest open-source models, including Qwen3 and GPT-OSS series. Notably, on challenging benchmarks such as AIME 2025, DeepConf@512 achieves up to 99.9\% accuracy and reduces generated tokens by up to 84.7\% compared to full parallel thinking. Anonymous code is available at https://anonymous.4open.science/r/deepconf-anonymous-E4E3
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2062. Laplacian Kernelized Bandit
๐ฌ 10
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Bandit Problem Kernel Method Graph Laplacian Regularization
We study multi-user contextual bandits where users are related by a graph and their reward functions exhibit both non-linear behavior and graph homophily. We introduce a principled joint penalty for the collection of user reward functions $\\{f_u\\}$, combining a graph smoothness term based on RKHS distances with an individual roughness penalty. Our central contribution is proving that this penalty is equivalent to the squared norm within a single, unified _multi-user RKHS_. We explicitly derive its reproducing kernel, which elegantly fuses the graph Laplacian with the base arm kernel. This unification allows us to reframe the problem as learning a single "lifted" function, enabling the design of principled algorithms, LK-GP-UCB and LK-GP-TS, that leverage Gaussian Process posteriors over this new kernel for exploration. We provide high-probability regret bounds that scale with an _effective dimension_ of the multi-user kernel, replacing dependencies on user count or ambient dimension. Empirically, our methods outperform strong linear and non-graph-aware baselines in non-linear settings and remain competitive even when the true rewards are linear. Our work delivers a unified, theoretically grounded, and practical framework that bridges Laplacian regularization with kernelized bandits for structured exploration.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2063. Inferring the Invisible: Neuro-Symbolic Rule Discovery for Missing Value Imputation
๐ฌ 16
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ Neuro-symbolic Learning Rule Discovery Interpretable Reasoning
One of the central challenges in artificial intelligence is reasoning under partial observability, where key values are missing but essential for understanding and modeling the system. This paper presents a neuro-symbolic framework for latent rule discovery and missing value imputation. In contrast to traditional latent variable models, our approach treats missing grounded values as latent predicates to be inferred through logical reasoning. By interleaving neural representation learning with symbolic rule induction, the model iteratively discoversโboth conjunctive and disjunctive rulesโthat explain observed patterns and recover missing entries. Our framework seamlessly handles heterogeneous data, reasoning over both discrete and continuous features by learning soft predicates from continuous values. Crucially, the inferred values not only fill in gaps in the data but also serve as supporting evidence for further rule induction and inferenceโcreating a feedback loop in which imputation and rule mining reinforce one another. Using coordinate gradient descent, the system learns these rules end-to-end, enabling interpretable reasoning over incomplete data. Experiments on both synthetic and real-world datasets demonstrate that our method effectively imputes missing values while uncovering meaningful, human-interpretable rules that govern system dynamics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2064. Pose Prior Learner: Unsupervised Categorical Prior Learning for Pose Estimation
๐ฌ 17
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Prior Learning Pose Estimation
A prior represents a set of beliefs or assumptions about a system, aiding inference and decision-making. In this paper, we introduce the challenge of unsupervised categorical prior learning in pose estimation, where AI models learn a general pose prior for an object category from images in a self-supervised manner. Although priors are effective in estimating pose, acquiring them can be difficult. We propose a novel method, named Pose Prior Learner (PPL), to learn a general pose prior for any object category. PPL uses a hierarchical memory to store compositional parts of prototypical poses, from which we distill a general pose prior. This prior improves pose estimation accuracy through template transformation and image reconstruction. PPL learns meaningful pose priors without any additional human annotations or interventions, outperforming competitive baselines on both human and animal pose estimation datasets. Notably, our experimental results reveal the effectiveness of PPL using learned prototypical poses for pose estimation on occluded images. Through iterative inference, PPL leverages the pose prior to refine estimated poses, regressing them to any prototypical poses stored in memory. Our code, model, and data will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2065. One Patch Doesnโt Fit All: Adaptive Patching for Native-Resolution Multimodal Large Language Models
๐ฌ 8
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Native Resolution Multimodal Large Language Models
Real-world visual signals are inherently variable in resolution, and it is natural to endow multimodal large language models (MLLMs) with such native-resolution perception capabilities. In principle, for general and straightforward multimodal understanding, low-resolution images are sufficient. While for images with nuanced details like documents and charts, it is crucial to preserve fine-grained details using high-resolution inputs, as naive resizing inevitably results in information loss. Recent advances employ sequence packing to process images of any resolution and aspect ratios. Despite these efforts, model performance degrades at both low and high resolutions, and high-resolution inputs incur substantial computational costs. We argue that the rigid use of a single patch size is the primary cause: when image resolution or information density varies, fixing patch size is intrinsically suboptimal. To address this issue, we introduce Adaptive Patching (AdaPatch), a simple yet effective strategy that adjusts patch size according to image resolution and information density and could be seamlessly plugged into pre-trained fixed-patch MLLMs without any training efforts. Extensive evaluations demonstrate consistent improvements in native resolution performance without additional training. Besides, we provide a training-based method to further adapt MLLMs with dynamic patch sizes and enhance the performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2066. Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
๐ฌ 14
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Large Reasoning Models Efficient Reasoning Reinforcement Learning
Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel **L**ength-b**A**sed **S**t**E**p **R**eward shaping method (LASER), which employs a step function as the reward based on target length. LASER surpasses previous methods, achieving a superior trade-off between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of the model evolves dynamically during training, necessitating reward specifications that are also adaptive and dynamic; (2) Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. This approach is expected to facilitate a combination of fast and slow thinking, leading to a better overall tradeoff. The resulting method is termed LASER-D (**D**ynamic and **D**ifficulty-aware). Experiments on five open-weight models from 1.5B to 32B demonstrate that our approach significantly enhances both reasoning performance and response length efficiency. For instance, LASER-D achieves a **5.3** improvement on AIME2024 while reducing token usage by **64**%. Further analysis reveals that our RL-based compression produces more concise reasoning patterns with less redundant ``self-reflections''.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2067. PoSh: Using Scene Graphs to Guide LLMs-as-a-Judge for Detailed Image Descriptions
๐ฌ 10
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ detailed image description metric benchmark art VLMs
While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman ฯ) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2068. Dyna-Mind: Learning to Simulate from Experience for Better AI Agents
๐ฌ 4
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ AI agents reasoning planning reinforcement learning
Reasoning models have recently shown remarkable progress in domains such as math and coding. However, their expert-level abilities in math and coding contrast sharply with their performance in long-horizon, interactive tasks such as web navigation and computer/phone-use. Inspired by literature on human cognition, we argue that current AI agents need ``vicarious trial and error'' - the capacity to mentally simulate alternative futures before acting - in order to enhance their understanding and performance in complex interactive environments. We introduce Dyna-Mind, a two-stage training framework that explicitly teaches (V)LM agents to integrate such simulation into their reasoning. In stage 1, we introduce Reasoning with Simulations (ReSim), which trains the agent to generate structured reasoning traces from expanded search trees built from real experience gathered through environment interactions. ReSim thus grounds the agent's reasoning in faithful world dynamics and equips it with the ability to anticipate future states in its reasoning. In stage 2, we propose Dyna-GRPO, an online reinforcement learning method to further strengthen the agent's simulation and decision-making ability by using both outcome rewards and intermediate states as feedback from real rollouts. Experiments on two synthetic benchmarks (Sokoban and ALFWorld) and one realistic benchmark (AndroidWorld) demonstrate that (1) ReSim effectively infuses simulation ability into AI agents, and (2) Dyna-GRPO leverages outcome and interaction-level signals to learn better policies for long-horizon, planning-intensive tasks. Together, these results highlight the central role of simulation in enabling AI agents to reason, plan, and act more effectively in the ever more challenging environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2069. MIMIC: Mask-Injected Manipulation Video Generation with Interaction Control
๐ฌ 14
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ video diffusion model manipulation video
Embodied intelligence faces a fundamental bottleneck from limited large-scale interaction data. Video generation offers a scalable alternative, but manipulation videos remain particularly challenging, as they require capturing subtle, contact-rich dynamics. Despite recent advances, video diffusion models still struggle to balance semantic understanding with fine-grained visual details, restricting their effectiveness in manipulation scenarios. Our key insight is that reference videos provide rich semantic and motion cues that can effectively drive manipulation video generation. Building on this, we propose MIMIC, a two-stage image-to-video diffusion framework. (1) We first introduce an Interaction-Motion-Aware (IMA) module to fuse visual features from the reference video, producing coherent semantic masks that correspond to the target image. (2) then utilize these masks as semantic control signals to guide the video generation process. Moreover, considering the ambiguity of the motion attribution, we introduce a Pair Prompt Control mechanism to disentangle object and camera motion by adding the reference video as an additional input. Extensive experiments demonstrate that MIMIC significantly outperforms existing methods, effectively preserves manipulation intent and motion details, even when handling diverse and deformable objects. Our findings underscore the effectiveness of reference-driven semantics for controllable and realistic manipulation video generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2070. A Unified Federated Framework for Trajectory Data Preparation via LLMs
๐ฌ 30
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Trajectory Data Preparation Federated Learning Large Language Model Trajectory Preprocessing
Trajectory data records the spatio-temporal movements of people and vehicles. However, raw trajectories are often noisy, incomplete, or inconsistent due to sensor errors and transmission failures. To ensure reliable downstream analytics, Trajectory Data Preparation (TDP) has emerged as a critical preprocessing stage, encompassing various tasks such as imputation, map matching, anomaly detection, trajectory recovery, compression, etc. However, existing TDP methods face two major limitations: (i) they assume centralized access to data, which is unrealistic under strict privacy regulations and data silo situations, and (ii) they train task-specific models that lack generalization across diverse or unseen TDP tasks. To this end, we propose FedTDP for Federated Trajectory Data Preparation (F-TDP), where trajectories are vertically partitioned across regions and cannot be directly shared. FedTDP introduces three innovations: (i) lightweight Trajectory Privacy AutoEncoder (TPA) with secret-sharing aggregation, providing formal privacy guarantees; (ii) Trajectory Knowledge Enhancer (TKE) that adapts LLMs to spatio-temporal patterns via trajectory-aware prompts, offsite-tuning, sparse-tuning, and bidirectional knowledge distillation; and (iii) Federated Parallel Optimization (FPO), which reduces communication overhead and accelerates federated training. We conduct experiments on 6 real-world datasets and 10 representative TDP tasks, showing that FedTDP surpasses 13 state-of-the-art baselines in accuracy, efficiency, and scalability, while also generalizing effectively across diverse TDP tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2071. Mitigating Noise Shift in Denoising Generative Models with Noise Awareness Guidance
๐ฌ 14
โญ 5.50
๐ generative models
๐ท๏ธ diffusion models generative models training-inference misalignment noise awareness guidance
Existing denoising generative models rely on solving discretized reverse-time SDEs or ODEs. In this paper, we identify a long-overlooked yet pervasive issue in this family of models: a misalignment between the pre-defined noise level and the actual noise level encoded in intermediate states during sampling. We refer to this misalignment as noise shift. Through empirical analysis, we demonstrate that noise shift is widespread in modern diffusion models and exhibits a systematic bias, leading to sub-optimal generation due to both out-of-distribution generalization and inaccurate denoising updates. To address this problem, we propose Noise Awareness Guidance (NAG), a simple yet effective correction method that explicitly steers sampling trajectories to remain consistent with the pre-defined noise schedule. We further introduce a classifier-free variant of NAG, which jointly trains a noise-conditional and a noise-unconditional model via noise-condition dropout, thereby eliminating the need for external classifiers. Extensive experiments, including ImageNet generation and various supervised fine-tuning tasks, show that NAG consistently mitigates noise shift and substantially improves the generation quality of mainstream diffusion models. Code will be released upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2072. Rectifying LLM Thought from Lens of Optimization
๐ฌ 17
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Lanugae Model Large Lanugae Model Reasoning Reinforcement Learning with Verifiable Rewards
Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (**Re**ctifying **Pro**cess-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2073. Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin
๐ฌ 19
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ attention sinks compression valleys deep trasformer-based LLMs
Attention sinks and compression valleys have attracted significant attention as two puzzling phenomena in large language models, but have been studied in isolation. In this work, we present a surprising connection between attention sinks and compression valleys, tracing both to the formation of massive activations in the residual stream. We prove theoretically that massive activations necessarily produce representational compression and establish bounds on the resulting entropy reduction. Through experiments across several models (410M--120B parameters), we confirm that when the beginning-of-sequence token develops extreme activation norms in the middle layers, both compression valleys and attention sinks emerge simultaneously. Targeted ablation validates our theoretical predictions. This unified view motivates us to propose the Mix-Compress-Refine theory of information flow, as an attempt to explain how LLMs organize their computation in depth by controlling attention and representational compression via massive activations. Specifically, we posit that Transformer-based LLMs process tokens in three distinct phases: (1) broad mixing in the early layers, (2) compressed computation with limited mixing in the middle layers, and (3) selective refinement in the late layers. Our framework helps explain why embedding tasks perform best at intermediate layers, whereas generation tasks benefit from full-depth processing, clarifying differences in task-dependent representations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2074. Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models
๐ฌ 4
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ dataset bias model bias laion-400m
Vision-language models trained on large-scale multimodal datasets show strong demographic biases, but the role of training data in producing these biases remains unclear. A major barrier has been the lack of demographic annotations in web-scale datasets such as LAION-400M. We address this gap by creating person-centric annotations for the full dataset, including over 276 million bounding boxes, perceived gender and race/ethnicity labels, and automatically generated captions. These annotations are produced through validated automatic labeling pipelines combining object detection, multimodal captioning, and finetuned classifiers. Using them, we uncover demographic imbalances and harmful associations, such as the disproportionate linking of men and individuals perceived as Black or Middle Eastern with crime-related and negative content. We also show that 60-70\% of gender bias in CLIP and Stable Diffusion can be linearly explained by direct co-occurrences in the data. Our resources establish the first large-scale empirical link between dataset composition and downstream model bias.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2075. WOW-Seg: A Word-free Open World Segmentation Model
๐ฌ 23
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ vision language model open world segmentation object recognition
Open world image segmentation aims to achieve precise segmentation and semantic understanding of targets within images by addressing the infinitely open set of object categories encountered in the real world. However, traditional closed-set segmentation approaches struggle to adapt to complex open world scenarios, while foundation segmentation models such as SAM exhibit notable discrepancies between their strong segmentation capabilities and relatively weaker semantic understanding. To bridge discrepancies, we propose WOW-Seg, a Word-free Open World Segmentation model for segmenting and recognizing objects from open-set categories. Specifically, WOW-Seg introduces a novel visual prompt module, Mask2Token, which transforms image masks into visual tokens and ensures their alignment with the VLLM feature space. Moreover, We introduce the Cascade Attention Mask to decouple information across different instances. This approach mitigates inter-instance interference, leading to a significant improvement in model performance. We further construct an open world region recognition test benchmark: the Region Recognition Dataset (RR-7K). With 7,662 classes, it represents the most extensive category-rich region recognition dataset to date. WOW-Seg attains strong results on the LVIS dataset, achieving a semantic similarity of 89.7 and a semantic IoU of 82.4. This performance surpasses the previous SOTA while using only one-eighth the parameter count. These results underscore the strong open world generalization capabilities of WOW-Seg. The code and related resources are available at https://anonymous.4open.science/r/WOW-Seg-4C5E.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2076. AetherCode: Evaluating LLMsโ Ability to Win In Premier Programming Competitions
๐ฌ 20
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Large Language Model Reasoning Code LLM Benchmark
Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present **AetherCode**, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem design with robust evaluation, AetherCode provides a more faithful measure of LLM capabilities and sets a new standard for future research in code reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2077. OpenPros: A Large-Scale Dataset for Limited View Prostate Ultrasound Computed Tomography
๐ฌ 22
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Ultrasound Computed Tomography Prostate Imaging Benchmark Dataset Medical Imaging
Prostate cancer is one of the most common and lethal cancers among men, making its early detection critically important. Although ultrasound imaging offers greater accessibility and cost-effectiveness compared to MRI, traditional transrectal ultrasound (TRUS) methods suffer from low sensitivity, especially in detecting anteriorly located tumors. Ultrasound computed tomography (USCT) provides quantitative tissue characterization, but its clinical implementation faces significant challenges, particularly under anatomically constrained limited-angle acquisition conditions specific to prostate imaging. To address these unmet needs, we introduce OpenPros, the first large-scale benchmark dataset for limited-angle prostate USCT designed to systematically evaluate ML methods for inverse problems. Our dataset includes over 280,000 paired samples of realistic 2D speed-of-sound (SOS) phantoms and corresponding ultrasound full-waveform data, generated from anatomically accurate 3D digital prostate models derived from real clinical MRI/CT scans and ex vivo ultrasound measurements, annotated by medical experts. Simulations are conducted under clinically realistic configurations using advanced finite-difference time-domain (FDTD) and Runge-Kutta acoustic wave solvers, both provided as open-source components. Through comprehensive benchmarking, we find that deep learning methods significantly outperform traditional physics-based algorithms in inference efficiency and reconstruction accuracy. However, our results also reveal that current machine learning methods fail to deliver clinically acceptable, high-resolution reconstructions, underscoring critical gaps in generalization, robustness, and uncertainty quantification. By publicly releasing OpenPros, we provide the community with a rigorous benchmark that not only enables fair method comparison but also motivates new advances in physics-informed learning, foundation models for scientific imaging, and uncertainty-aware reconstructionโbridging the gap between academic ML research and real-world clinical deployment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2078. Robust Optimization for Mitigating Reward Hacking with Correlated Proxies
๐ฌ 22
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Reward hacking Robust Reinforcement Learning
Designing robust reinforcement learning (RL) agents in the presence of imperfect reward signals remains a core challenge. In practice, agents are often trained with proxy rewards that only approximate the true objective, leaving them vulnerable to reward hacking, where high proxy returns arise from unintended or exploitative behaviors. Recent work formalizes this issue using
r-correlation between proxy and true rewards, but existing methods like occupancy-regularized policy optimization (ORPO) optimize against a fixed proxy and do not provide strong guarantees against broader classes of correlated proxies. In this work, we formulate reward hacking as a robust policy optimization problem over the space of all
r-correlated proxy rewards. We derive a tractable max-min formulation, where the agent maximizes performance under the worst-case proxy consistent with the correlation constraint. We further show that when the reward is a linear function of known features, our approach can be adapted to incorporate this prior knowledge, yielding both improved policies and interpretable worst-case rewards. Experiments across several environments show that our algorithms consistently outperform ORPO in worst-case returns, and offer improved robustness and stability across different levels of proxyโtrue reward correlation. These results show that our approach provides both robustness and transparency in settings where reward design is inherently uncertain.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2079. How Far Can Unsupervised RLVR Scale LLM Training?
๐ฌ 43
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Unsupervised Reward Reinforcement Learning Reasoning
Unsupervised Reinforcement Learning with Verifiable Rewards (URLVR) offers a pathway for Large Language Models (LLMs) to improve without human supervision.
Particularly, many works use model intrinsic information as rewards for URLVR, showing promising improvements, yet their potential and limitations remain unclear.
In this work, we revisit URLVR through the lens of intrinsic rewards.
We present a unified theoretical framework showing that intrinsic reward methods share a core mechanism: they trade uncertainty for performance by leveraging the modelโs prior knowledge to sharpen output distributions.
Empirical analysis confirms this tradeoff, revealing distinct failure modes and showing that collapse is not inevitable in small, domain-specific regimes such as test-time training.
Beyond these findings, early intrinsic reward dynamics also provide a lightweight indicator of model-task priors, complementing $pass@k$ in assessing RL trainability.
These insights highlight both the promise and pitfalls of URLVR, motivating future directions such as external rewards and hybrid supervision strategies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2080. Polychromic Objectives for Reinforcement Learning
๐ฌ 21
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Exploration
Reinforcement learning fine-tuning (RLFT) is a dominant paradigm for improving pretrained policies for downstream tasks. These pretrained policies, trained on large datasets, produce generations with a broad range of promising but unrefined behaviors. Often, a critical failure mode of RLFT arises when policies lose this diversity and collapse into a handful of easily exploitable outputs. This convergence hinders exploration, which is essential for expanding the capabilities of the pretrained policy and for amplifying the benefits of test-time compute scaling. To address this, we introduce an objective for policy gradient methods that explicitly enforces the exploration and refinement of diverse generations, which we call a polychromic objective. We then show how proximal policy optimization (PPO) can be adapted to optimize this objective. Our method (1) employs vine sampling to collect on-policy rollouts and (2) modifies the advantage function to reflect the advantage under our new objective. Experiments on BabyAI, Minigrid, and Algorithmic Creativity show that our method improves success rates by reliably solving a larger set of environment configurations and generalizes better under large perturbations. Moreover, when given multiple attempts in pass@$n$ experiments, the policy achieves substantially higher coverage, demonstrating its ability to maintain and exploit a diverse repertoire of strategies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2081. dParallel: Learnable Parallel Decoding for dLLMs
๐ฌ 21
โญ 5.50
๐ generative models
๐ท๏ธ diffusion language model parallel decoding efficiency
Diffusion large language models (dLLMs) have recently drawn considerable attention within the research community as a promising alternative to autoregressive generation, offering parallel token prediction and lower inference latency. Yet, their parallel decoding potential remains largely underexplored, as existing open-source models still require nearly token-length decoding steps to ensure performance. To address this, we introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling. We identify that the key bottleneck to parallel decoding arises from the sequential certainty convergence for masked tokens. Building on this insight, we introduce the core of our approach: certainty-forcing distillation, a novel training strategy that distills the model to follow its original sampling trajectories while enforcing it to achieve high certainty on masked tokens more rapidly and in parallel. Extensive experiments demonstrate that our method can dramatically reduce the number of decoding steps while maintaining performance. When applied to the LLaDA-8B-Instruct model, dParallel reduces decoding steps from 256 to 30 on GSM8K, achieving an 8.5ร speedup without performance degradation. On the MBPP benchmark, it cuts decoding steps from 256 to 24, resulting in a 10.5ร speedup while maintaining accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2082. Generative Human Geometry Distribution
๐ฌ 8
โญ 5.50
๐ generative models
๐ท๏ธ 3D Generation Human Generation Geometry Encoding
Realistic human geometry generation is an important yet challenging task, requiring both the preservation of fine clothing details and the accurate modeling of clothing-body interactions. To tackle this challenge, we build upon Geometry distributionsโa recently proposed representation that can model a single human geometry with high fidelity using a flow matching model. However, extending a single-geometry distribution to a dataset is non-trivial and inefficient for large-scale learning. To address this, we propose a new geometry distribution model by two key techniques: (1) encoding distributions as 2D feature maps rather than network parameters, and (2) using SMPL models as the domain instead of Gaussian and refining the associated flow velocity field. We then design a generative framework adopting a two-staged training paradigm analogous to state-of-the-art image and 3D generative models.
In the first stage, we compress geometry distributions into a latent space using a diffusion flow model; the second stage trains another flow model on this latent space.
We validate our approach on two key tasks: pose-conditioned random avatar generation and avatar-consistent novel pose synthesis.
Experimental results demonstrate that our method outperforms existing state-of-the-art methods, achieving a 57% improvement in geometry quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2083. Diffusion Language Models are Provably Optimal Parallel Samplers
๐ฌ 13
โญ 5.50
๐ learning theory
๐ท๏ธ Theory Diffusion Language Model Large Language Model
Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive models for faster inference via parallel token generation. We provide a rigorous foundation for this advantage by formalizing a model of parallel sampling and showing that DLMs augmented with polynomial-length chain-of-thought (CoT) can simulate any parallel sampling algorithm using an optimal number of sequential steps.
Consequently, whenever a target distribution
can be generated using a small number of sequential steps, a DLM can be used to generate the distribution using the same number of optimal sequential steps.
However, without the ability to modify previously revealed tokens, DLMs with CoT can still incur large intermediate footprints. We prove that enabling remasking (converting unmasked tokens to masks or revision (converting unmasked tokens to other unmasked tokens) together with CoT further allows DLMs to simulate any parallel sampling algorithm with optimal space complexity.
We further justify the advantage of revision by establishing a strict expressivity gap: DLMs with revision or remasking are strictly more powerful than those without.
Our results not only provide a theoretical justification for the promise of DLMs as the most efficient sampler, but also
advocate for why revisions should be enabled in DLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2084. SecP-Tuning: Efficient Privacy-Preserving Prompt Tuning for Large Language Models via MPC
๐ฌ 29
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ privacy-preserving secure multi-party computation large-language models prompt tuning
Large Language Models (LLMs) have revolutionized numerous fields, yet their adaptation to specialized tasks in privacy-sensitive domains such as healthcare and finance remains constrained due to the scarcity of accessible training data caused by stringent privacy requirements. Secure Multi-party Computation (MPC)-based privacy-preserving machine learning provides theoretical guarantees for the privacy of model parameters and data. However, its application to LLMs has been predominantly limited to inference, as fine-tuning introduces significant efficiency challenges, particularly in backward propagation, optimizer, and self-attention operations. To address these challenges, we propose SecP-Tuning, the first MPC-based framework designed for efficient, privacy-preserving prompt tuning of LLMs. SecP-Tuning innovatively integrates Forward-only Tuning (FoT) through the ''data owner-server interaction" paradigm, effectively removing the need for privacy-preserving computations in backward propagation and optimization processes. Furthermore, it devises an efficient privacy-preserving Random Feature Attention (RFA), effectively mitigating the computational complexity of softmax-based self-attention and circumventing MPC-incompatible nonlinear operations. Experimental results demonstrate that, compared to full-Parameter Supervised Fine-Tuning (SFT) and gradient-based prompt tuning, SecP-Tuning achieves approximately 12$\times$ and 16$\times$ end-to-end acceleration, as well as 18$\times$ and 20$\times$ reductions in communication overhead, respectively. Moreover, it delivers performance comparable to gradient-based methods across multiple few-shot tasks. Additionally, the ''black-box/API-style" privacy-preserving tuning paradigm of SecP-Tuning effectively avoids memory leakage risks caused by gradient/parameter transmission, thereby striking an optimal balance between efficiency, accuracy, deployability, and privacy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2085. Beyond Binary Preferences: A Principled Framework for Reward Modeling with Ordinal Feedback
๐ฌ 13
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ reward modeling ordinal regression Likert scale preference learning human feedback RLHF discrete ordinal regression Bradley-Terry model ordinal preferences large language models alignment preference data
Reward modeling is crucial for aligning large language models with human preferences, yet current approaches lack a principled mathematical framework for leveraging ordinal preference data. When human annotators provide graded preferences on a Likert scale (e.g., significantly better, better, slightly better, negligibly better), existing methods typically apply ad-hoc heuristicsโsuch as margin terms or scaling factorsโto loss functions derived from binary preference models like Bradley-Terry. These approaches lack an underlying mathematical model for how ordinal preference data is generated. We present a theoretically grounded framework that formulates reward modeling with Likert scale preferences as a discrete ordinal regression problem. We derive two loss functions from this formulation: a negative log-likelihood loss and an all-threshold loss, both of which learn threshold parameters that naturally capture the ordinal structure of preferences. Unlike existing heuristic methods that manually specify fixed margins or scaling weights, our approach learns these parameters directly from data within a coherent probabilistic framework. Experimental results on multiple benchmarks demonstrate that our ordinal regression approach consistently achieves competitive or superior performance compared to existing heuristic methods across diverse evaluation categories including chat, reasoning, and safety tasks. Our work provides the first principled mathematical framework for incorporating Likert scale preferences into reward model training, moving beyond ad-hoc modifications of binary preference models to enable more effective utilization of fine-grained human feedback.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2086. TreeGrad-Ranker: Feature Ranking via $O(L)$-Time Gradients for Decision Trees
๐ฌ 13
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ Shapley value Banzhaf value semi-values decision trees linear treeshap
We revisit the use of probabilistic values, which include the well-known Shapley and Banzhaf values, to rank features for explaining the local predicted values of decision trees. The quality of feature rankings is typically assessed with the insertion and deletion metrics. Empirically, we observe that co-optimizing these two metrics is closely related to a joint optimization that selects a subset of features to maximize the local predicted value while minimizing it for the complement. However, we theoretically show that probabilistic values are generally unreliable for solving this joint optimization. Therefore, we explore deriving feature rankings by directly optimizing the joint objective. As the backbone, we propose TreeGrad, which computes the gradients of the multilinear extension of the joint objective in $O(L)$ time for decision trees with $L$ leaves. Building upon TreeGrad, we introduce TreeGrad-Ranker, which aggregates the gradients while optimizing the joint objective to produce feature rankings, and TreeGrad-Shap, a parallelizable and numerically stable alternative to Linear TreeShap for computing the Shapley value. In practice, we show that the numerical error of Linear TreeShap can be up to $10^{15}$ times larger than that of TreeGrad-Shap. Meanwhile, we also develop TreeProb, which generalizes Linear TreeShap to support all probabilistic values for use as baselines. Empirically, our TreeGrad-Ranker performs significantly better on both insertion and deletion metrics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2087. CSRv2: Unlocking Ultra-Sparse Embeddings
๐ฌ 28
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ text embedding sparse representation contrastive learning
In the era of large foundation models, the quality of embeddings has become a central determinant of downstream task performance and overall system capability.
Yet widely used dense embeddings are often extremely high-dimensional (e.g., 4096), incurring substantial costs in storage, memory, and inference latency.
To address these, Contrastive Sparse Representation (CSR) is recently proposed as a promising direction, mapping dense embeddings into high-dimensional but $k$-sparse vectors, in contrast to compact dense embeddings such as Matryoshka Representation Learning (MRL).
Despite its promise, CSR suffers severe degradation in the ultra-sparse regime (e.g., $k \leq 4$), where over 80\% of neurons remain inactive, leaving much of its efficiency potential unrealized.
In this paper, we introduce CSRv2, a principled training approach designed to make ultra-sparse embeddings viable.
CSRv2 stabilizes sparsity learning through progressive $k$-annealing, enhances representational quality via supervised contrastive objectives, and ensures end-to-end adaptability with full backbone finetuning.
CSRv2 reduces dead neurons from 80\% to 20\% and delivers a 14\% accuracy gain at $k=2$, bringing ultra-sparse embeddings on par with CSR at $k=8$ and MRL at 32 dimensions, all with only two active features.
While maintaining comparable performance, CSRv2 delivers a 7$\times$ speedup over MRL, and yields up to 300$\times$ improvements in compute and memory efficiency relative to dense embeddings in e5-mistral-7b-instruct-based text representation.
Extensive experiments across text (MTEB, multiple state-of-the-art LLM embeddings (Qwen and e5-Mistral-7B), SPLADEv3, GraphRAG) and vision (ImageNet-1k) demonstrate that CSRv2 makes ultra-sparse embeddings practical without compromising performance, where CSRv2 achieves 7\%/4\% improvement over CSR when $k=4$ and further increases this gap to 14\%/6\% when $k=2$ in text/vision representation.
By making extreme sparsity viable, CSRv2 broadens the design space for large-scale, real-time, and edge-deployable AI systems where both embedding quality and efficiency are critical.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2088. Learning to Weight Parameters for Data Attribution
๐ฌ 13
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ Training Data attribution; Influence function
We study gradient-based data attribution, aiming to identify which training examples most influence a given output. Existing methods for this task either treat network parameters uniformly or rely on implicit weighting derived from Hessian approximations, which do not fully model functional heterogeneity of network parameters.
To address this, we propose a method to explicitly learn parameter importance weights directly from data, without requiring annotated labels.
Our approach improves attribution accuracy across diverse tasks, including image classification, language modeling, and diffusion, and enables fine-grained attribution for concepts like subject and style.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2089. SCI-Verifier: Scientific Verifier with Thinking
๐ฌ 24
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM-as-a-judge; Large Language Model
As large language models (LLMs) are increasingly applied to scientific reasoning, the complexity of answer formats and the diversity of equivalent expressions make answer verification a critical yet challenging task. Existing verification studies in scientific domains suffer from two major limitations: (a) the absence of systematic evaluation standards and insufficient disciplinary coverage, which hinders their comprehensive assessment; and (b) heavy reliance on cumbersome rule design or prompt engineering, which reduces their effectiveness in complex reasoning scenarios or limits their cross-disciplinary generalization. To address these challenges, we propose solutions at both the data and model levels. On the data side, we construct **SCI-VerifyBench**, a cross-disciplinary benchmark covering mathematics, physics, biology, chemistry, and general scientific QA. The benchmark is built from real LLM responses and enhanced with domain-specific equivalence transformations that generate challenging and realistic data. Model-based and expert annotations ensure both quality and diversity, enabling rigorous evaluation of verification ability. On the model side, we emphasize the importance of reasoning for verification and introduce **SCI-Verifier**, a unified reasoning-augmented verifier for scientific domains. Through post-training, SCI-Verifier demonstrates strong logical reasoning and equivalence judgment capabilities while maintaining concise and stable outputs. Together, SCI-VerifyBench and SCI-Verifier provide a principled framework for scientific verification, offering both systematic evaluation and practical pathways to enhance the reliability and applicability of LLMs in scientific domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2090. More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
๐ฌ 32
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Vision-Language Models Multimodal Reasoning Reinforcement Learning
Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model's reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2091. Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks
๐ฌ 22
โญ 5.50
๐ generative models
๐ท๏ธ video generation image generation unified framework
Diffusion models have shown impressive performance in many visual generation and manipulation tasks. Many existing methods focus on training a model for a specific task, especially, text-to-video (T2V) generation, while many other works focus on finetuning the pretrained T2V model for image-to-video (I2V), video-to-video (V2V), image and video manipulation tasks, \etc. However, training a strong T2V foundation model requires a large amount of high-quality annotations, which is very costly. In addition, many existing models can perform only one or several tasks. In this work, we introduce a unified framework, namely \textit{many-for-many}, which leverages the available training data from many different visual generation and manipulation tasks to train a single model for those different tasks. Specifically, we design a lightweight adapter to unify the different conditions in different tasks, then employ a joint image-video learning strategy to progressively train the model from scratch. Our joint learning not only leads to a unified generation and manipulation model but also benefits the performance of different tasks. In addition, we introduce depth maps as a condition to help our model better perceive the 3D space in visual generation. Two versions of our model are trained with different model sizes (8B and 2B), each of which can perform more than 10 different tasks. In particular, our 8B model demonstrates highly competitive performance in different generation and manipulation tasks compared to open-source and even commercial engines. Our models and source codes will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2092. Hot Fuzz: Temperature-Tunable Composition of Diffusion models with Fuzzy Logic
๐ฌ 16
โญ 5.50
๐ generative models
๐ท๏ธ Diffusion Score Composition Logical Inference Temperature Scaling Fuzzy Logic
Composing pretrained diffusion models provides a cost-effective mechanism to encode constraints and unlock new generative capabilities. Prior work relies on crafting compositional operators that seek to extend set-theoretic notions such as union and intersection to diffusion models, e.g., using a product or mixture of the underlying energy functions. We expose the inadequacy and inconsistency of combining these operators in terms of limited mode coverage, biased sampling, instability under negation queries, and failure to satisfy basic compositional laws such as idempotency and distributivity.
We introduce a principled calculus grounded in fuzzy logic that resolves these issues. Specifically, we define a general class of conjunction, disjunction and negation operators that generalize the classical mixtures, illustrating how they circumvent various pathologies and enable precise combinatorial reasoning with score models. Beyond existing methods, the proposed *Dombi* operators afford complex generative outcomes such as Exclusive-Union (XOR) of individual scores. We establish rigorous theoretical guarantees on the stability and temperature scaling of Dombi compositions, and derive Feynman-Kac correctors to mitigate the sampling bias in score composition. Empirical results on image generation with stable diffusion and multi-objective molecular generation substantiate the conceptual, theoretical, and methodological benefits. Overall, this work lays foundation for systematic design, analysis, and deployment of diffusion ensembles.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2093. Dynamic Novel View Synthesis in High Dynamic Range
๐ฌ 16
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ High Dynamic Range 4D Gaussian Splatting Dynamic Scene Modeling Deep Learning
High Dynamic Range Novel View Synthesis (HDR NVS) seeks to learn an HDR 3D model from Low Dynamic Range (LDR) training images captured under conventional imaging conditions. Current methods primarily focus on static scenes, implicitly assuming all scene elements remain stationary and non-living. However, real-world scenarios frequently feature dynamic elements, such as moving objects, varying lighting conditions, and other temporal events, thereby presenting a significantly more challenging scenario.
To address this gap, we propose a more realistic problem named HDR Dynamic Novel View Synthesis (HDR DNVS), where the additional dimension ``Dynamic'' emphasizes the necessity of jointly modeling temporal radiance variations alongside sophisticated 3D translation between LDR and HDR. To tackle this complex, intertwined challenge, we introduce HDR-4DGS, a Gaussian Splatting-based architecture featured with an innovative dynamic tone-mapping module that explicitly connects HDR and LDR domains, maintaining temporal radiance coherence by dynamically adapting tone-mapping functions according to the evolving radiance distributions across the temporal dimension. As a result, HDR-4DGS achieves both temporal radiance consistency and spatially accurate color translation, enabling photorealistic HDR renderings from arbitrary viewpoints and time instances.
Extensive experiments demonstrate that HDR-4DGS surpasses existing state-of-the-art methods in both quantitative performance and visual fidelity. Source code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2094. DexNDM: Closing the Reality Gap for Dexterous In-Hand Rotation via Joint-Wise Neural Dynamics Model
๐ฌ 54
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ In-Hand Object Rotation; Sim-to-Real; Neural Dynamics Model
Achieving generalized in-hand object rotation remains a significant challenge in robotics, largely due to the difficulty of transferring policies from simulation to the real world. The complex, contact-rich dynamics of dexterous manipulation create a "reality gap" that has limited prior work to constrained scenarios involving simple geometries, limited object sizes and aspect ratios, constrained wrist poses, or customized hands. We address this sim-to-real challenge with a novel framework that enables a single policy, trained in simulation, to generalize to a wide variety of objects and conditions in the real world. The core of our method is a joint-wise dynamics model that learns to bridge the reality gap by effectively fitting limited amount of real-world collected data and then adapting the sim policyโs actions accordingly. The model is highly dataโefficient and generalizable across different wholeโhand interaction distributions by factorizing dynamics across joints, compressing system-wide influences into lowโdimensional variables, and learning each jointโs evolution from its own dynamic profile, implicitly capturing these net effects. We pair this with a fully autonomous data collection strategy that gathers diverse, real-world interaction data with minimal human intervention. Our complete pipeline demonstrates unprecedented generality: a single policy successfully rotates challenging objects with complex shapes (*e.g.*, animals), high aspect ratios (up to 5.33), and small sizes, all while handling diverse wrist orientations and rotation axes. Comprehensive real-world evaluations and a teleoperation application for complex tasks validate the effectiveness and robustness of our approach. Website: [DexNDM](https://projectwebsitex.github.io/neudyn-reorientation/).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2095. STAIRS-Former: Spatio-Temporal Attention with Interleaved Recursive Structure TransFormer for Offline Mulit-task Multi-agent Reinforcement Learning
๐ฌ 15
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning multi-agent multi-task transformer offline learning
Offline multi-agent reinforcement learning (MARL) with multi-task (MT) datasets poses unique challenges, as input structures vary across tasks due to the varying number of agents. Prior works have adopted transformers and hierarchical skill learning to facilitate coordination, but these methods underutilize the transformerโs attention mechanism, focusing instead on extracting transferable skills. Moreover, existing transformer-based approaches compress the entire history into a single token and input this token at next time step, forming simple recursive neural network (RNN) processing on history tokens. As a result, models rely primarily on current and near-past observations while neglecting long historical information, even though the partially observable nature of MARL makes history information critical. In this paper, we propose STAIRS-Former, a transformer architecture augmented with spatial and temporal hierarchies that enables the model to properly attend to critical tokens while effectively leveraging long history. To further enhance robustness across varying token counts, we incorporate token dropout, which improves generalization to diverse agent populations. Experiments on the StarCraft Multi-Agent Challenge (SMAC) benchmark with diverse multi-task datasets show that STAIRS-Former consistently outperforms prior algorithms, achieving new state-of-the-art performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2096. BaseReward: A Strong Baseline for Multimodal Reward Model
๐ฌ 18
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Large Language Models Reward Models Human Preference Alignment Reinforcement Learning
The rapid advancement of Multimodal Large Language Models (MLLMs) has made aligning them with human preferences a critical challenge. Reward Models (RMs) are a core technology for achieving this goal, but a systematic guide for building state-of-the-art Multimodal Reward Models (MRMs) is currently lacking in both academia and industry. Through exhaustive experimental analysis, this paper aims to provide a clear โrecipeโ for constructing high-performance MRMs. We systematically investigate every crucial component in the MRM development pipeline, including reward modeling paradigms (e.g., Naive-RM, Critic-based RM, and Generative RM), reward head architecture, training strategies, data curation (covering over ten multimodal and text-only preference datasets), backbone model and model scale, and ensemble methods.
Based on these experimental insights, we introduce BaseReward, a powerful and efficient baseline for multimodal reward modeling. BaseReward adopts a simple yet effective architecture, built upon a Qwen2.5-VL backbone, featuring an optimized two-layer reward head, and is trained on a carefully curated mixture of high-quality multimodal and text-only preference data. Our results show that BaseReward establishes a new state-of-the-art (SOTA) on major benchmarks such as MM-RLHF-Reward Bench, VL-Reward Bench, and Multimodal Reward Bench, outperforming previous open-source and proprietary models. Furthermore, to validate its practical utility beyond static benchmarks, we integrate BaseReward into a real-world reinforcement learning pipeline, successfully enhancing an MLLMโs performance across various perception, reasoning, and conversational tasks. This work not only delivers a top-tier MRM but, more importantly, provides the community with a clear, empirically backed guide for developing robust reward models for the next generation of MLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2097. Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation
๐ฌ 23
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Data Synthesis Large Language Model Knowledge Distillation
Knowledge distillation from Large Language Models (LLMs) to smaller models has emerged as a critical technique for deploying efficient AI systems. However, current methods for distillation via synthetic data lack pedagogical awareness, treating knowledge transfer as a one-off data synthesis and training task rather than a systematic learning process. In this paper, we propose a novel pedagogically-inspired framework for LLM knowledge distillation that draws from fundamental educational principles. Our approach introduces a three-stage pipelineโ**Knowledge Identifier**, **Organizer**, and **Adapter** (**IOA**)โthat systematically identifies knowledge deficiencies in student models, organizes knowledge delivery through progressive curricula, and adapts representations to match the cognitive capacity of student models. We integrate Bloom's Mastery Learning Principles and Vygotsky's Zone of Proximal Development to create a dynamic distillation process where student models approach teacher model's performance on prerequisite knowledge before advancing, and new knowledge is introduced with controlled, gradual difficulty increments. Extensive experiments using LLaMA-3.1/3.2 and Qwen2.5 as student models demonstrate that IOA achieves significant improvements over baseline distillation methods, with student models retaining 94.7\% of teacher performance on DollyEval while using less than 1/10th of the parameters. Our framework particularly excels in complex reasoning tasks, showing 19.2\% improvement on MATH and 22.3\% on HumanEval compared with state-of-the-art baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2098. Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models
๐ฌ 19
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ large reasoning model reinforcement learning finetuning
Reinforcement learning (RL) finetuning has become a key technique for enhancing the reasoning abilities of large language models (LLMs). However, its effectiveness critically depends on the selection of training data. Recent advances underscore the importance of online prompt selection methods, which typically concentrate training on partially solved or moderately challenging examples under the current policy, thereby yielding more effective model updates. While significantly accelerating RL finetuning in terms of training steps, they also incur substantial computational overhead by requiring extensive LLM rollouts over large candidate batches to identify informative samples, an expense that can outweigh the finetuning process itself. To address this challenge, this work proposes Dynamics-Predictive Sampling (DPS), which online predicts and selects informative prompts by inferring their learning dynamics prior to costly rollouts. Specifically, we introduce a new perspective by modeling each prompt's solving progress during RL finetuning as a dynamical system, where the extent of solving is represented as the state and the transition is characterized by a hidden Markov model. Using historical rollout reward signals, we perform online Bayesian inference to estimate evolving state distributions, and the inference outcome provides a predictive prior for efficient prompt selection without rollout-intensive filtering. Empirical results across diverse reasoning tasks, including mathematics, planning, and visual geometry, demonstrate that DPS substantially reduces redundant rollouts, accelerates the training process, and achieves superior reasoning performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2099. SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation
๐ฌ 22
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Music Generation Rhythmic Stability and Synchronization Multi-Track Music Generation Audio Generation
Multi-track music generation has garnered significant research interest due to its precise mixing and remixing capabilities. However, existing models often overlook essential attributes such as rhythmic stability and synchronization, leading to a focus on differences between tracks rather than their inherent properties. In this paper, we introduce SyncTrack, a synchronous multi-track waveform music generation model designed to capture the unique characteristics of multi-track music. SyncTrack features a novel architecture that includes track-shared modules to establish a common rhythm across all tracks and track-specific modules to accommodate diverse timbres and pitch ranges. Each track-shared module employs two cross-track attention mechanisms to synchronize rhythmic information, while each track-specific module utilizes learnable instrument priors to better represent timbre and other unique features. Additionally, we enhance the evaluation of multi-track music quality by introducing rhythmic consistency through three novel metrics: Inner-track Rhythmic Stability (IRS), Cross-track Beat Synchronization (CBS), and Cross-track Beat Dispersion (CBD). Experiments demonstrate that SyncTrack significantly improves the multi-track music quality by enhancing rhythmic consistency. Code is available at https://anonymous.4open.science/r/SyncTrack/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2100. Don't Just Fine-tune the Agent, Tune the Environment
๐ฌ 20
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models LLM Agents Tool Learning Multi-turn tool use Reinforcement Learning
Large Language Model (LLM) agents show great promise for complex multi-turn tool-use tasks, but their development is often hampered by the extreme scarcity of high-quality training data. Supervised fine-tuning (SFT) on synthetic data leads to overfitting, whereas standard reinforcement learning (RL) struggles with a critical cold-start problem and training instability. To address these challenges, we introduce $\textbf{Environment Tuning}$, a novel training paradigm that enables agents to learn complex behaviors directly from problem instances without relying on pre-collected expert trajectories. $\textbf{Environment Tuning}$ orchestrates this learning process through a structured curriculum, actionable environment augmentation that provides corrective feedback, and fine-grained progress rewards to ensure stable and efficient exploration. Using only 400 problem instances from Berkeley Function-Calling Leaderboard (BFCL) benchmark, our method not only achieves competitive in-distribution performance against strong baselines but also demonstrates superior out-of-distribution generalization, overcoming the performance collapse common to SFT-based approaches. Our work presents a paradigm shift from supervised fine-tuning on static trajectories to dynamic, environment-based exploration, paving the way for training more robust and data-efficient agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2101. Advancing Complex Video Object Segmentation via Progressive Concept Construction
๐ฌ 11
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ SAM2 LVLM
We propose Segment Concept (SeC), a concept-driven video object segmentation (VOS) framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. To balance semantic reasoning with computational overhead, SeC forwards the LVLMs only when a new scene appears, injecting concept-level features at those points.
To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. Empirical evaluations demonstrate that SeC substantially outperforms state-of-the-art approaches, including SAM 2 and its advanced variants, on both SeCVOS and standard VOS benchmarks. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware VOS.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2102. Flatter Tokens are More Valuable for Speculative Draft Model Training
๐ฌ 16
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Speculative Decoding Efficient Training
Speculative Decoding (SD) is a key technique for accelerating Large Language Model (LLM) inference, but it typically requires training a draft model on a large dataset. We approach this problem from a data-centric perspective, finding that not all training samples contribute equally to the SD acceptance rate. Specifically, our theoretical analysis and empirical validation reveals that tokens inducing flatter predictive distributions from the target model are more valuable than those yielding sharply peaked distributions. Based on this insight, we propose flatness, a new metric to quantify this property, and develop the Sample-level-flatness-based Dataset Distillation (SFDD) approach, which filters the training data to retain only the most valuable samples. Experiments on the EAGLE framework demonstrate that SFDD can achieve over 2$\times$ training speedup using only 50\% of the data, while keeping the final model's inference speedup within 4\% of the full-dataset baseline. This work introduces an effective, data-centric approach that substantially improves the training efficiency for Speculative Decoding. Our code is available at https://anonymous.4open.science/r/Flatness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2103. Trinity: An Evolved LLM Coordinator
๐ฌ 12
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ evolutionary strategies multi-agent LLM systems role-based delegation logits-to-agent mapping
Combining diverse foundation models is promising, but weight-merging is limited by mismatched architectures and closed APIs. **Trinity** addresses this with a lightweight coordinator that orchestrates collaboration among large language models (LLMs). The coordinator, comprising a compact language model ($\approx 0.6$B parameters) and a lightweight head ($\approx 10$K parameters), is optimized with an evolutionary strategy for efficient and adaptive delegation. **Trinity** processes queries over multiple turns, where at each turn the coordinator assigns one of three roles (*Thinker*, *Worker*, or *Verifier*) to a selected LLM, effectively offloading complex skill acquisition from the coordinator itself. Extensive experiments demonstrate that **Trinity** consistently outperforms individual models and existing methods in various tasks, including coding, math, reasoning, and domain knowledge, while robustly generalizing to out-of-distribution tasks. On established benchmarks, **Trinity** achieves state-of-the-art performance, including a new record of $86.2\%$ on LiveCodeBench. Theoretical and empirical analyses highlight two key factors driving this success: (1) the coordinatorโs hidden-state representations provide rich contextualization of inputs, and (2) under high dimensionality and strict budget constraints, the separable Covariance Matrix Adaptation Evolution Strategy algorithm provides substantial advantages over RL, imitation learning, and random search, leveraging potential block-$\varepsilon$-separability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2104. Math Blind: Failures in Diagram Understanding Undermine Reasoning in MLLMs
๐ฌ 22
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Large Language Models Multimodal mathematical reasoning Mathematical Diagram Understanding
Diagrams represent a form of visual language that encodes abstract concepts and relationships through structured symbols and their spatial arrangements. Unlike natural images, they are inherently symbolic, and entirely artificial. They thus pose unique challenges for Multimodal Large Language Models (MLLMs) distinct from natural image processing. Recent studies have shown that MLLMs often exhibit flawed reasoning and hallucinations when handling diagram inputs. We investigate here whether these limitations stem from shortcomings in the models' ability to interpret diagrams themselves. To this end, we develop a diagnostic test suite that isolates perception from reasoning. Our systematic evaluation reveals that MLLMs perform poorly on basic perceptual tasks, e.g., shape classification, object counting, relationship identification, and object grounding, with near-zero accuracy on fine-grained grounding. Further analysis shows that weak diagram perception leads to ``blind faith in text", where models rely on textual shortcuts rather than visual understanding (that is, they are $\textit{Math Blind}$). We hypothesize that enabling models to capture the inherent structural properties of diagrams, represented as graphs of primitives and their interrelationships, is essential for improving diagram understanding. Experiments with 7B and 32B MLLMs validate this assumption, with models trained on such representations achieving a +79\% gain on the grounding task. Crucially, these gains transfer to reasoning, achieving 3โ4\% cross-suite improvements on three public benchmarks even without additional chain-of-thought reasoning data. Our findings demonstrate that low-level perception supports faithful high-level reasoning in mathematical MLLMs. We provide both methodological frameworks and empirical evidence to guide future research in this direction. All implementations will be released upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2105. Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation
๐ฌ 12
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ sign language translation multimodal generation vision-language model hallucination detection
Hallucination, where models generate fluent text unsupported by visual evidence, remains a major flaw in visionโlanguage models and is especially critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are particularly vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision. We argue that hallucination arises when models rely on language priors rather than visual input.
To capture this, we propose a token-level measure of reliability that quantifies how much the decoder uses visual information. Our method combines feature-based sensitivity, which measures internal changes when video is masked, with counterfactual signals, which capture probability differences between clean and altered video inputs. These signals are aggregated into a sentence-level reliability score, providing a compact and interpretable measure of visual grounding.
We evaluate the proposed measure on two SLT benchmarks (PHOENIX-2014T and CSL-Daily) with both gloss-based and gloss-free models. Our results show that reliability predicts hallucination rates, generalizes across datasets and architectures, and decreases under visual degradations. Beyond these quantitative trends, we also find that reliability distinguishes grounded tokens from guessed ones, allowing risk estimation without references; when combined with text-based signals (confidence, perplexity, or entropy), it further improves hallucination risk estimation. Qualitative analysis further highlights why gloss-free models are more susceptible to hallucinations. Taken together, our findings establish reliability as a practical and reusable tool for diagnosing hallucinations in SLT, and lay groundwork for more robust hallucination detection in multimodal generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2106. Universal Value-Function Uncertainties
๐ฌ 11
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Uncertainty Quantificaiton Epistemic Uncertainty Exploration Offline RL Neural Tanget Kernel Multitask RL
Estimating epistemic uncertainty in value functions is a crucial challenge for many aspects of reinforcement learning (RL), including efficient exploration, safe decision-making, and offline RL. While deep ensembles provide a robust method for quantifying value uncertainty, they come with significant computational overhead. Single-model methods, while computationally favorable, often rely on heuristics and typically require additional propagation mechanisms for myopic uncertainty estimates. In this work we introduce universal value-function uncertainties (UVU), which, similar in spirit to random network distillation (RND), quantify uncertainty as squared prediction errors between an online learner and a fixed, randomly initialized target network. Unlike RND, UVU errors reflect policy-conditional $\textit{value uncertainty}$, incorporating the future uncertainties $\textit{any policy}$ may encounter. This is due to the training procedure employed in UVU: the online network is trained using temporal difference learning with a synthetic reward derived from the fixed, randomly initialized target network. We provide an extensive theoretical analysis of our approach using neural tangent kernel (NTK) theory and show that in the limit of infinite network width, UVU errors are exactly equivalent to the variance of an ensemble of independent universal value functions. Empirically, we show that UVU achieves equal performance to large ensembles on challenging multi-task offline RL settings, while offering simplicity and substantial computational savings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2107. The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?
๐ฌ 14
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Scaling Laws Misalignment Bias-Variance
As AI becomes more capable, we entrust it with more general and consequential tasks. The risks from failure grow more severe with increasing task scope. It is therefore important to understand
the ways extremely capable AI models will fail: Will they fail by systematically pursuing goals we do not intend? Or will they fail by being a hot mess, and taking nonsensical actions that do not further any goal? We operationalize this question using a bias-variance decomposition of the errors made by AI models: An AI's *incoherence* on a task is measured over test-time randomness as the fraction of its error that stems from variance rather than bias in task outcome. Across all tasks and frontier models we measure, we find that the longer models spend reasoning and taking actions, *the more incoherent* they become. We observe that incoherence changes with model scale in a way that is task and experiment dependent. However, in several settings larger, more capable models are more incoherent than smaller models.
Consequently, scale alone seems unlikely to eliminate incoherence. Instead, as more capable AIs pursue harder tasks, requiring more sequential action and thought, our results predict failures to be accompanied by more incoherent behavior.
This suggests a future where AIs sometimes cause industrial accidents (due to unpredictable misbehavior), but are less likely to exhibit consistent pursuit of a misaligned goal.
This increases the relative importance of alignment research targeting reward hacking or goal misspecification.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2108. Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model
๐ฌ 12
โญ 5.50
๐ generative models
๐ท๏ธ Image Generation Autoregressive model Tokenizer
Autoregressive image generation aims to predict the next token based on previous ones. However, this process is challenged by the bidirectional dependencies inherent in conventional image tokenizations, which creates a fundamental misalignment with the unidirectional nature of autoregressive models. To resolve this, we introduce AliTok, a novel Aligned Tokenizer that alters the dependency structure of the token sequence. AliTok employs a bidirectional encoder constrained by a causal decoder, a design that compels the encoder to produce a token sequence with both semantic richness and forward-dependency. Furthermore, by incorporating prefix tokens and employing a two-stage tokenizer training process to enhance reconstruction performance, AliTok achieves high fidelity and predictability simultaneously. Building upon AliTok, a standard decoder-only autoregressive model with just 177M parameters achieves a gFID of 1.44 and an IS of 319.5 on ImageNet-256. Scaling to 662M, our model reaches a gFID of 1.28, surpassing the SOTA diffusion method with 10x faster sampling. On ImageNet-512, our 318M model also achieves a SOTA gFID of 1.39. Code and weights will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2109. Parallel Multimodal Diffusion Language Models for Thinking-Aware Editing and Generation
๐ฌ 19
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ generation multimodal diffusion language model
While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation.
To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image.
To resolve this, we propose a parallel multimodal diffusion framework that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. The model is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our approach significantly improves cross-modal alignment and semantic consistency, achieving a 6.9\% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2110. OrthAlign: Orthogonal Subspace Decomposition for Non-Interfering Multi-Objective Alignment
๐ฌ 4
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ alignment
Large language model (LLM) alignment faces a critical dilemma when addressing multiple human preferences: improvements in one dimension frequently come at the expense of others, creating unavoidable trade-offs between competing objectives like helpfulness and harmlessness. While prior work mainly focuses on constraint-based optimization algorithms and data selection strategies to mitigate conflicts, these approaches overlook the fundamental issue of resolving conflicts directly at the parameter level. In this paper, we present OrthAlign, an innovative approach that pioneers a new paradigm by leveraging orthogonal subspace decomposition to fundamentally resolve gradient-level conflicts in multi-objective preference alignment. OrthAlign strategically decomposes parameter update spaces into orthogonal subspaces, ensuring that optimization toward different preferences occurs in mathematically non-interfering directions. Building upon this, we provide theoretical guarantees demonstrating that when parameter increments satisfy both orthogonal subspace constraints and spectral norm bounds, the resulting updates exhibit linear Lipschitz growth rather than exponential instability, ensuring stable convergence across all preference dimensions. Extensive experiments show that: I. OrthAlign achieves maximum single-preference improvements ranging from 34.61% to 50.89% after multiple-objective alignment across helpful, harmless, and truthful dimensions. II. With an average overall reward improvement of 13.96%. Our code is available at https://anonymous.4open.science/r/OrthAlign.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2111. GenCP: Towards Generative Modeling Paradigm of Coupled physics with Application to Fluid-Structure Interaction
๐ฌ 19
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Coupled Physics Simulation Flow Matching Operator Splitting FSI
Real-world physical systems are inherently complex, often involving the coupling of multiple physics, making their simulation both highly valuable and challenging. Many mainstream approaches face challenges when dealing with decoupled data. Besides, they also suffer from low efficiency and fidelity in strongly coupled spatio-temporal physical systems. Here we propose GenCP, a novel and elegant generative paradigm for coupled multiphysics simulation. By formulating coupled-physics modeling as a probability modeling problem, our key innovation is to integrate probability density evolution in generative modeling with iterative multiphysics coupling, thereby enabling training on decoupled data and inferring coupled physics during sampling. We also utilize operator-splitting theory in the space of probability evolution to establish error controllability guarantees for this "conditional-to-joint" sampling scheme. Then we evaluate our paradigm on a synthetic setting and two challenging fluid-structure interaction scenarios to demonstrate both principled insight and superior application performance of GenCP.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2112. Exploring Interpretability for Visual Prompt Tuning with Cross-layer Concepts
๐ฌ 15
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ prompt tuning explainable AI knowledge discovery prototype learning
Visual prompt tuning offers significant advantages for adapting pre-trained visual foundation models to specific tasks. However, current research provides limited insight into the interpretability of this approach, which is essential for enhancing AI reliability and enabling AI-driven knowledge discovery. In this paper, rather than learning abstract prompt embeddings, we propose the first framework, named Interpretable Visual Prompt Tuning (IVPT), to explore interpretability for visual prompts by introducing cross-layer concept prototypes. Specifically, visual prompts are linked to human-understandable semantic concepts, represented as a set of category-agnostic prototypes, each corresponding to a specific region of the image. IVPT then aggregates features from these regions to generate interpretable prompts for multiple network layers, allowing the explanation of visual prompts at different network depths and semantic granularities. Comprehensive qualitative and quantitative evaluations on fine-grained classification benchmarks show its superior interpretability and performance over visual prompt tuning methods and existing interpretable methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2113. Massive Editing for Large Language Models Based on Dynamic Weight Generation
๐ฌ 17
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Knowledge Editing; Large Language Models; Weight Generation
Knowledge Editing (KE) is a field that studies how to modify some knowledge in Large Language Models (LLMs) at a low cost (compared to pre-training). Currently, performing large-scale edits on LLMs while ensuring the Reliability, Generality, and Locality metrics of the edits remain a challenge. This paper proposes a Massive editing approach for LLMs based on dynamic weight Generation (MeG). Our MeG involves attaching a dynamic weight neuron to specific layers of the LLMs and using a diffusion model to conditionally generate the weights of this neuron based on the input query required for the knowledge. This allows the use of adding a single dynamic weight neuron to achieve the goal of large-scale knowledge editing. Experiments show that our MeG can significantly improve the performance of large-scale KE in terms of Reliability, Generality, and Locality metrics compared to existing knowledge editing methods, particularly with a high percentage point increase in the absolute value index for the Locality metric, demonstrating the advantages of our proposed method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2114. Neural Theorem Proving for Verification Conditions: A Real-World Benchmark
๐ฌ 12
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ neural theorem proving program verification ai for verification automated theorem proving Lean Isabelle Rocq
Theorem proving is fundamental to program verification, where the automated proof of Verification Conditions (VCs) remains a primary bottleneck. Real-world program verification frequently encounters hard VCs that existing Automated Theorem Provers cannot prove, leading to a critical need for extensive manual proofs that burden practical application. While Neural Theorem Proving (NTP) has achieved significant success in mathematical competitions, demonstrating the potential of machine learning approaches to formal reasoning, its application to program verificationโparticularly VC provingโremains largely unexplored.
Despite existing work on annotation synthesis and verification-related theorem proving, no benchmark has specifically targeted this fundamental bottleneck: automated VC proving.
This work introduces Neural Theorem Proving for Verification Conditions (NTP4VC) and presents the first real-world multi-lingual benchmark for this task. Specifically, from real-world projects such as Linux and Contiki-OS kernel, our benchmark leverages industrial pipelines (Why3 and Frama-C) to generate semantically equivalent test cases across formal languages of Isabelle, Lean, and Rocq. We evaluate large language models (LLMs), both general-purpose and those fine-tuned for theorem proving, on NTP4VC. Results indicate that although LLMs show promise in VC proving, significant challenges remain for program verification, highlighting a large gap and opportunity for future research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2115. Learning Unified Representation of 3D Gaussian Splatting
๐ฌ 15
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Representation Learning 3D Gaussian Splatting
A well-designed vectorized representation is crucial for the learning systems natively based on 3D Gaussian Splatting. While 3DGS enables efficient and explicit 3D reconstruction, its parameter-based representation remains hard to learn as features, especially for neural-network-based models. Directly feeding raw Gaussian parameters into learning frameworks fails to address the non-unique and heterogeneous nature of the Gaussian parameterization, yielding highly data-dependent models. This challenge motivates us to explore a more principled approach to represent 3D Gaussian Splatting in neural networks that preserves the underlying color and geometric structure while enforcing unique mapping and channel homogeneity. In this paper, we propose an embedding representation of 3DGS based on continuous submanifold fields that encapsulate the intrinsic information of Gaussian primitives, thereby benefiting the learning of 3DGS.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2116. Set Representation Auxiliary Learning with Adversarial Encoding Perturbation and Optimization
๐ฌ 27
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Set Representation Learning Auxiliary Learning Adversarial Encoding Perturbation
Sets are a fundamental data structure, and learning their vectorized representations is crucial for many computational problems. Existing methods typically focus on intra-set properties such as permutation invariance and cardinality independence. While effective at preserving basic intra-set semantics, these approaches may be insufficient in explicitly modeling inter-set correlations, which are critical for tasks requiring fine-grained comparisons between sets. In this work, we propose SRAL, a Set Representation Auxiliary Learning framework for capturing inter-set correlations that is compatible with various downstream tasks. SRAL conceptualizes sets
as high-dimensional distributions and leverages the 2-Sliced-Wasserstein distance to derive their distributional discrepancies into set representation encoding. More
importantly, we introduce a novel adversarial auxiliary learning scheme. Instead of
manipulating the input data, our method perturbs the set encoding process itself and
compels the model to be robust against worst-case perturbations through a min-max
optimization. Our theoretical analysis shows that this objective, in expectation,
directly optimizes for the set-wise Wasserstein distances, forcing the model to
learn highly discriminative representations. Comprehensive evaluations across
four downstream tasks examine SRALโs performance relative to baseline methods,
showing consistent effectiveness in both inter-set relation-sensitive retrieval and
intra-set information-oriented processing tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2117. RNE: plug-and-play diffusion inference-time control and energy-based training
๐ฌ 14
โญ 5.50
๐ generative models
๐ท๏ธ diffusion generative models SDE SMC sequential monte carlo
Diffusion models generate data by removing noise gradually, which corresponds to the time-reversal of a noising process.
However, access to only the denoising kernels is often insufficient.
In many applications, we need the knowledge of the marginal densities along the generation trajectory, which enables tasks such as inference-time control.
To address this gap, in this paper, we introduce the Radon-Nikodym Estimator (RNE).
Based on the concept of the density ratio between path distributions, it reveals a fundamental connection between marginal densities and transition kernels, providing a flexible plug-and-play framework that unifies diffusion density estimation, inference-time control, and energy-based diffusion training under a single perspective.
Experiments demonstrated that RNE delivers strong results in inference-time control applications, such as annealing and model composition, with promising inference-time scaling performance.
Moreover, RNE provides a simple yet efficient regularisation for training energy-based diffusion.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2118. ToolTree: Efficient LLM Tool Planning via Dual-Feedback Monte Carlo Tree Search and Bidirectional Pruning
๐ฌ 22
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Tool Planning Monte-Carlo Tree Search Agent Tool Use
Large Language Model (LLM) agents are increasingly applied to complex, multi-step tasks that require interaction with diverse external tools across various domains. However, current LLM agent tool planning methods typically rely on greedy, reactive tool selection strategies that lack foresight and fail to account for inter-tool dependencies.
In this paper, we present ToolTree, a novel Monte-Carlo tree search-inspired planning paradigm for tool planning. ToolTree explores possible tool usage trajectories using a dual-stage LLM evaluation and bidirectional pruning mechanism that enables the agent to make informed, adaptive decisions over extended tool-use sequences while pruning less promising branches before and after the tool execution. Empirical evaluations across both open-set and closed-set tool planning tasks on 4 benchmarks demonstrate that ToolTree consistently improves performance while keeping the highest efficiency, achieving an average gain of around 10\% compared to the state-of-the-art planning paradigm.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2119. SIM-CoT: Supervised Implicit Chain-of-Thought
๐ฌ 19
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Chain-of-Thought large language model math reasoning
Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited their adoption.
We identify a core latent instability issue when scaling the computational budget of implicit CoT: as the number of reasoning tokens increases, training often becomes unstable and collapses.
Our analysis shows that this instability arises from latent representations becoming homogeneous and losing semantic diversity, caused by insufficient step-level supervision in current implicit CoT methods.
To address this, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space.
SIM-CoT employs an auxiliary decoder during training to align each implicit token with its corresponding explicit reasoning step, ensuring latent states capture distinct and meaningful information.
The auxiliary decoder is removed at inference, preserving the efficiency of implicit CoT with no added overhead.
It also provides interpretability by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization and diagnosis.
SIM-CoT significantly improves both in-domain accuracy and out-of-domain stability of implicit CoT methods, boosting Coconut by +8.2\% on GPT-2 and CODI by +3.0\% on LLaMA-3.1 8B.
It further surpasses the explicit CoT baseline on GPT-2 by 2.1\% with 2.3$\times$ greater token efficiency, while closing the performance gap on larger models like LLaMA-3.1 8B.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2120. SpareTrain: Fault-Tolerant LLM Training via Low-Cost Dual Modular Redundancy
๐ฌ 12
โญ 5.50
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ large language models distributed training silent data corruption fault-tolerance activation checkpointing parallelism
Dual Modular Redundancy (DMR) is a highly effective mechanism for detecting silent data corruption (SDC)โa critical reliability concern in large language model (LLM) trainingโby executing each operation twice. However, its high computation overhead has prevented practical deployment at scale. In this paper, we present SpareTrain, an LLM training system that achieves complete DMR with minimal overhead by repurposing the activation checkpointing mechanism and exploiting idle GPU time. Evaluations on up to 32 H200 GPUs show that SpareTrain improves throughput by 12โ35\% over naive DMR, corresponding to only 3โ14\% overhead compared to unprotected training, while maintaining full DMR error detection capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2121. PALC: Preference Alignment via Logit Calibration
๐ฌ 12
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ AI alignment Representation Editing
Aligning Large Language Models with human preferences typically requires computationally intensive training or complex reward architectures. We introduce PALC (Preference Alignment via Logit Calibration), a parameter-efficient framework that achieves test-time alignment through a novel intervention strategy: direct calibration in vocabulary space. Unlike existing methods that manipulate entangled hidden representations or rely on external reward models, PALC operates at the logit layer where each dimension corresponds to a distinct token, providing interpretable and efficient control. Our approach employs a bottleneck architecture that learns to compress the base model's hidden states and generate position-dependent calibration vectors, requiring only a fraction of the base model's parameters. Through this design, PALC sidesteps the superposition problem inherent in representation engineering while eliminating the computational overhead of guided decoding methods. A single scaling factor enables runtime adjustment of alignment strength without retraining, allowing practitioners to balance between preserving model capabilities and enforcing preferences. Experiments demonstrate that PALC outperforms most test-time alignment methods while maintaining near-baseline inference speed. Our ablations reveal that human preferences concentrate on surprisingly low-dimensional manifolds, validating our architectural choices. By establishing vocabulary-space intervention as an effective alignment paradigm, PALC makes preference alignment accessible for resource-constrained deployments where traditional methods are infeasible, opening new avenues for scalable and adaptive AI alignment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2122. WebDS: An End-to-End Benchmark for Web-based Data Science
๐ฌ 11
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Large Language Model Benchmark Data Science LLM Agents
Many real-world data science tasks involve complex web-based interactions: finding appropriate data available on the internet, synthesizing multimodal data from different locations, and producing summarized analyses. Existing web benchmarks often focus on simplistic interactions and often do not require diverse tool-using capabilities. Conversely, traditional data science benchmarks typically concentrate on static, highly structured datasets and do not assess end-to-end workflows that encompass data acquisition, cleaning, analysis, and insight generation. In response, we introduce WebDS, the first end-to-end web-based data science benchmark. It comprises 870 web-based data science tasks across 29 diverse websites from structured government data portals to unstructured news media, challenging agents to perform complex, multi-step, tool-based operations, across heterogeneous data formats, to better reflect the realities of modern data analytics. Evaluations of current SOTA LLM agents indicate significant performance gaps in accomplishing these tasks. For instance, Browser Use, which accomplishes 80% of tasks on Web Voyager, completes only 15% of tasks in WebDS, which our analysis suggests is due to new failure modes like poor information grounding, repetitive behavior and shortcut-taking that agents performing WebDS' tasks display. By contrast, humans achieve around 90% accuracy, highlighting a substantial gap between current agents and human performance. By providing a more robust and realistic testing ground, WebDS sets the stage for significant advances in the development of practically useful LLM-based data science.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2123. reAR: Rethinking Visual Autoregressive Models via Token-wise Consistency Regularization
๐ฌ 16
โญ 5.50
๐ generative models
๐ท๏ธ Visual Generation Autoregressive Model
Visual autoregressive (AR) generation offers a promising path toward unifying vision and language models, yet its performance remains suboptimal against diffusion models. Prior work often attributes this gap to tokenizer limitations and rasterization ordering. In this work, we identify a core bottleneck from the perspective of generator-tokenizer inconsistency, i.e., the AR-generated tokens may not be well-decoded by the tokenizer. To address this, we propose reAR, a simple training strategy introducing a token-wise regularization objective: when predicting the next token, the causal transformer is also trained to recover the visual embedding of the current token and predict the embedding of the target token under a noisy context. It requires no changes to the tokenizer, generation order, inference pipeline, or external models. Despite its simplicity, reAR substantially improves performance. On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standard rasterization-based tokenizer. When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2124. Cooperative Sheaf Neural Networks
๐ฌ 13
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ sheaves message-passing graphs neural networks
Sheaf neural networks (SNNs) leverage cellular sheaves to induce flexible diffusion processes on graphs, generalizing the diffusion mechanism of classical graph neural networks. While SNNs have been shown to cope well with heterophilic tasks and alleviate oversmoothing, we argue that there is further room for improving sheaf diffusion. More specifically, we show that SNNs do not allow nodes to independently choose how they cooperate with their neighbors, i.e., whether they convey and/or gather information to/from their neighbors. To address this issue, we first introduce the notion of cellular sheaves over directed graphs and characterize their in- and out-degree Laplacians. We then leverage our construction to propose Cooperative Sheaf Neural Network (CSNN). Additionally, we formally characterize its receptive field and prove that it allows nodes to selectively attend (listen) to arbitrarily far nodes while ignoring all others in their path, which is key to alleviating oversquashing. Our results on synthetic data empirically substantiate our claims, showing that CSNN can handle long-range interactions while avoiding oversquashing. We also show that CSNN performs strongly in heterophilic node classification and long-range graph classification benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2125. IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment
๐ฌ 17
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Instruction-guided video editing Benchmark suite Multimodal large language models Evaluation metrics
Instruction-guided video editing has emerged as a rapidly advancing research direction, offering new opportunities for intuitive content transformation while also posing significant challenges for systematic evaluation. Existing video editing benchmarks fail to support the evaluation of instruction-guided video editing adequately and further suffer from limited source diversity, narrow task coverage and incomplete evaluation metrics. To address above limitations, we introduce IVEBench, a modern benchmark suite specifically designed for instruction-guided video editing assessment. IVEBench comprises a diverse database of 600 high-quality source videos, spanning seven semantic dimensions, and covering video lengths ranging from 32 to 1,024 frames. It further includes 8 categories of editing tasks with 35 subcategories, whose prompts are generated and refined through large language models and expert review. Crucially, IVEBench establishes a three-dimensional evaluation protocol encompassing video quality, instruction compliance and video fidelity, integrating both traditional metrics and multimodal large language model-based assessments. Extensive experiments demonstrate the effectiveness of IVEBench in benchmarking state-of-the-art instruction-guided video editing methods, showing its ability to provide comprehensive and human-aligned evaluation outcomes. All data and code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2126. SHIELD: Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense
๐ฌ 20
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ VLM Hallucination Training Free
Large Vision-Language Models (LVLMs) excel in diverse cross-modal tasks. However, object hallucination, where models produce plausible but inaccurate object descriptions, remains a significant challenge. In contrast to previous work focusing on LLM components, this paper is the first to trace LVLM hallucinations to visual encoders and identifies three key issues: statistical bias, inherent bias, and vulnerability. To address these challenges, we propose SHIELD, a training-free framework that mitigates hallucinations through three strategies: re-weighting visual tokens to reduce statistical bias, introducing noise-derived tokens to counter inherent bias, and applying adversarial attacks with contrastive decoding to address vulnerability. Experiments demonstrate that SHIELD effectively mitigates object hallucinations across diverse benchmarks and LVLM families. Moreover, SHIELD achieves strong performance on the general LVLM benchmark, highlighting its broad applicability. Code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2127. Enabling True Global Perception in State Space Models for Visual Tasks
๐ฌ 25
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ State Space Model Frequency Domain Modulation Global Image Modeling Mathematical Definition
Despite the importance of global contextual modeling in visual tasks, a rigorous mathematical definition remains absent, and the concept is still largely described in heuristic or empirical terms. Existing methods either rely on computationally expensive attention mechanisms or are constrained by the recursive modeling nature of State Space Models (SSMs), making it challenging to achieve both efficiency and true global perception. To address this, we first propose a mathematical definition of global modeling for visual images, providing a theoretical foundation for designing globally-aware and interpretable models. Based on in-depth analysis of SSMs and frequency-domain modeling principles, we construct a complete theoretical framework that overcomes the limitations imposed by SSMs' recursive modeling mechanism from a frequency perspective, thereby adapting SSMs for global perception in image modeling. Guided by this framework, we design the Global-aware SSM (GSSM) module and formally prove that it satisfies definitional requirements of global image modeling. GSSM leverages a Discrete Fourier Transform (DFT)-based modulation mechanism, providing precise front-end control over the SSM's modeling behavior, and enabling efficient global image modeling with linear-logarithmic complexity. Building upon GSSM, we develop GMamba, a plug-and-play module that can be seamlessly integrated at any stage of Convolutional Neural Networks (CNNs). Extensive experiments across multiple tasks, including object detection, semantic segmentation, and instance segmentation, across diverse model architectures, demonstrate that GMamba consistently outperforms existing global modeling modules, validating both the effectiveness of our theoretical framework and the rigor of proposed definition.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2128. MAS$^2$: Self-Generative, Self-Configuring, Self-Rectifying Multi-Agent Systems
๐ฌ 5
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multi-Agent System LLM Agent
The past two years have witnessed the meteoric rise of Large Language Model (LLM)-powered multi-agent systems (MAS), which harness collective intelligence and exhibit a remarkable trajectory toward self-evolution. This paradigm has rapidly progressed from manually engineered systems that require bespoke configuration of prompts, tools, roles, and communication protocols toward frameworks capable of automated orchestration. Yet, dominant automatic multi-agent systems, whether generated by external modules or a single LLM agent, largely adhere to a rigid \textit{generate-once-and-deploy} paradigm, rendering the resulting systems brittle and ill-prepared for the dynamism and uncertainty of real-world environments.
To transcend this limitation, we introduce MAS$^2$, a paradigm predicated on the principle of recursive self-generation: a multi-agent system that autonomously architects bespoke multi-agent systems for diverse problems. Technically, we devise a ``\textit{generator-implementer-rectifier}'' tri-agent team capable of dynamically composing and adaptively rectifying a target agent system in response to real-time task demands. Collaborative Tree Optimization is proposed to train and specialize these meta-agents. Extensive evaluation across seven benchmarks reveals that MAS$^2$ achieves performance gains of up to $19.6\\%$ over state-of-the-art MAS in complex scenarios such as deep research and code generation. Moreover, MAS$^2$ exhibits superior cross-backbone generalization, effectively leveraging previously unseen LLMs to yield improvements of up to $15.1\\%$. Crucially, these gains are attained without incurring excessive token costs, as MAS$^2$ consistently resides on the Pareto frontier of cost-performance trade-offs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2129. MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
๐ฌ 20
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Token Compression Efficient Video Understanding
The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. Nevertheless, visual language models (VLMs) still face significant computational overhead when scaled from images to the video domain.
When video data is too large (due to high frame rates and long durations), the inference cost of models increases sharply. This severely hinders their deployment and application in environments that require rapid responses and have limited computation resources.
Token compression for input videos is one of the promising directions, as effective compression schemes can significantly reduce computational overhead.
Most existing compression methods are based on training-free token merging strategies in either the spatial or temporal dimension. Although these methods reduce computational overhead, their training-free nature inevitably leads to information loss during token compression, resulting in a significant performance drop.
To address these challenges, we propose a Memory-Augmented Reinforcement Learning-based Token Compression (MARC) method for efficient video understanding that integrates structured retrieval with RL-based distillation.
Our proposed MARC is a retrieve-then-compress method, which employs a Visual Memory Retriever (VMR) tool and a Compression Group Relative Policy Optimization (C-GRPO) training strategy.
The Visual Memory Retriever first segments videos into event-level fragments and selects query-relevant clips. The C-GRPO distills reasoning ability from a Teacher Network to a Student Network by encouraging the output of the student network to match the performance of the teacher network.
Extensive experiments on six video benchmarks demonstrate that our compression method achieves nearly identical accuracy to the 64-frame Qwen2.5-VL-3B baseline while using only one frameโs worth of tokens as input, resulting in a 95% reduction in visual tokens. Moreover, our approach reduces GPU memory usage by 72% and generation latency by 23.9%.
These results demonstrate the strong potential of our compression method as a robust solution for RL-based post-training compression of large-scale models, enabling practical deployment in latency-sensitive and resource-constrained applications such as real-time video question answering, surveillance, and autonomous driving.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2130. Negative Pre-activations Differentiate Syntax
๐ฌ 18
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ entanglement Wasserstein distance negative pre-activation syntax interpretability
Modern large language models increasingly use smooth activation functions such as GELU or SiLU, allowing negative pre-activations to carry both signal and gradient. Nevertheless, many neuron-level interpretability analyses have historically focused on large positive activations, often implicitly treating the negative region as less informative, a carryover from the ReLU-era. We challenge this assumption and ask whether and how negative pre-activations are functionally utilized. We address this question by studying a sparse subpopulation of Wasserstein neurons whose output distributions deviate strongly from a Gaussian baseline and that functionally differentiate similar inputs. We show that this negative region plays a functional role rather than reflecting a mere gradient optimization side effect. A minimal, sign-specific intervention that zeroes only the negative pre-activations of a small set of Wasserstein neurons substantially increases perplexity and sharply degrades grammatical performance on BLiMP and TSE, whereas both random and perplexity-matched ablations of many more non-Wasserstein neurons in their negative pre-activations leave grammatical performance largely intact. Conversely, on a suite of non-grammatical benchmarks, the perplexity-matched control ablation is more damaging than the Wasserstein neuron ablation, yielding a double dissociation between syntax and other capabilities. Part-of-speech analysis localizes the excess surprisal to syntactic scaffolding tokens, layer-specific interventions show that small local degradations accumulate across depth, and training-dynamics analysis reveals that the same sign-specific ablation becomes more harmful as Wasserstein neurons emerge and stabilize. Together, these results identify negative pre-activations in a sparse subpopulation of Wasserstein neurons as an actively used substrate for syntax in smooth-activation language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2131. Discovering heterogeneous synaptic plasticity rules via large-scale neural evolution
๐ฌ 19
โญ 5.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Synaptic plasticity Evolutionary computation Computational neuroscience
Synaptic plasticity is a fundamental substrate for learning and memory, where different synapse types exhibit distinct plasticity mechanisms. However, how functional behaviors emerge from heterogeneous synaptic plasticity mechanisms remains poorly understood. Here, we introduce a computational framework that harnesses Darwinian evolutionary principles to discover biologically plausible, heterogeneous synaptic plasticity rules within a biologically realistic model of the mouse primary visual cortex. Specifically, we parameterize several key factors related to synaptic plasticity, including presynaptic and postsynaptic spikes, their associated eligibility traces, and neuromodulatory signals. By integrating these factors via a truncated Taylor expansion, we construct a large-scale search space of candidate plasticity rules, with each rule containing over 2.6k optimizable parameters. Each rule is subsequently evaluated on both cross-domain visual task performance and biological validity. Leveraging a multi-objective evolutionary algorithm, we effectively navigate this high-dimensional search space to identify plasticity rules that are both biologically plausible and yield high task performance. We uncover diverse families of high-performing plasticity rules that achieve similar behavioral outcomes despite markedly different mathematical formulations, suggesting that real-world synaptic learning mechanisms may exhibit computational degeneracy. We further show that these biologically plausible rules are not only robust across network scales but also enable few-shot learning, offering a computational explanation for the emergence of innate ability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2132. Robustify Spiking Neural Networks via Dominant Singular Deflation under Heterogeneous Training Vulnerability
๐ฌ 20
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Neuromophic computing Spiking neural networks robustness
Spiking Neural Networks (SNNs) process information via discrete spikes, enabling them to operate at remarkably low energy levels. However, our experimental observations reveal a striking vulnerability when SNNs are trained using the mainstream methodโdirect encoding combined with backpropagation through time (BPTT): even a single backward pass on data drawn from a slightly different distribution can lead to catastrophic network collapse. We refer to this phenomenon as the heterogeneous training vulnerability of SNNs. Our theoretical analysis attributes this vulnerability to the repeated inputs inherent in direct encoding and the gradient accumulation characteristic of BPTT, which together produce an exceptional large Hessian spectral radius. To address this challenge, we develop a hyperparameter-free method called Dominant Singular Deflation (DSD). By orthogonally projecting the dominant singular components of gradients, DSD effectively reduces the Hessian spectral radius, thereby preventing SNNs from settling into sharp minima. Extensive experiments demonstrate that DSD not only mitigates the vulnerability of SNNs under heterogeneous training, but also significantly enhances overall robustness compared to key baselines, providing strong support for safer SNNs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2133. SPWOOD: Sparse Partial Weakly-Supervised Oriented Object Detection
๐ฌ 19
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Oriented Object Detection
A consistent trend throughout the research of oriented object detection (OOD) has been the pursuit of maintaining comparable performance with fewer and weaker annotations. This is particularly crucial in the remote sensing domain, where the dense object distribution and a wide variety of categories contribute to prohibitively high costs. Based on the supervision level, existing OOD algorithms
can be broadly grouped into fully supervised, semi-supervised, and weakly supervised methods. Within the scope of this work, we further categorize them to include sparsely supervised and partially weakly-supervised methods. To address the challenges of large-scale labeling, we introduce the first Sparse Partial Weakly-Supervised Oriented Object Detection (SPWOOD) framework, designed to efficiently leverage only a few sparse weakly-labeled data and plenty of unlabeled data. Our framework incorporates three key innovations: (1) We design a Sparse-annotation-Orientation-and-Scale-aware Student (SOS-Student) model to separate unlabeled objects from the background in a sparsely-labeled setting, and learn orientation and scale information from orientation-agnostic or scale-agnostic
weak annotations. (2) We construct a novel Multi-level Pseudo-label Filtering (MPF) strategy that leverages the distribution of model predictions, which is informed by the modelโs multi-layer predictions. (3) We propose a unique sparse partitioning approach, ensuring equal treatment for each category. Extensive experiments on the DOTA-v1.0 and v1.5 datasets show that SPWOOD framework
achieves a significant performance gain over traditional OOD methods mentioned above, offering a highly cost-effective solution. Our code will be public soon.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2134. Product-Quantised Image Representation for High-Quality Image Synthesis
๐ฌ 14
โญ 5.50
๐ generative models
๐ท๏ธ Vector Quantisation Representation Learning Diffusion Models Generative Models Image Synthesis
Product quantisation (PQ) is a classical method for scalable vector encoding, yet it has seen limited usage for latent representations in high-fidelity image generation.
In this work, we introduce PQGAN, a quantised image autoencoder that integrates PQ into the well-known vector quantisation (VQ) framework of VQGAN.
PQGAN achieves a noticeable improvement over state-of-the-art methods in terms of reconstruction performance, including both quantisation methods and their continuous counterparts. We achieve a PSNR score of 37dB, where prior work achieves 27dB,
and are able to reduce the FID, LPIPS, and CMMD score by up to 96\%.
Our key to success is a thorough analysis of the interaction between codebook size, embedding dimensionality, and subspace factorisation, with vector and scalar quantisation as special cases. We obtain novel findings, such that the performance of VQ and PQ behaves in opposite ways when scaling the embedding dimension. Furthermore, our analysis shows performance trends for PQ that help guide optimal hyperparameter selection.
Finally, we demonstrate that PQGAN can be seamlessly integrated into pre-trained diffusion models. This enables either a significantly faster and more compute-efficient generation, or a doubling of the output resolution at no additional cost, positioning PQ as a strong extension for discrete latent representation in image synthesis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2135. GeoFAR: Geography-Informed Frequency-Aware Super-Resolution for Climate Data
๐ฌ 27
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ climate downscaling image super-resolution implicit neural representation earth observation environmental science
Super-resolving climate data is crucial for fine-grained decision-making in various domains, ranging from agriculture to environmental conservation. However, existing super-resolution approaches struggle to generate the high-frequency spatial information present in climate data, especially over regions showing complex terrain variability. A key obstacle lies in a frequency bias existing in both deep neural networks (DNNs) and climate data: DNNs exhibit such bias by overfitting to low-frequency information, which is further exacerbated by the prevalence of low-frequency components in climate data (e.g., plains, oceans). As a consequence, geography-dependent high-frequency details are hard to reconstruct from coarse climate inputs with DNNs. To improve the fidelity of climate super-resolution (SR), we introduce GeoFAR: by explicitly encoding climatic patterns at different frequencies, while learning implicit geographical neural representations (i.e., related to location and elevation), our approach provides frequency-aware and geography-informed representations for climate SR, thereby reconstructing fine-grained climate information at high resolution. Experiments show that GeoFAR is a model-agnostic approach that can mitigate high-frequency prediction errors in both deterministic and generative SR models, demonstrating state-of-the-art performance across various spatial resolutions, atmospheric variables, and downscaling ratios. Datasets and code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2136. ReconViaGen: Towards Accurate Multi-view 3D Object Reconstruction via Generation
๐ฌ 20
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Generation 3D Reconstruction
Existing multi-view 3D object reconstruction methods heavily rely on sufficient overlap between input views, where occlusions and sparse coverage in practice frequently yield severe reconstruction incompleteness. Recent advancements in diffusion-based 3D generative techniques offer the potential to address these limitations by leveraging learned generative priors to โโhallucinate" invisible parts of objects, thereby generating plausible 3D structures. However, the stochastic nature of the inference process limits the accuracy and reliability of generation results, preventing existing reconstruction frameworks from integrating such 3D generative priors. In this work, we comprehensively analyze the reasons why diffusion-based 3D generative methods fail to achieve high consistency, including (a) the insufficiency in constructing and leveraging cross-view connections when extracting multi-view image features as conditions,
and (b) the poor controllability of iterative denoising during local detail generation, which easily leads to plausible but inconsistent fine geometric and texture details with inputs. Accordingly, we propose ReconViaGen to innovatively integrate reconstruction priors into the generative framework and devise several strategies that effectively address these issues. Extensive experiments demonstrate that our ReconViaGen can reconstruct complete and accurate 3D models consistent with input views in both global structure and local details.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2137. One Demo Is All It Takes: Planning Domain Derivation with LLMs from A Single Demonstration
๐ฌ 17
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Planning Domain Inference PDDL Robot Task Planning Task and Motion Planning LLMs for Planning Embodied AI
Pre-trained large language models (LLMs) show promise for robotic task planning but often struggle to guarantee correctness in long-horizon problems. Task and motion planning (TAMP) addresses this by grounding symbolic plans in low-level execution, yet it relies heavily on manually engineered planning domains. To improve long-horizon planning reliability and reduce human intervention, we present Planning Domain Derivation with LLMs (PDDLLM), a framework that automatically induces symbolic predicates and actions directly from demonstration trajectories by combining LLM reasoning with physical simulation roll-outs. Unlike prior domain-inference methods that rely on partially predefined or language descriptions of planning domains, PDDLLM constructs domains without manual domain initialization and automatically integrates them with motion planners to produce executable plans, enhancing long-horizon planning automation. Across 1,200 tasks in nine environments, PDDLLM outperforms six LLM-based planning baselines, achieving at least 20% higher success rates, reduced token costs, and successful deployment on multiple physical robot platforms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2138. Exploiting Low-Dimensional Manifold of Features for Few-shot Whole Slide Image Classification
๐ฌ 21
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Computational Pathology Whole Slide Image Classification Few-shot Learning Manifold Hypothesis
Few-shot Whole Slide Image (WSI) classification is severely hampered by overfitting. We argue that this is not merely a data-scarcity issue but a fundamentally geometric problem. Grounded in the manifold hypothesis, our analysis shows that features from pathology foundation models exhibit a low-dimensional manifold geometry that is easily perturbed by downstream models. This insight reveals a key potential issue in downstream multiple instance learning models: linear layers are geometry-agnostic and, as we show empirically, can distort the manifold geometry of the features. To address this, we propose the Manifold Residual (MR) block, a plug-and-play module that is explicitly geometry-aware. The MR block reframes the linear layer as residual learning and decouples it into two pathways: (1) a fixed, random matrix serving as a geometric anchor that approximately preserves topology while also acting as a spectral shaper to sharpen the feature spectrum; and (2) a trainable, low-rank residual pathway that acts as a residual learner for task-specific adaptation, with its structural bottleneck explicitly mirroring the low effective rank of the features. This decoupling imposes a structured inductive bias and reduces learning to a simpler residual fitting task. Through extensive experiments, we demonstrate that our approach achieves state-of-the-art results with significantly fewer parameters, offering a new paradigm for few-shot WSI classification. Code will be released upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2139. ASCIIEval: Benchmarking Models' Visual Perception in Text Strings via ASCII Art
๐ฌ 17
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ LLM evaluation MLLM evaluation ASCII art Visual Perception
Perceiving visual semantics embedded within consecutive characters is a crucial yet under-explored capability for both Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs). In this work, we select ASCII art as a representative artifact. It depicts concepts through careful arrangement of characters, which can be formulated in both text and image modalities. We frame the problem as a recognition task, and construct a novel benchmark, ASCIIEval. It covers over 3K samples with an elaborate categorization tree, along with a training set for further enhancement. Encompassing a comprehensive analysis of tens of models through different input modalities, our benchmark demonstrate its multi-faceted diagnostic power. Given textual input, language models shows their visual perception ability on ASCII art concepts. Proprietary models achieve over 70\% accuracy on certain categories, with GPT-5 topping the rank. For image inputs, we reveal that open-source MLLMs suffer from a trade-off between fine-grained text recognition and collective visual perception. They exhibit limited generalization ability to this special kind of arts, leading to the dramatic gap of over 20.01\% accuracy compared with their proprietary counterparts. Another critical finding is that model performance is sensitive to the length of the ASCII art, with this sensitivity varying across input modalities. Unfortunately, none of the models could successfully benefit from the simultaneous provision of both modalities, highlighting the need for more flexible modality-fusion approaches. Besides, we also introduce approaches for further enhancement and discuss future directions. Resources are available at https://anonymous.4open.science/r/VisionInText-9EE5.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2140. Practical estimation of the optimal classification error with soft labels and calibration
๐ฌ 14
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Bayes error irreducible error uncertainty quantification soft labels calibration evaluation
While the performance of machine learning systems has experienced significant improvement in recent years, relatively little attention has been paid to the fundamental question: to what extent can we improve our models? This paper provides a means of answering this question in the setting of binary classification, which is practical and theoretically supported. We extend a previous work that utilizes soft labels for estimating the Bayes error, the optimal error rate, in two important ways. First, we theoretically investigate the properties of the bias of the hard-label-based estimator discussed in the original work. We reveal that the decay rate of the bias is adaptive to how well the two class-conditional distributions are separated, and it can decay significantly faster than the previous result suggested as the number of hard labels per instance grows. Second, we tackle a more challenging problem setting: estimation withย _corrupted_ย soft labels. One might be tempted to use calibrated soft labels instead of clean ones. However, we reveal thatย _calibration guarantee is not enough_, that is, even perfectly calibrated soft labels can result in a substantially inaccurate estimate. Then, we show that isotonic calibration can provide a statistically consistent estimator under an assumption weaker than that of the previous work. Our method isย _instance-free_, i.e., we do not assume access to any input instances. This feature allows it to be adopted in practical scenarios where the instances are not available due to privacy issues. Experiments with synthetic and real-world datasets show the validity of our methods and theory.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2141. Uncovering Robot Vulnerabilities through Semantic Potential Fields
๐ฌ 23
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Multimodal Embedding
Robot manipulation policies, while central to the promise of physical AI, are highly vulnerable in the presence of external variations in the real world. Diagnosing these vulnerabilities is hindered by two key challenges: (i) the relevant variations to test against are often unknown, and (ii) direct testing in the real world is costly and unsafe. We introduce a framework that tackles both issues by learning a separate deep reinforcement learning (deep RL) policy for vulnerability prediction through virtual runs on a continuous vision-language embedding trained with limited success-failure data. By treating this embedding space, which is rich in semantic and visual variations, as a potential field, the policy learns to move toward vulnerable regions while being repelled from success regions. This vulnerability prediction policy, trained on virtual rollouts, enables scalable and safe vulnerability analysis without expensive physical trials. By querying this policy, our framework builds a probabilistic vulnerability-likelihood map. Experiments across simulation benchmarks and a physical robot arm show that our framework uncovers up to 23\% more unique vulnerabilities than state-of-the-art vision-language baselines, revealing subtle vulnerabilities overlooked by heuristic testing. Additionally, we show that fine-tuning the manipulation policy with the vulnerabilities discovered by our framework improves manipulation performance with much less fine-tuning data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2142. MergeTune: Continued Fine-Tuning of Vision-Language Models
๐ฌ 17
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Vision-Language Models Continue Learning Parameter-Efficient Fine-Tuning Robust Fine-Tuning
Fine-tuning vision-language models (VLMs) such as CLIP often leads to catastrophic forgetting of pretrained knowledge. Prior work primarily aims to mitigate forgetting during adaptation; however, forgetting often remains inevitable during this process. We introduce a novel paradigm, continued fine-tuning (CFT), which seeks to recover pretrained knowledge after a zero-shot model has already been adapted. We propose a simple, model-agnostic CFT strategy (named MergeTune) guided by linear mode connectivity (LMC), which can be applied post hoc to existing fine-tuned models without requiring architectural changes. Given a fine-tuned model, we continue fine-tuning its trainable parameters (e.g., soft prompts or linear heads) to search for a continued model which has two low-loss paths to the zero-shot (e.g., CLIP) and the fine-tuned (e.g., CoOp) solutions. By exploiting the geometry of the loss landscape, the continued model implicitly merges the two solutions, restoring pretrained knowledge lost in the fine-tuned counterpart. A challenge is that the vanilla LMC constraint requires data replay from the pretraining task. We approximate this constraint for the zero-shot model via a second-order surrogate, eliminating the need for large-scale data replay. Experiments show that MergeTune improves the harmonic mean of CoOp by +5.6\% on base-novel generalisation without adding parameters.
On robust fine-tuning evaluations, the LMC-merged model from MergeTune surpasses ensemble baselines with lower inference cost, achieving further gains and state-of-the-art results when ensembled with the zero-shot model.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2143. A^2TG: Adaptive Anisotropic Textured Gaussians for Efficient 3D Scene Representation
๐ฌ 13
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Computer Vision;Novel View Synthesis:Neural Rendering;3D Gaussian Splatting
Gaussian Splatting has emerged as a powerful representation for high-quality, real-time 3D scene rendering. While recent works extend Gaussians with learnable textures to enrich visual appearance, existing approaches allocate a fixed square texture per primitive, leading to inefficient memory usage and limited adaptability to scene variability. In this paper, we introduce adaptive anisotropic textured Gaussians (A$^2$TG), a novel representation that generalizes textured Gaussians by equipping each primitive with an anisotropic texture. Our method employs a gradient-guided adaptive rule to jointly determine texture resolution and aspect ratio, enabling non-uniform, detail-aware allocation that aligns with the anisotropic nature of Gaussian splats. This design significantly improves texture efficiency, reducing memory consumption while enhancing image quality. Experiments on multiple benchmark datasets demonstrate that A$^2$TG consistently outperforms fixed-texture Gaussian Splatting methods, achieving comparable rendering fidelity with substantially lower memory requirements.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2144. Time Optimal Execution of Action Chunk Policies Beyond Demonstration Speed
๐ฌ 22
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Accelerating Execution Speed of Imitation Policies Time Optimal Path Parameterization Test-time Search
Achieving both speed and accuracy is a central challenge for real-world robot manipulation. While recent imitation learning approaches, including vision-language-action (VLA) models, have achieved remarkable precision and generalization, their execution speed is often limited by slow demonstration via teleoperation and by inference latency. In this work, we introduce a method to accelerate any imitation policy that predicts action chunks, enabling speeds that surpass those of the original demonstration. A naive approach of simply increasing the execution frequency of predicted actions leads to significant state errors and task failure, as it alters the underlying transition dynamics and encounters physical reachability constraints over shorter time horizons. These errors are further amplified by misaligned actions based on outdated robot state when using asynchronous inference to accelerate execution. Our method $\textbf{\textit{RACE}}$ address these challenges with a three-part solution: 1) using desired states as imitation targets instead of commanded actions, 2) replanning the timing of action chunks to execute them as fast as the robot's physical limits allow, and 3) employing a test-time search for an aligned action chunk that maximizes controllability from the current state. Through extensive experiments in both simulation and the real world, we show that our method achieves up to a 4x acceleration over the original policy while maintaining a high success rate
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2145. OrderDP: A Theoretically Guaranteed Lossless Dynamic Data Pruning Framework
๐ฌ 18
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Dynamic Data Pruning; Training acceleration; Convergence Analysis;Bias Analysis;
Data pruning (DP), as an oft-stated strategy to alleviate heavy training burdens, reduces the volume of training samples according to a well-defined pruning method while striving for near-lossless performance. However, existing approaches, which commonly select highly informative samples, can lead to biased gradient estimation compared to full-dataset training. Furthermore, the analysis of this bias and its impact on final performance remains ambiguous. To address these challenges, we propose OrderDP, a plug-and-play framework that aims to obtain stable, unbiased, and near-lossless training acceleration with theoretical guarantees. Specifically, OrderDP first randomly selects a subset and then chooses the top-$q$ samples, where unbiasedness is established with respect to a surrogate loss. This ensures that OrderDP conducts unbiased training in terms of the surrogate objective. We further establish convergence and generalization analyses, elucidating how OrderDP affects optimal performance and enables well-controlled acceleration while ensuring guaranteed final performance. Empirically, we evaluate OrderDP against comprehensive baselines on CIFAR-10, CIFAR-100, and ImageNet-1K, demonstrating competitive accuracy, stable convergence, and exact control---all with a simpler design and faster runtime, while reducing training cost by over 40\%. Delivering both strong performance and computational efficiency, our method serves as a robust and easily adaptable tool for data-efficient learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2146. ARFlow: Auto-regressive Optical Flow Estimation for Arbitrary-Length Videos via Progressive Next-Frame Forecasting
๐ฌ 5
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Optical Flow Estimation
Optical flow estimation is a fundamental computer vision task that predicts per-pixel displacements from consecutive images. Recent works attempt to exploit temporal cues to improve the estimation performance. However, their temporal modeling is restricted to short video sequences due to the unaffordable computational burden, thereby suffering from restricted temporal receptive fields. Moreover, their group-wise paradigm in one forward pass undermines inter-group information exchange, leading to modest performance improvement. To address these problems, we propose a novel multi-frame optical flow network based on an auto-regressive paradigm, named ARFlow. Unlike previous multi-frame methods, our method can be scalable to arbitrary-length videos with marginal computational overhead. Specifically, we design an Auto-regressive Flow Initialization (AFI) module and an Auto-regressive Multi-stride Flow Refinement (AMFR) module to forecast the next-frame flow based on multi-stride history observations. Our ARFlow achieves state-of-the-art performance, ranking 1st on both KITTI-2015 and Spring official benchmarks and 2nd on the MPI-Sintel (Final) benchmark among all open-sourced methods. Furthermore, due to the auto-regressive nature, our method can generalize to arbitrary video length with a constant GPU memory usage of 2.1GB. The code will be released upon publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2147. MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow
๐ฌ 18
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Medical AI Agentic AI
Modern clinical diagnosis relies on the comprehensive analysis of multi-modal patient data, drawing on medical expertise to ensure systematic and rigorous reasoning. Recent advances in VisionโLanguage Models (VLMs) and agent-based methods are reshaping medical diagnosis by effectively integrating multi-modal information. However, they often output direct answers and empirical-driven conclusions without clinical evidence supported by quantitative analysis, which compromises their reliability and hinders clinical usability.
Here we propose MedAgent-Pro, an agentic reasoning paradigm that mirrors modern diagnosis principles via a hierarchical diagnostic workflow, consisting of disease-level standardized plan generation and patient-level personalized step-by-step reasoning. To support disease-level planning, a retrieval-augmented generation agent is designed to access medical guidelines for alignment with clinical standards. For patient-level reasoning, MedAgent-Pro leverages professional tools such as visual models to take various actions to analyze multi-modal input, and performs evidence-based reflection to iteratively adjust memory, enforcing rigorous reasoning throughout the process. Extensive experiments across a wide range of anatomical regions, imaging modalities, and diseases demonstrate the superiority of MedAgent-Pro over mainstream VLMs, agentic systems and leading expert models. Ablation studies and expert evaluation further confirm its robustness and clinical relevance. Anonymized code link is available in the reproducibility statement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2148. SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation
๐ฌ 16
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Imitation Learning Reward Modeling Robotics Manipulation
Large-scale robot learning has made progress on complex manipulation tasks, yet long-horizon, contact-rich problemsโespecially those involving deformable objectsโremain challenging due to inconsistent demonstration quality. We propose a stage-aware, video-based reward modeling framework that jointly predicts task stage and fine-grained progress, using natural-language subtask annotations to derive consistent labels across variable-length demonstrations. This avoids the brittleness of frame-index-based labeling and provides stable supervision even in tasks like T-shirt folding. Our reward model is robust to demonstration variability, generalizes to out-of-distribution scenarios, and improves downstream policy training. Building on it, we introduce Reward-Aligned Behavior Cloning (RA-BC), which filters and reweights demonstrations based on reward estimates. Experiments show that our method significantly outperforms baselines in both real-world rollouts and human validation. On T-shirt folding, we achieve 83\% success from the flattened state and 67\% from the crumpled state, compared to 8\% and 0\% with vanilla BC. Overall, our results highlight reward modeling as a scalable and annotation-efficient solution for long-horizon robotic manipulation. Project website: https://qianzhong-chen.github.io/sarm.github.io/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2149. Membership Privacy Risks of Sharpness Aware Minimization
๐ฌ 17
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ membership inference attack sam sharpness aware minimization memorization benign overfitting
Optimization algorithms that seek flatter minima, such as Sharpness-Aware Minimization (SAM), are credited with improved generalization and robustness to noise. We ask whether such gains impact membership privacy. Surprisingly, we find that SAM is more prone to Membership Inference Attacks (MIA) than classical SGD across multiple datasets and attack methods, despite achieving lower test error. This suggests that the very geometric mechanism of SAM that improves generalization simultaneously exacerbates membership leakage. We investigate this phenomenon through extensive analysis of memorization and influence scores. Our results reveal that SAM is more capable of capturing atypical subpatterns, leading to higher memorization scores of samples. Conversely, SGD depends more heavily on majority features, exhibiting worse generalization on atypical subgroups and lower memorization. Crucially, this characteristic of SAM can be linked to lower variance in the output confidence for unseen samples, thereby amplifying membership signals. Finally, we model SAM under a perfectly interpolating linear regime and theoretically prove that geometric mechanism of SAM inherently reduces variance, guaranteeing a higher MIA advantage for confidence and likelihood ratio attacks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2150. UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding
๐ฌ 18
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Unified understanding and generation; Large language models; 3D generation; 3D vision; Spatial understanding
Despite the impressive progress on understanding and generating images shown by the recent unified architectures, the integration of 3D tasks remains challenging and largely unexplored. In this paper, we introduce UniUGG, the first unified understanding and generation framework for 3D modalities. Our unified framework employs an LLM to comprehend and decode sentences and 3D representations. At its core, we propose a spatial decoder leveraging a latent diffusion model to generate high-quality 3D representations. This allows for the generation and imagination of 3D scenes based on a reference image and an arbitrary view transformation, while remaining supports for spatial visual question answering (VQA) tasks. Additionally, we propose a geometric-semantic learning strategy to pretrain the vision encoder. This design jointly captures the input's semantic and geometric cues, enhancing both spatial understanding and generation. Extensive experimental results demonstrate the superiority of our method in visual representation, spatial understanding, and 3D generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2151. Flow Map Learning via Games
๐ฌ 12
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ flow map flow matching diffusion density ode
Diffusion and flow-based models benefit from simple regression losses, but inference (i.e, producing samples) incurs significant computational overhead because it requires integration. Consistency models address this overhead by directly learning the flow maps along the ODE trajectory, revealing a design space for the learning problem between one-step and many-step approaches. However, existing consistency training methods feature computational challenges such as requiring model inverses or backpropagation through iterated model calls, and do not always prove that the desired ODE flow map is a solution to the loss. We introduce CurlFlow, an approach for learning flow maps that bypasses explicit invertibility constraints and expensive differentiation through model iteration. CurlFlow trains a model to compute both the ODE solutions and the implied velocity from scratch by following non-conservative dynamics (i.e., those with curl) with stationary point at the desired flow map. On the CIFAR image benchmark, CurlFlow attains a favorable relationship of FID to step count, relative to flow matching, MeanFlow, and several other flow map learning methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2152. Data Aware and Scalable Sensitivity Analysis for Decision Tree Ensembles
๐ฌ 5
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Robustness verification Sensitivity analysis SAT solvers efficient encodings NP-hardness fairness confidence decision tree ensembles MultiClass
Decision tree ensembles are widely used in critical domains, making robustness and sensitivity analysis essential to their trustworthiness. We study the feature sensitivity problem, which asks whether an ensemble is ``sensitive" to a specified subset of features - such as protected attributes- whose manipulation can alter model predictions. Existing approaches often yield examples of sensitivity that lie far from the training distribution, limiting their interpretability and practical value. We propose a data-aware sensitivity framework that constrains the sensitive examples to remain close to the dataset, thereby producing realistic and interpretable evidence of model weaknesses. To this end, we develop novel techniques for data-aware search using a combination of mixed-integer linear programming (MILP) and satisfibility modulo theories (SMT) encodings. Our contributions are fourfold. Firstly, we strengthen the NP-hardness result for sensitivity verification, showing it holds even for trees of depth 1. Secondly, we develop MILP-optimizations that significantly speed up sensitivity verification for single ensembles and for the first time can also handle multiclass tree ensembles. Thirdly we introduce a data-aware framework generating realistic examples near the training distribution. Finally, we conduct an extensive experimental evaluation on large tree ensembles, demonstrating scalability to ensembles with up to 800 trees of depth 8, achieving substantial improvements over the state of the art. This framework provides a practical foundation for analyzing the reliability and fairness of tree-based models in high-stakes applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2153. VFScale: Intrinsic Reasoning through Verifier-Free Test-time Scalable Diffusion Model
๐ฌ 20
โญ 5.50
๐ generative models
๐ท๏ธ Reasoning Energy-based Diffusion model Monte Carlo tree search Test-time scaling
Inspired by human SYSTEM 2 thinking, LLMs excel at complex reasoning tasks via extended Chain-of-Thought. However, similar test-time scaling for diffusion models to tackle complex reasoning remains largely unexplored. From existing work, two primary challenges emerge in this setting: (i) the dependence on an external verifier indicating a notable gap from intrinsic reasoning of human intelligence without any external feedback, and (ii) the lack of an efficient search algorithm. In this paper, we introduce the Verifier-free Test-time Scalable Diffusion Model (VFScale) to achieve scalable intrinsic reasoning, which equips number-of-sample test-time scaling with the intrinsic energy function of diffusion models as the verifier. Concretely, VFScale comprises two key innovations to address the aforementioned challenges. On the training side, VFScale consists of a novel MRNCL loss and a KL regularization to improve the energy landscape, ensuring that the learned energy function itself serves as a reliable verifier. On the inference side, VFScale integrates the denoising process with a novel hybrid Monte Carlo Tree Search (hMCTS) to improve search efficiency. On challenging reasoning tasks of Maze and Sudoku, we demonstrate the effectiveness of VFScale's training objective and scalable inference method. In particular, trained with Maze sizes of up to 6ร6, our VFScale solves 88\% of Maze problems with much larger sizes of 15ร15, while standard diffusion model completely fails.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2154. Texture Vector-Quantization and Reconstruction Aware Prediction for Generative Super-Resolution
๐ฌ 14
โญ 5.50
๐ generative models
๐ท๏ธ generative super-resolution; vector-quantization
Vector-quantized based models have recently demonstrated strong potential for visual prior modeling. However, existing VQ-based methods simply encode visual features with nearest codebook items and train index predictor with code-level supervision. Due to the richness of visual signal, VQ encoding often leads to large quantization error. Furthermore, training predictor with code-level supervision can not take the final reconstruction errors into consideration, result in sub-optimal prior modeling accuracy.
In this paper we address the above two issues and propose a Texture Vector-Quantization and a Reconstruction Aware Prediction strategy. The texture vector-quantization strategy leverages the task character of super-resolution and only introduce codebook to model the prior of missing textures. While the reconstruction aware prediction strategy makes use of the straight-through estimator to directly train index predictor with image-level supervision. Our proposed generative SR model TVQ&RAP is able to deliver photo-realistic SR results with small computational cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2155. When Agents โMisrememberโ Collectively: Exploring the Mandela Effect in LLM-based Multi-Agent Systems
๐ฌ 14
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM for Social Science Mandela Effect Multi-agent System Cognitive Bias
Recent advancements in large language models (LLMs) have significantly enhanced the capabilities of collaborative multi-agent systems, enabling them to address complex challenges. However, within these multi-agent systems, the susceptibility of agents to collective cognitive biases remains an underexplored issue. A compelling example is the Mandela effect, a phenomenon where groups collectively misremember past events as a result of false details reinforced through social influence and internalized misinformation. This vulnerability limits our understanding of memory bias in multi-agent systems and raises ethical concerns about the potential spread of misinformation. In this paper, we conduct a comprehensive study on the Mandela effect in LLM-based multi-agent systems, focusing on its existence, causing factors, and mitigation strategies. We propose ManBench, a novel benchmark designed to evaluate agent behaviors across four common task types that are susceptible to the Mandela effect, using five interaction protocols that vary in agent roles and memory timescales. We evaluate agents powered by several LLMs on ManBench to quantify the Mandela effect, and analyze how different factors affect it. Moreover, we propose strategies to mitigate this effect, including prompt-level defenses (e.g., cognitive anchoring and source scrutiny) and model-level alignment-based defense, achieving an average 74.40% reduction in the Mandela effect compared to the baseline. Our findings provide valuable insights for developing more resilient and ethically aligned collaborative multi-agent systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2156. Repurposing Foundation Model for Generalizable Medical Time Series Classification
๐ฌ 18
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ Medical Time Seris Classification Time Series Foundation Model
Medical time series (MedTS) classification suffers from poor generalizability
in real-world deployment due to inter- and intra-dataset heterogeneity, such as varying
numbers of channels, signal lengths, task definitions, and patient characteristics.
% implicit patient characteristics, variable channel configurations, time series lengths, and diagnostic tasks.
To address this, we propose FORMED, a novel framework for repurposing a backbone foundation model, pre-trained on generic time series, to enable highly generalizable MedTS classification on unseen datasets.
FORMED combines the backbone with a novel classifier comprising two components: (1) task-specific channel embeddings and label queries, dynamically sized to match any number of channels and target classes, and (2) a shared decoding attention layer, jointly trained across datasets to capture medical domain knowledge through task-agnostic feature-query interactions. After repurposing, FORMED achieves seamless adaptation to unseen MedTS datasets through lightweight label query training (0.1\% of parameters), eliminating the need for full fine-tuning or architectural redesign.
We evaluate FORMED on 5 diverse MedTS datasets, benchmarking against 11 Task-Specific Models (TSM) and 4 Task-Specific Adaptation (TSA) methods. Our results demonstrate FORMED's dominant performance, achieving up to 35\% absolute improvement in F1-score (on ADFTD dataset) over specialized baselines.
By decoupling domain-invariant representation learning from task-specific adaptation, FORMED establishes a scalable and resource-efficient paradigm for foundation model repurposing in healthcare. This approach prioritizes clinical adaptability over rigid task-centric design, offering a practical pathway for real-world implementation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2157. AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes
๐ฌ 20
โญ 5.50
๐ generative models
๐ท๏ธ Viewpoint Planning in 4D Scenes; Video Model
Recent Text-to-Video (T2V) models have demonstrated powerful capability in visual simulation of real-world geometry and physical laws, indicating its potential as implicit world models. Inspired by this, we explore the feasibility of leveraging the video generation prior for viewpoint planning from given 4D scenes, since videos internally accompany dynamic scenes with natural viewpoints. To this end, we propose a two-stage paradigm to adapt pre-trained T2V models for viewpoint prediction, in a compatible manner. First, we inject the 4D scene representation into the pre-trained T2V model via an adaptive learning branch, where the 4D scene is viewpoint-agnostic and the conditional generated video embeds the viewpoints visually. Then, we formulate viewpoint extraction as a hybrid-condition guided camera extrinsic denoising process. Specifically, a camera extrinsic diffusion branch is further introduced onto the pre-trained T2V model, by taking the generated video and 4D scene as input. Experimental results show the superiority of our proposed method over existing competitors, and ablation studies validate the effectiveness of our key technical designs. To some extent, this work proves the potential of video generation models toward 4D interaction in real world.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2158. Color3D: Controllable and Consistent 3D Colorization with Personalized Colorizer
๐ฌ 33
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Gaussian Splatting 3D Editing 3D Colorization 3D Generation
In this work, we present Color3D, a highly adaptable framework for colorizing both static and dynamic 3D scenes from monochromatic inputs, delivering visually diverse and chromatically vibrant reconstructions with flexible user-guided control. In contrast to existing methods that focus solely on static scenarios and enforce multi-view consistency by averaging color variations which inevitably sacrifice both chromatic richness and controllability, our approach is able to preserve color diversity and steerability while ensuring cross-view and cross-time consistency. In particular, the core insight of our method is to colorize only a single key view and then fine-tune a personalized colorizer to propagate its color to novel views and time steps. Through personalization, the colorizer learns a scene-specific deterministic color mapping underlying the reference view, enabling it to consistently project corresponding colors to the content in novel views and video frames via its inherent inductive bias. Once trained, the personalized colorizer can be applied to infer consistent chrominance for all other images, enabling direct reconstruction of colorful 3D scenes with a dedicated Lab color space Gaussian splatting representation. The proposed framework ingeniously recasts complicated 3D colorization as a more tractable single image paradigm, allowing seamless integration of arbitrary image colorization models with enhanced flexibility and controllability. Extensive experiments across diverse static and dynamic 3D colorization benchmarks substantiate that our method can deliver more consistent and chromatically rich renderings with precise user control. The code will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2159. AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild
๐ฌ 25
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ UAV vision-language-action model robot learning
Vision-language navigation (VLN) requires intelligent agents to navigate environments by interpreting linguistic instructions alongside visual observations, serving as a cornerstone task in Embodied AI. Current VLN research for unmanned aerial vehicles (UAVs) relies on detailed, pre-specified instructions to guide the UAV along predetermined routes. However, real-world outdoor exploration typically occurs in unknown environments where detailed navigation instructions are unavailable. Instead, only coarse-grained positional or directional guidance can be provided, requiring UAVs to autonomously navigate through continuous planning and obstacle avoidance. To bridge this gap, we propose AutoFly, an end-to-end Vision-Language-Action (VLA) model for autonomous UAV navigation. AutoFly incorporates a pseudo-depth encoder that derives depth-aware features from RGB inputs to enhance spatial reasoning, coupled with a progressive two-stage training strategy that effectively aligns visual, depth, and linguistic representations with action policies. Moreover, existing VLN datasets have fundamental limitations for real-world autonomous navigation, stemming from their heavy reliance on explicit instruction-following over autonomous decision-making and insufficient real-world data. To address these issues, we construct a novel autonomous navigation dataset that shifts the paradigm from instruction-following to autonomous behavior modeling through: (1) trajectory collection emphasizing continuous obstacle avoidance, autonomous planning, and recognition workflows; (2) comprehensive real-world data integration. Experimental results demonstrate that AutoFly achieves a 3.9\% higher success rate compared to state-of-the-art VLA baselines, with consistent performance across simulated and real environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2160. Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare
๐ฌ 16
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ AI for Healthcare mental health fairness bias dataset language models decision-making uncertainty expert annotation
Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions.
In psychiatry especially, these challenges are worsened by fairness and bias issues, since models can be swayed by patient demographics even when those factors should not influence clinical decisions.
Thus, we present an expert-created and annotated dataset spanning five critical domains of decision-making in mental healthcare: treatment, diagnosis, documentation, monitoring, and triage.
This U.S. centric dataset โ created without any LM assistance โ is designed to capture the nuanced clinical reasoning and daily ambiguities mental health practitioners encounter, reflecting the inherent complexities of care delivery that are missing from existing datasets.
Almost all base questions with five answer options each have had the decision-irrelevant demographic patient information removed and replaced with variables, e.g., for age or ethnicity, and are available for male, female, or non-binary-coded patients.
This design enables systematic evaluations of model performance and bias by studying how demographic factors affect decision-making.
For question categories dealing with ambiguity and multiple valid answer options, we create a preference dataset with uncertainties from the expert annotations.
We outline a series of intended use cases and demonstrate the usability of our dataset by evaluating sixteen off-the-shelf
and six (mental) health fine-tuned LMs on category-specific task accuracy, on the fairness impact of patient demographic information on decision-making, and how consistently free-form responses deviate from human-annotated samples.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2161. TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization
๐ฌ 16
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video summarization Multimodal
The exponential growth of video content highlights the importance of video summarization, a task that efficiently extracts key information from long videos. However, existing video summarization studies face inherent limitations in understanding complex, multimodal videos. This limitation stems from the fact that most existing architectures employ static or modality-agnostic fusion, which fails to account for the dynamic and frame-dependent variation in modality saliency that naturally occurs within a video. To overcome these limitations, we propose a novel architecture, TripleSumm, which adaptively weights and fuses the contributions of the three modalities at the frame level. Furthermore, a significant bottleneck for research into multimodal video summarization has been the lack of comprehensive benchmarks. Addressing this bottleneck, we introduce MoSu (Most Replayed Multimodal Video Summarization), the first large-scale benchmark that provides all three modalities. Our proposed TripleSumm demonstrates its superiority by achieving state-of-the-art performance by a large margin on four video summarization benchmarks, including MoSu.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2162. Off-Policy Evaluation for Ranking Policies under Deterministic Logging Policies
๐ฌ 15
โญ 5.50
๐ reinforcement learning
๐ท๏ธ off-policy evaluation; ranking; common support; deterministic logging
Off-Policy Evaluation (OPE) is an important practical problem in algorithmic ranking systems, where the goal is to estimate the expected performance of a new ranking policy using only offline logged data collected under a different, logging policy. Existing estimators, such as the ranking-wise and position-wise inverse propensity score (IPS) estimators, require the data collection policy to be sufficiently stochastic and suffer from severe bias when the logging policy is deterministic. In this paper, we propose novel estimators, Click-based Inverse Propensity Score (CIPS) and Click-based Doubly Robust (CDR), which exploit the intrinsic stochasticity of user click behavior to address this challenge. Unlike existing methods that rely on the stochasticity of the logging policy, our approach uses click probability as a new form of importance weighting, enabling low-bias OPE even under deterministic logging policies where existing methods incur substantial bias. We provide theoretical analyses of the bias and variance properties of the proposed estimators and show, through synthetic and real-world experiments, that our estimators achieve significantly lower bias compared to strong baselines, particularly in settings with completely deterministic logging policies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2163. Long Chain-of-Thought Reasoning Across Languages
๐ฌ 12
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multilingual Reasoning Long Chain-of-Thought
While large reasoning models have shown remarkable ability to generate long chains-of-thought (CoTs) in English, we still lack understanding of how these long-form reasoning abilities transfer to the vast majority of the worldโs languages. In this work, we systematically investigate four key stages of model developmentโscaling, pretraining, post-training, and inferenceโto understand how long CoT capabilities extend beyond English. We compare two reasoning settings across nine non-English target languages: En-CoT, where models process target-language inputs, but reason in English; and Target-CoT, where models both process inputs and generate long CoTs in the target language. We find that scaling reasoning model size improves multilingual task performance in En-CoT, but Target-CoT performance lags behind. This gap widens for tasks requiring long, multi-step CoTs such as mathematical reasoning. Shifting to pretraining, we find that adding a specialized reasoning stage enhances En-CoT performance but degrades Target-CoT, whereas broad multilingual pretraining improves both modes simultaneously. Given the scarcity of high-quality reasoning traces in languages other than English, we explore synthetic data curation approaches for post-training. We demonstrate that fine-tuning on reasoning traces
automatically translated from gold English traces outperforms fine-tuning on target-language traces distilled from large reasoning models. Finally, we report disparities in inference efficiency between languages and uncover language-specific failure modes
in CoTs. We release models, datasets, and code to foster further research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2164. VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?
๐ฌ 20
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Vision-language Models Multimodal Large Language Models Comparative Reasoning Benchmark Visual Question Answering
The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce **VLM-SubtleBench**, a benchmark designed to evaluate VLMs on *subtle comparative reasoning*. Our benchmark covers ten difference typesโAttribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Actionโand curate paired questionโimage sets reflecting these fine-grained variations. Unlike prior benchmarks restricted to natural image datasets, our benchmark spans diverse domains, including industrial, aerial, and medical imagery. Through extensive evaluation of both proprietary and open-source VLMs, we reveal systematic gaps between model and human performance across difference types and domains, and provide controlled analyses highlighting where VLMsโ reasoning sharply deteriorates. Together, our benchmark and findings establish a foundation for advancing VLMs toward human-level comparative reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2165. DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment
๐ฌ 24
โญ 5.50
๐ reinforcement learning
๐ท๏ธ text-to-image generation reinforcement learning flow matching preference alignment group relative policy optimization
Recent GRPO-based approaches built on flow matching models have shown remarkable improvements in human preference alignment for text-to-image generation.
Nevertheless, they still suffer from the sparse reward problem: the terminal reward of the entire denoising trajectory is applied to all intermediate steps, resulting in a mismatch between the global feedback signals and the exact fine-grained contributions at intermediate denoising steps.
To address this issue, we introduce \textbf{DenseGRPO}, a novel framework that aligns human preference with dense rewards, which evaluates the fine-grained contribution of each denoising step.
Specifically, our approach includes two key components: (1) we propose to predict the step-wise reward gain as dense reward of each denoising step, which applies a reward model on the intermediate clean images via an ODE-based approach. This manner ensures an alignment between feedback signals and the contributions of individual steps, facilitating effective training;
and (2) based on the estimated dense rewards, a mismatch drawback between the uniform exploration setting and the time-varying noise intensity in existing GRPO-based methods is revealed, leading to an inappropriate exploration space.
Thus, we propose a reward-aware scheme to calibrate the exploration space by adaptively adjusting a timestep-specific stochasticity injection in the SDE sampler, ensuring a suitable exploration space at all timesteps.
Extensive experiments on multiple standard benchmarks demonstrate the effectiveness of the proposed DenseGRPO and highlight the critical role of the valid dense rewards in flow matching model alignment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2166. Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models
๐ฌ 18
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Benchmark Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks.
However, their capacity to comprehend human-centric scenes has rarely been explored, primarily due to the absence of comprehensive evaluation benchmarks that take into account both the human-oriented granular level and higher-dimensional causal reasoning ability.
Such high-quality evaluation benchmarks face tough obstacles, given the physical complexity of the human body and the difficulty of annotating granular structures.
In this paper, we propose Human-MME, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding. Compared with other existing benchmarks, our work provides three key features:
**(1) Diversity in human scene**, spanning 4 primary
visual domains with 15 secondary domains and 43 sub-fields to ensure broad scenario
coverage.
**(2) Progressive and diverse evaluation dimensions**, evaluating the human-based activities progressively from the human-oriented granular perception to the higher-dimensional multi-target and causal reasoning, consisting of eight dimensions with 19,945 real-world image question pairs and an evaluation suite.
**(3) High-quality annotations with rich data paradigms**, constructing the automated annotation pipeline and human-annotation platform, supporting rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. Our benchmark extends the single-person and single-image understanding to the multi-person and multi-image mutual understanding by constructing the choice, short-answer, grounding, ranking and judgment question components, and complex question-answer pairs of their combination. The extensive experiments on 20 state-of-the-art MLLMs effectively expose the limitations and guide future MLLMs research toward better human-centric image understanding and reasoning. Data and code are available at [https://anonymous.4open.science/r/Human-MME-FDE7](https://anonymous.4open.science/r/Human-MME-FDE7).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2167. Durian: Dual Reference Image-Guided Portrait Animation with Attribute Transfer
๐ฌ 16
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Attribute Transfer Portrait Animation
We present Durian, the first method for generating portrait animation videos with cross-identity attribute transfer from one or more reference images to a target portrait. Training such models typically requires attribute pairs of the same individual, which are rarely available at scale. To address this challenge, we propose a self-reconstruction formulation that leverages ordinary portrait videos to learn attribute transfer without explicit paired data. Two frames from the same video act as a pseudo pair: one serves as an attribute reference and the other as an identity reference. To enable this self-reconstruction training, we introduce a Dual ReferenceNet that processes the two references separately and then fuses their features via spatial attention within a diffusion model. To make sure each reference functions as a specialized stream for either identity or attribute information, we apply complementary masking to the reference images.
Together, these two components guide the model to reconstruct the original video, naturally learning cross-identity attribute transfer.
To bridge the gap between self-reconstruction training and cross-identity inference, we introduce a mask expansion strategy and augmentation schemes, enabling robust transfer of attributes with varying spatial extent and misalignment. Durian achieves state-of-the-art performance on portrait animation with attribute transfer. Moreover, its dual reference design uniquely supports multi-attribute composition and smooth attribute interpolation within a single generation pass, enabling highly flexible and controllable synthesis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2168. TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
๐ฌ 23
โญ 5.50
๐ generative models
๐ท๏ธ Text-to-Video Generation Test-Time Optimization Memory
Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (\eg, motion, numeracy, and spatial relation).
In this work, we introduce **Test-Time Optimization and Memorization (TTOM)**, a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment.
Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective.
Furthermore, we formulate
video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete.
Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization.
Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2169. UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction
๐ฌ 18
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Autonomous Driving Feed-Forward Scene Reconstruction 3D Gaussian Splatting
Feed-forward 3D reconstruction for autonomous driving has advanced rapidly, yet existing methods struggle with the joint challenges of sparse, non-overlapping camera views and complex scene dynamics.
We present UniSplat, a general feed-forward framework that learns robust dynamic scene reconstruction through unified latent spatio-temporal fusion. UniSplat constructs a 3D latent scaffold, a structured representation that captures geometric and semantic scene context by leveraging pretrained foundation models. To effectively integrate information across spatial views and temporal frames, we introduce an efficient fusion mechanism that operates directly within the 3D scaffold, enabling consistent spatio-temporal alignment. To ensure complete and detailed reconstructions, we design a dual-branch decoder that generates dynamic-aware Gaussians from the fused scaffold by combining point-anchored refinement with voxel-based generation, and maintain a persistent memory of static Gaussians to enable streaming scene completion beyond current camera coverage. Extensive experiments on real-world datasets demonstrate that UniSplat achieves state-of-the-art performance in novel view synthesis, while providing robust and high-quality renderings even for viewpoints outside the original camera coverage.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2170. Interleaving Reasoning for Better Text-to-Image Generation
๐ฌ 13
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Interleaving Reasoning Text-to-Image Generation
Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o.
Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve text-to-image (T2I) generation.
We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics.
To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image.
We curate IRGL-300K, a 300K-scale dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinkingโimage trajectories.
Starting from a unified foundation model that natively emits interleaved textโimage outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinkingโimage trajectory data.
Extensive experiments show SoTA performance, yielding absolute gains of 5โ10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity.
As an early exploration, our results demonstrate that interleaving reasoning is a powerful paradigm for advancing T2I.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2171. SMAN-Bench: A Cross-System Benchmark for Mobile Agents under Single- and Multi-path, Ambiguous, and Noisy Tasks
๐ฌ 21
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ multimodal mobile agent offline evaluation
VLM-based mobile agents are increasingly popular due to their capabilities to interact with smartphone GUIs and XML-structured texts and to complete daily tasks. However, existing online benchmarks fail to obtain stable critical reward signals under dynamic environmental changes, and neglect the influence of noise components and interactive instructions. Offline benchmarks evaluate the agents through single-path trajectories, which stand in contrast to the inherently multi-solution characteristics of GUI tasks. To address these limitations, we introduce SMAN-Bench, a benchmark designed to evaluate agents under Single-path, Multi-path, Ambiguous, and Noisy task settings. We employ a slot-based instruction generation method to match templates with GUI trajectories from an existing, graph-structured, unlabeled mobile corpus. SMAN-Bench includes a common task split, with offline multi-path evaluation to assess the agentโs ability to obtain step rewards during task execution. It contains a noisy split based on pop-ups and ad apps, and a contaminated split named AITZ-Noise to simulate a realistic noisy environment. Furthermore, an ambiguous instruction split with preset Q&A interactions is released to evaluate the agentโs proactive interaction capabilities. Our evaluation covers mobile agent frameworks like AppAgent-v1, Mobile-Agent-v2, and Mobile-Agent-E, and includes both open-source and closed-source mobile fundamental models, as well as several multimodal reasoning models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2172. Primary-Fine Decoupling for Action Generation in Robotic Imitation
๐ฌ 17
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ dexterous manipulation multi-modal policy MeanFlow action decoupling robot imitation learning
Multi-modal distribution in robotic manipulation action sequences poses critical challenges for imitation learning.
To this end, existing approaches often model the action space as either a discrete set of tokens or a continuous, latent-variable distribution.
However, both approaches present trade-offs: some methods discretize actions into tokens and therefore lose fine-grained action variations, while others generate continuous actions in a single stage tend to produce unstable mode transitions.
To address these limitations, we propose Primary-Fine Decoupling for Action Generation (PF-DAG), a two-stage framework that decouples coarse action consistency from fine-grained variations. First, we compress action chunks into a small set of discrete modes, enabling a lightweight policy to select consistent coarse modes and avoid mode bouncing. Second, a mode conditioned MeanFlow policy is learned to generate high-fidelity continuous actions.
Theoretically, we prove PF-DAGโs two-stage design achieves a strictly lower MSE bound than single-stage generative policies.
Empirically, PF-DAG outperforms state-of-the-art baselines across 56 tasks from Adroit, DexArt, and MetaWorld benchmarks. It further generalizes to real-world tactile dexterous manipulation tasks. Our work demonstrates that explicit mode-level decoupling enables both robust multi-modal modeling and reactive closed-loop control for robotic manipulation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2173. Frequency-aware Dynamic Gaussian Splatting
๐ฌ 12
โญ 5.50
๐ generative models
๐ท๏ธ 4D reconstruction Gasussian Slpatting Deformation network
We present \textbf{Frequency-Aware Dynamic Gaussian Splatting (FAGS)}, a novel approach to mitigating motion blur in 4D reconstruction, particularly under novel viewpoints. This blur stems from a fundamental spectral conflict in existing methods, which struggle to \textbf{balance high-frequency rendering details with high-frequency motion.}
FAGS addresses this challenge with two key innovations. First, we introduce a frequency-differentiated Gaussian kernel that refines the alpha-blending process of 3D Gaussian Splatting. By adaptively classifying Gaussians into two typesโa slowly varying kernel for smooth, low-frequency regions and a sharp-transitioning kernel for high-frequency boundariesโour method explicitly separates representation responsibilities, preserving fine details without sacrificing continuity.
Second, we propose a Fourier-Deformation Network that enhances motion expressiveness. This network employs high-frequency Fourier embeddings to capture diverse motion patterns by learning amplitudes across frequency components. To further improve accuracy, we integrate a frequency-aware gate in fusion module, which predicts and regulates the relative deformation of each Gaussian.
Extensive experiments on both synthetic and real-world 4D benchmarks demonstrate that FAGS significantly reduces motion blur and enhances structural details, achieving state-of-the-art performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2174. Flow Caching for Autoregressive Video Generation
๐ฌ 18
โญ 5.50
๐ generative models
๐ท๏ธ Autoregressive video generation chunkwise caching KV cache compression ultra-long video synthesis video acceleration
Autoregressive models, often built on Transformer architectures, represent a powerful paradigm for generating ultra-long videos by synthesizing content in sequential chunks. However, this sequential generation process is notoriously slow. While caching strategies have proven effective for accelerating traditional video diffusion models, existing methods assume uniform denoising across all framesโan assumption that breaks down in autoregressive models where different video chunks exhibit varying similarity patterns at identical timesteps.
In this paper, we present FlowCache, the first caching framework specifically designed for autoregressive video generation. Our key insight is that each video chunk should maintain independent caching policies, allowing fine-grained control over which chunks require recomputation at each timestep. We introduce a chunkwise caching strategy that dynamically adapts to the unique denoising characteristics of each chunk, complemented by an importance-based KV cache compression mechanism that maintains fixed memory bounds while preserving generation quality.
Our method achieves remarkable speedups of $\textbf{2.38}\times$ on MAGI-1 and $\textbf{6.7}\times$ on SkyReels-V2, with negligible quality degradation (VBench: $0.87\uparrow$ and $0.79\downarrow$ respectively). These results demonstrate that FlowCache, successfully unlocks the potential of autoregressive models for real-time, ultra-long video generationโestablishing a new benchmark for efficient video synthesis at scale. The code is available at https://anonymous.4open.science/r/FlowCache-23495iclrAnonymous
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2175. Boosting Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning
๐ฌ 19
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ LLM judge tool-integrated reasoning
Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation. However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation. Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a Python executor for precise evaluation. TIR-Judge is built on three principles: (i) diverse training across verifiable and non-verifiable domains, (ii) flexible judgment formats (pointwise, pairwise, listwise), and (iii) iterative RL that enables bootstrapping directly from a base model without distillation. On six public benchmarks, TIR-Judge surpasses strong reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), and achieves listwise performance comparable to Claude-Opus-4 despite having only 8B parameters. Remarkably, TIR-Judge-Zeroโtrained entirely without distillationโmatches the performance of the distilled variants, showing that tool-augmented judges can self-improve through iterative reinforcement learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2176. Steering MoE LLMs via Expert (De)Activation
๐ฌ 15
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Steering MoE Mixture-of-Experts LLM
Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token through a subset of specialized Feed-Forward Networks (FFN), known as experts. We present SteerMoE, a framework to steer MoE models by detecting and controlling behavior-associated experts. We detect key experts by comparing how often they activate between paired inputs that demonstrate opposite behaviors (e.g., safe vs. unsafe). By selectively activating or deactivating such experts during inference, we control behaviors like faithfulness and safety without fine-tuning. Across 11 benchmarks and 6 LLMs, our steering raises safety by up to +20% and faithfulness by +27%. Alternatively, unsafe steering drops safety by -41% alone, and -100% when combined with existing jailbreak methods, bypassing all safety guardrails. Overall, SteerMoE offers a lightweight, effective, and widely applicable test-time control, while revealing unique vulnerabilities in MoE LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2177. Counterfactual Structural Causal Bandits
๐ฌ 14
โญ 5.50
๐ causal reasoning
๐ท๏ธ causal inference counterfactual inference structural causal bandits causal decision making
Causal reasoning lies at the heart of robust and generalizable decision-making, and the *Pearl Causal Hierarchy* provides a formal language for distinguishing between observational ($\mathcal{L}_1$), interventional ($\mathcal{L}_2$), and counterfactual ($\mathcal{L}_3$) levels of reasoning. Existing bandit algorithms that leverage causal knowledge have primarily operated within the $\mathcal{L}_1$ and $\mathcal{L}_2$ regimes, treating each realizable and physical intervention as a distinct arm. That is, they have largely excluded counterfactual quantities due to their perceived inaccessibility. In this paper, we introduce a *counterfactual structural causal bandit* (ctf-SCB) framework which expands the agent's feasible action space beyond conventional observational and interventional arms to include a class of realizable counterfactual actions. Our framework offers a principled extension of structural causal bandits and paves the way for integrating counterfactual reasoning into sequential decision-making.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2178. MotionStream: Real-Time Video Generation with Interactive Motion Controls
๐ฌ 12
โญ 5.50
๐ generative models
๐ท๏ธ Interactive Video Generation Motion Control Real-Time Generation Causal Generation
Current motion-conditioned video generation methods suffer from prohibitive latency (minutes per video) and non-causal processing that prevents real-time interaction. We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU. Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly. As such, we distill this bidirectional teacher into a causal student through Self Forcing with Distribution Matching Distillation, enabling real-time streaming inference. Several key challenges arise when generating videos of long, potentially infinite time-horizons -- (1) bridging the domain gap from training on finite length and extrapolating to infinite horizons, (2) sustaining high quality by preventing error accumulation, and (3) maintaining fast inference, without incurring growth in computational cost due to increasing context windows. A key to our approach is introducing carefully designed sliding-window causal attention, combined with attention sinks. By incorporating self-rollout with attention sinks and KV cache rolling during training, we properly simulate inference-time extrapolations with a fixed context window, enabling constant-speed generation of arbitrarily long videos. Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming. With MotionStream, users can paint trajectories, control cameras, or transfer motion, and see results unfold in real-time, delivering a truly interactive experience.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2179. MIRACLE: Model-free Imitation and Reinforcement Learning for Adaptive Cut-Selection
๐ฌ 41
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Model-Based Reinforcement Learning Adversarial Reward Learning Proximal Policy Optimization Mixed-Integer Programming Combinatorial Optimization
Mixed-Integer Programming (MIP) solvers rely heavily on cutting planes to tighten LP relaxations, but traditional approaches generate thousands of cuts that consume gigabytes of memory while providing minimal benefit. We present an intelligent cut selection framework that achieves a 98.1\% reduction in memory usage while maintaining competitive solving with an objective gap of approximately 0.08\%. Within this RL framework, we use Proximal Policy Optimization (PPO) to learn a behavioral model that imitates the expert solverโs decisions. The adversarially imitated behavioral model drives an agent comprising these key innovations: (i) a cut-selection policy trained via curriculum learning; and (ii) adaptive inference that dynamically adjusts computational budgets. Through comprehensive evaluation across SetCover and diverse MIPLIB problems, we demonstrate consistent speedups (3.78$\times$ average on MIPLIB) and achieve a 100\% success rate on instances where traditional SCIP fails 47-53\% of the time. Our method also reduces peak memory consumption from 3.03GB to 46 MB, enabling optimization in previously inaccessible and other resource-constrained environments where traditional solvers face fundamental limitations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2180. Null-Space Filtering for Data-free Continual Model Merging: Preserving Transparency, Promoting Fidelity
๐ฌ 21
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Continual Model Merging Model Merging
Data-free continual model merging (DFCMM) aims to fuse independently fine-tuned models into a single backbone that evolves with incoming tasks without accessing task data. This paper formulate two fundamental desiderata for DFCMM: transparency, avoiding interference with earlier tasks, and fidelity, adapting faithfully to each new task. This poses a challenge that existing approaches fail to address: how to bridge data-level desiderata with parameter-space optimization to ensure transparency and fidelity in the absence of task data. To this end, we propose NUFILT (NUll-space FILTering), a data-free framework that directly links these desiderata to optimization. Our key observation is that task vectors approximately align with representation subspaces, providing structural surrogates for enforcing transparency and fidelity. Accordingly, we design a null-space projector that preserves prior responses by filtering out overlapping components of new task vectors, thereby ensuring transparency, and a lightweight LoRA adapter that injects complementary task-specific signals, enabling fidelity in adapting to new tasks. The adapter is trained with a projection-based surrogate loss to retain consistency with previous knowledge while introducing novel directions. This joint filteringโadaptation process allows the backbone to absorb new knowledge while retaining existing behaviors, and the updates are finally fused back in a layer-wise linear fashion without extra parameters or inference cost. Theoretically, we establish approximate subspace alignment guarantees that justify null-space filtering. Empirically, NUFILT achieves state-of-the-art performance with minimal forgetting on both vision and NLP benchmarks, improving average accuracy by 4โ7\% over OPCM and WUDI-Merging, while narrowing the gap to fine-tuning and reducing computation overhead.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2181. Concept Insertion Success over Time in Diffusion Models through Prompt-Conditioned Interventions
๐ฌ 18
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ temporal analysis concept emergence diffusion models explainability
Diffusion models are usually evaluated by their final outputs, gradually denoising random noise into meaningful images.
Yet, generation unfolds along a trajectory, and understanding this dynamic process is crucial for explaining how controllable, reliable, and predictable these models are in terms of their success/failure modes. In this work, we ask the question: *when* does noise turn into a specific concept (e.g., age) and lock in the denoising trajectory? We propose PCI Prompt-Conditioned Intervention) to study this question. PCI is a training-free and model-agnostic framework for analyzing concept dynamics through diffusion time. The central idea is the analysis of *Concept Insertion Success* (CIS), defined as the probability that a concept inserted at a given timestep is preserved and reflected in the final image, offering a way to characterize the temporal dynamics of concept formation. Applied to several state-of-the-art text-to-image diffusion models and a broad taxonomy of concepts, PCI reveals diverse temporal behaviors across diffusion models, in which certain phases of the trajectory are more favorable to specific concepts even within the same concept type. These findings also provide actionable
insights for text-driven image editing, highlighting *when* interventions are most effective without requiring access to model internals or training, and yielding quantitatively stronger edits that achieve a balance of semantic accuracy and content preservation than strong baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2182. Universal Properties of Activation Sparsity in Modern Large Language Models
๐ฌ 16
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLMs activation sparsity efficiency representations
Activation sparsity is an intriguing property of deep neural networks that has been extensively studied in ReLU-based models, due to its advantages for efficiency, robustness, and interpretability.
However, methods relying on exact zero activations do not directly apply to modern Large Language Models (LLMs), leading to fragmented, model-specific strategies for LLM activation sparsity and a gap in its general understanding.
In this work, we introduce a general framework for evaluating sparsity robustness in contemporary LLMs and conduct a systematic investigation of this phenomenon in their feedforward~(FFN) layers.
Our results uncover universal properties of activation sparsity across diverse model families and scales.
Importantly, we observe that the potential for effective activation sparsity grows with model size, highlighting its increasing relevance as models scale.
Furthermore, we present the first study of activation sparsity in diffusion-based LLMs.
Overall, our work provides a comprehensive perspective and practical guidance for harnessing activation sparsity in LLM design and acceleration.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2183. Group Critical-token Policy Optimization for Autoregressive Image Generation
๐ฌ 14
โญ 5.50
๐ generative models
๐ท๏ธ Autoregressive Image Generation Text-to-Image Generation Reinforcement learning
Recent studies have extended Reinforcement Learning with Verifiable Rewards (RLVR) to autoregressive (AR) visual generation and achieved promising progress.
However, existing methods typically apply uniform optimization across all image tokens, while the varying contributions of different image tokens for RLVR's training remain unexplored.
In fact, the key obstacle lies in how to identify more critical image tokens during AR generation and implement effective token-wise optimization for them.
To tackle this challenge, we propose $\textbf{G}$roup $\textbf{C}$ritical-token $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{GCPO}$), which facilitates effective policy optimization on critical tokens.
We identify the critical tokens in RLVR-based AR generation from three perspectives, specifically:
$\textbf{(1)}$ Causal dependency: early tokens fundamentally determine the later tokens and final image effect due to unidirectional dependency;
$\textbf{(2)}$ Entropy-induced spatial structure: tokens with high entropy gradients correspond to image structure and bridges distinct visual regions;
$\textbf{(3)}$ RLVR-focused token diversity: tokens with low visual similarity across a group of sampled images contribute to richer token-level diversity.
For these identified critical tokens, we further introduce a dynamic token-wise advantage weight to encourage exploration, based on confidence divergence between the policy model and reference model.
By leveraging 30\% of the image tokens, GCPO achieves better performance than GRPO with full tokens.
Extensive experiments on multiple text-to-image benchmarks for both AR models and unified multimodal models demonstrate the effectiveness of GCPO for AR visual generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2184. UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models
๐ฌ 19
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Image Inversion Image Editing Rectified Flow Models Iterative Generation Models Diffusion Models
Flow matching models have emerged as a strong alternative to diffusion models, but existing inversion and editing methods designed for diffusion are often ineffective or inapplicable to them. The straight-line, non-crossing trajectories of flow models pose challenges for diffusion-based approaches but also open avenues for novel solutions. In this paper, we introduce a predictor-corrector-based framework for inversion and editing in flow models. First, we propose Uni-Inv, an effective inversion method designed for accurate reconstruction. Building on this, we extend the concept of delayed injection to flow models and introduce Uni-Edit, a region-aware, robust image editing approach. Our methodology is tuning-free, model-agnostic, efficient, and effective, enabling diverse edits while ensuring strong preservation of edit-irrelevant regions. Extensive experiments across various generative models demonstrate the superiority and generalizability of Uni-Inv and Uni-Edit, even under low-cost settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2185. KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model
๐ฌ 24
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Embedding Model LLMs Retriever
Recent advancements in Large Language Models (LLMs)-based text embedding models primarily focus on data scaling or synthesis, yet limited exploration of training techniques and data quality, thereby constraining performance. In this work, we propose KaLM-Embedding-V2, a series of versatile and compact embedding models, systematically incentivizing advanced embedding capability in LLMs by superior training techniques and high-quality data. For model architecture, we implement the models on a 0.5B compact size with simple mean-pooling to produce fixed-length embeddings and remove the causal attention mask to enable fully bidirectional representation learning. For training techniques, we propose a progressive multi-stage training pipeline: pre-training on weakly supervised large-scale datasets, fine-tuning with supervised high-quality datasets, and contrastive distillation with fine-grained soft signals, integrated with focal-style reweighting and online hard-negative mixing to emphasize difficult samples and enrich hard negatives, respectively. For training data, we curate over 20 categories for pre-training and 100 categories for fine-tuning and contrastive distillation, to improve both performance and generalization, leveraging task-specific instructions, hard-negative mining, and example-based multi-class labeling to ensure high quality. Combining these techniques, our KaLM-Embedding-V2 series achieves state-of-the-art performance on the Massive Text Embedding Benchmark, outperforming models of comparable size and rivaling models 3-26x larger, setting a new standard for versatile and compact embedding models under 1B parameters. The code, data, and models will be publicly available to facilitate academic research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2186. Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense
๐ฌ 17
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models Jailbreak Defense Self-Alignment Intrinsic Safety
Ensuring Large Language Model (LLM) safety remains challenging due to the absence of universal standards and reliable content validators, making it difficult to obtain effective training signals. We discover that aligned models already possess robust internal safety beliefs: they consistently produce high-confidence refusals to harmful requests while exhibiting high entropy when generating potentially dangerous content. This entropy gap reveals an untapped signalโmodels intrinsically "know" when to refuse. We introduce Safety Instincts Reinforcement Learning (*SIRL*), which transforms this internal confidence into a self-generated reward signal, eliminating dependence on external validators or human annotations. *SIRL* teaches models to trust their safety instincts by reinforcing low-entropy refusal behaviors. Evaluated on Llama and Qwen models, *SIRL* maintains 89\%+ Defense Success Rates (DSRs) against 20+ jailbreak methods, from static prompts to automated attacks. Using only 15,000 unlabeled prompts, *SIRL* surpasses resource-intensive supervised methods while preserving performance on mathematics, coding, and conversation benchmarks. Our work demonstrates that effective alignment can emerge from within, paving the way for more autonomous and robust AI safety mechanisms that scale without extensive human oversight.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2187. Cache-to-Cache: Direct Semantic Communication Between Large Language Models
๐ฌ 15
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Multi-LLM
Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source modelโs KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 6.4โ14.2% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.1โ5.4%, while delivering an average 2.5ร speedup in latency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2188. SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty in TinyML
๐ฌ 17
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ TinyML uncertainty quantification single-pass inference depth-wise next-activation prediction selective prediction on-device monitoring
We introduce SNAP-UQ, a single-pass, label-free uncertainty method for TinyML that estimates risk from depth-wise next-activation prediction: tiny int8 heads forecast the statistics of the next layer from a compressed view of the previous one, and a lightweight monotone mapper turns the resulting surprisal into an actionable score. The design requires no temporal buffers, auxiliary exits, or repeated forward passes, and adds only a few tens of kilobytes to MCU deployments. Across vision and audio backbones, SNAP-UQ consistently reduces flash and latency relative to early-exit and deep ensembles (typically $\sim$40--60\% smaller and $\sim$25--35\% faster), with competing methods of similar accuracy often exceeding memory limits. In corrupted streams it improves accuracy-drop detection by several AUPRC points and maintains strong failure detection (AUROC $\approx$0.9) in a single pass. Grounding uncertainty in layer-to-layer dynamics yields a practical, resource-efficient basis for on-device monitoring in TinyML.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2189. How Stable is the Next Token? A Geometric View of LLM Prediction Stability
๐ฌ 11
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Post Training
Large Language Models (LLMs) exhibit impressive capabilities yet suffer from sensitivity to slight input context variations, hampering reliability. Conventional metrics like accuracy and perplexity fail to assess local prediction robustness, as normalized output probabilities can obscure the underlying resilience of an LLM's internal state to perturbations. We introduce the Token Constraint Bound
, a novel metric that quantifies the maximum internal state perturbation an LLM can withstand before its dominant next-token prediction significantly changes. Intrinsically linked to output embedding space geometry,
provides insights into the stability of the model's internal predictive commitment. Our experiments show
correlates with effective prompt engineering and uncovers critical prediction instabilities missed by perplexity during in-context learning and text generation. offers a principled, complementary approach to analyze and potentially improve the contextual stability of LLM predictions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2190. How Far Are LLMs from Professional Poker Players? Revisiting Game-Theoretic Reasoning with Agentic Tool Use
๐ฌ 25
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Reinforcement Learning Imperfect Information Game Strategic Reasoning
As Large Language Models (LLMs) are increasingly applied in high-stakes domains, their ability to reason strategically under uncertainty becomes critical. Poker provides a rigorous testbed, requiring not only strong actions but also principled, game-theoretic reasoning. In this paper, we conduct a systematic study of LLMs in multiple realistic poker tasks, evaluating both gameplay outcomes and reasoning traces. Our analysis reveals LLMs fail to compete against traditional algorithms and identifies three recurring flaws: reliance on heuristics, factual misunderstandings, and a โknowingโdoingโ gap where actions diverge from reasoning. An initial attempt with behavior cloning and step-level reinforcement learning improves reasoning style but remains insufficient for accurate game-theoretic play. Motivated by these limitations, we propose ToolPoker, a tool-integrated reasoning framework that combines external solvers for GTO-consistent actions with more precise professional-style explanations. Experiments demonstrate that ToolPoker achieves state-of-the-art gameplay while producing reasoning traces that closely reflect game-theoretic principles.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2191. Inference-Time Dynamic Modality Selection for Incomplete Multimodal Classification
๐ฌ 22
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Multimodal Learning Dynamic Neural Network Missing Modality
Multimodal deep learning (MDL) has achieved remarkable success across various domains, yet its practical deployment is often hindered by incomplete multimodal data. Existing incomplete MDL methods either discard missing modalities, risking the loss of valuable task-relevant information, or recover them, potentially introducing irrelevant noise, leading to the discarding-imputation dilemma. To address this dilemma, in this paper, we propose DyMo, a new inference-time dynamic modality selection framework that adaptively identifies and integrates reliable recovered modalities, fully exploring task-relevant information beyond the conventional discard-or-impute paradigm. Central to DyMo is a novel selection algorithm that maximizes multimodal task-relevant information for each test sample. Since direct estimation of such information at test time is intractable due to the unknown data distribution, we theoretically establish a connection between information and the task loss, which we compute at inference time as a tractable proxy. Building on this, a novel principled reward function is proposed to guide modality selection. In addition, we design a flexible multimodal network architecture compatible with arbitrary modality combinations, alongside a tailored training strategy for robust representation learning. Extensive experiments on diverse natural and medical datasets show that DyMo significantly outperforms state-of-the-art incomplete/dynamic MDL methods across various missing-data scenarios. Our code will be available at https://github.com/anonymous.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2192. ODE-GS: Latent ODEs for Dynamic Scene Extrapolation with 3D Gaussian Splatting
๐ฌ 22
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ gaussian splatting latent ODE extrapolation reconstruction
We introduce ODE-GS, a novel approach that integrates 3D Gaussian Splatting with latent neural ordinary differential equations (ODEs) to enable future extrapolation of dynamic 3D scenes. Unlike existing dynamic scene reconstruction methods, which rely on time-conditioned deformation networks and are limited to interpolation within a fixed time window, ODE-GS eliminates timestamp dependency by modeling Gaussian parameter trajectories as continuous-time latent dynamics. Our approach first learns an interpolation model to generate accurate Gaussian trajectories within the observed window, then trains a Transformer encoder to aggregate past trajectories into a latent state evolved via a neural ODE. Finally, numerical integration produces smooth, physically plausible future Gaussian trajectories, enabling rendering at arbitrary future timestamps. On the D-NeRF, NVFi, and HyperNeRF benchmarks, ODE-GS achieves state-of-the-art extrapolation performance, improving metrics by 19.8% compared to leading baselines, demonstrating its ability to accurately represent and predict 3D scene dynamics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2193. ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
๐ฌ 19
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Computer-using agents AI4Research Multimodal LLM
Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2194. FastFlow: Accelerating The Generative Flow Matching Models with Bandit Inference
๐ฌ 14
โญ 5.50
๐ generative models
๐ท๏ธ generative modelling faster inference.
Flow-matching models deliver state-of-the-art fidelity in image and video generation, but the inherent sequential denoising process renders them slower. Existing acceleration methods like distillation, trajectory truncation, and consistency approaches are static, require retraining, and often fail to generalize across tasks. We propose FastFlow, a plug-and-play adaptive inference framework that accelerates generation in flow matching models. FastFlow identifies denoising steps that produce only minor adjustments to the denoising path and approximates them without using the full neural network models used for velocity predictions. The approximation utilizes finite-difference velocity estimates from prior predictions to efficiently extrapolate future states, enabling faster advancements along the denoising path at zero compute cost. This enables skipping computation at intermediary steps. We model the decision of how many steps to safely skip before requiring a full model computation as a multi-armed bandit problem. The bandit learns the optimal skips to balance speed with performance. FastFlow integrates seamlessly with existing pipelines and generalizes across image generation, video generation, and editing tasks. Experiments demonstrate a speedup of over $2.6\times$ while maintaining high-quality outputs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2195. FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory
๐ฌ 15
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Large Language model Neural Theorem Proving Machine Learning Theory
Large language models (LLMs) have recently demonstrated remarkable progress in formal theorem proving. Yet their ability to serve as practical assistants for mathematiciansโfilling in missing steps within complex proofsโremains underexplored. We identify this challenge as the task of subgoal completion, where an LLM must discharge short but nontrivial proof obligations left unresolved in a human-provided sketch. To study this problem, we introduce FormalML, a Lean 4 benchmark built from foundational theories of machine learning. Using a translation tactic that converts procedural proofs into declarative form, we extract 4,937 problems spanning optimization and probability inequalities, with varying levels of difficulty. FormalML is the first subgoal completion benchmark to combine premise retrieval and complex research-level contexts. Evaluation of state-of-the-art provers highlights persistent limitations in accuracy and efficiency, underscoring the need for more capable LLM-based theorem provers for effective subgoal completion.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2196. Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test
๐ฌ 20
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ Generalization Large Language Models
This paper presents *the first study of grokking in practical LLM pretraining*. Specifically, we investigate when an LLM memorizes the training data, when its generalization on downstream tasks starts to improve, and what happens if there is a lag between the two. Unlike existing works studying when a small model generalizes to limited and specified tasks during thousands epochs' training on algorithmic data, we focus on a practical setting for LLMs, i.e., near single-pass pretraining of next-token prediction on a cross-domain, large-scale corpus, and generalization on diverse benchmark tasks covering math/commonsense reasoning, code generation, and domain-specific retrieval. Our study, *for the first time, verifies that grokking still emerges in pretraining mixture-of-experts (MoE) LLMs*, though different local data groups may enter their grokking stages asynchronously due to the heterogeneity of their distributions and attributions to others. To find a mechanistic interpretation of this local grokking, we investigate the dynamics of training data's pathways (i.e., expert choices across layers in MoE). Our primary discovery is that *the pathways evolve from random, non-smooth across layers, instance-specific to more structured and transferable across samples*, despite the converged pretraining loss. This depicts a transition from memorization to generalization. Two novel metrics are developed to quantify these patterns: one computes the pathway similarity between samples, while the other measures the consistency of aggregated experts between subsequent layers for each sample. These training data based metrics induce near zero cost but can faithfully track and monitor the generalization of LLMs on downstream tasks, reducing reliance on costly instruction tuning and benchmark evaluations. We also ground our findings in a theoretical analysis of one-layer MoE, showing that more structured pathways improve the generalization bound.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2197. NewtonGen: Physics-consistent and Controllable Text-to-Video Generation via Neural Newtonian Dynamics
๐ฌ 19
โญ 5.50
๐ generative models
๐ท๏ธ Generative Models Video Generation
A primary bottleneck in large-scale text-to-video generation today is physical consistency and controllability. Despite recent advances, state-of-the-art models often produce unrealistic motions, such as objects falling upward, or abrupt changes in velocity and direction. Moreover, these models lack precise parameter control, struggling to generate physically consistent dynamics under different initial conditions. We argue that this fundamental limitation stems from current models learning motion distributions solely from appearance, while lacking an understanding of the underlying dynamics. In this work, we propose NewtonGen, a framework that integrates data-driven synthesis with learnable physical principles. At its core lies trainable Neural Newtonian Dynamics (NND), which can model and predict a variety of Newtonian motions, thereby injecting latent dynamical constraints into the video generation process. By jointly leveraging data priors and dynamical guidance, NewtonGen enables physically consistent video synthesis with precise parameter control. All data and code will be public.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2198. VideoNSA: Native Sparse Attention Scales Video Understanding
๐ฌ 22
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Efficient Video Understanding Sparse Attention
Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. **Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video.** Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal globalโlocal attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2199. Self-Improving Loops for Visual Robotic Planning
๐ฌ 22
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ visual planning self-improvement video models
Video generative models trained on expert demonstrations have been utilized as performant text-conditioned visual planners for solving robotic tasks. However, generalization to unseen tasks remains a challenge. Whereas improved generalization may be facilitated by leveraging learned prior knowledge from additional pre-collected offline data sources, such as web-scale video datasets, in the era of experience we aim to design agents that can continuously improve in an online manner from self-collected behaviors. In this work we thus propose the Self-Improving Loops for Visual Robotic Planning (SILVR), where an in-domain video model iteratively updates itself on self-produced trajectories, and steadily improves its performance for a specified task of interest. We apply SILVR to a diverse suite of MetaWorld tasks, as well as two manipulation tasks on a real robot arm, and find that performance improvements continuously emerge over multiple iterations for novel tasks unseen during initial in-domain video model training. We demonstrate that SILVR is robust in the absence of human-provided ground-truth reward functions or expert-quality demonstrations, and is preferable to alternate approaches that utilize online experience in terms of performance and sample efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2200. 3D Aware Region Prompted Vision Language Model
๐ฌ 4
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Vision Language Models Spatial Reasoning
We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric measurements. We show more qualitative results anonymously at https://sr3d-iclr.github.io/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2201. Value Gradient Flow: Behavior-Regularized RL without Regularization
๐ฌ 16
โญ 5.50
๐ reinforcement learning
๐ท๏ธ behavior-regularized rl offline rl rlhf optimal transport
We study behavior-regularized reinforcement learning (RL), which encompasses offline RL and RL from human feedback (RLHF). In both settings, regularization toward a reference distribution (offline data in offline RL or the supervised-finetuned policy in RLHF) is essential to prevent value over-optimization caused by erroneous out-of-distribution extrapolation. Existing methods typically add distance or divergence penalties on the learning objective, which introduces optimization challenges and over-conservatism. In this paper, we propose Value Gradient Flow (VGF), a new paradigm for behavior-regularized RL. VGF formulates an optimal transport problem from the reference distribution to the optimal policy distribution induced by the value function. This problem is solved via discrete gradient flow, where value gradients guide particles sampled from the reference distribution. Our theoretical analysis shows that an implicit behavior regularization is imposed by controlling the transport budget. This formulation avoids unnecessary restrictions on the optimization problem, enabling better reward maximization. Moreover, VGF operates without explicit policy parameterization while remaining expressive and flexible, allowing adaptively test-time scaling by adjusting the transport budget. Extensive experiments demonstrate that VGF significantly outperforms prior methods, achieving state-of-the-art results on offline RL benchmarks (D4RL, OGBench) and challenging RLHF tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2202. VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
๐ฌ 21
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Reference-based reward bench Reward for reinforcement learning
Large reasoning models such as OpenAI o1 and DeepSeek-R1 have achieved remarkable performance in the domain of reasoning. A key component of their training is the incorporation of verifiable rewards within reinforcement learning (RL). However, existing reward benchmarks do not evaluate reference-based reward systems, leaving researchers with limited understanding of the accuracy of verifiers used in RL. In this paper, we introduce two benchmarks, VerifyBench and VerifyBench-Hard, designed to assess the performance of reference-based reward systems. These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. Current models still show considerable room for improvement on both VerifyBench and VerifyBench-Hard, especially smaller-scale models. Furthermore, we conduct a thorough and comprehensive analysis of evaluation results, offering insights for understanding and developing reference-based reward systems. Our proposed benchmarks serve as effective tools for guiding the development of verifier accuracy and the reasoning capabilities of models trained via RL in reasoning tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2203. Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors
๐ฌ 17
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ graph machine learning node classification
One of the most challenging problems in graph machine learning is generalizing across graphs with diverse properties. Graph neural networks (GNNs) face a fundamental limitation: they require separate training for each new graph, preventing universal generalization across diverse graph datasets. A critical challenge facing GNNs lies in their reliance on labeled training data for each individual graph, a requirement that hinders the capacity for universal node classification due to the heterogeneity inherent in graphs --- differences in homophily levels, community structures, and feature distributions across datasets. Inspired by the success of large language models (LLMs) that achieve in-context learning through massive-scale pre-training on diverse datasets, we introduce NodePFN. This universal node classification method generalizes to arbitrary graphs without graph-specific training. NodePFN learns posterior predictive distributions (PPDs) by training only on thousands of synthetic graphs generated from carefully designed priors. Our synthetic graph generation covers real-world graphs through the use of random networks with controllable homophily levels and structural causal models for complex feature-label relationships. We develop a dual-branch architecture combining context-query attention mechanisms with local message passing to enable graph-aware in-context learning. Extensive evaluation on 23 benchmarks demonstrates that a single pre-trained NodePFN achieves 71.27% average accuracy. These results validate that universal graph learning patterns can be effectively learned from synthetic priors, establishing a new paradigm for generalization in node classification.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2204. Bridging Degradation Discrimination and Generation for Universal Image Restoration
๐ฌ 24
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Degradation Discrimination Universal Image Restoration
Universal image restoration is a critical task in low-level vision, requiring the model to remove various degradations from low-quality images to produce clean images with rich detail. The challenges lie in sampling the distribution of high-quality images and adjusting the outputs on the basis of the degradation. This paper presents a novel approach, Bridging Degradation discrimination and Generation (BDG), which aims to address these challenges concurrently. First, we propose the Multi-Angle and multi-Scale Gray Level Co-occurrence Matrix (MAS-GLCM) and demonstrate its effectiveness in performing fine-grained discrimination of degradation types and levels. Subsequently, we divide the diffusion training process into three distinct stages: generation, bridging, and restoration. The objective is to preserve the diffusion model's capability of restoring rich textures while simultaneously integrating the discriminative information from the MAS-GLCM into the restoration process. This enhances its proficiency in addressing multi-task and multi-degraded scenarios. Without changing the architecture, BDG achieves significant performance gains in all-in-one restoration and real-world super-resolution tasks, primarily evidenced by substantial improvements in fidelity without compromising perceptual quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2205. ARTDECO: Toward High-Fidelity On-the-Fly Reconstruction with Hierarchical Gaussian Structure and Feed-Forward Guidance
๐ฌ 16
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D reconstruction On-the-fly reconstruction
On-the-fly 3D reconstruction from monocular image sequences is a long-standing challenge in computer vision, critical for applications such as real-to-sim, AR/VR, and robotics. Existing methods face a major tradeoff: per-scene optimization yields high fidelity but is computationally expensive, whereas feed-forward foundation models enable real-time inference but struggle with accuracy and robustness. In this work, we propose ARTDECO, a unified framework that combines the efficiency of feed-forward models with the reliability of SLAM-based pipelines. ARTDECO uses 3D foundation models for pose estimation and point prediction, coupled with a Gaussian decoder that transforms multi-scale features into structured 3D Gaussians. To sustain both fidelity and efficiency at scale, we design a hierarchical Gaussian representation with a LoD-aware rendering strategy, which improves rendering fidelity while reducing redundancy. Experiments on eight diverse indoor and outdoor benchmarks show that ARTDECO delivers interactive performance comparable to SLAM, robustness similar to feed-forward systems, and reconstruction quality close to per-scene optimization, providing a practical path toward on-the-fly digitization of real-world environments with both accurate geometry and high visual fidelity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2206. Beyond Uniformity: Regularizing Implicit Neural Representations through a Lipschitz Lens
๐ฌ 13
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Implicit Neural Representations Neural Fields Lipschitz Regularization
Implicit Neural Representations (INRs) have shown great promise in solving inverse problems, but their lack of inherent regularization often leads to a trade-off between expressiveness and smoothness. While Lipschitz continuity presents a principled form of implicit regularization, it is often applied as a rigid, uniform 1-Lipschitz constraint, limiting its potential in inverse problems. In this work, we reframe Lipschitz regularization as a flexible *Lipschitz budget framework*. We propose a method to first derive a principled, task-specific total budget $K$, then proceed to distribute this budget *non-uniformly* across all network components, including linear weights, activations, and embeddings. Across extensive experiments on deformable registration and image inpainting, we show that non-uniform allocation strategies provide a measure to balance regularization and expressiveness within the specified global budget. Our *Lipschitz lens* introduces an alternative, interpretable perspective to Neural Tangent Kernel (NTK) and Fourier analysis frameworks in INRs, offering practitioners actionable principles for improving network architecture and performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2207. Bridging Past and Future: Distribution-Aware Alignment for Time Series Forecasting
๐ฌ 28
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ Time Series Forecasting Representation Learning Alignment
Although contrastive and other representation-learning methods have long been explored in vision and NLP, their adoption in modern time series forecasters remains limited. We believe they hold strong promise for this domain. To unlock this potential, we explicitly align past and future representations, thereby bridging the distributional gap between input histories and future targets. To this end, we introduce TimaAlign, a lightweight, plug-and-play framework that establishes a new representation paradigm, distinct from contrastive learning, by aligning auxiliary features via a simple reconstruction task and feeding them back into any base forecaster. Extensive experiments across eight benchmarks verify its superior performance. Further studies indicate that the gains arise primarily from correcting frequency mismatches between historical inputs and future outputs. Additionally, we provide two theoretical justifications for how reconstruction improves forecasting generalization and how alignment increases the mutual information between learned representations and predicted targets. Code is in supplementary material.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2208. Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception
๐ฌ 5
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ MLLM Self-Distillation Fine-Grained Perception
Multimodal Large Language Models (MLLMs) require high-resolution visual information to perform fine-grained perception, yet processing entire high-resolution images is computationally prohibitive.
While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model's internal attention are computationally inefficient and less accurate, requiring either multi-pass prefill stages or reliance on the slow auto-regressive decoding process.
In this paper, we propose an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves this trade-off. The SD-RPN is built around a pipeline that transforms the noisy attention maps from the MLLM's middle layers into high-quality pseudo-RoI labels by explicitly denoising the signal and resolving ambiguity. We use these labels to train a lightweight Region Proposal Network (RPN) that learns a more precise localization. This RPN is also highly efficient, predicting the RoI in a single forward pass using features from the MLLM's middle layers, decoupling RoI identification from the auto-regressive generation and avoiding costly multi-pass operations.
To validate our approach, we integrate the framework into the LLaVA-1.5 architecture. Despite being trained on only a few (e.g. 10K) question-answer pairs, our method demonstrates exceptional data efficiency and generalization, achieving over a 10\% absolute accuracy improvement on unseen benchmarks, including TextVQA, DocVQA, and V-Star. Our work presents a practical and scalable solution for enhancing the fine-grained perception of MLLMs without requiring costly supervision or full model fine-tuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2209. Towards Spatial Supersensing in Video
๐ฌ 28
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Large Langauge Model Super Sensing Model Spatial Understanding Video Understanding Memory
We frame spatial supersensing in video as an overarching goal for multimodal intelligence and argue that progress requires a shift from long-context brute force to predictive sensing. Using a four-level taxonomy: semantic perception, streaming event cognition, implicit 3D spatial cognition, and predictive world modeling, we audit existing benchmarks and show they focus heavily on the first tier, with only partial coverage of streaming and spatial cognition, and almost never test true world modeling. To ground these gaps, we introduce VSI-Super, a two-part benchmark for continual spatial sensing: VSO (long-horizon spatial observation and recall) and VSC (continual counting under changing viewpoints and scenes). These tasks admit arbitrarily long video inputs and are specifically built so that simply scaling tokens or context length isnโt enough. Within the current paradigm, we push spatial cognition by curating VSI-590K and training a new family of video MLLMs that deliver 30% absolute on VSI-Bench without sacrificing general semantic perception. Yet these models still underperform on VSI-Super, exposing a paradigm gap. We then prototype predictive sensing: a self-supervised next latent-frame predictor whose surprise (prediction error) drives long-horizon memory and event segmentation. On VSI-Super, this approach substantially outperforms leading video MLLMs, evidencing that advancing spatial supersensing requires models that not only see but also anticipate, select, and organize experience.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2210. Complexity Analysis of Normalizing Constant Estimation: from Jarzynski Equality to Annealed Importance Sampling and beyond
๐ฌ 18
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ normalizing constant free energy Jarzynski equality annealed importance sampling reverse diffusion samplers
Given an unnormalized probability density $\pi\propto\mathrm{e}^{-V}$, estimating its normalizing constant $Z=\int_{\mathbb{R}^d}\mathrm{e}^{-V(x)}\mathrm{d}x$ or free energy $F=-\log Z$ is a crucial problem in Bayesian statistics, statistical mechanics, and machine learning. It is challenging especially in high dimensions or when $\pi$ is multimodal. To mitigate the high variance of conventional importance sampling estimators, annealing-based methods such as Jarzynski equality and annealed importance sampling are commonly adopted, yet their quantitative complexity guarantees remain largely unexplored. We take a first step toward a non-asymptotic analysis of annealed importance sampling. In particular, we derive an oracle complexity of $\widetilde{O}\left(\frac{d\beta^2{\mathcal{A}}^2}{\varepsilon^4}\right)$ for estimating $Z$ within $\varepsilon$ relative error with high probability, where $\beta$ is the smoothness of $V$ and $\mathcal{A}$ denotes the action of a curve of probability measures interpolating $\pi$ and a tractable reference distribution. Our analysis, leveraging Girsanov theorem and optimal transport, does not explicitly require isoperimetric assumptions on the target distribution. Finally, to tackle the large action of the widely used geometric interpolation, we propose a new algorithm based on reverse diffusion samplers, establish a framework for analyzing its complexity, and empirically demonstrate its efficiency in tackling multimodality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2211. RiskPO: Risk-based Policy Optimization with Verifiable Reward for LLM Post-Training
๐ฌ 4
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning with Verifiable Reward Risk-Sensitive RL
Reinforcement learning with verifiable reward has recently emerged as a central paradigm for post-training large language models (LLMs); however, prevailing mean-based methods, such as Group Relative Policy Optimization (GRPO), suffer from entropy collapse and limited reasoning gains. We argue that these issues stem from overemphasizing high-probability output sequences while neglecting rare but informative reasoning paths. To address these challenges, we propose Risk-based Policy Optimization (RiskPO), which substitutes classical mean-based objectives with principled risk measures. Specifically, we introduce a Mixed Value-at-Risk objective that integrates weighted attention over multiple regions of the reward distribution, thereby amplifying gradient signals on challenging instances and preventing overconfident convergence. We further design a bundling scheme that aggregates multiple questions into bundles, thus enriching the feedback signal and yielding more stable and informative training dynamics. Theoretically, we prove that the risk-averse update alleviates entropy collapse and promotes exploration. Numerically, RiskPO achieves consistent and significant improvements in mathematical reasoning, multi-modal reasoning, and code generation benchmarks, surpassing GRPO and its variants on both Pass@1 and Pass@k metrics. Our results demonstrate that risk-based optimization provides a rigorous and effective paradigm for enhancing LLM reasoning capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2212. Test-Time Accuracy-Cost Control in Neural Simulators via Recurrent-Depth
๐ฌ 15
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Neural Simulator Recurrent Depth AI4Simulation
Accuracy-cost trade-offs are a fundamental aspect of scientific computing. Classical numerical methods inherently offer such a trade-off: increasing resolution, order, or precision typically yields more accurate solutions at higher computational cost. We introduce \textbf{Recurrent-Depth Simulator} (\textbf{RecurrSim}) an architecture-agnostic framework that enables explicit test-time control over accuracy-cost trade-offs in neural simulators without requiring retraining or architectural redesign. By setting the number of recurrent iterations $K$, users can generate fast, less-accurate simulations for exploratory runs or real-time control loops, or increase $K$ for more-accurate simulations in critical applications or offline studies. We demonstrate RecurrSim's effectiveness across fluid dynamics benchmarks (Burgers, Korteweg-De Vries, Kuramoto-Sivashinsky), achieving physically faithful simulations over long horizons even in low-compute settings. On high-dimensional 3D compressible Navier-Stokes simulations with 262k points, a 0.8B parameter RecurrFNO outperforms 1.6B parameter baselines while using 13.5\% less training memory. RecurrSim consistently delivers superior accuracy-cost trade-offs compared to alternative adaptive-compute models, including Deep Equilibrium and diffusion-based approaches. We further validate broad architectural compatibility: RecurrViT reduces error accumulation by 90\% compared to standard Vision Transformers on Active Matter, while RecurrUPT matches UPT performance on ShapeNet-Car using 44\% fewer parameters.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2213. Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
๐ฌ 17
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reinforcement Learning with Verifiable Rewards Group Relative Policy Optimization LLM Reasoning
This paper examines the explorationโexploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: \textit{spurious rewards}, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and \textit{entropy minimization}, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2214. MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion
๐ฌ 15
โญ 5.50
๐ generative models
๐ท๏ธ Multi-view generation Customizaton Personalization
Multi-view generation with camera pose control and prompt-based customization are both essential elements for achieving controllable generative models.
However, existing multi-view generation models do not support customization with geometric consistency, whereas customization models lack explicit viewpoint control, making them challenging to unify.
Motivated by these gaps, we introduce a novel task, multi-view customization, which aims to jointly achieve multi-view camera pose control and customization.
Due to the scarcity of training data in customization, existing multi-view generation models, which inherently rely on large-scale datasets, struggle to generalize to diverse prompts.
To address this, we propose MVCustom, a novel diffusion-based framework explicitly designed to achieve both multi-view consistency and customization fidelity.
In the training stage, MVCustom learns the subject's identity and geometry using a feature-field representation, incorporating the text-to-video diffusion backbone enhanced with dense spatio-temporal attention, which leverages temporal coherence for multi-view consistency. In the inference stage, we introduce two novel techniques: depth-aware feature rendering explicitly enforces geometric consistency, and consistent-aware latent completion ensures accurate perspective alignment of the customized subject and surrounding backgrounds.
Extensive experiments demonstrate that MVCustom achieves the most balanced and consistent competitive performance across multi-view consistency, customization fidelity, demonstrating effective solution of multi-objective generation task.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2215. NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks
๐ฌ 18
โญ 5.50
๐ generative models
๐ท๏ธ 3D Computer Vision 3D Editing 3D Generation Flow Image Editiing
3D object editing is essential for interactive content creation in gaming, animation, and robotics, yet current approaches remain inefficient, inconsistent, and often fail to preserve unedited regions. Most methods rely on editing multi-view renderings followed by reconstruction, which introduces artifacts and limits practicality. To address these challenges, we propose \textbf{Nano3D}, a training-free framework for precise and coherent 3D object editing without masks. Nano3D integrates FlowEdit into TRELLIS to perform localized edits guided by front-view renderings, and further introduces region-aware merging strategies, Voxel/Slat-Merge, which adaptively preserve structural fidelity by ensuring consistency between edited and unedited areas. Experiments demonstrate that Nano3D achieves superior 3D consistency and visual quality compared with existing methods. Based on this framework, we construct the first large-scale 3D editing datasets \textbf{Nano3D-Edit-100k}, which contains over 100,000 high-quality 3D editing pairs. This work addresses long-standing challenges in both algorithm design and data availability, significantly improving the generality and reliability of 3D editing, and laying the groundwork for the development of feed-forward 3D editing models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2216. Breaking Scale Anchoring: Frequency Representation Learning for Accurate High-Resolution Inference from Low-Resolution Training
๐ฌ 34
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Scale Anchoring Zero-Shot Super-Resolution Spatiotemporal Forecasting Frequency Representation
Zero-Shot Super-Resolution Spatiotemporal Forecasting requires a deep learning model to be trained on low-resolution data and deployed for inference on high-resolution. Existing studies consider **maintaining** similar error across different resolutions as indicative of successful multi-resolution generalization performance. However, deep learning models serving as alternatives to numerical solvers should **reduce** error as resolution increases. The fundamental limitation is, the upper bound of physical law frequencies that low-resolution data can represent is constrained by its Nyquist frequency, making it difficult for models to process signals containing unseen frequency components during high-resolution inference. *This results in errors being anchored at low resolution, incorrectly interpreted as successful generalization.* We define this fundamental phenomenon as a new problem distinct from existing issues: **Scale Anchoring**. Therefore, we propose architecture-agnostic Frequency Representation Learning. It alleviates Scale Anchoring through resolution-aligned frequency representations and spectral consistency training: on grids with higher Nyquist frequencies, the frequency response in high-frequency bands of FRL-enhanced variants is more stable. This allows errors to decrease with resolution and significantly outperform baselines within our task and resolution range, while incurring only modest computational overhead.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2217. The Pensieve Paradigm: Stateful Language Models with Learned Memory Management
๐ฌ 16
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM memory management
In the world of Harry Potter, when Dumbledore's mind is overburdened, he extracts memories into a Pensieve to be revisited later. In the world of AI, while we possess the Pensieveโmature databases and retrieval systems, our models inexplicably lack the "wand" to operate it. They remain like a Dumbledore without agency, passively accepting a manually engineered context as their entire memory.
This work finally places the wand in the model's hand. We introduce StateLM, a new class of foundation models endowed with an internal reasoning loop to manipulate their own state. We equip our model with a suite of tools, such as dynamic indexing, context pruning, and note-taking, and train it to actively manage this loop. By learning to dynamically construct its own context, our model breaks free from the architectural prison of a fixed window. The results are prominent: our state-management approach decouples performance from context window size, delivering strong accuracy and sustainability under extremely long contexts with linear inference cost. We demonstrate this by showing StateLM reliably retrieves a "needle" from a 1-million-token haystack, a task far beyond the reach of conventional models. On practical document QA tasks from NovelQA and $\infty$Bench, StateLM outperforms strong instruct baselines while using only 1/4 of their active context. An ablation further shows that our curated training pipeline is more effective for learning memory management than agent-like prompting. Together, these results mark a shift from passive predictors to state-aware systems where reasoning becomes a stateful and manageable process.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2218. FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM Agents
๐ฌ 9
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Mobile Agent LLM Agent GUI Proactive Agent Personalization
Mobile GUI agents are becoming critical tools to improve user experience on smart devices, with multimodal large language models (MLLMs) emerging as the dominant paradigms in this domain. Current agents, however, rely on explicit human instructions, overlooking the potential to leverage the contextual information (like location, time, user profile) and historical data for proactive task suggestions. Besides, previous works focus on optimizing the success rate during task execution, but pay less attention to the personalized execution trajectory, thereby neglecting potentially vast differences in user preferences. To address these challenges, we introduce the FingerTip 20K benchmark. We collected 20K unique human demonstrations of multi-step Android device interactions across a variety of everyday apps. These demonstrations are not isolated but are continuously acquired from the users' long-term usage in their real lives, and encompass essential user-related contextual information. The benchmark contains two new tracks: proactive task suggestions by analyzing environment observation and users' previous intents, and personalized task execution by catering to users' action preferences. Our experiments reveal that the tracks we propose pose significant challenges for leveraging user-related information in GUI tasks. We also performed a human study to show that there exists a huge gap between existing agents and humans. The model fine-tuned with the data we collected effectively utilized user information and achieved good results, highlighting the potential of our approach in building more user-oriented mobile LLM agents. Our code is open-source at \url{https://anonymous.4open.science/r/FingerTip-57B8} for reproducibility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2219. Incomplete Data, Complete Dynamics: A Diffusion Approach
๐ฌ 20
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ diffusion models missing data
Learning physical dynamics from data is a fundamental challenge in machine learning and scientific modeling. Real-world observational data are inherently incomplete and irregularly sampled, posing significant challenges for existing data-driven approaches. In this work, we propose a principled diffusion-based framework for learning physical systems from incomplete training samples. To this end, our method strategically partitions each such sample into observed context and unobserved query components through a carefully designed splitting strategy, then trains a conditional diffusion model to reconstruct the missing query portions given available contexts. This formulation enables accurate imputation across arbitrary observation patterns without requiring complete data supervision. Specifically, we provide theoretical analysis demonstrating that our diffusion training paradigm on incomplete data achieves asymptotic convergence to the true complete generative process under mild regularity conditions. Empirically, we show that our method significantly outperforms existing baselines on synthetic and real-world physical dynamics benchmarks, including fluid flows and weather systems, with particularly strong performance in limited and irregular observation regimes. These results demonstrate the effectiveness of our theoretically principled approach for learning and imputing partially observed dynamics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2220. CLUE: Conflict-guided Localization for LLM Unlearning Framework
๐ฌ 19
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM unlearning circuit discovery conjunctive normal form interpretability
The LLM unlearning aims to eliminate the influence of undesirable data without affecting causally unrelated information.
This process typically involves using a **forget set** to remove target information, alongside a **retain set** to maintain non-target capabilities. While recent localization-based methods demonstrate promise in identifying important nodes (neurons) to be unlearned, they fail to disentangle nodes responsible for forgetting undesirable knowledge or retaining essential skills, often treating them as a single entangled group. As a result, these methods apply uniform interventions, risking catastrophic over-forgetting or incomplete erasure of the target knowledge. To address this, we turn to circuit discovery, a mechanistic interpretability technique, and propose the **C**onflict-guided **L**ocalization for LLM **U**nlearning fram**E**work (**CLUE**). This framework identifies the forget and retain circuit composed of important nodes, and then the circuits are transformed into conjunctive normal forms (CNF). The assignment of each node in the CNF satisfiability solution reveals whether it should be forgotten or retained. We then provide targeted fine-tuning strategies for different categories of nodes. Extensive experiments demonstrate that, compared to existing localization methods, CLUE achieves superior forget efficacy and retain utility through precise neural localization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2221. Masked Skill Token Training for Hierarchical Off-Dynamics Transfer
๐ฌ 20
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Tranfser Learning Skills Hierarchical RL Embodied AI
Generalizing policies across environments with altered dynamics remains a key challenge in reinforcement learning, particularly in offline settings where direct interaction or fine-tuning is impractical. We introduce Masked Skill Token Training (MSTT), a fully offline hierarchical RL framework that enables policy transfer using observation-only demonstrations. MSTT constructs a discrete skill space via unsupervised trajectory tokenization and trains a skill-conditioned value function using masked Bellman updates, which simulate dynamics shifts by selectively disabling skills. A diffusion-based trajectory generator, paired with feasibility-based filtering, enables the agent to execute valid, temporally extended actions without requiring action labels or access to the target environment. Our results in both discrete and continuous domains demonstrate the potential of mask-guided planning for robust generalization under dynamics shifts. To our knowledge, MSTT is the first work to explore masking as a mechanism for simulating and generalizing across off-dynamics environments. It marks a promising step toward scalable, structure-aware transfer and opens avenues to explore multi-goal conditioning, and extensions to more complex, real-world scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2222. QuadGPT: Native Quadrilateral Mesh Generation with Autoregressive Models
๐ฌ 14
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Autoregressive Quad Mesh Generation Reinforcement Learning Topology Optimization
The generation of quadrilateral-dominant meshes is a cornerstone of professional 3D content creation.
However, existing generative models generate quad meshes by first generating triangle meshes and then merging triangles into quadrilaterals with some specific rules, which typically produces quad meshes with poor topology.
In this paper, we introduce QuadGPT, the first autoregressive framework for generating quadrilateral meshes in an end-to-end manner.
QuadGPT formulates this as a sequence prediction paradigm, distinguished by two key innovations: a unified tokenization method to handle mixed topologies of triangles and quadrilaterals, and a specialized Reinforcement Learning fine-tuning method tDPO for better generation quality.
Extensive experiments demonstrate that QuadGPT significantly surpasses previous triangle-to-quad conversion pipelines in both geometric accuracy and topological quality.
Our work establishes a new benchmark for native quad-mesh generation and showcases the power of combining large-scale autoregressive models with topology-aware RL refinement for creating structured 3D assets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2223. SRT: Super-Resolution for Time Series via Disentangled Rectified Flow
๐ฌ 18
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ Time Series Super-Resolution Rectified Flow Temporal Disentanglement Implicit Neural Representations
Fine-grained time series data with high temporal resolution is critical for accurate analytics across a wide range of applications. However, the acquisition of such data is often limited by cost and feasibility. This problem can be tackled by reconstructing high-resolution signals from low-resolution inputs based on specific priors, known as super-resolution. While extensively studied in computer vision, directly transferring image super-resolution techniques to time series is not trivial. To address this challenge at a fundamental level, we propose **S**uper-**R**esolution for **T**ime series (SRT), a novel framework that reconstructs temporal patterns lost in low-resolution inputs via disentangled rectified flow. SRT decomposes the input into trend and seasonal components, aligns them to the target resolution using an implicit neural representation, and leverages a novel cross-resolution attention mechanism to guide the generation of high-resolution details. We further introduce SRT-large, a scaled-up version with extensive pretraining, which enables strong zero-shot super-resolution capability. Extensive experiments on nine public datasets demonstrate that SRT and SRT-large consistently outperform existing methods across multiple scale factors, showing both robust performance and the effectiveness of each component in our architecture.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2224. PredNext: Explicit Cross-View Temporal Prediction for Unsupervised Learning in Spiking Neural Networks
๐ฌ 22
โญ 5.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Spiking Neural Network Brain inspired Neuromorphic computing Unsupervised learning
Spiking Neural Networks (SNNs), with their temporal processing capabilities and biologically plausible dynamics, offer a natural platform for unsupervised representation learning. However, current unsupervised SNNs predominantly employ shallow architectures or localized plasticity rules, limiting their ability to model long-range temporal dependencies and maintain temporal feature consistency. This results in semantically unstable representations, thereby impeding the development of deep unsupervised SNNs for large-scale temporal video data. We propose PredNext, which explicitly models temporal relationships through cross-view future Step Prediction and Clip Prediction. This plug-and-play module seamlessly integrates with diverse self-supervised objectives. We firstly establish standard benchmarks for SNN self-supervised learning on UCF101, HMDB51, and MiniKinetics, which are substantially larger than conventional DVS datasets. PredNext delivers significant performance improvements across different tasks and self-supervised methods. PredNext achieves performance comparable to ImageNet-pretrained supervised weights through unsupervised training solely on UCF101. Additional experiments demonstrate that PredNext, distinct from forced consistency constraints, substantially improves temporal feature consistency while enhancing network generalization capabilities. This work provides a effective foundation for unsupervised deep SNNs on large-scale temporal video data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2225. Enhancing Agentic Search via Data Synthesis on Hierarchical Constraint Satisfaction
๐ฌ 17
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ data synthesis agentic search large language models
Deep research becomes increasingly important as people seek to solve complex problems that require gathering and synthesizing information from diverse sources. A key capability in this process is agentic search, where an LLM-agent iteratively retrieves relevant information across multiple sources while performing multi-step reasoning. However, developing effective agentic search systems is challenging due to the lack of high-quality training data that reflects the complexity of real-world research tasks.
To address this gap, we introduce InfoSeek, a novel data synthesis framework that conceptualizes agentic search as a Hierarchical Constraint Satisfaction Problem (HCSP), where solving a task requires satisfying layered constraints across multiple levels of sub-problems.
InfoSeek employs a DiffusionโRetrospection process: in the diffusion phase, the framework expands outward from a seed webpage, generating constraints that connect to neighboring pages and forming an exploration tree; in the retrospection phase, a subtree is sampled and backtracking constraints are introduced, which are then blurred and integrated into an HCSP instance.
As a generic framework, InfoSeek can be easily extended to other domains beyond web, facilitating ad-hoc optimization of deep research. To our knowledge, InfoSeek is the first publicly released framework in this area, complete with open-source code and well-curated datasets. Extensive experiments on diverse information-seeking benchmarks show that training on InfoSeek-generated data substantially improves agentic search performance, delivering significantly larger gains than traditional datasets across diverse model backends and training strategies, thereby validating the effectiveness of our approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2226. Semantic-Aware Diffusion LLM Inference With Adaptive Block Size
๐ฌ 19
โญ 5.50
๐ generative models
๐ท๏ธ Diffusion Large Language Models Non-Autoregressive Decoding
Diffusion-based large language models (dLLMs) are gaining attention for their inherent capacity for parallel decoding, offering a compelling alternative to autoregressive LLMs. Among various decoding strategies, blockwise semi-autoregressive (semi-AR) approaches are widely adopted due to their natural support for KV caching and their favorable accuracyโspeed trade-off. However, this paper identifies two fundamental limitations in the conventional semi-AR decoding approach that applies a fixed block size: i) late decoding overhead, where the unmasking of high-confidence tokens outside the current block is unnecessarily delayed; and ii) premature decoding error, where low-confidence tokens inside the current block are committed too early, leading to incorrect tokens. This paper presents the first systematic investigation challenging the fixed block size assumption in semi-AR decoding. Through a statistical analysis of confidence dynamics during the denoising process, we identify a volatility band (VB) region during dLLM decoding, which encodes local semantic structure and can be used to guide adaptive block sizing. Leveraging these insights, we introduce AdaBlock-dLLM, a training-free, plug-and-play scheduler that adaptively aligns block boundaries with semantic steps by adjusting block size during runtime. Extensive experiments across diverse benchmarks show that AdaBlock-dLLM achieves up to 5.3% accuracy improvement under the same throughput budget. Beyond inference-time optimization, we hope our semantics-aware adaptive scheduling approach and confidence-based analysis will inspire future training strategies for dLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2227. Next Visual Granularity Generation
๐ฌ 13
โญ 5.50
๐ generative models
๐ท๏ธ image generation
We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity. Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels. We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 -> 3.03, 2.57 ->2.44, 2.09 -> 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2228. HEEGNet: Hyperbolic Embeddings for EEG
๐ฌ 6
โญ 5.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ geometric deep learning transfer learning source-free adaptation electroencephalography neurology brain-computer interfaces
Electroencephalography (EEG)-based brain-computer interfaces facilitate direct communication with a computer, enabling promising applications in human-computer interactions. However, their utility is currently limited because EEG decoding often suffers from poor generalization due to distribution shifts across domains (e.g., subjects). Learning robust representations that capture underlying task-relevant information would mitigate these shifts and improve generalization. One promising approach is to exploit the underlying hierarchical structure in EEG, as recent studies suggest that hierarchical cognitive processes, such as visual processing, can be encoded in EEG. Yet, most existing decoding methods rely on Euclidean embeddings, which are not well-suited for capturing hierarchical structures.
In contrast, hyperbolic spaces, regarded as the continuous analogue of tree structures, provide a natural geometry for representing hierarchical data. In this study, we first demonstrate that EEG data exhibit hyperbolicity and show that hyperbolic embeddings improve generalization. Motivated by these findings, we propose HEEGNet, a hybrid hyperbolic network architecture to capture the hierarchical structure in EEG and learn domain-invariant hyperbolic embeddings. To this end, HEEGNet combines both Euclidean and hyperbolic encoders and employs a novel coarse-to-fine domain adaptation strategy. Extensive experiments on multiple public EEG datasets, covering visual evoked potentials, emotion recognition, and intracranial EEG, demonstrate that HEEGNet achieves state-of-the-art performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2229. D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping
๐ฌ 16
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Real-to-Sim-to-Real; Differentiable Simulation; Learning Robotic Policies from Videos; System Identification;
Simulation provides a cost-effective and flexible platform for data generation and policy learning to develop robotic systems. However, bridging the gap between simulation and real-world dynamics remains a significant challenge, especially in physical parameter identification. In this work, we introduce a real-to-sim-to-real engine that leverages the Gaussian Splat representations to build a differentiable engine, enabling object mass identification from real-world visual observations and robot control signals, while enabling grasping policy learning simultaneously. Through optimizing the mass of the manipulated object, our method automatically builds high-fidelity and physically plausible digital twins. Additionally, we propose a novel approach to train force-aware grasping policies from limited data by transferring feasible human demonstrations into simulated robot demonstrations. Through comprehensive experiments, we demonstrate that our engine achieves accurate and robust performance in mass identification across various object geometries and mass values. Those optimized mass values facilitate force-aware policy learning, achieving superior and high performance in object grasping, effectively reducing the sim-to-real gap. Our code is included in the Supplementary Material and will be open source to facilitate reproducibility. Anonymous project page is available at https://robot-drex-engine.github.io.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2230. Attributing Response to Context: A JensenโShannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation
๐ฌ 17
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ context attribution mechanistic interpretability RAG
Retrieval-Augmented Generation (RAG) leverages large language models (LLMs) combined with external contexts to enhance the accuracy and reliability of generated responses. However, reliably attributing generated content to specific context segments, context attribution, remains challenging due to the computationally intensive nature of current methods, which often require extensive fine-tuning or human annotation. In this work, we introduce a novel JensenโShannon Divergence driven method to Attribute Response to Context (ARC-JSD), enabling efficient and accurate identification of essential context sentences without additional fine-tuning, gradient-calculation or surrogate modelling. Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements compared to the previous baselines. Furthermore, our mechanistic analysis reveals specific attention heads and multilayer perceptron (MLP) layers responsible for context attribution, providing valuable insights into the internal workings of RAG models and how they affect RAG behaviours.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2231. DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving
๐ฌ 14
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ VLA World model End-to-end autonomous driving
Scaling Vision-Language-Action (VLA) models on large-scale data offers a promising path to achieving a more generalized driving intelligence.
However, VLA models are limited by a ``supervision deficit'': the vast model capacity is supervised by sparse, low-dimensional actions, leaving much of their representational power underutilized.
To remedy this, we propose DriveVLA-W0, a training paradigm that employs world modeling to predict future images.
This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment.
We showcase the paradigm's versatility by instantiating it for two dominant VLA archetypes: an autoregressive world model for VLAs that use discrete visual tokens, and a diffusion world model for those operating on continuous visual features.
Building on the rich representations learned from world modeling, we introduce a lightweight action expert to address the inference latency for real-time deployment.
Extensive experiments on the NAVSIM benchmark and a 700x larger in-house dataset demonstrate that DriveVLA-W0 significantly outperforms BEV and VLA baselines.
Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.
Code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2232. BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
๐ฌ 19
โญ 5.50
๐ generative models
๐ท๏ธ Diffusion Models Flow Matching RLHF GRPO Efficient Training
Recent progress in aligning image and video generative models with Group Relative Policy Optimization (GRPO) has improved human preference alignment, but existing variants remain inefficient due to sequential rollouts and large numbers of sampling steps, unreliable credit assignment,as sparse terminal rewards are uniformly propagated across timesteps, failing to capture the varying criticality of decisions during denoising.
In this paper, we present BranchGRPO, a method that restructures the rollout process into a branching tree, where shared prefixes amortize computation and pruning removes low-value paths and redundant depths.
BranchGRPO introduces three contributions:
(1) a branching scheme that amortizes rollout cost through shared prefixes while preserving exploration diversity;
(2) a reward fusion and depth-wise advantage estimator that transforms sparse terminal rewards into dense step-level signals; and
(3) pruning strategies that cut gradient computation but leave forward rollouts and exploration unaffected.
On HPSv2.1 image alignment, BranchGRPO improves alignment scores by up to \textbf{16\%} over DanceGRPO, while reducing per-iteration training time by nearly \textbf{55\%}.
A hybrid variant, BranchGRPO-Mix, further accelerates training to 4.7ร faster than DanceGRPO without degrading alignment.
On WanX video generation, it further achieves higher motion quality reward with sharper and temporally consistent frames.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2233. Mitigating Mismatch within Reference-based Preference Optimization
๐ฌ 15
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ machine learning language models alignment preference optimization offline preference alignment
Direct Preference Optimization (DPO) has become the de facto standard for offline preference alignment of large language models, but its reliance on a reference policy introduces a critical tension.
DPO weighs each update relative to a reference, which stabilizes the training by regularizing the updates within a trusted region. This reliance becomes problematic for pessimistic pairs, where the reference model prefers the rejected response. For these pairs, DPO prematurely attenuates the gradient as soon as the policy margin ($\Delta_\theta$) merely beats the reference margin ($\Delta_{\mathrm{ref}}$) even if the policy is still wrong ($\Delta_{\theta}<0$). We name this failure premature satisfaction, which is a concrete form of the trainingโinference mismatch.
Reference-free objectives remove this mismatch by optimizing the absolute margin, but at the cost of discarding the stabilizing signal of the reference. We resolve this tension with Hybrid-DPO (HyPO), a drop-in modification to DPO that applies reference conditionally: HyPO behaves exactly like DPO when the reference is optimistic or neutral, and it treats the reference as neutral when it is pessimistic by replacing $\Delta_\theta-\Delta_{\mathrm{ref}}$ with $\Delta_\theta-\max\(0,\Delta_{\mathrm{ref}}\)$. This one-line change strictly strengthens per-example learning signals on pessimistic pairs while preserving DPOโs objective form and computational cost. By conditionally debiasing the pessimistic reference signal, HyPO mitigates premature satisfaction; empirically, across preference alignment, HyPO improves inference-aligned metrics and achieves higher pairwise win rates. Our results provide evidence that direct preference alignment could be enhanced by conditionally debiasing the reference signal, rather than discarding it.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2234. Vision-Language-Action Instruction Tuning: From Understanding to Manipulation
๐ฌ 19
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ vision-language-action model vision-language model large language model
To operate effectively in the real world, robots should integrate multimodal reasoning with precise action generation. However, existing vision-language-action (VLA) models often sacrifice one for the other, narrow their abilities to task-specific manipulation data, and suffer catastrophic forgetting of pre-trained vision-language capabilities. To bridge this gap, we introduce **InstructVLA**, an end-to-end VLA model that preserves the flexible reasoning of large vision-language models (VLMs) while delivering leading manipulation performance with the help of embodied reasoning. InstructVLA introduces a novel training paradigm, *Vision-Language-Action Instruction Tuning (VLA-IT)*, which employs multimodal training with mixture-of-experts adaptation to jointly optimize embodied reasoning and action generation on both standard VLM corpora and a curated 650K-sample VLA-IT dataset. On in-domain SimplerEnv tasks, InstructVLA achieves 33.3% improvement over SpatialVLA. To evaluate generalization, we introduce SimplerEnv-Instruct, an 80-task benchmark requiring closed-loop control and high-level instruction understanding, where it outperforms a fine-tuned OpenVLA by 96% and an action expert aided by GPT-4o by 29%. Additionally, InstructVLA surpasses baseline VLMs on multimodal tasks and exhibits inference-time scaling by leveraging textual reasoning to boost manipulation performance in both simulated and real-world settings. These results demonstrate InstructVLA's potential for bridging intuitive and steerable human-robot interaction with efficient policy learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2235. The Quest for Generalizable Motion Generation: Data, Model, and Evaluation
๐ฌ 21
โญ 5.50
๐ generative models
๐ท๏ธ Motion Generation Generalizable
Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228k, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both textโmotion pairs and textโvideoโmotion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2236. GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs
๐ฌ 16
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Vision Encoder Multimodal Large Language Model Fine-Grain Perception
Vision encoders are indispensable for allowing impressive performance of Multimodal Large Language Models (MLLMs) in visionโlanguage tasks such as visual question answering and reasoning. However, existing vision encoders focus on global image representations but overlook fine-grained regional analysis. They are limited in fine-grained perception due to the scarcity of fine-grained annotated data and the lack of a fine-grained pre-training paradigm. In this paper, we propose GranViT, a novel Vision Transformer that integrates fine-grained feature extraction with semantic alignment to Large Language Models (LLMs) via region-level autoregressive training. We first construct Gran-29M, a dataset comprising 29 million natural and OCR images paired with over 180 million high-quality region-level annotations, to enable large-scale fine-grained pretraining. Consequently, we develop a pretraining-adaptation framework along with a self-distillation mechanism to train fine-grained GranViT on Gran-29M. We sufficiently exploit the fine-grained annotations from Gran-29M to resort to bounding-box-to-caption regression to enhance localized visual representation of the vision encoder in the pretraining and caption-to-bounding-box regression to improve vision feature utilization and localization for LLM in the adaptation. We further incorporate a self-distillation mechanism that imposes explicit localization constraints on the vision encoder to strengthen its regional reasoning capability. Extensive experiments show that GranViT surpasses existing vision encoders and attains strong transferability to varying LLMs. Remarkably, it achieves state-of-the-art results on fine-grained recognition, multimodal VQA, and OCR understanding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2237. Is On-Policy Data always the Best Choice for Direct Preference Optimization-Based LM Alignment?
๐ฌ 14
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ DPO Preference Candidates On-policy Sampling
The alignment of language models (LMs) with human preferences is critical for building reliable AI systems. The problem is typically framed as optimizing an LM policy to maximize the expected reward that reflects human preferences. Recently, Direct Preference Optimization (DPO) was proposed as a LM alignment method that directly optimize the policy from static preference data, and further improved by incorporating on-policy sampling (i.e., preference candidates generated during the training loop) for better LM alignment. However, we show on-policy data is not always optimal, with systematic effectiveness difference emerging between static and on-policy preference candidates. For example, on-policy data can result in a $3\times$ effectiveness compared with static data for Llama-3, and a $0.4\times$ effectiveness for Zephyr. To explain the phenomenon, we propose the alignment stage assumption, which divides the alignment process into two distinct stages: the preference injection stage, which benefits from diverse data, and the preference fine-tuning stage, which favors high-quality data. Through theoretical and empirical analysis, we characterize these stages and propose an effective algorithm to identify the boundaries between them. We perform experiments on $5$ models (Llama, Zephyr, Phi-2, Qwen, Pythia) and
$2$ alignment methods (DPO, SLiC-HF) to show the generalizability of alignment stage assumption and boundary measurement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2238. EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model
๐ฌ 16
โญ 5.50
๐ generative models
๐ท๏ธ Subject-driven Image Generation; Autoregressive Generation
Subject-driven generation is a critical task in creative AI; yet current state-of-the-art methods present a stark trade-off. They either rely on computationally expensive, per-subject fine-tuning, sacrificing efficiency and zero-shot capability, or employ feed-forward architectures built on diffusion models, which are inherently plagued by slow inference speeds.
Visual Auto-Regressive (VAR) models are renowned for their rapid sampling speeds and strong generative quality, making them an ideal yet underexplored foundation for resolving this tension.
To bridge this gap, we introduce EchoGen,
a pioneering framework that empowers VAR models with subject-driven generation capabilities.
The core design of EchoGen is an effective dual-path injection strategy that disentangles a subject's high-level semantic identity from its low-level fine-grained details, enabling enhanced controllability and fidelity.
We employ a semantic encoder to extract the subject's abstract identity, which is injected through decoupled cross-attention to guide the overall composition. Concurrently, a content encoder captures intricate visual details, which are integrated via a multi-modal attention mechanism to ensure high-fidelity texture and structural preservation.
To the best of our knowledge, EchoGen is the first feed-forward subject-driven framework built upon VAR models. Both quantitative and qualitative results substantiate our design, demonstrating that EchoGen achieves subject fidelity and image quality comparable to state-of-the-art diffusion-based methods with significantly lower sampling latency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2239. Cannistraci-Hebb Training on Ultra-Sparse Spiking Neural Networks
๐ฌ 17
โญ 5.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Sparse Spiking Neural Network Dynamic Sparse Training Pruning and Regrowth
Inspired by the brain's spike-based computation, spiking neural networks (SNNs) inherently possess temporal activation sparsity. However, when it comes to the sparse training of SNNs in the structural connection domain, existing methods fail to achieve ultra-sparse network structures without significant performance loss, thereby hindering progress in energy-efficient neuromorphic computing. This limitation presents a critical challenge: how to achieve high levels of structural connection sparsity while maintaining performance comparable to fully connected networks. To address this challenge, we propose the Cannistraci-Hebb Spiking Neural Network (CH-SNN), a novel and generalizable dynamic sparse training framework for SNNs consisting of four stages. First, we propose a sparse spike correlated topological initialization (SSCTI) method to initialize a sparse network based on node correlations. Second, temporal activation sparsity and structural connection sparsity are integrated via a proposed sparse spike weight initialization (SSWI) method. Third, a hybrid link removal score (LRS) is applied to prune redundant weights and inactive neurons, improving information flow. Finally, the CH3-L3 network automaton framework inspired by Cannistraci-Hebb learning theory is incorporated to perform link prediction for potential synaptic regrowth. These mechanisms enable CH-SNN to achieve sparsification across all linear layers. We have conducted extensive experiments on six datasets including CIFAR-10 and CIFAR-100, evaluating various network architectures such as spiking convolutional neural networks and Spikformer. The proposed method achieves a maximum sparsity of 97.75% and outperforms the fully connected (FC) network by 0.16% in accuracy. Furthermore, we apply CH-SNN within an SNN training algorithm deployed on an edge neuromorphic processor. The experimental results demonstrate that, compared to the FC baseline without CH-SNN, the sparse CH-SNN architecture achieves up to 98.84% sparsity, an accuracy improvement of 2.27%, and a 97.5$\times$ reduction in synaptic operations, and the energy consumption is reduced by an average of 55$\times$ across four datasets. To comply with double-blind review requirements, our code will be made publicly available upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2240. Beyond Distributions: Geometric Action Control for Continuous Reinforcement Learning
๐ฌ 13
โญ 5.50
๐ reinforcement learning
๐ท๏ธ reinforcement learning geometric control spherical normalization bounded action spaces continuous control action generation distribution-aware policy optimization
Gaussian policies have dominated continuous control in deep reinforcement learning (RL), yet they suffer from a fundamental mismatch: their unbounded support requires ad-hoc squashing functions that distort the geometry of bounded action spaces.
While von Mises-Fisher (vMF) distributions offer a theoretically grounded alternative on the sphere, their reliance on Bessel functions and rejection sampling hinders practical adoption.
We propose \textbf{Geometric Action Control (GAC)}, a novel action generation paradigm that preserves the geometric benefits of spherical distributions while \textit{simplifying computation}.
GAC decomposes action generation into a direction vector and a learnable concentration parameter, enabling efficient interpolation between deterministic actions and uniform spherical noise.
This design reduces parameter count from \(2d\) to \(d+1\), and avoids the \(O(dk)\) complexity of vMF rejection sampling, achieving simple \(O(d)\) operations.
Empirically, GAC consistently matches or exceeds state-of-the-art methods across six MuJoCo benchmarks, achieving 37.6\% improvement over SAC on Ant-v4 and up to 112\% on complex DMControl tasks, demonstrating strong performance across diverse benchmarks.
Our ablation studies reveal that both \textbf{spherical normalization} and \textbf{adaptive concentration control} are essential to GAC's success.
These findings suggest that robust and efficient continuous control does not require complex distributions, but a principled respect for the geometry of action spaces. Code and pretrained models are available in supplementary materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2241. VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration
๐ฌ 23
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Multimodal Alignment Vision Language Model
Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. Our full implementation will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2242. LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution
๐ฌ 20
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Image Super-Resolution Linear Attention Training Stability
Generative models for Image Super-Resolution (SR) are increasingly powerful, yet their reliance on self-attention's quadratic complexity ($O(N^2)$) creates a major computational bottleneck. Linear Attention offers an $O(N)$ solution, but its promise for photorealistic SR has remained largely untapped, historically hindered by a cascade of interrelated and previously unsolved challenges. This paper introduces LinearSR, a holistic framework that, for the first time, systematically overcomes these critical hurdles. Specifically, we resolve a fundamental, training instability that causes catastrophic model divergence using our novel ''knee point''-based Early-Stopping Guided Fine-tuning (ESGF) strategy. Furthermore, we mitigate the classic perception-distortion trade-off with a dedicated SNR-based Mixture of Experts (MoE) architecture. Finally, we establish an effective and lightweight guidance paradigm, TAG, derived from our ''precision-over-volume'' principle. Our resulting LinearSR model simultaneously delivers state-of-the-art perceptual quality with exceptional efficiency. Its core diffusion forward pass (1-NFE) achieves SOTA-level speed, while its overall multi-step inference time remains highly competitive. This work provides the first robust methodology for applying Linear Attention in the photorealistic SR domain, establishing a foundational paradigm for future research in efficient generative super-resolution.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2243. GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning
๐ฌ 41
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Mathematical Reasoning Reinforcement Learning Large Language Models Decentralized Training Heterogeneous Computing
As single-center computing approaches power constraints, decentralized training becomes essential. However, traditional Reinforcement Learning (RL) methods, crucial for enhancing large model post-training, cannot adapt to decentralized distributed training due to the tight coupling between parameter learning and rollout sampling. For this, we propose HeteroRL, a heterogeneous RL architecture that decouples these processes, enabling stable training across geographically distributed nodes connected via the Internet. The core component is Group Expectation Policy Optimization (GEPO), an asynchronous RL algorithm robust to latency caused by network delays or heterogeneity in computational resources. Our study reveals that high latency significantly increases KL divergence, leading to higher variance of importance weights and training instability. GEPO mitigates this issue by using group expectation weighting to exponentially reduce the variance of importance weights, with theoretical guarantees. Experiments show GEPO achieves superior stabilityโonly a 3\% performance drop from online to 1800s latencyโand reduces the best-to-last gap by 85\% versus GSPO ($\Delta$=1.8 vs. 12.0) while attaining the highest scores, highlighting its effectiveness in decentralized, resource-heterogeneous environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2244. Data Provenance for Image Auto-Regressive Generation
๐ฌ 43
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ data provenance image autoregressive models
Image autoregressive models (IARs) have recently demonstrated remarkable capabilities in visual content generation, achieving photorealistic quality and rapid synthesis through the next-token prediction paradigm adapted from large language models. As these models become widely accessible, robust data provenance is required to reliably trace IAR-generated images to the source model that synthesized them. This is critical to prevent the spread of misinformation, detect fraud, and attribute harmful content. We find that although IAR-generated images often appear visually identical to real images, their generation process introduces characteristic patterns in their outputs, which serves as a reliable provenance signal for the generated images. Leveraging this, we present a post-hoc framework that enables the robust detection of such patterns for provenance tracing. Notably, our framework does not require modifications of the generative process or outputs. Thereby, it is applicable in contexts where prior watermarking methods cannot be used, such as for generated content that is already published without additional marks and for models that do not integrate watermarking. We demonstrate the effectiveness of our approach across a wide range of IARs, highlighting its high potential for robust data provenance tracing in autoregressive image generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2245. InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
๐ฌ 26
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Scalable Vector Graphic Multimodal Large Language Models Dataset and Benchmark
General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated dataโbenchmarkโmodel suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on \benchset and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2246. GIR-Bench: Versatile Benchmark for Generating Images with Reasoning
๐ฌ 21
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Evaluation Unified Multimodal Model Visual Generation
Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks.
To this end, we introduce \textbf{GIR-Bench}, a comprehensive benchmark that evaluates unified models across three complementary perspectives.
Firstly, we explore whether models can consistently leverage the same knowledge for both understanding and generation (GIR-Bench-Uni).
Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I).
Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit).
For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm.
Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at
\url{https://anonymous.4open.science/r/GIR-Bench-7E40}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2247. DiCache: Let Diffusion Model Determine Its Own Cache
๐ฌ 22
โญ 5.50
๐ generative models
๐ท๏ธ diffusion model generative model inference acceleration
Recent years have witnessed the rapid development of acceleration techniques for diffusion models, especially caching-based acceleration methods. These studies seek to answer two fundamental questions: _"When to cache"_ and _"How to use cache"_, typically relying on predefined empirical laws or dataset-level priors to determine caching timings and adopting handcrafted rules for multi-step cache utilization. However, given the highly dynamic nature of the diffusion process, they often exhibit limited generalizability and fail to cope with diverse samples. In this paper, a strong sample-specific correlation is revealed between the variation patterns of the shallow-layer feature differences in the diffusion model and those of deep-layer features. Moreover, we have observed that the features from different model layers form similar trajectories. Based on these observations, we present **DiCache**, a novel training-free adaptive caching strategy for accelerating diffusion models at runtime, answering both when and how to cache within a unified framework. Specifically, DiCache is composed of two principal components: (1) _Online Probe Profiling Scheme_ leverages a shallow-layer online probe to obtain an on-the-fly indicator for the caching error in real time, enabling the model to dynamically customize the caching schedule for each sample. (2) _Dynamic Cache Trajectory Alignment_ adaptively approximates the deep-layer feature output from multi-step historical caches based on the shallow-layer feature trajectory, facilitating higher visual quality. Extensive experiments validate DiCacheโs capability in achieving higher efficiency and improved fidelity over state-of-the-art approaches on various leading diffusion models including WAN 2.1, HunyuanVideo and Flux.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2248. iFusion: Integrating Dynamic Interest Streams via Diffusion Model for Click-Through Rate Prediction
๐ฌ 14
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ User Behavior Modeling Diffusion Models Dynamic Interest Fusion
Click-through rate (CTR) prediction is crucial for recommendation systems and online advertising, relying heavily on effective user behavior modeling. While existing methods separately refine long-term and short-term interest representations, the fusion of these behaviors remains a critical yet understudied challenge due to misaligned feature spaces, disjointed modeling, and noise propagation in short-term interests. To address these limitations, we propose iFusion, a diffusion-based generative user interest fusion method, which reformulates interest fusion as a conditional generation process. iFusion leverages short-term interests as conditional guidance and progressively integrates long-term representations through denoising, eliminating reliance on linear fusion assumptions. Our framework introduces two key components: (1) the Disentangled Classifier-Free Diffusion Guidance (DCFG) Mechanism, which adaptively disentangles core preferences from transient fluctuations, and (2) the Mixture AutoRegressive Denoising Network (MARN), which enables joint interest modeling and fusion through autoregressive denoising. Experiments demonstrate that iFusion outperforms baselines across public and industrial datasets, as well as in online A/B tests, validating its effectiveness in robust CTR prediction. This work establishes a new paradigm for generative user interests fusion in CTR prediction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2249. Learning Hierarchical and Geometry-Aware Graph Representations for Text-to-CAD
๐ฌ 15
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Text-to-CAD Graph Representations Large Language Models Curriculum Learning
Text-to-CAD code generation is a long-horizon task, requiring the translation of instructions into a long sequence of interdependent operations. This process is exceptionally fragile, as minor early errors can propagate through the sequence and ultimately invalidate an entire complex assembly. Existing methods typically decode instructions directly into executable code (e.g., bpy) without an explicit representation of assembly hierarchy or geometric constraints. This flat decoding strategy vastly expands the search space, amplifying local errors and leading to cascading failures in contextual operations. We address this gap by learning an intermediate representation: a hierarchical and geometry-aware graph. The graph represents an assembly-based decomposition, with multi-level nodes modeling the product's parts and components, and edges defining the explicit geometric constraints between them. Rather than mapping text directly to code, our graph paradigm first predicts high-level structure and constraints, then conditions the sequencing of operations and program generation, thereby narrowing the search space and improving both geometric fidelity and constraint satisfaction. Furthermore, we introduce a structure-aware progressive curriculum learning mechanism to enhance the model's ability to generate sophisticated decomposition graphs, allowing it to handle more complex assemblies. The mechanism constructs graded tasks via controlled edits to object structure, probes the modelโs capability boundary, and synthesizes boundary examples for subsequent training rounds. We also introduce a 12K-instruction dataset annotated with instructions, geometric decomposition graphs, action sequences, and bpy code, together with metrics for node- and hierarchy-level graph accuracy and a measure of constraint satisfaction. Extensive experiments show that our approach outperforms existing methods in terms of both geometric fidelity and accurate fulfillment of geometric constraints.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2250. Heterogeneous Agent Q-weighted Policy Optimization
๐ฌ 28
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Multi-agent Reinforcement Learning Heterogeneous Agent Reinforcement Learning Diffusion Policy Policy Optimization
Multi-agent reinforcement learning (MARL) confronts a fundamental tension between stability and expressiveness. Stability requires avoiding divergence under non-stationary updates, while expressiveness demands capturing multimodal strategies for heterogeneous coordination. Existing methods sacrifice one for the other: value-decomposition and trust-region approaches ensure stability but assume restrictive unimodal policies, while expressive generative models lack optimization guarantees. To address this challenge, we introduce **H**eterogeneous **A**gent **Q**-weighted Policy **O**ptimization (**HAQO**), a framework unifying sequential advantage-aware updates, Q-weighted variational surrogates, and entropy regularization. Our analysis establishes monotone improvement guarantees under bounded critic bias, extending trust-region theory to diffusion-based policies with intractable log-likelihoods. HAQO achieves superior returns and reduced variance compared to policy-gradient baselines across diverse benchmarks. The ablation studies confirm sequential updates ensure stability, expressive policies enable multimodality, and entropy regularization prevents collapse. HAQO reconciles stability and expressiveness in MARL with theoretical rigor and practical effectiveness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2251. SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models
๐ฌ 25
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Vision-Language Models Spatial Reasoning
Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single-image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2252. QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification
๐ฌ 4
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Generation Model Quantization Attention Sparsification
Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment. Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression. Combining them promises compounded efficiency gains, but naive integration is ineffective. The sparsity-induced information loss exacerbates quantization noise, leading to amplified attention shifts. To address this, we propose **QuantSparse**, a unified framework that integrates model quantization with attention sparsification. Specifically, we introduce *Multi-Scale Salient Attention Distillation*, which leverages both global structural guidance and local salient supervision to mitigate quantization-induced bias. In addition, we develop *Second-Order Sparse Attention Reparameterization*, which exploits the temporal stability of second-order residuals to efficiently recover information lost under sparsity. Experiments on HunyuanVideo-13B demonstrate that QuantSparse achieves 20.88 PSNR, substantially outperforming the state-of-the-art quantization baseline Q-VDiT (16.85 PSNR), while simultaneously delivering a **3.68$\times$** reduction in storage and **1.88$\times$** acceleration in end-to-end inference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2253. Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
๐ฌ 22
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ privacy llm adaptations auditing differential privacy
Recent work has applied differential privacy (DP) to adapt large language models (LLMs) for sensitive applications, offering theoretical guarantees. However, its practical effectiveness remains unclear, partly due to LLM pretraining, where overlaps and interdependencies with adaptation data can undermine privacy despite DP efforts. To analyze this issue in practice, we investigate privacy risks under DP adaptations in LLMs using state-of-the-art attacks such as robust membership inference and canary data extraction. We benchmark these risks by systematically varying the adaptation data distribution, from exact overlaps with pretraining data, through in-distribution (IID) cases, to entirely out-of-distribution (OOD) examples. Additionally, we evaluate how different adaptation methods and different privacy regimes impact the vulnerability. Our results show that distribution shifts strongly influence privacy vulnerability: the closer the adaptation data is to the pretraining distribution, the higher the practical privacy risk at the same theoretical guarantee, even without direct data overlap. We find that parameter-efficient fine-tuning methods, such as LoRA, achieve the highest empirical privacy protection for OOD data. Our benchmark identifies key factors for achieving practical privacy in DP LLM adaptation, providing actionable insights for deploying customized models in sensitive settings. Looking forward, we propose a structured framework for holistic privacy assessment beyond adaptation privacy, to identify and evaluate risks across the full pretrain-adapt pipeline of LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2254. Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling
๐ฌ 16
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Knowledge Graph Large Language Models Knowledge-enhanced reasoning reinforcement learning
The reasoning process of Large Language Models (LLMs) is often plagued by hallucinations and missing facts in question-answering tasks.
A promising solution is to ground LLMs' answers in verifiable knowledge sources, such as Knowledge Graphs (KGs). Prevailing KG-enhanced methods typically constrained LLM reasoning either by enforcing rules during generation or by imitating paths from a fixed set of demonstrations. However, they naturally confined the reasoning patterns of LLMs within the scope of prior experience or fine-tuning data, limiting their generalizability to out-of-distribution graph reasoning problems.
To tackle this problem, in this paper, we propose Explore-on-Graph (EoG), a novel framework that encourages LLMs to autonomously explore a more diverse reasoning space on KGs.
To incentivize exploration and discovery of novel reasoning paths, we propose to introduce reinforcement learning during training, whose reward is the correctness of the reasoning paths' final answers.
To enhance the efficiency and meaningfulness of the exploration, we propose to incorporate path information as additional reward signals to refine the exploration process and reduce futile efforts.
Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2255. CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos
๐ฌ 10
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Large Language Model Reasoning Video Object Segmentation
Reasoning Video Object Segmentation is a challenging task, aiming at generating a mask sequence from an input video given a complex and implicit text query. While existing works finetune Multimodal Large Language Models (MLLM) for the task, they still fail in video inputs given complex temporally-sensitive queries, indicating their lack of temporal and spatial integration in complex scenarios. In this paper, we propose **CoT-RVS**, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these complex challenges by **temporal-semantic reasoning**: CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding keyframe for each object that can be observed effortlessly among all frames (temporal). Notably, the CoT-RVS framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. Our framework's training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that CoT-RVS significantly outperforms previous works in both cases, qualitatively and quantitatively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2256. Learning to Orchestrate Agents in Natural Language with the Conductor
๐ฌ 22
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ RL reasoning LLM tool use prompting
Powerful large language models (LLMs) from different providers have been expensively trained and finetuned to specialize across varying domains. In this work, we introduce a new kind of Conductor model trained with reinforcement learning to automatically discover powerful coordination strategies among LLMs. Our Conductor learns not only to design targeted communication topologies for effective agent-to-agent collaboration, but also to prompt engineer focused instructions to the LLMs to maximally leverage their individual capabilities. We show that, by learning optimal coordination strategies over pools of powerful worker LLMs, a 7B Conductor achieves significant performance gains beyond any individual worker, attaining state-of-the-art results in challenging reasoning benchmarks, such as LiveCodeBench and GPQA. By training with randomized agent pools, our conductor effectively adapts to arbitrary sets of open- and closed-source agents, meeting any user requirements. Furthermore, allowing the Conductor to select itself as a worker gives rise to recursive topologies, elevating performance with a new form of dynamic test-time scaling through online iterative adaptation.
More broadly, ours is among the early work demonstrating language model coordination can be unlocked through RL, where powerful coordination strategies emerge naturally in LLMs through pure end-to-end reward maximization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2257. Interp3D: Correspondence-aware Interpolation for Generative Textured 3D Morphing
๐ฌ 14
โญ 5.50
๐ generative models
๐ท๏ธ Image-to-3D generation Textured 3D Morphing
Textured 3D morphing seeks to generate smooth and plausible transitions between two 3D assets, preserving both structural coherence and fine-grained appearance. This ability is crucial not only for advancing 3D generation research but also for practical applications in animation, editing, and digital content creation. Existing approaches either operate directly on geometry, limiting them to shape-only morphing while neglecting textures, or extend 2D interpolation strategies into 3D, which often causes semantic ambiguity, structural misalignment, and texture blurring. These challenges underscore the necessity to jointly preserve geometric consistency, texture alignment, and robustness throughout the transition process.
To address this, we propose \textbf{Interp3D}, a novel training-free framework for textured 3D morphing. It harnesses generative priors and adopts a progressive alignment principle to ensure both geometric fidelity and texture coherence.
Starting from semantically aligned interpolation in condition space, Interp3D enforces structural consistency via SLAT (Structure Latent)-guided structure interpolation, and finally transfers appearance details through fine-grained texture fusion. For comprehensive evaluations, we construct a dedicated dataset, Interp3DData, with graded difficulty levels and assess generation results from fidelity, transition smoothness,
and plausibility. Both quantitative metrics and human studies demonstrate the significant advantages of our proposed approach over previous methods. Our code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2258. Paper Copilot: Tracking the Evolution of Peer Review in AI Conferences
๐ฌ 30
โญ 5.50
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ peer review review dynamic aiml community
Submissions are rising fast, and venues use different rules, data formats, and update times. As a result, signals of progress get split across places, and key moments (rebuttal, discussion, final decision) are easy to miss, making analysis hard. We present Paper Copilot, a system and scalable peer-review archive that pulls data from official sites, OpenReview, and opt-in forms into a single, standardized, versioned record with timestamps. This lets us track trends over time and compare venues, institutions, and countries in a consistent way. Using the archive for ICLR 2024/2025, we see larger score changes after rebuttal for higher-tier papers, reviewer agreement that dips during active discussion and tightens by the end, and in 2025 a sharper, mean-scoreโdriven assignment of tiers with lower decision uncertainty than expected at that scale. We also state simple rules for ethicsโclear sourcing and consent, privacy protection, and limits on use for closed venues. Together, we provide a clear, reusable base for tracking AI/ML progress, and, with this data, enable validation, benchmarking, and otherwise hard-to-run studies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2259. Causal Discovery via Quantile Partial Effect
๐ฌ 19
โญ 5.50
๐ causal reasoning
๐ท๏ธ causality causal discovery causal order identifiability normalizing flow
Quantile Partial Effect (QPE) is a statistic associated with conditional quantile regression, measuring the effect of covariates at different levels. Our theory demonstrates that when the QPE of cause on effect is assumed to lie in a finite linear span, cause and effect are identifiable from their observational distribution. This generalizes previous identifiability results based on Functional Causal Models (FCMs) with additive, heteroscedastic noise, etc. Meanwhile, since QPE resides entirely at the observational level, this parametric assumption does not require considering mechanisms, noise, or even the Markov assumption, but rather directly utilizes the asymmetry of shape characteristics in the observational distribution. By performing basis function tests on the estimated QPE, causal directions can be distinguished, which is empirically shown to be effective in experiments on a large number of bivariate causal discovery datasets. For multivariate causal discovery, leveraging the close connection between QPE and score functions, we find that Fisher Information is sufficient as a statistical measure to determine causal order when assumptions are made about the second moment of QPE. We validate the feasibility of using Fisher Information to identify causal order on multiple synthetic and real-world multivariate causal discovery datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2260. A foundation model with multi-variate parallel attention to generate neuronal activity
๐ฌ 23
โญ 5.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ time-series ieeg neurology foundation model attention transformer
Learning from multi-variate time-series with heterogeneous channel configurations remains a fundamental challenge for deep neural networks, particularly in clinical domains such as intracranial electroencephalography (iEEG), where channel setups vary widely across subjects. In this work, we introduce multi-variate parallel attention (MVPA), a novel self-attention mechanism that disentangles content, temporal, and spatial attention, enabling flexible, generalizable, and efficient modeling of time-series data with varying channel counts and configurations. We use MVPA to build MVPFormer, a generative foundation model for human electrophysiology, trained to predict the evolution of iEEG signals across diverse subjects. To support this and future efforts by the community, we release the Long-term iEEG dataset, the largest publicly available iEEG dataset to date, comprising nearly 10,000 hours of recordings from heterogeneous clinical sources. MVPFormer leverages MVPA to achieve strong generalization across subjects, demonstrating expert-level performance in several iEEG tasks. MVPFormer surpasses state-of-the-art (SOTA) Transformer baselines in seizure detection across the Long-term, the MAYO, and the FNUSA datasets, while also achieving SOTA performance on four Brain TreeBank iEEG decoding tasks (volume, pitch, onset, and speech). We further validate MVPA on standard time-series forecasting and classification tasks, where it matches or exceeds the performance of existing attention-based models. Together, our contributions establish MVPA as a general-purpose attention mechanism for heterogeneous time-series and MVPFormer as the first open-source, open-weights, and open-data iEEG foundation model with SOTA clinical performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2261. UniCon: Unified Framework for Efficient Contrastive Alignment via Kernels
๐ฌ 14
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ contrastive learning representation learning multimodal alignment
Contrastive objectives power state-of-the-art multimodal models, but their training remains slow, relying on long stochastic optimization.
We propose a Unified Framework for Efficient Contrastive Alignment via Kernels (UniCon), which spans linear and nonlinear encoders as well as one-to-one and many-to-many alignments.
At its core, UniCon introduces the contrastive similarity weight matrix $S(\gamma)$, which enables closed-form global solutions that provably replace minibatch back-propagation with exact updates. Through the lens of reproducing kernel Hilbert spaces (RKHS), UniCon provides a kernelized perspective that unifies contrastive alignment and reveals its connection to spectral methods.
To validate the theory, we conduct experiments on synthetic, unimodal, multimodal, and zero-shot tasks, demonstrating that UniCon achieves substantial efficiency gains while preserving generality and strong empirical performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2262. Panoptic Pairwise Distortion Graph
๐ฌ 17
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ distortion analysis low-level vision iqa graphs
In this work, we introduce a new perspective on comparative image assessment by representing an image pair as a structured composition of its regions. In contrast, existing methods focus on whole image analysis, while implicitly relying on region-level understanding. We extend the intra-image notion of a scene graph to inter-image, and propose a novel task of Distortion Graph (DG). DG treats paired images as a structured topology grounded in regions, and represents dense degradation information such as distortion type, severity, comparison and quality score in a compact interpretable graph structure. To realize the task of learning a distortion graph, we contribute (i) a region-level dataset, PandaSet, (ii) a benchmark suite, PandaBench, with varying region-level difficulty, and (iii) an efficient architecture, Panda, to generate distortion graphs. We demonstrate that PandaBench poses a significant challenge for state-of-the-art multimodal large language models (MLLMs) as they fail to understand region-level degradations even when fed with explicit region cues. We show that training on PandaSet or prompting with DG elicits region-wise distortion understanding, opening a new direction for fine-grained, structured pairwise image assessment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2263. Stop Guessing: Choosing the Optimization-Consistent Uncertainty Measurement for Evidential Deep Learning
๐ฌ 18
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ uncertainty estimation;
Evidential Deep Learning (EDL) has emerged as a promising framework for uncertainty estimation in classification tasks by modeling predictive uncertainty with a Dirichlet prior. Despite its empirical success, prior work has primarily focused on the probabilistic properties of the Dirichlet distribution, leaving the role of optimization dynamics during training underexplored. In this paper, we revisit EDL through the lens of optimization and establish a non-trivial connection: minimizing the expected cross-entropy loss over the Dirichlet prior implicitly encourages solutions akin to multi-class Support Vector Machines, maximizing decision margins. Motivated by this observation, we introduce the \emph{optimization-consistency principle}, which deems an uncertainty measure valid if its value decreases as samples approach the global optimum of the training objective. This principle provides a new criterion for evaluating and designing uncertainty measures that are consistent with the optimization dynamics. Building on this foundation, we further propose a novel measure, \emph{Margin-aware Predictive Uncertainty (MPU)}, which directly captures the separation between target and non-target evidence. Extensive experiments on out-of-distribution detection and classification-with-rejection benchmarks demonstrate the effectiveness of our propositions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2264. GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation
๐ฌ 14
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ 3D Scene Understanding
Recent attempts to transfer features from 2D VisionโLanguage Models (VLMs) to 3D semantic segmentation expose a persistent trade-off. Directly projecting 2D features into 3D yields noisy and fragmented predictions, whereas enforcing geometric coherence necessitates costly training pipelines and large-scale, annotated 3D data. We argue that this limitation stems from the dominant \textit{segmentation-and-matching} paradigm, which fails to reconcile 2D semantics with 3D geometric structure. The geometric cues are not eliminated during the 2D-to-3D transfer but remain latent within the noisy and view-aggregated features. To exploit this property, we propose \textbf{GeoPurify} that applies a small Student Affinity Network to purify 2D VLM-generated 3D point features using geometric priors distilled from a 3D self-supervised teacher model. During inference, we devise a Geometry-Guided Pooling module to further denoise the point cloud and ensure the semantic and structural consistency. Benefiting from latent geometric information and the learned affinity network, GeoPurify effectively mitigates the trade-off and achieves superior data efficiency. Extensive experiments on major 3D benchmarks demonstrate that GeoPurify achieves or surpasses state-of-the-art performance while utilizing only \textbf{$\sim$1.5\%} of the training data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2265. Figma2Code: Automating Multimodal Design to Code in the Wild
๐ฌ 35
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Code Generation Desigin to Code
Front-end development constitutes a substantial portion of software engineering, yet converting design mockups into production-ready *User Interface* (UI) code remains tedious and time-costly.
While recent work has explored automating this process with *Multimodal Large Language Models* (MLLMs), existing approaches typically rely solely on design images. As a result, they must infer complex UI details from images alone, often leading to degraded results.
In real-world development workflows, however, design mockups are usually delivered as Figma filesโa widely used tool for front-end designโthat embed rich multimodal information (e.g., metadata and assets) essential for generating high-quality UI.
To bridge this gap, we introduce Figma2Code, a new task that generalizes *design-to-code* into a multimodal setting and aims to automate *design-to-code* in the wild.
Specifically, we collect paired design images and their corresponding metadata files from the Figma community. We then apply a series of processing operations, including rule-based filtering, human and MLLM-based annotation and screening, and metadata refinement. This process yields 3,055 samples, from which designers curate a balanced dataset of 213 high-quality cases.
Using this dataset, we benchmark ten state-of-the-art open-source and proprietary MLLMs. Our results show that while proprietary models achieve superior visual fidelity, they remain limited in layout responsiveness and code maintainability.
Further experiments across modalities and ablation studies corroborate this limitation, partly due to modelsโ tendency to directly map primitive visual attributes from Figma metadata.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2266. Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning
๐ฌ 19
โญ 5.50
๐ reinforcement learning
๐ท๏ธ deep reinforcement learning q-learning actor-critic function approximation
The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the **MIN**imum estimate between the **T**arget and **O**nline network, giving rise to our method, **MINTO**. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2267. Healthcare Insurance Fraud Detection via Continual Fiedler Vector Graph Model
๐ฌ 16
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ online learning semi-supervised fraud detection
Healthcare insurance fraud detection presents unique machine learning challenges: labeled data are scarce due to delayed verification processes, and fraudulent behaviors evolve rapidly, often manifesting in complex, graph-structured interactions. Existing methods struggle in such settings. Pretraining routines typically overlook structural anomalies under limited supervision, while online models often fail to adapt to changing fraud patterns without labeled updates.
To address these issues, we propose the Continual Fiedler Vector Graph model (ConFVG), a fraud detection framework designed for label-scarce and non-stationary environments. The framework comprises two key components. To mitigate label scarcity, we develop a Fiedler Vector-guided graph autoencoder that leverages spectral graph properties to learn structure-aware node representations. The Fiedler Vector, derived from the second smallest eigenvalue of the graph Laplacian, captures global topological signals such as community boundaries and connectivity bottlenecks, which are patterns frequently associated with collusive fraud. This enables the model to identify structurally anomalous nodes without relying on labels. To handle evolving graph streams, we propose a Subgraph Attention Fusion (SAF) module that constructs neighborhood subgraphs and applies attention-based reweighting to emphasize emerging high-risk structures. This design allows the model to adapt to new fraud patterns in real time. A Mean Teacher mechanism further stabilizes online updates and prevents forgetting of previously acquired knowledge.
Experiments on real-world medical fraud datasets demonstrate that the Continual Fiedler Vector Graph model outperforms state-of-the-art baselines in both low-label and distribution-shift scenarios, offering a scalable and structure-sensitive solution for real-time fraud detection.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2268. One for Two: A Unified Framework for Imbalanced Graph Classification via Dynamic Balanced Prototype
๐ฌ 22
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph classification; graph imbalance learning; graph neural networks; Graph data mining; long-tail learning
Graph Neural Networks (GNNs) have advanced graph classification, yet they remain vulnerable to graph-level imbalance, encompassing class imbalance and topological imbalance. To address both types of imbalance in a unified manner, we propose UniImb, a novel framework for imbalanced graph classification. Specifically, UniImb first captures multi-scale topological features and enhances data diversity via learnable personalized graph perturbations. It then employs a dynamic balanced prototype module to learn representative prototypes from graph instances, improving the quality of graph representations. Concurrently, a prototype load-balancing optimization term mitigates dominance by majority samples to equalize sample influence during training. We justify these design choices theoretically using the Information Bottleneck principle. We justify these design choices theoretically using the Information Bottleneck principle. Extensive experiments on 19 datasets and 23 baselines demonstrate that UniImb has achieved dominant performance across various imbalanced scenarios. Our code is available at Anonymous GitHub.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2269. Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification
๐ฌ 21
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Verifiers Verification Digital Agents Web Agents GUI Agents Robotics Multimodal Large Language Models Test Time Scaling WebArena OSWorld
Verifiersโfunctions assigning rewards to agent behaviorโhave been key for AI progress in domains such as math, code and games. However, extending these gains to domains without clear-cut success criteria (e.g., computer use) remains a challenge: while humans can recognize suitable outcomes, translating this intuition into scalable rules is nontrivial. Multimodal Large Language Models (MLLMs)
emerge as a promising solution, given their world knowledge, human-preference alignment, and reasoning skills. We evaluate MLLMs as verifiers of agent trajectories across web navigation, computer use, and robotic manipulation, and identify a critical limitation: a strong tendency to over-validate agent behavior, a phenomenon we call agreement bias. We show that agreement bias is pervasive across models, resilient to test-time scaling, and can impact existing methods relying on MLLMs as evaluators. We discuss metrics to measure and strategies to mitigate this bias, and introduce Self-Grounded Verification (SGV), a lightweight method that harnesses MLLMsโ own sampling mechanisms by modulating (un)conditional generation to better leverage their knowledge, alignment, and reasoning. SGV operates in two steps: first, the MLLM is elicited to generate broad priors about desired behavior, independent of the data under evaluation. Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory. SGV improves verification across models, metrics, and benchmarks, leading to more human-aligned responses, with gains of up to 25 pp in failure identification, 14 pp in accuracy, and benefits extending to downstream applications. In self-refinement and online supervision, SGV boosts task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena--setting a new state of the art, surpassing the previous best by 20pp. Finally, we release an updated version of (Visual)WebArena featuring more human-aligned evaluators, environment parallelization with improved execution fidelity, and runtime speedups of over 10x.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2270. Communication-Efficient Decentralized Optimization via Double-Communication Symmetric ADMM
๐ฌ 19
โญ 5.50
๐ optimization
๐ท๏ธ Decentralized Optimization Symmetric ADMM Multi-Communication
This paper focuses on decentralized composite optimization over networks without a central coordinator. We propose a novel decentralized Symmetric ADMM algorithm that incorporates multiple communication rounds within each iteration, derived from a new constraint formulation that enables information exchange beyond immediate neighbors. While increasing per-iteration communication, our approach significantly reduces the total number of iterations and overall communication cost. We further design optimal communication rules that minimize the number of rounds and variables transmitted per iteration. The proposed algorithms are shown to achieve linear convergence under standard assumptions. Extensive experiments on regression and classification tasks validate the theoretical results and demonstrate superior performance compared to existing decentralized optimization methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2271. Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks
๐ฌ 18
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Autonomous Driving Driving World Model Perception Tasks Synthetic Data
Recent advancements in driving world models enable controllable generation of high-quality RGB videos or multimodal videos.
Existing methods primarily focus on metrics related to generation quality and controllability.
However, they often overlook the evaluation of downstream perception tasks, which are {\bf really crucial} for the performance of autonomous driving.
Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only).
When we double the epochs in the baseline, the benefit of synthetic data becomes negligible.
To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks.
Dream4Drive first decomposes the input video into several 3D-aware guidance maps and subsequently renders the 3D assets onto these guidance maps.
Finally, the driving world model is fine-tuned to produce the edited, multi-view photorealistic videos, which can be used to train the downstream perception models.
Dream4Drive enables unprecedented flexibility in generating multi-view corner cases at scale, significantly boosting corner case perception in autonomous driving.
To facilitate future research, we also contribute a large-scale 3D asset dataset named DriveObj3D, covering the typical categories in driving scenarios and enabling diverse 3D-aware video editing.
We conduct comprehensive experiments to show that Dream4Drive can effectively boost the performance of downstream perception models under various training epochs.
The code and dataset will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2272. Improved Adversarial Diffusion Compression for Real-World Video Super-Resolution
๐ฌ 23
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Real-World Video Super-Resolution One-Step Diffusion Improved Adversarial Diffusion Compression Diffusion Distillation
While many diffusion models have achieved impressive results in real-world video super-resolution (Real-VSR) by generating rich and realistic details, their reliance on multi-step sampling leads to slow inference. One-step networks like SeedVR2, DOVE, and DLoRAL alleviate this through condensing generation into one single step, yet they remain heavy, with billions of parameters and multi-second latency. Recent adversarial diffusion compression (ADC) offers a promising path via pruning and distilling these models into a compact AdcSR network, but directly applying it to Real-VSR fails to balance spatial details and temporal consistency due to its lack of temporal awareness and the limitations of standard adversarial learning. To address these challenges, we propose an improved **ADC** method for Real-**VSR**. Our approach distills a large diffusion Transformer (DiT) teacher DOVE equipped with 3D spatio-temporal attentions, into a pruned 2D Stable Diffusion (SD)-based AdcSR backbone, augmented with lightweight 1D temporal convolutions, achieving significantly higher efficiency. In addition, we introduce a dual-head adversarial distillation scheme, in which discriminators in both pixel and feature domains explicitly disentangle the discrimination of details and consistency into two heads, enabling both objectives to be effectively optimized without sacrificing one for the other. Experiments demonstrate that the resulting compressed **AdcVSR** model reduces complexity by **95%** in parameters and achieves an **8$\times$** acceleration over its DiT teacher DOVE, while maintaining competitive video quality and efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2273. Micro-Macro Retrieval: Reducing Long-Form Hallucination in Large Language Models
๐ฌ 16
โญ 5.50
๐ generative models
๐ท๏ธ Hallucination Long-form Hallucination Large Language Models
Large Language Models (LLMs) achieve impressive performance across many tasks but remain prone to hallucination, especially in long-form generation where redundant retrieved contexts and lengthy reasoning chains amplify factual errors. Recent studies highlight a critical phenomenon: the closer key information appears to the model outputs, the higher the factual accuracy. However, existing retrieval-augmented language models (RALMs) lack effective mechanisms to ensure this proximity โ external evidence is injected into reasoning via multi-turn retrieval, but this cannot ensure key information stays close to the outputs. We propose MicroโMacro Retrieval ($M^2R$), a novel retrieve-while-generate framework to fill this gap. At the macro level, $M^2R$ retrieves coarse-grained evidence from external sources; at the micro level, it extracts essential results from a key information repository built during reasoning and reuses them while generating answers. This design directly addresses the key-informationโto-output proximity bottleneck, effectively reducing hallucination in long-form tasks. $M^2R$ is trained with a curriculum learningโbased reinforcement learning strategy using customized rule-based rewards, enabling stable acquisition of retrieval and grounding skills. Extensive experiments across different benchmarks demonstrate the effectiveness of $M^2R$, especially in lengthy-context settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2274. Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for One-/Two-step High-Fidelity Audio Generation
๐ฌ 18
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Flow2GAN audio generation Flow Matching GAN multi-resolution
Existing dominant methods for audio generation include Generative Adversarial Networks (GANs) and diffusion-based methods like Flow Matching. GANs suffer from slow convergence and potential mode collapse during training, while diffusion methods require multi-step inference that introduces considerable computational overhead. In this work, we introduce Flow2GAN, a two-stage framework that combines Flow Matching training for learning generative capabilities with GAN fine-tuning for efficient few-step inference. Specifically, given audio's unique properties, we first improve Flow Matching for audio modeling through: 1) reformulating the objective as endpoint estimation, avoiding velocity estimation difficulties when involving empty regions; 2) applying spectral energy-based loss scaling to emphasize perceptually salient quieter regions. Building on these Flow Matching adaptations, we demonstrate that a further stage of lightweight GAN fine-tuning enables us to obtain one-step generator that produces high-quality audio. In addition, we develop a multi-branch network architecture that processes Fourier coefficients at different time-frequency resolutions, which improves the modeling capabilities compared to prior single-resolution designs. Experimental results indicate that our Flow2GAN delivers high-fidelity audio generation from Mel-spectrograms or discrete audio tokens, achieving better quality-efficiency trade-offs than existing state-of-the-art GAN-based and Flow Matching-based methods.
Online demo samples are available at \url{https://flow2gan.github.io}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2275. Stochastic Self-Organization in Multi-Agent Systems
๐ฌ 23
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ multi-agent systems contribution estimation
Large Language Models (LLMs) have enabled multi-agent systems (MAS) where agents collaborate to solve tasks beyond the reach of a single model. Yet most existing approaches rely on fixed topologies, pretrained graph generators, optimization over edges, or external LLM judges, thereby adding complexity. We introduce a response-conditioned framework that adapts communication on the fly. Agents independently generate answers and assess peer contributions using a Shapley~value-inspired approximation. A directed acyclic graph (DAG) is then constructed to route information from high-contribution agents to others, ensuring stable and efficient message passing without the need for additional supervision or training. We provide a theoretical analysis showing that multiple agents increase the chance of correctness and that the correct answers naturally dominate information flow. Experiments with both strong and weak LLM backends demonstrate robust performance, with significant gains in the weak regime where prior methods collapse.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2276. FreeViS: Training-free Video Stylization with Inconsistent References
๐ฌ 14
โญ 5.50
๐ generative models
๐ท๏ธ Style Transfer Video Stylization
Video stylization plays a key role in content creation, but it remains a challenging problem. Naรฏvely applying image stylization frame-by-frame hurts temporal consistency and reduces style richness. Alternatively, training a dedicated video stylization model typically requires paired video data and is computationally expensive. In this paper, we propose FreeViS, a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence. Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works, without introducing flickers and stutters. In addition, it leverages high-frequency compensation to constrain the content layout and motion, together with flow-based motion cues to preserve style textures in low-saliency regions. Through extensive evaluations, FreeViS delivers higher stylization fidelity and superior temporal consistency, outperforming recent baselines and achieving strong human preference. Our training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2277. ReLi3D: Relightable Multi-view 3D Reconstruction with Disentangled Illumination
๐ฌ 13
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Material Reconstruction Large Reconstruction Model Multi-View Illumination
Reconstructing 3D assets from images has long required separate pipelines for geometry reconstruction, material estimation, and illumination recovery, each with distinct limitations and computational overhead. We present MIDR-3D, the first unified end-to-end pipeline that simultaneously reconstructs complete 3D geometry, spatially-varying physically-based materials, and environment illumination
from sparse multi-view images in under one second. Our key insight is that multi-view constraints can dramatically improve material and illumination disentanglement, a problem that remains fundamentally ill-posed for single image methods. Key to our approach is the fusion of the multi-view input via a transformer cross-conditioning architecture, followed by a novel unified two path prediction strategy. The first path predicts the objectโs structure and appearance, while the second path predicts the environment illumination from image background or object reflections. This combined with a differentiable Monte Carlo multiple importance sampling renderer, creates an optimal illumination disentanglement training pipeline. Further with our mixed-domain training protocol, combining synthetic PBR datasets with real-world RGB captures, we establish generalizable results across geometry, material accuracy, and illumination quality. By unifying previously separate reconstruction tasks into a single feed-forward pass, we enable near-instantaneous generation of complete, relightable 3D assets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2278. T-TAMER: Provably Taming Trade-offs in ML Serving
๐ฌ 20
โญ 5.50
๐ optimization
๐ท๏ธ Cascaded Inference Early-Exit Models Multi-Model Serving Provable Optimality
As machine learning models continue to grow in size and complexity, efficient serving faces increasingly broad trade-offs spanning accuracy, latency, resource usage, and other objectives. Multi-model serving further complicates these trade-offs; for example, in cascaded models, each early-exit decision balances latency reduction against potential accuracy loss. Despite the pervasiveness and importance of such trade-offs, current strategies remain largely heuristic and case-specific, limiting both their theoretical guarantees and general applicability.
We present a general framework, T-Tamer, which formalizes this setting as a multi-stage decision process, where the objective is to determine both when to exit and which model to consult. Our main result shows that recall (i.e., the ability to revisit earlier models) is both necessary and sufficient for achieving provable performance guarantees. In particular, we prove that strategies without recall cannot obtain any constant-factor approximation to the optimal trade-off, whereas recall-based strategies provably attain the optimal trade-off in polynomial time.
We validate our analysis through experiments on synthetic datasets and early-exit workloads for vision and NLP benchmarks. The results show that recall-based strategies consistently yield efficient accuracyโlatency trade-offs. We hope this work provides a principled foundation for bridging heuristic practice with theoretical guarantees in the design of early-exit and cascaded models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2279. GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning
๐ฌ 21
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Mobile GUI Agent Reinforcement Learning Vision Language Model
Training effective Vision-Language Models (VLMs) for GUI agents typically depends on large-scale annotated datasets, whose collection is both labor-intensive and error-prone. We introduce K-step GUI Transition, a self-supervised inverse dynamics task in which VLMs learn GUI dynamics by predicting the initial action that causes a transition between two GUI states. This approach eliminates the need for natural language instructions and enables scalable dataset construction from existing GUI trajectories or automated exploration. Building on this task, we propose GUI-Shift, a reinforcement learning (RL) framework that combines rule-based optimization with data filtering to improve VLM performance. We conduct extensive experiments using multiple VLM backbones across four benchmarks, spanning GUI task automation (AndroidControl, GUI Odyssey) and GUI grounding (ScreenSpot-v2, ScreenSpot-Pro). Our results show that training on GUI-Shift generalizes well to both GUI automation and grounding tasks, yielding up to an 11.2% increase in GUI automation accuracy. This study underscores the potential of self-supervised RL to leverage unlabeled GUI trajectories and offers a scalable alternative to training with annotated samples.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2280. Near-Optimal Online Deployment and Routing for Streaming LLMs
๐ฌ 16
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ online learning bandits LLM routing staged deployment streaming model arrivals regret bounds budget/capacity constraints
The rapid pace at which new large language models (LLMs) appear, and older ones become obsolete, forces providers to manage a streaming inventory under a strict concurrency cap and per-query cost budgets. We cast this as an online decision problem that couples *stage-wise deployment* (at fixed maintenance windows) with *per-query routing* among live models. We introduce *StageRoute*, a hierarchical algorithm that (i) optimistically selects up to $M_{\max}$ models for the next stage using reward upper-confidence and cost lower-confidence bounds, and (ii) routes each incoming query by solving a budget- and throughput-constrained bandit subproblem over the deployed set. We prove a regret of $\tilde{\mathcal{O}}(T^{2/3})$ with a matching lower bound, establishing near-optimality, and validate the theory empirically: *StageRoute* tracks a strong oracle under tight budgets across diverse workloads.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2281. From Natural Alignment to Conditional Controllability in Multimodal Dialogue
๐ฌ 15
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Multimodal dialogue dataset Multimodal conditional dialogue generation Spoken dialogue generation
The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multi-modal dialogue.
While current methods impressively generates realistic dialogue in speech and vision modalities, challenges remain in multi-modal conditional dialogue generation.
This paper focuses on the natural alignment between speech, vision, and text, aiming at expressive dialogue generation through multi-modal conditional control.
Since existing datasets lack the richness and diversity in dialogue expressiveness, we introduce a novel multi-modal dialogue annotation pipeline to exploit meaningful dialogues from movies and TV series with fine-grained annotations across multi-modalities.
The resultant dataset, MM-Dia, provides over 360 hours and 54,700 dialogues, facilitating the Multimodal Dialogue Generation task through explicit control over style-controllable dialogue speech synthesis.
While the proposed benchmark, MM-Dia-Bench, containing 309 dialogues that are highly expressive with visible dual/single speaker scenes, supporting the evaluation of implicit cross-modal control through downstream multi-modal dialogue generation tasks to assess the audio-visual style consistency across modalities.
Our experiments demonstrate the effectiveness of our data in enhancing style controllability and reveal limitations in current frameworks' ability to replicate human interaction expressiveness, providing new insights and challenges for multi-modal conditional dialogue generation. Code, demo and data will be released at: https://mmdiaiclr26.github.io/mmdiaiclr26/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2282. Understanding the Implicit Biases of Design Choices for Time Series Foundation Models
๐ฌ 18
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ time series foundation models inductive bias frequency uncertainty geometry
Time series foundation models (TSFMs) are a potential class of powerful, general-purpose tools for forecasting and related temporal tasks, but their behavior is strongly shaped by subtle inductive biases in their design.
Rather than developing a new model and claiming that it is better than existing TSFMs, e.g., by winning on existing benchmarks, our objective is to understand how the various "knobs" of the training process affect model quality.
Using a mix of theory and controlled empirical evaluation, we identify and show how various design choices (e.g., patch size, embedding choice, training objective, etc.) lead to implicit biases in fundamental model properties (e.g., temporal behavior, geometric structure, how aggressively or not the model regresses to the mean, etc.), and how these biases can be intuitive or counterintuitive, depending on properties of the model and data.
We illustrate in a case study on outlier handling how multiple biases interact in complex ways.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2283. ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
๐ฌ 15
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Autonomous Driving Vision-Language Models Reinforcement Learning
Recent studies have explored leveraging the world knowledge and cognitive capabilities of Vision-Language Models (VLMs) to address the long-tail problem in end-to-end autonomous driving. However, existing methods typically formulate trajectory planning as a language modeling task, where physical actions are output in the language space, potentially leading to issues such as format-violating outputs, infeasible actions, and slow inference speeds. In this paper, we propose ReCogDrive, a novel **Re**inforced **Cog**nitive framework for end-to-end autonomous **Driv**ing, unifying driving understanding and planning by integrating an autoregressive model with a diffusion planner. First, to instill human driving cognition into the VLM, we introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers through three stages: generation, refinement, and quality control. Building on this cognitive foundation, we then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner to efficiently generate continuous and stable trajectories. Furthermore, to enhance driving safety and reduce collisions, we introduce a Diffusion Group Relative Policy Optimization (DiffGRPO) stage, reinforcing the planner for enhanced safety and comfort. Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that ReCogDrive achieves state-of-the-art performance. Additionally, qualitative results across diverse driving scenarios and DriveBench highlight the model's scene comprehension. All code, model weights, and datasets will be made publicly available to facilitate subsequent research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2284. Pragma-VL: Towards a Pragmatic Arbitration of Safety and Helpfulness in MLLMs
๐ฌ 16
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ safety alignment MLLM VLM safety-helpfulness trade-off
Multimodal Large Language Models (MLLMs) pose critical safety challenges, as they are susceptible not only to adversarial attacks such as jailbreaking but also to inadvertently generating harmful content for benign users. While internal safety alignment via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) is a primary mitigation strategy, current methods often face a safety-utility trade-off: they either refuse benign queries out of excessive caution or overlook latent risks in cross-modal interactions. To resolve this, we introduce Pragma-VL, an end-to-end alignment algorithm that enables MLLMs to pragmatically arbitrate between safety and helpfulness. First, we enhance visual risk perception with a novel cold-start SFT stage. This is achieved by applying risk-aware clustering to the visual encoder and using an interleaved dataset of risk descriptions and high-quality data. Second, we introduce a theoretically-guaranteed reward model that leverages synergistic learning. We train it with a novel data augmentation method that assigns dynamic weights based on the queries, enabling contextual arbitration between safety and helpfulness. Extensive experiments show that Pragma-VL effectively balances safety and helpfulness, outperforming baselines by 5% to 20% on most multimodal safety benchmarks while preserving its general capabilities in areas such as mathematics and knowledge reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2285. DeepEyesV2: Toward Agentic Multimodal Model
๐ฌ 21
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ DeepEyesV2 Agentic Multimodal Model
Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We validate DeepEyesV2 across real-world understanding, mathematical reasoning, and search-intensive benchmarks, demonstrating that systematic tool integration enables reliable and extensible multimodal reasoning behaviour. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enable complex tool combinations and allowing model to selectively invoke tools based on problem context. We hope our study can provide guidance for community in developing agentic multimodal models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2286. TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation
๐ฌ 17
โญ 5.50
๐ generative models
๐ท๏ธ Video generation Diffusion model
Generating high-quality videos from complex temporal descriptions, which refer to prompts containing multiple sequential actions, remains a significant challenge. Existing methods are constrained by an inherent trade-off: using multiple short prompts fed sequentially into the model improves action fidelity but compromises temporal consistency, while a single complex prompt preserves consistency at the cost of prompt following capability. We attribute this problem to two primary causes: temporal misalignment between video content and the prompt, and conflicting attention coupling between motion-related visual objects and their associated text conditions. To address these challenges, we propose a novel, training-free attention mechanism, Temporal-wise Separable Attention (TS-Attn), which dynamically rearranges attention distribution to ensure temporal awareness and global coherence in multi-event scenarios. TS-Attn can be seamlessly integrated into various pre-trained text-to-video models, boosting StoryEval-Bench scores by 33.5% and 16.4% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 2% increase in inference time. It also supports plug-and-play usage across models for multi-event image-to-video generation. The source code and video demos are available in the supplementary materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2287. Epistemic Uncertainty Quantification To Improve Decisions From Black-Box Models
๐ฌ 14
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Epistemic uncertainty Excess risk Uncertainty quantification Aleatoric LLM Deferral Calibration
Distinguishing a model's lack of knowledge (epistemic uncertainty) from inherent task randomness (aleatoric uncertainty) is crucial for reliable AI. However, standard evaluation metrics of confidence scores target different aspects. AUC and accuracy capture predictive signal, proper scoring rules capture overall uncertainty, and calibration metrics isolate part of the epistemic uncertainty but ignore heterogeneity of the errors within bins, known as grouping loss. We close this evaluation gap by introducing asymptotically consistent and sample-efficient lower-bound estimators for the grouping loss and excess risk, i.e. suboptimality of a prediction. Our estimators complement existing calibration metrics to provide a more complete, fine-grained assessment of epistemic uncertainty. Applied to LLM question-answering with inherent aleatoric noise, our estimator reveals substantial grouping loss which decreases with model scale but is amplified by instruction tuning. The local nature of our estimators provides actionable insights: they automatically identify subgroups with systematic over- or under-confidence for interpretable audits. We also demonstrate that it reveals better the need of post-training. Finally, we leverage our estimator to design efficient LLM cascades that defer to stronger models, achieving higher accuracy at a lower cost than competing approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2288. Delving into Spectral Clustering with Vision-Language Representations
๐ฌ 6
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Spectral Clustering Vision-Language Models Neural Tangent Kernel
Spectral clustering is known as a powerful technique in unsupervised data analysis.
The vast majority of approaches to spectral clustering are driven by a single modality, leaving the rich information in multi-modal representations untapped.
Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of spectral clustering from a single-modal to a multi-modal regime.
Particularly, we propose Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models.
By anchoring the neural tangent kernel with positive nouns, i.e., those semantically close to the images of interest, we arrive at formulating the affinity between images as a coupling of their visual proximity and semantic overlap.
We show that this formulation amplifies within-cluster connections while suppressing spurious ones across clusters, hence encouraging block-diagonal structures.
In addition, we present a regularized affinity diffusion mechanism that adaptively ensembles affinity matrices induced by different prompts.
Extensive experiments on \textbf{16} benchmarks---including classical, large-scale, fine-grained and domain-shifted datasets---manifest that our method consistently outperforms the state-of-the-art by a large margin.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2289. Beyond In-Domain Detection: SpikeScore for Cross-Domain Hallucination Detection
๐ฌ 38
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Large Language Models Hallucination Detection Cross-Domain Generalization Multi-Turn Continuation
Hallucination detection is critical for deploying large language models (LLMs) in real-world applications. Existing hallucination detection methods achieve strong performance when the training and test data come from the same domain, but they suffer from poor cross-domain generalization. In this paper, we study an important yet overlooked problem, termed generalizable hallucination detection (GHD), which aims to train hallucination detectors on data from a single domain while ensuring robust performance across diverse related domains. In studying GHD, we simulate multi-turn dialogues following LLMs initial response and observe an interesting phenomenon: hallucination-initiated multi-turn dialogues universally exhibit larger uncertainty fluctuations than factual ones across different domains. Based on the phenomenon, we propose a new score SpikeScore, which quantifies abrupt fluctuations in multi-turn dialogues. Through both theoretical analysis and empirical validation, we demonstrate that SpikeScore achieves strong cross-domain separability between hallucinated and non-hallucinated responses. Experiments across multiple LLMs and benchmarks demonstrate that the SpikeScore-based detection method outperforms representative baselines in cross-domain generalization and surpasses advanced generalization-oriented methods, verifying the effectiveness of our method in cross-domain hallucination detection.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2290. Trace Anything: Representing Any Video in 4D via Trajectory Fields
๐ฌ 15
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Representation 4D Scene Representation
Building 4D video representations to model underlying spacetime constitutes a crucial step toward understanding dynamic scenes, yet there is no consensus on the paradigm: current approaches resort to additional estimators such as depth, flow, or tracking, or to heavy per-scene optimization, making them brittle and hard to generalize. In a video, its atomic unit, the pixel, follows a continuous 3D trajectory that unfolds over time, acting as the atomic primitive of dynamics. Recognizing this, we propose to represent any video as a Trajectory Field: a dense mapping that assigns each pixel in each frame to a parametric 3D trajectory. To this end, we introduce Trace Anything, a neural network that predicts the trajectory field in a feed-forward manner. Specifically, for each video frame, the model outputs a series of control point maps, defining parametric trajectories for each pixel. Together, our representation and model directly construct a 4D video representation in a single forward pass, without additional estimators or global alignment. We develop a synthetic data platform to construct a training dataset and a benchmark for trajectory field estimation. Experiments show that Trace Anything surpasses existing methods or performs competitively on the new benchmark and established point tracking benchmarks, with significant efficiency gains. Moreover, it facilitates downstream applications such as goal-conditioned manipulation, simple motion extrapolation, and spatio-temporal fusion. We will release the code, the model weights, and the data platform.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2291. Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models
๐ฌ 19
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Medical Image Analysis Medical VQA Medical VLM
Multimodal large language models are promising for clinical visual question answering tasks, but scaling to 3D imaging is hindered by high computational costs. Prior methods often rely on 2D slices or fixed-length token compression, disrupting volumetric continuity and obscuring subtle findings. We present Photon, a framework that represents 3D medical volumes with token sequences of variable length. Photon introduces instruction-conditioned token scheduling and surrogate gradient propagation to adaptively reduce tokens during both training and inference, which lowers computational cost while mitigating the attention dilution caused by redundant tokens. It incorporates a custom backpropagation rule with gradient restoration to enable differentiable optimization despite discrete token drop. To stabilize token compression and ensure reliable use of visual evidence, Photon further applies regularization objectives that mitigate language-only bias and improve reliability. Experiments on diverse medical visual question answering tasks show that Photon achieves state-of-the-art accuracy while reducing resource usage and accelerating both training and inference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2292. Constraint Matters: Multi-Modal Representation for Reducing Mixed-Integer Linear programming
๐ฌ 19
โญ 5.50
๐ optimization
๐ท๏ธ Mixed-integer Linear Programming Learning to Optimize Model Reduction
Model reduction, which aims to learn a simpler model of the original mixed integer linear programming (MILP), can solve large-scale MILP problems much faster. Most existing model reduction methods are based on variable reduction, which predicts a solution value for a subset of variables. From a dual perspective, constraint reduction that transforms a subset of inequality constraints into equalities can also reduce the complexity of MILP, but has been largely ignored. Therefore, this paper proposes a novel constraint-based model reduction approach for MILPs. Constraint-based MILP reduction has two challenges: 1) which inequality constraints are critical such that reducing them can accelerate MILP solving while preserving feasibility, and 2) how to predict these critical constraints efficiently. To identify critical constraints, we label the tight-constraints at the optimal solution as potential critical constraints and design an information theory-guided heuristic rule to select a subset of critical tight-constraints. Theoretical analyses indicate that our heuristic mechanism effectively identify the constraints most instrumental in reducing the solution space and uncertainty. To learn the critical tight-constraints, we propose a multi-modal representation that integrates information from both instance-level and abstract-level MILP formulations. The experimental results show that, compared to the state-of-the-art MILP solvers, our method improves the quality of the solution by over 50\% and reduces the computation time by 17.47\%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2293. Detective SAM: Adaptive AI-Image Forgery Localization
๐ฌ 17
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Image Forgery Localization Diffusion Models Dataset Creation Generative Models
Image forgery localization in the generative AI era poses new challenges, as modern editing pipelines produce photorealistic, semantically coherent manipulations that evade conventional detectors while model capabilities evolve rapidly.
In response, we develop Detective SAM, a framework built on SAM2, a foundation model for image segmentation, that integrates perturbation-driven forensic clues with lightweight feature adapters and a mask adapter to convert forensic clues into forgery masks via automatic prompting.
Moreover, to keep up with the rapidly evolving capabilities of diffusion models, we introduce AutoEditForge: an automated diffusion edit generation pipeline spanning four edit types. This supplies high-quality data to maintain localization accuracy under newly released editors and enables continual fine-tuning for Detective SAM.
Across seven benchmark datasets and seven baselines, Detective SAM delivers stable out-of-distribution performance, averaging 36.99 IoU / 44.19 F1, a 33.67% relative IoU gain over the best baseline. Further, we show that state-of-the-art edits cause localization systems to collapse.
With 500 AutoEditForge samples, Detective SAM quickly adapts and restores performance, enabling practical, low-friction updates as editing models improve.
AutoEditForge, Detective SAM's pretrained weights and training script are available at the anonymized repository: https://anonymous.4open.science/r/Detective-SAM-9057/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2294. Equilibrium Language Models
๐ฌ 15
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Model Compression On-Device Inference Fixed-Point Network
Large Language Models (LLMs) excel across diverse applications but remain impractical for edge deployment due to severe memory bottlenecks at the edge devices. We propose Equilibrium Language Models (ELMs), a novel compression framework that replaces groups of Transformer layers with a lightweight fixed-point network, reinterpreting deep computation as solving for an equilibrium state. To achieve ELMs, We introduce *Group Pruning Policy Optimization*, which automatically learns optimal pruning intervals. Moreover, we propose *One-Step KV-Cache*, which drastically reduces memory overhead by storing only the final iteration cache without compromising the accuracy, to enable effective deployment at the edge devices. Across different tasks such as common sense reasoning, mathematical problem solving, and code generation, ELMs prune 28\% of parameters while retaining 99\% of the accuracy of dense fine-tuned LLMs, establishing a new direction for memory-efficient edge deployment of large models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2295. Training-Free Determination of Network Width via Neural Tangent Kernel
๐ฌ 16
โญ 5.50
๐ learning theory
๐ท๏ธ neural tangent kernel kernel regression smallest eigenvalue generalization error
Determining an appropriate size for an artificial neural network under computational constraints is a fundamental challenge. This paper introduces a practical metric, derived from Neural Tangent Kernel (NTK), for estimating the minimum necessary network width with respect to test loss -- prior to training. We provide both theoretical and empirical evidence that the smallest eigenvalue of the NTK strongly influences test loss in wide but finite-width neural networks. Based on this observation, we define an NTK-based metric computed at initialization to identify what we call cardinal width, i.e., the width of a network at which generalization performance saturates. Our experiments across multiple datasets and architectures demonstrate the effectiveness of this metric in estimating the cardinal width.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2296. Self-Improving Skill Learning for Robust Skill-based Meta-Reinforcement Learning
๐ฌ 18
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Meta Reinforcement Learning Skill-based Reinforcement Learning Hierarchical Noisy Demonstration Skill Refinement Generalization
Meta-reinforcement learning (Meta-RL) facilitates rapid adaptation to unseen tasks but faces challenges in long-horizon environments. Skill-based approaches tackle this by decomposing state-action sequences into reusable skills and employing hierarchical decision-making. However, these methods are highly susceptible to noisy offline demonstrations, leading to unstable skill learning and degraded performance. To address this, we propose Self-Improving Skill Learning (SISL), which performs self-guided skill refinement using decoupled high-level and skill improvement policies, while applying skill prioritization via maximum return relabeling to focus updates on task-relevant trajectories, resulting in robust and stable adaptation even under noisy and suboptimal data. By mitigating the effect of noise, SISL achieves reliable skill learning and consistently outperforms other skill-based meta-RL methods on diverse long-horizon tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2297. Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
๐ฌ 15
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Post-training Reinforcement Learning Alignment
Reinforcement fine-tuning (RFT) often suffers from reward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs. Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail: the inability to reliably distinguish excellent responses from merely great ones. This motivate us to focus on the high-reward region. However, such tail examples are scarce under the base LLM. While off-policy exemplars (e.g. from stronger models or rewrites) are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align. To address this, we study rubric-based rewards. By design, rubrics can leverage off-policy examples while remaining insensitive to their artifacts. To elicit rubrics that capture the high-reward tail, we highlight the importance of distinguishing among great and diverse responses, and introduce a workflow to implement this idea. We empirically demonstrate that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2298. Neural Networks Learn Multi-Index Models Near the Information-Theoretic Limit
๐ฌ 12
โญ 5.50
๐ learning theory
๐ท๏ธ Representation Learning Multi-Index Models Two-Layer Network Gradient Descent Sample Complexity
In deep learning, a central issue is to understand how neural networks efficiently learn high-dimensional features. To this end, we explore the gradient descent learning of a general Gaussian Multi-index model $f(\boldsymbol{x})=g(\boldsymbol{U}\boldsymbol{x})$ with hidden subspace $\boldsymbol{U}\in \mathbb{R}^{r\times d}$, which is the canonical setup to study representation learning. We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with $o_d(1)$ test error using $\widetilde{\mathcal{O}}(d)$ samples and $\widetilde{\mathcal{O}}(d^2)$ time. The sample and time complexity both align with the information-theoretic limit up to leading order and are therefore optimal. During the first stage of gradient descent learning, the proof proceeds via showing that the inner weights can perform a power-iteration process. This process implicitly mimics a spectral start for the whole span of the hidden subspace and eventually eliminates finite-sample noise and recovers this span. It surprisingly indicates that optimal results can only be achieved if the first layer is trained for more than $\mathcal{O}(1)$ steps. This work demonstrates the ability of neural networks to effectively learn hierarchical functions with respect to both sample and time efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2299. Unleashing Perception-Time Scaling to Multimodal Reasoning Models
๐ฌ 20
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Vision-Language Models Inference-Time Scaling Fine-grained Perception
Recent advances in inference-time scaling, particularly those leveraging reinforcement learning with verifiable rewards, have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs). Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear. To investigate this gap, we introduce DisTANCE, a perception-centric benchmark for visual estimation tasks. Evaluation results show that LVLMs exhibit limited estimation precision, and inference-time scaling offers only marginal gains. We attribute this to the fast perception paradigm of current LVLMs, where visual understanding is treated as a one-shot output without modeling the underlying perceptual process. To address this, we propose Perception-Time Scaling (PTS), a novel paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems, thereby enabling perception to align with and benefit from inference-time scaling. Combined with reinforcement learning techniques, PTS significantly improves perception accuracy, raising high-precision performance on DisTANCE from 8.0% to 64.7%, and generalizes well to out-of-domain tasks. Surprisingly, even though PTS data are purely synthetic, combining them with math reasoning data yields consistent gains in both reasoning and real-world perception benchmarks. Further analysis reveals that PTS introduces more perception-related tokens and increases the modelโs attention to image tokens. Our code and data will be publicly released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2300. All That Glitters Is Not Gold: Key-Secured 3D Secrets within 3D Gaussian Splatting
๐ฌ 15
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Gaussian Splatting 3D Steganography
Recent advances in 3D Gaussian Splatting (3DGS) have revolutionized scene reconstruction, opening new possibilities for 3D steganography by hiding 3D secrets within 3D covers. The key challenge in steganography is ensuring imperceptibility while maintaining high-fidelity reconstruction. However, existing methods often suffer from detectability risks and utilize only suboptimal 3DGS attributes, limiting their full potential. We propose a novel end-to-end key-secured 3D steganography framework (KeySS) that jointly optimizes a 3DGS model and a key-secured decoder for secret reconstruction. Our approach reveals that Gaussian attributes contribute unequally to secret hiding. The framework incorporates a key-controllable mechanism enabling multi-secret hiding and unauthorized access prevention, while systematically exploring optimal attribute update to balance fidelity and security. To rigorously evaluate steganographic imperceptibility beyond conventional 2D metrics, we introduce 3D-Sinkhorn distance analysis, which quantifies distributional differences between original and steganographic Gaussian parameters in the representation space. Extensive experiments show that our method achieves state-of-the-art performance in 3D reconstruction while ensuring high levels of steganographic security. The framework is highly efficient and readily extensible to multi-GPU training. Our code will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2301. Plan then Act: Bi-level CAD Command Sequence Generation
๐ฌ 15
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ CAD Command Sequence Generation; LLMs
Computer-Aided Design (CAD), renowned for its flexibility and precision, serves as the foundation of digital design. Recently, some efforts adopt Large Language Models (LLMs) for generating parametric CAD command sequences from text instructions. However, our study reveals that LLMs pre-trained on large-scale general data are not proficient at directly outputting task-specific CAD sequences. Instead of relying on direct generation, we introduce a Plan then Act process where user instructions are first parsed into a chain-like operational plan via an LLM, which is then used to generate accurate command sequences. Specifically, we propose PTA, a new bi-level CAD command sequence generation method. The PTA consists of two critical stages: high-level plan generation and low-level command generation. During the high-level stage, an LLM-based Planner completes the planning process, parsing user instructions into a high-level operation plan. Following this, at the low-level generation stage, we introduce an Actioner equipped with a requirement-aware mechanism to extract design requirements (e.g., dimensions, geometric relationships) from user instructions. This extracted information is used to guide the low-level command sequence generation, improving the alignment of the generated sequences with user requirements. Experimental results demonstrate that our PTA outperforms existing methods in both quantitative and qualitative evaluations. Our source code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2302. Measuring the Intrinsic Dimension of Earth Representations
๐ฌ 13
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Intrinsic Dimension Implicit Neural Representations Location Encoding Positional Encoding Earth Observation
Within the context of representation learning for Earth observation, geographic Implicit Neural Representations (INRs) embed low-dimensional location inputs (longitude, latitude) into high-dimensional embeddings, through models trained on geo-referenced satellite, image or text data. Despite the common aim of geographic INRs to distill Earth's data into compact, learning-friendly representations, we lack an understanding of how much information is contained in these Earth representations, and where that information is concentrated. The intrinsic dimension of a dataset measures the number of degrees of freedom required to capture its local variability, regardless of the ambient high-dimensional space in which it is embedded. This work provides the first study of the intrinsic dimensionality of geographic INRs. Analyzing INRs with ambient dimension between 256 and 512, we find that their intrinsic dimensions fall roughly between 2 and 10 and are sensitive to changing spatial resolution and input modalities during INR pre-training. Furthermore, we show that the intrinsic dimension of a geographic INR correlates with downstream task performance and can capture spatial artifacts, facilitating model evaluation and diagnostics. More broadly, our work offers an architecture-agnostic, label-free metric of information content that can enable unsupervised evaluation, model selection, and pre-training design across INRs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2303. Learning From Dictionary: Enhancing Robustness of Machine-Generated Text Detection in Zero-Shot Language via Adversarial Training
๐ฌ 40
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Adversarial Training Machine Generated Text Detection
Machine-generated text (MGT) detection is critical for safeguarding online content integrity and preventing the spread of misleading information.
Although existing detectors achieve high accuracy in monolingual settings, they exhibit severe performance degradation on zero-shot languages and are vulnerable to adversarial attacks.
To tackle these challenges, we propose a robust adversarial training framework named
\textbf{T}ranslation-based
\textbf{A}ttacker
\textbf{S}trengthens
Mul\textbf{T}ilingual
Def\textbf{E}nder (\detectorname).
\detectorname comprises two core components: an attacker that performs code-switching by querying translation dictionaries to generate adversarial examples, and a detector trained to resist these attacks while generalizing to unseen languages.
We further introduce a novel Language-Agnostic Adversarial Loss (LAAL), which encourages the detector to learn language-invariant feature representations and thus enhances zero-shot detection performance and robustness against unseen attacks.
Additionally, the attacker and detector are synchronously updated, enabling continuous improvement of defensive capabilities.
Experimental results on 9 languages and 8 attack types show that our \detectorname surpasses 8 SOTA detectors, improving the average F1 score by \textbf{0.064} and reducing the average Attack Success Rate (ASR) by \textbf{3.8\%}.
Our framework offers a promising approach for building robust, multilingual MGT detectors with strong generalization to real-world adversarial scenarios.
We will release our code, models, and dataset upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2304. Generalizable End-to-End Tool-Use RL with Synthetic CodeGym
๐ฌ 4
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Agent Reinforcement Learning Synthetic Dataset Generalizability
Tool-augmented large language models (LLMs), hereafter LLM agents, leverage external tools to solve diverse tasks and interface with the real world. However, current training practices largely rely on supervised fine-tuning (SFT) over static trajectories or reinforcement learning (RL) on narrow tasks, which generalize poorly beyond development settings and lead to brittleness with new tools and unseen workflows. Because code execution reflects many structures of real-world workflows, we use coding problems as a structured substrate to build tool-use agent training environments with diverse task configurations. To this end, we introduce **CodeGym**, a scalable framework that synthesizes diverse, verifiable, and controllable multi-turn tool-use environments for agent RL, enabling LLM agents to explore and master various workflows actively. CodeGym converts static coding problems into interactive environments by extracting atomic functions or logic into callable tools, yielding verifiable tasks that span various tool-execution workflows. Models of varying sizes and chain-of-thought configurations, trained in CodeGym, exhibit consistent out-of-distribution generalizability; for example, Qwen2.5-32B-Instruct achieves an absolute accuracy gain of 8.7 points on the OOD benchmark $\tau$-Bench. These results highlight CodeGym as a step toward scalable general-purpose RL environments for training tool-use behaviors that align with real-world agent workflows.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2305. Unveiling Super Experts in Mixture-of-Experts Large Language Models
๐ฌ 17
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ MoE LLM compression attention
Leveraging the intrinsic importance differences among experts, recent research has explored expert-level compression techniques to enhance the efficiency of Mixture-of-Experts (MoE) large language models (LLMs).
However, existing approaches often rely on empirical heuristics to identify critical experts, while lacking a deeper understanding into the heterogeneous importance of experts and the inner workings of MoE LLMs.
In this study, we report, for the first time, the discovery and systematic investigation of a distinct subset of experts that play a pivotal role in the model's forward inference.
These experts are prevalent in open-source MoE LLMs, and despite their extremely limited number, pruning them results in a substantial decline in model performance (e.g., prune just three out of 6,144 causes Qwen3-30B-A3B to generate repetitive and uninformative outputs).
We refer to these experts as Super Experts (SEs).
Our comprehensive analysis provides progressively deeper insights into SEs:
(i) SEs are characterized by rare but extreme activation outliers in the output of the down\_proj, which give rise to massive activations in the
hidden states between decoder layers.
Moreover, the distribution of SEs is model-specific, data-agnostic, and remains unaffected by post-training processes.
(ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model's overall performance, particularly in mathematical reasoning.
(iii) We further investigate why compressing SEs exerts such a pronounced impact.
We show that, in MoE LLMs, SEs serve as the primary source of the systematic outlier mechanism in Transformers, and that compressing them profoundly disrupts this process, ultimately causing the collapse of attention sinks.
These findings advance the understanding of the internal dynamics of MoE LLMs, filling an important gap in the current knowledge.
In addition, we developed an automated tool for rapid and accurate SE profiling.
The code is provided in the supplementary materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2306. Master Skill Learning with Policy-Grounded Synergy of LLM-based Reward Shaping and Exploring
๐ฌ 17
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Robot Skill Acquisition Dexterous Manipulation Automatic Reward Design
The acquisition of robotic skills via reinforcement learning (RL) is crucial for advancing embodied intelligence, but designing effective reward functions for complex tasks remains challenging. Recent methods using large language models (LLMs) can generate reward functions from language instructions, but they often produce overly goal-oriented rewards that neglect state exploration, causing robots to get stuck in local optima. Traditional RL addresses this by adding exploration bonuses, but these are typically generic and inefficient, wasting resources on exploring task-irrelevant areas. To address these limitations, we propose Policy-grounded Synergy of Reward Shaping and Exploration (PoRSE), a novel and unified framework that guides LLMs to generate task-aware reward functions while constructing an abstract affordance space for efficient exploration bonuses. Given the vast number of possible reward-bonus combinations, it is impractical to exhaustively train a policy from scratch for each configuration to identify the best one. Instead, PoRSE employs an in-policy-improvement grounding process, dynamically and continuously generating and filtering out reward-bonus pairs along the policy improvement process. This approach accelerates skill acquisition and fosters a mutually reinforcing relationship between reward shaping, exploration and policy enhancement through close feedback. Experiments show that PoRSE is highly effective, achieving significant improvement in average returns across all robotic tasks compared to previous state-of-the-art methods. It also achieves initial success in two highly challenging manipulation tasks, marking a significant breakthrough.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2307. AttriCtrl: A Generalizable Framework for Controlling Semantic Attribute Intensity in Diffusion Models
๐ฌ 14
โญ 5.50
๐ generative models
๐ท๏ธ Diffusion; Control Generation
Diffusion models have recently become the dominant paradigm for image generation, yet existing systems struggle to interpret and follow numeric instructions for adjusting semantic attributes.
In real-world creative scenarios, especially when precise control over aesthetic attributes is required, current methods fail to provide such controllability.
This limitation partly arises from the subjective and context-dependent nature of aesthetic judgments, but more fundamentally stems from the fact that current text encoders are designed for discrete tokens rather than continuous values.
Meanwhile, efforts on aesthetic alignment, often leveraging reinforcement learning, direct preference optimization, or architectural modifications, primarily align models with a global notion of human preference. While these approaches improve user experience, they overlook the multifaceted and compositional nature of aesthetics, underscoring the need for explicit disentanglement and independent control of aesthetic attributes.
To address this gap, we introduce AttriCtrl, a lightweight framework for continuous aesthetic intensity control in diffusion models.
It first decomposes relevant aesthetic attributes, then quantifies them through a hybrid strategy that maps both concrete and abstract dimensions onto a unified $[0,1]$ scale. A plug-and-play value encoder is then used to transform user-specified values into model-interpretable embeddings for controllable generation.
Experiments show that AttriCtrl achieves accurate and continuous control over both single and multiple aesthetic attributes, significantly enhancing personalization and diversity.
Crucially, it is implemented as a lightweight adapter while keeping the diffusion model frozen, ensuring seamless integration with existing frameworks such as ControlNet at negligible computational cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2308. Simulating and Understanding Deceptive Behaviors in Long-Horizon Interactions
๐ฌ 22
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM deception Long-horizon interaction
Deception is a pervasive feature of human communication and an emerging concern in large language models (LLMs). While recent studies document instances of LLM deception under pressure, most evaluations remain confined to single-turn prompts and fail to capture the long-horizon interactions in which deceptive strategies typically unfold. We introduce the first simulation framework for probing and evaluating deception in LLMs under extended sequences of interdependent tasks and dynamic contextual pressures. Our framework instantiates a multi-agent system: a performer agent tasked with completing tasks and a supervisor agent that evaluates progress, provides feedback, and maintains evolving states of trust. An independent deception auditor then reviews full trajectories to identify when and how deception occurs. We conduct extensive experiments across 11 frontier models, spanning both closed- and open-source systems, and find that deception is model-dependent, increases with event pressure, and consistently erodes supervisor trust. Qualitative analyses further reveal distinct strategies of concealment, equivocation, and falsification. Our findings establish deception as an emergent risk in long-horizon interactions and provide a foundation for evaluating future LLMs in real-world, trust-sensitive contexts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2309. CLIP-FMoE: Scalable CLIP via Fused Mixture-of-Experts with Enforced Specialization
๐ฌ 21
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Vision-Language Models Mixture of Experts Fine-tuning
Mixture-of-Experts (MoE) architectures have emerged as a promising approach for scaling deep learning models while maintaining computational efficiency. However, existing MoE adaptations for Contrastive Language-Image Pre-training (CLIP) models suffer from significant computational overhead during sequential training and degradation of zero-shot capabilities. To address these limitations, we propose CLIP-FMoE, a novel approach that integrates MoE architecture into CLIP fine-tuning. Our method uses Isolated Constrained Contrastive Learning, a pipeline that trains specialized experts on cluster-based data partitions to accelerate expert specialization. Additionally, we introduce a Fusion Gate mechanism to mitigate catastrophic forgetting of pre-trained knowledge. Extensive experiments across multiple benchmarks demonstrate that our approach achieves consistent improvements on downstream tasks while preserving zero-shot capabilities. Furthermore, our method demonstrates robust performance across varying context lengths, making it particularly suitable for diverse real-world applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2310. Bridging Radiology and Pathology Foundation Models via Concept-Based Multimodal Co-Adaptation
๐ฌ 26
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ multimodal learning concept-based learning foundation models parameter-efficient fine-tuning medical imaging survival analysis
Pretrained medical foundation models (FMs) have shown strong generalization across diverse imaging tasks, such as disease classification in radiology and tumor grading in histopathology. While recent advances in parameter-efficient finetuning have enabled effective adaptation of FMs to downstream tasks, these approaches are typically designed for a single modality. In contrast, many clinical workflows rely on joint diagnosis from heterogeneous domains, such as radiology and pathology, where fully leveraging the representation capacity of multiple FMs remains an open challenge. To address this gap, we propose Concept Tuning and Fusing (CTF), a parameter-efficient framework that uses clinically grounded concepts as a shared semantic interface to enable cross-modal co-adaptation before fusion. By incorporating task-specific concepts that are relevant across modalities, CTF aligns radiology and pathology representations, thereby enhancing their complementarity and enabling interpretation. We further design a GlobalโContextโShared Prompt (GCSP) mechanism, which employs a small set of learnable tokens to capture domain-specific priors, shared patient-level information, and cross-domain context. The resulting concept alignment scores from each modality are then fused to produce a final prediction. Extensive experiments demonstrate that CTF outperforms strong unimodal, latent-fusion, and adapter-based baselines (e.g., AUC 0.903 on TCGA-GBMLGG). Notably, CTF achieves these gains without finetuning the full FMs, requiring only 0.15\% additional parameters, thus highlighting the effectiveness of concept-based multimodal co-adaptation. Our code is anonymously available at: https://anonymous.4open.science/r/CTF-27C2.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2311. COMPASS: Robust Feature Conformal Prediction for Medical Segmentation Metrics
๐ฌ 24
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Conformal Prediction Segmentation Imaging Uncertainty Quantification
In clinical applications, the utility of segmentation models is often based on the accuracy of derived downstream metrics such as organ size, rather than by the pixel-level accuracy of the segmentation masks themselves. Thus, uncertainty quantification for such metrics is crucial for decision-making. Conformal prediction (CP) is a popular framework to derive such principled uncertainty guarantees, but applying CP naively to the final scalar metric is inefficient because it treats the complex, non-linear segmentation-to-metric pipeline as a black box. We introduce COMPASS, a practical framework that generates efficient, metric-based CP intervals for image segmentation models by leveraging the inductive biases of their underlying deep neural networks. COMPASS performs calibration directly in the model's representation space by perturbing intermediate features along low-dimensional subspaces maximally sensitive to the target metric. We prove that COMPASS achieves valid marginal coverage under the assumption of exchangeability. Empirically, we demonstrate that COMPASS produces significantly tighter intervals than traditional CP baselines on four medical image segmentation tasks for area estimation of skin lesions and anatomical structures. Furthermore, we show that leveraging learned internal features to estimate importance weights allows COMPASS to also recover target coverage under covariate shifts. COMPASS paves the way for practical, metric-based uncertainty quantification for medical image segmentation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2312. S$^2$-Guidance: Stochastic Self-Guidance for Training-Free Enhancement of Diffusion Models
๐ฌ 25
โญ 5.50
๐ generative models
๐ท๏ธ Diffusion Models; Classifier-free Guidance
Classifier-free Guidance (CFG) is a widely used technique in modern diffusion models for generating high-quality samples. However, through an empirical analysis on both Gaussian mixture models with closed-form solutions and real-world data distributions, we observe a discrepancy between the suboptimal results produced by CFG and the ground truth. The model's excessive reliance on these suboptimal predictions often leads to low fidelity and semantic incoherence. To address this issue, we first empirically demonstrate that the model's suboptimal predictions can be effectively refined using sub-networks of the model itself, without requiring additional training or the integration of external modules. Building on this insight, we propose **$S^2$-Guidance ($S$tochastic $S$elf-Guidance)**, a novel method that leverages stochastic block-dropping during the denoising process to construct sub-networks. This approach effectively guides the model away from potential low-quality predictions, thereby improving sample quality. Extensive qualitative and quantitative experiments across multiple standard benchmarks for text-to-image and text-to-video generation tasks demonstrate that **$S^2$-Guidance** delivers superior performance, consistently surpassing CFG and other advanced guidance strategies. Our code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2313. Two-Layer Convolutional Autoencoders Trained on Normal Data Provably Detect Unseen Anomalies
๐ฌ 4
โญ 5.50
๐ learning theory
๐ท๏ธ Learning Theory
Anomaly detection refers to the techniques that identify (probably unseen) rare or suspicious data that deviate significantly from the pre-defined normal data (Chalapathy & Chawla, 2019; Ruff et al., 2021). Empirical studies have observed that generative models trained on normal data tend to produce larger reconstruction errors when reconstructing anomalies. Based on this observation, researchers have developed various anomaly detection methods, referred to as reconstruction-based anomaly detection (RBAD) (Lv et al., 2024; Li et al., 2024) in the literature.
Despite the empirical success of RBAD, the theoretical understanding of RBAD is still limited. This paper provides a theoretical analysis of RBAD. We analyze the training dynamics of a 2-layer convolutional autoencoder and introduce the cone set of the features. We prove that the cone sets of the normal features would absorb the (convolutional) kernels of the autoencoder during training and use these absorbed kernels to reconstruct the inputs. The absorbed kernels are more aligned with the normal features, which explains the cause of the reconstruction error gap between the normal data and the anomalies. Synthesized experiments are provided to validate our theoretical findings. We also visualize the training dynamics of the autoencoder on real-world data, demonstrating our proposed cone set intuition.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2314. LightRetriever: A LLM-based Text Retrieval Architecture with Extremely Faster Query Inference
๐ฌ 24
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Information Retrieval Efficient Deploy Fast Query Inference LLM-based Text Retrieval
Large Language Models (LLMs)-based text retrieval retrieves documents relevant to search queries based on vector similarities. Documents are pre-encoded offline, while queries arrive in real-time, necessitating an efficient online query encoder. Although LLMs significantly enhance retrieval capabilities, serving deeply parameterized LLMs slows down query inference throughput and increases demands for online deployment resources. In this paper, we propose LightRetriever, a novel LLM-based retriever with extremely lightweight query encoders. Our method retains a full-sized LLM for document encoding, but reduces the workload of query encoding to no more than an embedding lookup. Compared to serving a full LLM on an A800 GPU, our method achieves over a thousand times of speedup in query encoding and over 10ร increase in end-to-end retrieval throughput. Extensive experiments on large-scale retrieval benchmarks show that LightRetriever generalizes well across diverse tasks, maintaining an average of 95% retrieval performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2315. Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention
๐ฌ 15
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Audio-video speech separation vector quantization lightweight network discrete semantic units
Audio-visual speech separation (AVSS) methods leverage visual cues to extract target speech and have demonstrated strong separation quality in noisy acoustic environments. However, these methods usually involve a large number of parameters and require high computational cost, which is unacceptable in many applications where speech separation serves as only a preprocessing step for further speech processing. To address this issue, we propose an efficient AVSS method, named **Dolphin**. For visual feature extraction, we develop **DPโLipCoder**, a dualโpath lightweight video encoder that transforms lipโmotion into discrete audioโaligned semantic tokens. For audio separation, we construct a lightweight encoderโdecoder separator, in which each layer incorporates a globalโlocal attention (GLA) block to efficiently capture multi-scale dependencies. Experiments on three benchmark datasets showed that Dolphin not only surpassed the current state-of-the-art (SOTA) model in separation quality but also achieved remarkable improvements in efficiency: over 50\% fewer parameters, more than 2.4$\times$ reduction in MACs, and over 6$\times$ faster GPU inference speed. These results indicate that Dolphin offers a practical and deployable solution for high-performance AVSS in real-world scenarios. Our code and demo page are publicly available at https://dolphin-avss.github.io/Dolphin.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2316. ResearchRubrics: A Benchmark of Prompts and Rubrics For Deep Research Agents
๐ฌ 10
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Deep Research Large Language Models Benchmarks Rubrics LLM-as-a-judge Multi-step Reasoning Cross-document Synthesis Long-form Question Answering Evidence-based Reasoning Evaluation Frameworks Natural Language Processing
Deep Research (DR) is an emerging agent application that leverages large language models (LLMs) to address open-ended queries. It requires the integration of several capabilities, including multi-step reasoning, cross-document synthesis, and the generation of evidence-backed, long-form answers. Evaluating DR remains challenging because responses are lengthy and diverse, admit many valid solutions, and often depend on dynamic information sources. We introduce ResearchRubrics, a standardized benchmark for DR built with over 2,800+ hours of human labor that pairs realistic, domain-diverse prompts with 2,500+ expert-written, fine-grained rubrics to assess factual grounding, reasoning soundness, and clarity. We also propose a new complexity framework for categorizing DR tasks along three axes: conceptual breadth, logical nesting, and exploration. In addition, we develop human and model-based evaluation protocols that measure rubric adherence for DR agents. We evaluate several state-of-the-art DR systems and find that even leading agents like Geminiโs DR and OpenAIโs DR achieve under 68% average compliance with our rubrics, primarily due to missed implicit context and inadequate reasoning about retrieved information. Our results highlight the need for robust, scalable assessment of deep research capabilities, to which end we release ResearchRubrics (including all prompts, rubrics, and evaluation code) to facilitate
progress toward well-justified research assistants.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2317. PCPO: Proportionate Credit Policy Optimization for Preference Alignment of Image Generation Models
๐ฌ 17
โญ 5.50
๐ generative models
๐ท๏ธ Text-to-image generation Reinforcement Learning Preference Alignment
While reinforcement learning has advanced the alignment of text-to-image (T2I) models, state-of-the-art policy gradient methods are still hampered by training instability and high variance, hindering convergence speed and compromising image quality. Our analysis identifies a key cause of this instability: disproportionate credit assignment, in which the mathematical structure of the generative sampler produces volatile and non-proportional feedback across timesteps. To address this, we introduce Proportionate Credit Policy Optimization (PCPO), a
framework that enforces proportional credit assignment through a stable objective reformulation and a principled reweighting of timesteps. This correction stabilizes the training process, leading to significantly accelerated convergence and superior image quality. The improvement in quality is a direct result of mitigating model collapse, a common failure mode in recursive training. PCPO substantially outperforms existing policy gradient baselines on all fronts, including the state-of-the-art DanceGRPO.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2318. OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents
๐ฌ 18
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ MCP Computer-use Agent LMM
With advances in decision-making and reasoning capabilities, multimodal agents show strong potential in computer application scenarios. Past evaluations have mainly assessed GUI interaction skills, while tool invocation abilities, such as those enabled by the Model Context Protocol (MCP), have been largely overlooked. Comparing agents with integrated tool invocation to those evaluated only on GUI interaction is inherently unfair. We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents' tool invocation, GUI operation, and decision-making abilities in a real-world environment. We design a novel automated code-generation pipeline to create tools and combine them with a curated selection from existing tools. Rigorous manual validation yields 158 high-quality tools (covering 7 common applications), each verified for correct functionality, practical applicability, and versatility. Extensive evaluations of state-of-the-art multimodal agents on OSWorld-MCP show that MCP tools generally improve task success rates (e.g., from 8.3\% to 20.4\% for OpenAI o3 at 15 steps, from 40.1\% to 43.3\% for Claude 4 Sonnet at 50 steps), underscoring the importance of assessing tool invocation capabilities. However, even the strongest models have relatively low tool invocation rates, Only 36.3\%, indicating room for improvement and highlighting the benchmark's challenge. By explicitly measuring MCP tool usage skills, OSWorld-MCP deepens understanding of multimodal agents and sets a new standard for evaluating performance in complex, tool-assisted environments. We will release all code and data to the community.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2319. STEM: SCALING TRANSFORMERS WITH EMBEDDING MODULES
๐ฌ 18
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Sparse Transformer Parametric scaling Embedding Layers Foundation Models Pre-training Model Architecture
Fine-grained sparsity promises higher parametric capacity without proportional per-token compute, but often suffers from training instability, load balancing, and communication overhead. We introduce \textbf{STEM} (\emph{Scaling Transformers with Embedding Modules}), a static, token-indexed approach that replaces the FFN up-projection with a layer-local embedding lookup while keeping the gate and down-projection dense. This removes runtime routing, enables CPU offload with asynchronous prefetch, and decouples capacity from both per-token FLOPs and cross-device communication. Empirically, STEM trains stably despite extreme sparsity. It improves downstream performance over dense baselines while reducing per-token FLOPs and parameter accesses (eliminating roughly one-third of FFN parameters). STEM learns embedding spaces with large angular spread which enhances it knowledge storage capacity. In addition, STEM strengthens long-context performance: as sequence length grows, more distinct parameters are activated, yielding practical test-time capacity scaling. Across 350M and 1B model scales, STEM delivers up to $\sim$3--4\% improvements in average downstream performance, with notable gains on knowledge and reasoning-heavy benchmarks (ARC-Challenge, OpenBookQA, GSM8K, MMLU). Overall, STEM is an effective way of scaling parametric memory while remaining simpler to train and deploy than existing fine-grained sparse models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2320. Every Language Model Has a Forgery-Resistant Signature
๐ฌ 14
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ fingerprint watermark language model signature accountability cryptography forgery security
The ubiquity of closed-weight language models with public-facing APIs has generated interest in forensic methods, both for extracting hidden model details (e.g., parameters) and identifying models by their outputs. One successful approach to these goals has been to exploit the geometric constraints imposed by the language model architecture and parameters. In this work, we show that a lesser-known geometric constraintโnamely that language model outputs lie on the surface of a high-dimensional ellipseโfunctions as a signature for the model, which be used to identify which model an output came from. This ellipse signature has unique properties that distinguish it from existing model-output association methods like language model watermarks. In particular, the signature is hard to forge: without direct access to model parameters, it is practically infeasible to produce logprobs on the ellipse. Secondly, the signature is naturally occurring, since all language models have these elliptical constraints. Thirdly, the signature is self-contained, in that it is detectable without access to the model input or full weights. Finally, the signature is exceptionally redundant, as it is independently detectable in every single logprob output from the model. We evaluate a novel technique for extracting the ellipse on small models, and discuss the practical hurdles that make it infeasible for production-size models, making the signature hard to forge. Finally, we use ellipse signatures to propose a protocol for language model output verification, which is analogous to cryptographic symmetric-key message authentication systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2321. LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer
๐ฌ 22
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Diffusion Transformer; Generative Models; Image Restoration
Universal image restoration (UIR) aims to recover images degraded by unknown mixtures while preserving semanticsโconditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free UIR framework that adapts a large diffusion transformer (Flux.1) to restoration with minimal parameter overhead. LucidFlux introduces a lightweight \emph{dual-branch conditioner} that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. A timestep- and layer-adaptive modulation schedule routes these cues across the backboneโs hierarchy, yielding coarse-to-fine, context-aware updates that protect global structure while recovering texture. To avoid the latency and instability of text prompts or VLM captions, we enforce \emph{caption-free semantic alignment} via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision.
Across synthetic and in-the-wild benchmarks, LucidFlux consistently surpasses strong open-source and commercial baselines across seven metrics, with clear visual gains in realism, detail, and artifact suppression. Ablations confirm that, for large DiTs, when, where, and what to conditionโrather than scaling parameters or relying on text promptsโis the key lever for robust, prompt-free restoration.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2322. GRO-RAG: Gradient-aware Re-rank Optimization for Multi-source Retrieval-Augmented Generation
๐ฌ 5
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Retrieval-Augmented Generation LLM
Retrieval-Augmented Generation (RAG) systems often rely on information retrieved from heterogeneous sources to support generation tasks. However, existing approaches typically either aggregate all sources uniformly or statically select a single source, neglecting semantic complementarity. Moreover, they commonly employ re-ranking models to obtain Top-k documents, without accounting for actual contribution to generation objective.
In this paper, we propose GRO-RAG, a training-free, gradient-aware re-ranking framework for multi-source RAG.
Our method performs Top-k document selection by reading gradients from the language model, estimating each documentโs contribution to the generation loss through a single backward pass.
This enables re-ranking not by heuristic relevance, but by direct feedback from LLM's generation objective.
At the source level, we incorporate inter-source redundancy and query relevance to select source combination prior to re-ranking.
Theoretically, we prove that this gradient-based Top-k selection approximates the optimal subset minimizing the generation loss, and aligns with minimizing the leave-one-out loss upper bound.
Experiments across multi-source QA and open-domain generation tasks demonstrate consistent improvements in generation quality, highlighting the importance of generation-aware retrieval selection in multi-source RAG.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2323. WorldTree: Towards 4D Dynamic Worlds from Monocular Video using Tree-Chains
๐ฌ 16
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Novel View Synthesis Monocular Dynamic Reconstruction Gaussian Splatting
Dynamic reconstruction has achieved remarkable progress, but there remain challenges in monocular input for more practical applications. The prevailing works attempt to construct efficient motion representations, but lack a unified spatiotemporal decomposition framework, suffering from either holistic temporal optimization or coupled hierarchical spatial composition. To this end, we propose WorldTree, a unified framework comprising Temporal Partition Tree (TPT) that enables coarse-to-fine optimization based on the inheritance-based partition tree structure for hierarchical temporal decomposition, and Spatial Ancestral Chains (SAC) that recursively query ancestral hierarchical structure to provide complementary spatial dynamics while specializing motion representations across ancestral nodes. Experimental results on different datasets indicate that our proposed method achieves $8.26\\%$ improvement of LPIPS on NVIDIA-LS and $9.09\\%$ improvement of mLPIPS on DyCheck compared to the second-best method. The code will be publicly released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2324. DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation
๐ฌ 30
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Text-Line Generation Online Handwriting Latent Diffusion Transformer
Deep generative models have advanced text-to-online handwriting generation (TOHG), which aims to synthesize realistic pen trajectories conditioned on textual input and style references. However, most existing methods still primarily focus on character- or word-level generation, resulting in inefficiency and a lack of holistic structural modeling when applied to full text lines. To address these issues, we propose DiffInk, the first latent diffusion Transformer framework for full-line handwriting generation. We first introduce InkVAE, a novel sequential variational autoencoder enhanced with two complementary latent-space regularization losses: (1) an OCR-based loss enforcing glyph-level accuracy, and (2) a style-classification loss preserving writing style. This dual regularization yields a semantically structured latent space where character content and writer styles are effectively disentangled. We then introduce InkDiT, a novel latent diffusion Transformer that integrates target text and reference styles to generate coherent pen trajectories. Experimental results demonstrate that DiffInk outperforms existing state-of-the-art methods in both glyph accuracy and style fidelity, while significantly improving generation efficiency. Code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2325. CLAUSE: Agentic Neuro-Symbolic Knowledge Graph Reasoning via Dynamic Learnable Context Engineering
๐ฌ 20
โญ 5.50
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ multi-hop KGQA neuro-symbolic reasoning agentic system context engineering
Knowledge graphs provide structured context for multiโhop question answering, but deployed systems must balance answer accuracy with strict latency and cost targets while preserving provenance. Static $k$โhop expansions and ``thinkโlonger'' prompting often overโretrieve, inflate context, and yield unpredictable runtime. Thus, we introduce CLAUSE, an agentic three-agent neuroโsymbolic framework that treats context construction as a sequential decision process over knowledge graphs, deciding what to expand, which paths to follow or backtrack, what evidence to keep and when to stop. Latency (interaction steps) and prompt cost (selected tokens) are exposed as userโspecified budgets or prices, allowing perโquery adaptation to tradeโoffs among accuracy, latency, and cost without retraining. CLAUSE employs the proposed LagrangianโConstrained MultiโAgent Proximal Policy Optimization (LCโMAPPO) algorithm to coordinate three agents: Subgraph Architect, Path Navigator, and Context Curator, so that subgraph construction, reasoning paths discovery, and evidence selection are jointly optimized under perโquery's resource budgets on edge edits, interaction steps, and selected tokens. Across HotpotQA, MetaQA, and FactKG, CLAUSE yields higher EM@1 while reducing subgraph growth and end-to-end latency at equal or lower token budgets. On MetaQA-2-hop, relative to the strongest RAG baseline (GraphRAG), CLAUSE achieves $+39.3$ EM@1 with 18.6% lower latency, and 40.9% lower edge growth. The resulting contexts are compact, provenanceโpreserving, and deliver predictable performance under deployment constraints.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2326. Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents
๐ฌ 11
โญ 5.50
๐ reinforcement learning
๐ท๏ธ temporal reasoning reinforcement learning memory selection multi-session dialogue
Temporal reasoning over long, multi-session dialogues is a critical capability for conversational agents. As dialogue histories grow in length and accumulate noise, existing long-context models struggle to accurately identify temporally pertinent information, significantly impairing reasoning performance. To address this, we introduce **Memory-T1**, a framework that learns a time-aware memory selection policy using reinforcement learning (RL). It employs a coarse-to-fine strategy, first pruning the dialogue history into a candidate set with temporal and retriever filters, followed by an RL agent that selects the precise evidence. The RL training is guided by a multi-level reward function optimizing (i) accuracy, (ii) evidence grounding, and (iii) temporal consistency. This temporal consistency reward provides a dense signal by evaluating alignment at both the session-level (range proximity) and the utterance-level (evidence density), enabling the agent to resolve subtle chronological ambiguities. On the Time-Dialog benchmark, Memory-T1 boosts a 7B model to an overall score of 67.0\%, establishing a new state-of-the-art performance for open-source models and outperforming a 14B baseline by 10.2\%. Ablation studies show temporal consistency and evidence grounding rewards jointly contributing to a 15.0\% performance gain.Moreover, Memory-T1 maintains robustness up to 128k tokens, where baseline models collapse, proving effectiveness against noise in extensive dialogue histories.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2327. Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
๐ฌ 41
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Reasoning Reinforcement Learning from Verifier Rewards Mathematical Reasoning
Reinforcement learning from verifiable rewards has emerged as a powerful technique for enhancing the complex reasoning abilities of Large Language Models (LLMs). However, these methods are fundamentally constrained by the ''learning cliff'' phenomenon: when faced with problems far beyond their current capabilities, models consistently fail, yielding a persistent zero-reward signal. In policy optimization algorithms like GRPO, this collapses the advantage calculation to zero, rendering these difficult problems invisible to the learning gradient and stalling progress. To overcome this, we introduce Scaf-GRPO (Scaffolded Group Relative Policy Optimization), a progressive training framework that strategically provides minimal guidance only when a model's independent learning has plateaued. The framework first diagnoses learning stagnation and then intervenes by injecting tiered in-prompt hints, ranging from abstract concepts to concrete steps, enabling the model to construct a valid solution by itself. Extensive experiments on challenging mathematics benchmarks demonstrate Scaf-GRPO's effectiveness, boosting the pass@1 score of the Qwen2.5-Math-7B model on the AIME24 benchmark by a relative 44.3\% over a vanilla GRPO baseline. This result demonstrates our framework provides a robust and effective methodology for unlocking a model's ability to solve problems previously beyond its reach, a critical step towards extending the frontier of autonomous reasoning in LLM.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2328. LogiConBench: Benchmarking Logical Consistencies of LLMs
๐ฌ 23
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ llm logical reasoning
Logical consistency, the requirement that statements remain non-contradictory under logical rules, is fundamental for trustworthy reasoning, yet current LLMs often fail to maintain it even on simple inference tasks. Existing benchmarks for LLM logical consistency are not scalable, not diverse, and not challenging, with state-of-the-art models already surpassing 95% accuracy. LogiConBench is the first benchmark that (1) generates unlimited logical rule combinations with precise labels, (2) provides controllable-depth graphs with explicit reasoning paths, and (3) remains challenging for state-of-the-art LLMs. To achieve this, LogiConBench automatically generates logical graphs where nodes represent symbolic propositions and edges denote reasoning relations. From these graphs, it samples lists of propositions, extracts reasoning paths, determines all consistent label lists, and translates them into diverse natural language expressions. While we release a 280K-sample corpus in this work, the framework can be scaled to generate unlimited data. To strengthen its evaluative significance, we evaluate 14 frontier LLMs on two tasks with varying difficulty levels, and find that the Enumerative task remains extremely challenging, with the best exact accuracy as only 34%. Our code and data are available at https://anonymous.4open.science/r/LogiConBench-11D1/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2329. Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning
๐ฌ 10
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Editing
Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos. However, current methods often rely on large-scale pretraining, limiting flexibility for specific edits. First-frame-guided editing provides control over the first frame, but lacks fine-grained control over the edit's subsequent temporal evolution. To address this, we propose a mask-based LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video models for flexible video editing.
Our key innovation is using a spatiotemporal mask to strategically guide the LoRA fine-tuning process. This teaches the model two distinct skills: first, to interpret the mask as a command to either preserve content from the source video or generate new content in designated regions. Second, for these generated regions, LoRA learns to synthesize either temporally consistent motion inherited from the video or novel appearances guided by user-provided reference frames.
This dual-capability LoRA grants users control over the edit's entire temporal evolution, allowing complex transformations like an object rotating or a flower blooming. Experimental results show our method achieves superior video editing performance compared to baseline methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2330. EditAnyShape: Shape-Aware Image Editing via Trajectory-Guided Region Control
๐ฌ 12
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ image editing
While recent flow-based image editing models demonstrate general-purpose capabilities across diverse tasks, they often struggle to specialize in challenging scenarios---particularly those involving large-scale shape transformations.
When performing such structural edits, these methods either fail to achieve the intended shape change or inadvertently alter non-target regions, resulting in degraded background quality.
We propose $\textbf{EditAnyShape}$, a training-free and mask-free framework that supports precise and controllable editing of object shapes while strictly preserving non-target content.
Motivated by the divergence between inversion and editing trajectories, we compute a $\textbf{Trajectory Divergence Map (TDM)}$ by comparing token-wise velocity differences between the inversion and denoising paths.
The TDM enables precise localization of editable regions and guides a $\textbf{Scheduled KV Injection}$ mechanism that ensures stable and faithful editing.
To facilitate a rigorous evaluation, we introduce $\textbf{ReShapeBench}$, a new benchmark comprising 120 new images and enriched prompt pairs specifically curated for shape-aware editing.
Experiments demonstrate that our method achieves superior editability and visual fidelity, particularly in tasks requiring large-scale shape replacement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2331. Heterogeneous Federated Fine-Tuning with Parallel One-Rank Adaptation
๐ฌ 21
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Federated Learning Low-Rank Adaptation Large Language Models Resource Heterogeneity
Large Language Models (LLMs) have demonstrated remarkable effectiveness in adapting to downstream tasks through fine-tuning. Federated Learning (FL) extends this capability by enabling collaborative fine-tuning across distributed clients using Low-Rank Adaptation (LoRA), while keeping data private. However, practical deployments face challenges when clients have heterogeneous resources and thus adopt different LoRA ranks, leading to substantial initialization and aggregation noise that undermines performance. To address these challenges, we propose Fed-PLoRA, a novel lightweight heterogeneous federated fine-tuning (FFT) framework. Fed-PLoRA introduces Parallel One-Rank Adaptation (PLoRA), a new LoRA variant that replaces the classic multi-rank LoRA module with multiple parallel one-rank modules, and a novel Select-N-Fold strategy that folds untrained PLoRA modules into the pre-trained weights before local training, thereby accommodating heterogeneous client resources. We provide a unified analysis of initialization and aggregation noise of Fed-PLoRA and demonstrate how it addresses the limitations of state-of-the-art methods. Extensive experiments on diverse LLM fine-tuning tasks demonstrate that Fed-PLoRA consistently outperforms existing methods in both accuracy and efficiency. Our code will be openly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2332. BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design
๐ฌ 12
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Experimental Design Bayesian BED EIG Information Gain LLM Conversational clarification questions input ambiguities uncertainty
We propose a general-purpose approach for improving the ability of large language models (LLMs) to intelligently and adaptively gather information from a user or other external source using the framework of sequential Bayesian experimental design (BED).
This enables LLMs to act as effective multi-turn conversational agents and interactively interface with external environments.
Our approach, which we call BED-LLM (Bayesian experimental design with large language models), is based on iteratively choosing questions or queries that maximize the expected information gain (EIG) about the task of interest given the responses gathered previously. We show how this EIG can be formulated (and then estimated) in a principled way using a probabilistic model derived from the LLM's predictive distributions and provide detailed insights into key decisions in its construction and updating procedure. We find that BED-LLM achieves substantial gains in performance across a wide range of tests based on the 20 Questions game and using the LLM to actively infer user preferences, compared to direct prompting of the LLM and other adaptive design strategies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2333. AutoLibra: Agent Metric Induction from Open-Ended Human Feedback
๐ฌ 21
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Agent Evaluation LLM
Agents are predominantly evaluated and optimized via task success metrics, which are coarse,
rely on manual design from experts, and fail to reward intermediate emergent behaviors.
We propose AutoLibra, a framework for agent evaluation, that transforms open-ended
human feedback e.g. โIf you find that the button is disabled, donโt click it againโ, or โThis
agent has too much autonomy to decide what to do on its ownโ into metrics for evaluating
fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding
feedback to an agentโs behavior, clustering similar positive and negative behaviors, and
creating concrete metrics with clear definitions and concrete examples, which can be used for
prompting LLM-as-a-Judge as evaluators. We further propose two meta-metrics to evaluate
the alignment of a set of (induced) metrics with open feedback: โcoverageโ and โredundancyโ.
Through optimizing these meta-metrics, we experimentally demonstrate AutoLibraโs ability
to induce more concrete agent evaluation metrics than the ones proposed in previous
agent evaluation benchmarks and discover new metrics to analyze agents. We also present
two applications of AutoLibra in agent improvement: First, we show that AutoLibra
serve human prompt engineers for diagonalize agent failures and improve prompts iterative.
Moreover, we find that AutoLibra can induce metrics for automatic optimization for agents,
which makes agents improve through self-regulation. Our results suggest that AutoLibra is a
powerful task-agnostic tool for evaluating and improving language agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2334. Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in its Latent Thoughts
๐ฌ 13
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Latent representation learning scaling test-time compute
Large Language Models (LLMs) excel at problem solving by generating chain of thoughts in natural language, but such verbal thinking is computationally costly and prone to overthinking. Recent work instead proposes a latent thinking architecture Huginn-3.5B, which represents intermediate reasoning steps as sequence of latent representations. However, latent thoughts lack interpretability and are difficult to supervise, raising concerns about the correctness and reliability of its latent thinking processes. In this paper, we provide a systematic study of how Huginn-3.5B thinks in the latent space and how external supervision signals can improve its latent thinking processes. We show that latent thoughts leading to correct versus incorrect answers exhibit highly distinguishable patterns, and that a latent classifier can reliably predict answer correctness directly from latent thoughts. Leveraging these insights, we propose Latent Thinking Optimization (LTO), a probabilistic algorithm that employs the latent classifier as a Latent Reward Model (LRM) to optimize the latent thinking processes. Extensive experiments across diverse reasoning tasks demonstrate that LRM is highly effective in detecting incorrect latent thinking patterns, and LTO can significantly improve the latent thinking processes. Furthermore, we show that LRM can generalize across diverse domains, and LTO can be seamlessly applied to general LLMs to improve their thinking processes. In contrast to verbal thinking, our method demonstrates that reward modeling and scaling test-time thinking with supervision can be performed directly in the latent space, highlighting its potential as a general, efficient, and domain-agnostic approach to improving the thinking processes of LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2335. EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing
๐ฌ 22
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Multi-turn editing benchmark image editing
Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference imagesโresulting in limited coverage and inheriting biases from prior generative modelsโor (ii) rely *solely* on zero-shot visionโlanguage models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise.
To address this, we introduce **EdiVal-Agent**, an automated and fine-grained evaluation framework grounded in an object-centric perspective, designed to assess not only standard single-turn but also multi-turn instruction-based editing with precision. Given an input image, **EdiVal-Agent** first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions while dynamically updating object pools across turns. These two stages enable two novel object-centric metrics tailored for multi-turn evaluation and one global metric of visual quality: 1) EdiVal-IF, which measures instruction following by combining open-vocabulary object detectors for symbolic checks with VLMs for semantic verification on detector-guided crops; 2) EdiVal-CC, which evaluates content consistency by calculating semantic similarity of unchanged objects and background using the evolving object pools; and 3) EdiVal-VQ, which quantifies changes in overall visual quality with human preference models.
Instantiating this pipeline, we build **EdiVal-Bench**, a multi-turn editing benchmark covering 9 instruction types and 13 state-of-the-art editing models spanning In-Context, flow-matching, and diffusion paradigms. We further conduct experiments comparing multi-turn editing with single-shot complex editing, highlighting the distinctive characteristics of different model paradigms. We demonstrate that EdiVal can be used to identify existing failure modes, thereby informing the development of the next generation of editing models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2336. LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts
๐ฌ 22
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Mixture of Experts Mixture of LoRA Experts Dynamic routing Fully differentiable LoRA MoE
Recent studies have shown that combining parameter-efficient fine-tuning (PEFT) with mixture-of-experts (MoE) is an effective strategy for adapting large language models (LLMs) to the downstream tasks. However, most existing approaches rely on conventional TopK routing, which requires careful hyperparameter tuning and assigns a fixed number of experts to each token. In this work, we propose LD-MoLE, a Learnable Dynamic routing mechanism for Mixture of LoRA Experts that enables adaptive, token-dependent, and layer-wise expert allocation. Our method replaces the non-differentiable TopK selection with a differentiable routing function and a closed-form solution. Moreover, our design allows the model to adaptively determine the number of experts to activate for each token at different layers. In addition, we introduce an analytical sparsity control objective to regularize the number of activated experts. Extensive experiments on the Qwen3-1.7B and Llama-3.2-3B models show that LD-MoLE achieves the highest average scores compared to state-of-the-art baselines, across a diverse set of benchmarks. Our method not only achieves superior performance, but also demonstrates the ability to learn token-dependent and layer-wise expert allocation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2337. Does Weak-to-strong Generalization Happen under Spurious Correlations?
๐ฌ 20
โญ 5.50
๐ learning theory
๐ท๏ธ Weak-to-Strong Generalization Spurious Correlation
We initiate a unified theoretical and algorithmic study of a key problem in weak-to-strong (W2S) generalization: when fine-tuning a strong pre-trained student with pseudolabels from a weaker teacher on a downstream task with spurious correlations, does W2S happen, and how to improve it upon failures? We consider two sources of spurious correlations caused by group imbalance: (i) a weak teacher fine-tuned on group-imbalanced labeled data with a minority group of fraction $\eta_\ell$, and (ii) a group-imbalanced unlabeled set pseudolabeled by the teacher with a minority group of fraction $\eta_u$. Theoretically, a precise characterization of W2S gain at the proportional asymptotic limit shows that W2S always happens with sufficient pseudolabels when $\eta_u = \eta_\ell$ but may fail when $\eta_u \ne \eta_\ell$, where W2S gain diminishes as $(\eta_u - \eta_\ell)^2$ increases. Our theory is corroborated by extensive experiments on various spurious correlation benchmarks and teacher-student pairs. To boost W2S performance upon failures, we further propose a simple, effective algorithmic remedy that retrains the strong student on its high-confidence data subset after W2S fine-tuning. Our algorithm is group-label-free and achieves consistent, substantial improvements over vanilla W2S fine-tuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2338. Can Transformers Really Do It All? On the Compatibility of Inductive Biases Across Tasks
๐ฌ 10
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Transformers language models inductive biases length generalization activation functions
Transformers are remarkably versatile and their design is largely consistent across a variety of applications. But are they optimal for any given task or dataset? The answer may be key for pushing AI beyond the mere scaling of current designs.
*Method.* We present a method to optimize a transformer architecture for a given dataset, which we use as a tool to study optimal task-specific inductive biases. The method replaces the most important non-linearities (GeLUs, softmax) with components optimized on held out data. We then use each resulting new architecture
with other datasets as a way to evaluate the compatibility between pairs of tasks.
**Findings.** On a range of popular algorithmic tasks, our method identifies new architectures with dramatic improvements in learning speed, generalization, and stability across seeds. These designs prove very task-specific, which means that the tasks require inductive biases very different from those of standard transformers. On a range of code and language modeling datasets, we also find architectures with consistent, yet smaller improvements. These designs now transfer much better across datasets, domains (English vs. computer code), and tokenizations.
**Implications.** These results show that standard transformers are rarely a local optimum in the space of architectures. We show that alternative designs can perform better, but they often sacrifice universality. This calls for future work on architectures that could serve multiple objectives such as fluency and robust reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2339. Falcon: Fast Proximal Linearization of Normalized Cuts for Unsupervised Image Segmentation
๐ฌ 16
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Unsupervised Segmentation Graph Cut Normalized Cut Proximal Gradient Method Kurdykaโลojasiewicz (KL) Convergence
Current zero-shot unsupervised segmentation methods based on normalized cuts (NCut) face three key limitations. First, they rely on recursive bipartitions with repeated eigen-decompositions, making them prohibitively expensive at scale. Second, each split requires spectral relaxation followed by rounding, introducing layers of approximation where the final partition may diverge from the true NCut objective. Third, existing heuristics lack convergence guarantees, and recursive bipartitioning offers no principled assurance of producing a stable $K$-way segmentation. We propose \textbf{Falcon}, a proximal-gradient solver that directly optimizes the discrete $K$-way NCut objective without spectral relaxation. We prove linear convergence in the number of tokens. Falcon computes closed-form gradient scores weighted by cluster volumes and performs row-wise one-hot proximal updates stabilized by inertia. A monotone backtracking scheme adaptively tunes the proximal parameter, ensuring non-decreasing NCut values. This design preserves discrete feasibility, removes repeated eigen-decomposition, and guarantees convergence under the \text{Kurdyka--\L{}ojasiewicz} framework. Across six benchmarks, Falcon outperforms the strongest official baseline (DiffCut) by wide margins, e.g., +13.2 mIoU on VOC, +27.7 on COCO-Object, and +3.1 on Cityscapes, while remaining competitive on Pascal Context. It also runs up to an order of magnitude faster than recursive NCut. By pairing pretrained foundation models with a principled NCut solver, Falcon sets a new state of the art across six benchmarks and achieves the best performance on 17 of 18 benchmarkโencoder pairs, underscoring both its robustness and its generality in bridging the gap between unsupervised and supervised segmentation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2340. Medical thinking with multiple images
๐ฌ 18
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Multimodal diagnostic reasoning Vision language models (VLMs) Medical VQA Thinking with images
Large language models and vision-language models score high on many medical QA benchmarks; however, real-world clinical reasoning remains challenging because cases often involve multiple images and require cross-view fusion. We present MedThinkVQA, a benchmark that asks models to think with multiple images: read each image, merge evidence across views, and pick a diagnosis with stepwise supervision. We make three parts explicit: multi-image questions, expert-annotated stepwise supervision, and beyond-accuracy evaluation. Only MedThinkVQA combines all these parts in one expert-annotated benchmark. The dataset has 8,481 cases in total, with 751 test cases, and on average 6.51 images per case; it is expert-annotated and, at this level, larger and more image-dense than prior work (earlier maxima < 1.43 images per case). On the test set, GPT-5 achieves 57.39% accuracy, approximately 15 percentage points below the strongest result on the most challenging prior benchmark of a similar kind, while other strong models are lower (Qwen2.5-VL-32B: 39.54%, MedGemma-27B: 37.55%, InternVL3.5-38B: 43.14%). Giving expert findings and summaries brings clear gains, but using models' self-generated ones brings small or negative gains. Step-level evaluation shows where models stumble: errors center on image reading and cross-view integration in both decisive and non-decisive steps (>70%); when a step is decisive for the final choice, reasoning slips become more common (32.26%), while scenario and pure-knowledge slips are relatively rare (<10%). These patterns isolate and quantify the core obstacle: extracting and integrating cross-image evidence, rather than language-only inference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2341. Distilled Pretraining: A modern lens of Data, In-Context Learning and Test-Time Scaling
๐ฌ 13
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Pretraining
In the past year, distillation has seen a renewed prominence in large language model (LLM) pretraining,
exemplified by the Llama-3.2 and Gemma model families. While distillation has historically been
shown to improve statistical modeling, its effects on new paradigms key to modern LLMsโsuch as
test-time scaling and in-context learningโremain underexplored. In this work, we make three main
contributions. First, we show that pretraining with distillation yields models that exhibit remarkably
better test-time scaling. Second, we observe that this benefit comes with a trade-off: distillation
impairs in-context learning capabilities, particularly the one modeled via induction heads. Third, to
demystify these findings, we study distilled pretraining in a sandbox of a bigram model, which helps
us isolate the common principal factor behind our observations. Finally, using these insights, we shed
light on various design choices for pretraining that should help practitioners going forward.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2342. Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring
๐ฌ 13
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Reinforcement learning
When there exists an unlearnable source of randomness (noisy-TV) in the environment, a naively intrinsic reward driven exploring agent gets stuck at that source of randomness and fails at exploration.
Intrinsic reward based on uncertainty estimation or distribution similarity, while eventually escapes noisy-TVs as time unfolds, suffers from poor sample efficiency and high computational cost.
Inspired by recent findings from neuroscience that humans monitor their improvements during exploration, we propose a novel method for intrinsically-motivated exploration, named Learning Progress Monitoring (LPM).
During exploration, LPM rewards model improvements instead of prediction error or novelty, effectively rewards the agent for observing learnable transitions rather than the unlearnable transitions.
We introduce a dual-network design that uses an error model to predict the expected prediction error of the dynamics model in its previous iteration, and use the difference between the model errors of the current iteration and previous iteration to guide exploration.
We theoretically show that the intrinsic reward of LPM is zero-equivariant and a monotone indicator of Information Gain (IG), and that the error model is necessary to achieve monotonicity correspondence with IG.
We empirically compared LPM against state-of-the-art baselines in noisy environments based on MNIST, 3D maze with 160x120 RGB inputs, and Atari.
Results show that LPM's intrinsic reward converges faster, explores more states in the maze experiment, and achieves higher extrinsic reward in Atari.
This conceptually simple approach marks a shift-of-paradigm of noise-robust exploration.
Code to reproduce our experiments will be provided.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2343. (U)NFV: Supervised and Unsupervised Neural Finite Volume Methods for Solving Hyperbolic PDEs
๐ฌ 14
โญ 5.50
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Neural PDE solvers Hyperbolic conservation laws Finite volume methods Physics-informed learning Traffic modeling
We introduce (U)NFV, a modular neural network architecture that generalizes classical finite volume (FV) methods for solving hyperbolic conservation laws. Hyperbolic partial differential equations (PDEs) are challenging to solve, particularly conservation laws whose physically relevant solutions contain shocks and discontinuities. FV methods are widely used for their mathematical properties: convergence to entropy solutions, flow conservation, or total variation diminishing, but often lack accuracy and flexibility in complex settings. Neural Finite Volume addresses these limitations by learning update rules over extended spatial and temporal stencils while preserving conservation structure. It supports both supervised training on solution data (NFV) and unsupervised training via weak-form residual loss (UNFV). Applied to first-order conservation laws, (U)NFV achieves up to 10x lower error than Godunov's method, outperforms ENO/WENO, and rivals discontinuous Galerkin solvers with lower implementation burden. On traffic modeling problems, both from PDEs and from experimental highway data, (U)NFV captures nonlinear wave dynamics with significantly higher fidelity and scalability than traditional FV approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2344. FAME: $\underline{F}$ormal $\underline{A}$bstract $\underline{M}$inimal $\underline{E}$xplanation for neural networks
๐ฌ 20
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ abductive explanations abstract interpretation robustness NN verification
We propose $\textbf{FAME}$ (Formal Abstract Minimal Explanations), a new class of abductive explanations grounded in abstract interpretation. FAME is the first method to scale to large neural networks while reducing explanation size. Our main contribution is the design of dedicated perturbation domains that eliminate the need for traversal order. FAME progressively shrinks these domains and leverages LiRPA-based bounds to discard irrelevant features, ultimately converging to a $\textbf{formal abstract minimal explanation}$. To assess explanation quality, we introduce a procedure that measures the worst-case distance between an abstract minimal explanation and a true minimal explanation. This procedure combines adversarial attacks with an optional $VERI{\large X}+$ refinement step. We benchmark FAME against $VERI{\large X}+$ and demonstrate consistent gains in both explanation size and runtime on medium- to large-scale neural networks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2345. SWERank: Software Issue Localization with Code Ranking
๐ฌ 11
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Software Issue Localization Automated Code Repair Retrieve-and-Rerank
Software issue localization, the task of identifying the precise code locations (files, classes, or functions) relevant to a natural language issue description (e.g., bug report, feature request), is a critical yet time-consuming aspect of software development. While recent LLM-based agentic approaches demonstrate promise, they often incur significant latency and cost due to complex multi-step reasoning and relying on closed-source LLMs. Alternatively, traditional code ranking models, typically optimized for query-to-code or code-to-code retrieval, struggle with the verbose and failure-descriptive nature of issue localization queries. To bridge this gap, we introduce SWERank, an efficient and effective retrieve-and-rerank framework for software issue localization. To facilitate training, we construct SWELoc, a large-scale dataset curated from public GitHub repositories, featuring real-world issue descriptions paired with corresponding code modifications. Empirical results on SWE-Bench-Lite and LocBench show that SWERank achieves state-of-the-art performance, outperforming both prior ranking models and costly agent-based systems using closed-source LLMs like Claude-3.5. Further, we demonstrate SWELoc's utility in enhancing various existing retriever and reranker models for issue localization, establishing the dataset as a valuable resource for the community.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2346. Deforming Videos to Masks: Flow Matching for Referring Video Segmentation
๐ฌ 21
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Referring Video Object Segmentation Flow Matching
Referring Video Object Segmentation (RVOS) requires segmenting specific objects in a video guided by a natural language description. The core challenge of RVOS is to anchor abstract linguistic concepts onto a specific set of pixels and continuously segment them through the complex dynamics of a video. Faced with this difficulty, prior work has often decomposed the task into a pragmatic `locate-then-segment' pipeline. However, this cascaded design creates an information bottleneck by simplifying semantics into coarse geometric prompts (e.g, point), and struggles to maintain temporal consistency as the segmenting process is often decoupled from the initial language grounding. To overcome these fundamental limitations, we propose FlowRVS, a novel framework that reconceptualizes RVOS as a conditional continuous flow problem. This allows us to harness the inherent strengths of pretrained T2V models, fine-grained pixel control, text-video semantic alignment, and temporal coherence. Instead of conventional generating from noise to mask or directly predicting mask, we reformulate the task by learning a direct, language-guided deformation from a video's holistic representation to its target mask. Our one-stage, generative approach achieves new state-of-the-art results across all major RVOS benchmarks. Specifically, achieving a J&F of 51.1 in MeViS (+1.6 over prior SOTA) and 73.3 in the zero shot Ref-DAVIS17 (+2.7), demonstrating the significant potential of modeling video understanding tasks as continuous deformation processes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2347. Revenue Maximization Under Sequential Price Competition Via The Estimation Of $s$-Concave Demand Functions
๐ฌ 18
โญ 5.50
๐ learning theory
๐ท๏ธ Dynamic pricing Nash Equilibrium Nonlinear Demand Learning Regret Analysis Sequential Competition Shape Constraints
We consider price competition among multiple sellers over a selling horizon of $T$ periods. In each period, sellers simultaneously offer their prices (which are made public) and subsequently observe their respective demand (not made public). The demand function of each seller depends on all sellers' prices through a private, unknown, and nonlinear relationship. We propose a dynamic pricing policy that uses semi-parametric least-squares estimation and show that when the sellers employ our policy, their prices converge at a rate of $O(T^{-1/7})$ to the Nash equilibrium prices that sellers would reach if they were fully informed. Each seller incurs a regret of $O(T^{5/7})$ relative to a dynamic benchmark policy. A theoretical contribution of our work is proving the existence of equilibrium under shape-constrained demand functions via the concept of $s$-concavity and establishing regret bounds of our proposed policy. Technically, we also establish new concentration results for the least squares estimator under shape constraints. Our findings offer significant insights into dynamic competition-aware pricing and contribute to the broader study of non-parametric learning in strategic decision-making.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2348. From Predictors to Samplers via the Training Trajectory
๐ฌ 13
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ sampling energy based models discrete sampling synergistic interactions markov chain monte carlo
Sampling from trained predictors is fundamental for interpretability and as a compute-light alternative to diffusion models, but local samplers struggle on the rugged, high-frequency functions such models learn. We observe that standard neuralโnetwork training implicitly produces a coarseโtoโfine sequence of models. Early checkpoints suppress highโdegree/ highโfrequency components (Boolean monomials; spherical harmonics under NTK), while later checkpoints restore detail. We exploit this by running a simple annealed sampler across the training trajectory, using early checkpoints for highโmobility proposals and later ones for refinement. In the Boolean domain, this can turn the exponential bottleneck arising from rugged landscapes or needle gadgets into a near-linear one. In the continuous domain, under the NTK regime, this corresponds to smoothing under the NTK kernel. Requiring no additional compute, our method shows strong empirical gains across a variety of synthetic and real-world tasks, including constrained sampling tasks that diffusion models are unable to natively handle.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2349. Beyond Instance-Level Alignment: Dual-Level Optimal Transport for Audio-Text Retrieval
๐ฌ 11
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Audio-Text Retrieval Cross-Modal Matching
Cross-modal matching tasks have achieved significant progress, yet remain limited by mini-batch subsampling and scarce labelled data. Existing objectives, such as contrastive losses, focus solely on instance-level alignment and implicitly assume that all feature dimensions contribute equally. Under small batches, this assumption amplifies noise, making alignment signals unstable and biased. We propose DART (Dual-level Alignment via Robust Transport), a framework that augments instance-level alignment with feature-level regularization based on the Unbalanced Wasserstein Distance (UWD). DART constructs reliability-weighted marginals that adaptively reweight channels according to their cross-modal consistency and variance statistics, highlighting stable and informative dimensions while down-weighting noisy or modality-specific ones. From a theoretical perspective, we establish concentration bounds showing that instance-level objectives scale with the maximum distance across presumed aligned pairs, while feature-level objectives are governed by the Frobenius norm of the transport plan. By suppressing unmatched mass and sparsifying the transport plan, DART reduces the effective transport diameter and tightens the bound, yielding greater robustness under small batches. Empirically, DART achieves state-of-the-art retrieval performance on three audio-text benchmarks, with particularly strong gains under scarce labels and small batch sizes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2350. Semantic-aware Wasserstein Policy Regularization for Large Language Model Alignment
๐ฌ 21
โญ 5.50
๐ generative models
๐ท๏ธ Large language Model Alignment Reinforcement learning with human preference Wasserstein distance Sinkhorn distance
Large language models (LLMs) are commonly aligned with human preferences using reinforcement learning from human feedback (RLHF). In this method, LLM policies are generally optimized through reward maximization with Kullback-Leibler (KL) divergence regularization of the reference policy. However, KL and its $f$-divergence variants only compare token probabilities at identical indices, failing to capture semantic similarity. We propose Wasserstein Policy Regularization (WPR), a semantic-aware regularization for the RLHF framework based on the entropy-regularized Wasserstein distance, which incorporates the geometry of the token space. The dual formulation of the distance expresses the regularization as penalty terms applied to the reward via optimal dual variables, which yield a tractable objective compatible with standard RL algorithms. Empirically, our method outperforms KL- and $f$-divergence-based baselines, demonstrating the benefits of semantic-aware policy distances for alignment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2351. Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization
๐ฌ 9
โญ 5.50
๐ generative models
๐ท๏ธ reinforcement learning discrete diffusion diffusion language models
Diffusion language models (DLMs) enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to autoregressive large language models (LLMs). However, adapting reinforcement learning (RL) fine-tuning to DLMs remains an open challenge because of the intractable likelihood. Pioneering work such as diffu-GRPO estimated token-level likelihoods via one-step unmasking. While computationally efficient, this approach is severely biased. A more principled foundation lies in sequence-level likelihoods, where the evidence lower bound (ELBO) serves as a surrogate. Yet, despite this clean mathematical connection, ELBO-based methods have seen limited adoption due to the prohibitive cost of likelihood evaluation. In this work, we revisit ELBO estimation and disentangle its sources of variance. This decomposition motivates reducing variance through fast, deterministic integral approximations along a few pivotal dimensions. Building on this insight, we introduce **Group Diffusion Policy Optimization (GDPO)**, a new RL algorithm tailored for DLMs. GDPO leverages simple yet effective *Semi-deterministic Monte Carlo* schemes to mitigate the variance explosion of ELBO estimators under vanilla double Monte Carlo sampling, yielding a provably lower-variance estimator under tight evaluation budgets. Empirically, GDPO achieves consistent gains over pretrained checkpoints and outperforms diffu-GRPO, one of the state-of-the-art baselines, on the majority of math, reasoning, and coding benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2352. Bridging the performance-gap between target-free and target-based reinforcement learning
๐ฌ 15
โญ 5.50
๐ reinforcement learning
๐ท๏ธ deep reinforcement learning Q-learning function approximation
The use of target networks in deep reinforcement learning is a widely popular solution to mitigate the brittleness of semi-gradient approaches and stabilize learning. However, target networks notoriously require additional memory and delay the propagation of Bellman updates compared to an ideal target-free approach. In this work, we step out of the binary choice between target-free and target-based algorithms. We introduce a new method that uses a copy of the last linear layer of the online network as a target network, while sharing the remaining parameters with the up-to-date online network. This simple modification enables us to keep the target-free's low-memory footprint while leveraging the target-based literature. We find that combining our approach with the concept of iterated $Q$-learning, which consists of learning consecutive Bellman updates in parallel, helps improve the sample-efficiency of target-free approaches. Our proposed method, iterated Shared $Q$-Learning (iS-QL), bridges the performance gap between target-free and target-based approaches across various problems while using a single $Q$-network, thus stepping towards resource-efficient reinforcement learning algorithms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2353. Task-free Adaptive Meta Black-box Optimization
๐ฌ 22
โญ 5.50
๐ optimization
๐ท๏ธ Meta Black-box Optimization Evolutionary Algorithms
Handcrafted optimizers become prohibitively inefficient for complex black-box optimization (BBO) tasks. MetaBBO addresses this challenge by meta-learning to automatically configure optimizers for low-level BBO tasks, thereby eliminating heuristic dependencies. However, existing methods typically require extensive handcrafted training tasks to learn meta-strategies that generalize to target tasks, which poses a critical limitation for realistic applications with unknown task distributions. To overcome the issue, we propose the Adaptive meta Black-box Optimization Model (ABOM), which performs online parameter adaptation using solely optimization data from the target task, obviating the need for predefined task distributions. Unlike conventional metaBBO frameworks that decouple meta-training and optimization phases, ABOM introduces a closed-loop adaptive parameter learning mechanism, where parameterized evolutionary operators continuously self-update by leveraging generated populations during optimization. This paradigm shift enables zero-shot optimization: ABOM achieves competitive performance on synthetic BBO benchmarks and realistic unmanned aerial vehicle path planning problems without any handcrafted training tasks. Visualization studies reveal that parameterized evolutionary operators exhibit statistically significant search patterns, including natural selection and genetic recombination.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2354. Precise and Interpretable Editing of Code Knowledge in Large Language Models
๐ฌ 16
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ Programming Languages Code-to-code Translation Knowledge Editing code LLMs Software Engineering
Large Language Models (LLMs) have demonstrated outstanding capabilities in various code-related tasks, including code completion, translation, or summarization. However, these pretrained models are static, posing a challenge to incorporate new knowledge into an LLM to correct erroneous behavior. Approaches such as retraining or fine-tuning demand extensive labeled datasets and might be computationally expensive, while prompt engineering fails to change models permanently. Knowledge Editing (KE) techniques offer a more efficient alternative, enabling model updates with minimal data, even just a single example. Nevertheless, existing KE methods often manipulate parameters within the Transformer's multi-layer perceptrons (MLPs), where neuronal polysemanticity hinders both the precision and interpretability of the edits. To address these limitations, we exploit TransCoder, an MLP-like model component with a wide and sparsely activated hidden feature vector. Specifically, we introduce **TransCoder-based Precise Editing** (**TCPE**), a novel method that leverages the sparsity and monosemanticity of the TransCoderโs neurons for highly localized knowledge editing. TCPE exhibits neuron-level mechanistic interpretability characteristics, revealing the correspondence between the edited neurons and the specific code-related knowledge. Furthermore, we present KECode, a new evaluation benchmark for code-to-code translation based on functional equivalence. Using KECode, we conduct a systematic evaluation of representative KE methods in the context of code-to-code translation. Our experimental results demonstrate that TCPE outperforms existing KE methods, achieving a substantial improvement of translation accuracy of CodeLlama-7b-Instruct from 57.5% to 64.0% in a low-resource scenario of Java-to-D translation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2355. Sharp asymptotic theory for Q-learning with \texttt{LD2Z} learning rate and its generalization
๐ฌ 21
โญ 5.50
๐ learning theory
๐ท๏ธ Q-learning Stochastic approximation central limit theory strong invariance principle
Despite the sustained popularity of Q-learning as a practical tool for policy determination, a majority of relevant theoretical literature deals with either constant ($\eta_t\equiv \eta$) or polynomially decaying ($\eta_t = \eta t^{-\alpha}$) learning schedules. However, it is well known the these choices suffer from either persistent bias or prohibitively slow convergence. In contrast, the recently proposed linear decay to zero (\texttt{LD2Z}: $\eta_t=\eta(1-t/n)$) schedule has shown appreciable empirical performance, but its theoretical and statistical properties remain largely unexplored, especially in the Q-learning setting. We address this gap in the literature by first considering a general class of power-law decay to zero (\texttt{PD2Z}-$\nu$: $\eta_t=\eta(1-t/n)^{\nu}$). Proceeding step-by-step, we present a sharp non-asymptotic error bound for Q-learning with \texttt{PD2Z}-$\nu$ schedule, which then is used to derive a central limit theory for a new \textit{tail} Polyak-Ruppert averaging estimator. Finally, we also provide a novel time-uniform Gaussian approximation (also known as \textit{strong invariance principle}) for the partial sum process of Q-learning iterates, which facilitates bootstrap-based inference. All our theoretical results are complemented by extensive numerical experiments. Beyond being new theoretical and statistical contributions to the Q-learning literature, our results definitively establish that \texttt{LD2Z} and in general \texttt{PD2Z}-$\nu$ achieve a best-of-both-worlds property: they inherit the rapid decay from initialization (characteristic of constant step-sizes) while retaining the asymptotic convergence guarantees (characteristic of polynomially decaying schedules). This dual advantage explains the empirical success of \texttt{LD2Z} while providing practical guidelines for inference through our results.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2356. Jailbreaking on Text-to-Video Models via Scene Splitting Strategy
๐ฌ 21
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Jailbreak Safety Text-to-Video
Along with the rapid advancement of numerous Text-to-Video (T2V) models, growing concerns have emerged regarding their safety risks. While recent studies have explored vulnerabilities in models like LLMs, VLMs, and Text-to-Image (T2I) models through jailbreak attacks, T2V models remain largely unexplored, leaving a significant safety gap. To address this gap, we introduce SceneSplit, a novel black-box jailbreak method that works by fragmenting a harmful narrative into multiple scenes, each individually benign. This approach manipulates the generative output space, the abstract set of all potential video outputs for a given prompt, using the combination of scenes as a powerful constraint to guide the final outcome. While each scene individually corresponds to a wide and safe space where most outcomes are benign, their sequential combination collectively restricts this space, narrowing it to an unsafe region and significantly increasing the likelihood of generating a harmful video. This core mechanism is further enhanced through iterative scene manipulation, which bypasses the safety filter within this constrained unsafe region. Additionally, a strategy library that reuses successful attack patterns further improves the attack's overall effectiveness and robustness. To validate our method, we evaluate SceneSplit across 11 safety categories from T2VSafetyBench on T2V models. Our results show that it achieves a high average Attack Success Rate (ASR) of 77.2% on Luma Ray2, 84.1% on Hailuo, 78.2% on Veo2, 78.6% on Kling v1.0, and 68.6% on Sora2, significantly outperforming the existing baselines. Through this work, we demonstrate that current T2V safety mechanisms are vulnerable to attacks that exploit narrative structure, providing new insights for understanding and improving the safety of T2V models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2357. A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models
๐ฌ 16
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Discrete Diffusion Safety NLP
Diffusion large language models (dLLMs) enable any-order generation, but this flexibility enlarges the attack surface: harmful spans may appear at arbitrary positions, and template-based prefilling attacks such as DIJA bypass response-level refusals.
We introduce A2D (Any-Order, Any-Step Defense), a token-level alignment method that aligns dLLMs to emit an [EOS] refusal signal whenever harmful content arises. By aligning safety directly at the token-level under randomized masking, A2D achieves robustness to both any-decoding-order and any-step prefilling attacks under various conditions. It also enables real-time monitoring: dLLMs may begin a response but automatically terminate if unsafe continuation emerges. On safety benchmarks, A2D consistently prevents the generation of harmful outputs, slashing DIJA success rates from over 80\% to near-zero (1.3\% on LLaDA-8B-Instruct, 0.0\% on Dream-v0-Instruct-7B), and thresholded [EOS] probabilities allow early rejection, yielding up to 19.3ร faster safe termination.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2358. Imagine How To Change: Explicit Procedure Modeling for Change Captioning
๐ฌ 18
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ dynamic procedure understanding confidence-guided sampling change captioning
Change captioning generates descriptions that explicitly describe the differences between two visually similar images. Existing methods operate on static image pairs, thus ignoring the rich temporal dynamics of the change procedure, which is the key to understand not only what has changed but also how it occurs. We introduce ProCap, a novel framework that reformulates change modeling from static image comparison to dynamic procedure modeling.
ProCap features a two-stage design: The first stage trains a procedure encoder to learn the change procedure from a sparse set of keyframes.
These keyframes are obtained by automatically generating intermediate frames to make the implicit procedural dynamics explicit and then sampling them to mitigate redundancy.
Then the encoder learns to capture the latent dynamics of these keyframes via a caption-conditioned, masked reconstruction task.
The second stage integrates this trained encoder within an encoder-decoder model for captioning.
Instead of relying on explicit frames from the previous stage---a process incurring computational overhead and sensitivity to visual noise---we introduce learnable procedure queries to prompt the encoder for inferring the latent procedure representation, which the decoder then translates into text. The entire model is then trained end-to-end with a captioning loss, ensuring the encoder's output is both temporally coherent and captioning-aligned. Experiments on three datasets demonstrate the effectiveness of ProCap.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2359. STAR: Strategy-driven Automatic Jailbreak Red-teaming For Large Language Model
๐ฌ 21
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Model Jailbreak Attack Red-teaming
Jailbreaking refers to techniques that bypass the safety alignment of large language models (LLMs) to elicit harmful outputs, and automated red-teaming has become a key approach for detecting such vulnerabilities before deployment. However, most existing red-teaming methods operate directly in text space, where they tend to generate semantically similar prompts and thus fail to probe the broader spectrum of latent vulnerabilities within a model. To address this limitation, we shift the exploration of jailbreak strategies from conventional text space to the modelโs latent activation space and propose STAR (**ST**rategy-driven **A**utomatic Jailbreak **R**ed-teaming), a black-box framework for systematically generating jailbreak prompts. STAR is composed of two modules: (i) strategy generation module, which extracts the principal components of existing strategies and recombines them to generate novel ones; and (ii) prompt generation module, which translates abstract strategies into concrete jailbreak prompts with high success rates. Experimental results show that STAR substantially outperforms state-of-the-art baselines in terms of both attack success rate and strategy diversity. These findings highlight critical vulnerabilities in current alignment techniques and establish STAR as a more powerful paradigm for comprehensive LLM security evaluation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2360. VLMgineer: Vision-Language Models as Robotic Toolsmiths
๐ฌ 13
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ robotic manipulation robotic tool use vision language models
Tool design and use reflect the ability to understand and manipulate the physical world through creativity, planning, and foresight. As such, it is often regarded as a measurable indicator of cognitive intelligence across biological species. While much of todayโs research on robotics intelligence focuses on generating better control strategies, inventing smarter tools offers a complementary form of physical intelligence: moving the problem-solving onus into the toolโs geometry so that control becomes simpler. This motivates us to ask: can todayโs foundation models offer useful priors to automatically inventโand effectively wieldโsuch tools? We present VLMgineer, the first fully automatic framework designs tools and actions from scratch by harnessing the creativity of VisionโLanguage Models (VLMs) together with evolutionary search. We evaluate VLMgineer on a diverse benchmark of everyday manipulation scenarios that demand creative tool design and use. Across this suite, VLMgineer consistently discovers tools and policies that solve tasks more effectively and innovatively, transforming challenging robotics problems into straightforward executions. It also consistently outperforms VLM-generated designs from human specifications and existing human-crafted tools for everyday tasks. We further demonstrate that VLMgineerโs automatically designed tools and action policies transfer seamlessly to real-world task execution on a physical robot. To facilitate future research on automated tool invention, we will release our benchmark and code. Project Website: https://vlmgineer.github.io/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2361. Rapid Training of Hamiltonian Graph Networks Using Random Features
๐ฌ 10
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Graph neural networks physics-informed machine learning random feature methods gradient-descent-free training Hamiltonian neural network
Learning dynamical systems that respect physical symmetries and constraints remains a fundamental challenge in data-driven modeling. Integrating physical laws with graph neural networks facilitates principled modeling of complex N-body dynamics and yields accurate and permutation-invariant models. However, training graph neural networks with iterative, gradient-descent-based optimization algorithms (e.g., Adam, RMSProp, LBFGS) often leads to slow training, especially for large, complex systems. In comparison to 15 different optimizers, we demonstrate that Hamiltonian Graph Networks (HGN) can be trained 150-600ร faster - but with comparable accuracy - by replacing iterative optimization with random feature-based parameter construction. We show robust performance in diverse simulations, including N-body mass-spring and molecular dynamics systems in up to $3$ dimensions and 10,000 particles with different geometries, while retaining essential physical invariances with respect to permutation, rotation, and translation. Our proposed approach is benchmarked using a NeurIPS 2022 Datasets and Benchmarks Track publication to further demonstrate its versatility. We reveal that even when trained on minimal 8-node systems, the model can generalize in a zero-shot manner to systems as large as 4096 nodes without retraining. Our work challenges the dominance of iterative gradient-descent-based optimization algorithms for training neural network models for physical systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2362. Towards Lossless Memory-efficient Training of Spiking Neural Networks via Gradient Checkpointing and Spike Compression
๐ฌ 13
โญ 5.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Spiking Neural Network Training Memory Optimization Gradient Checkpointing
Deep spiking neural networks (SNNs) hold immense promise for low-power event-driven computing, but their direct training via backpropagation through time (BPTT) incurs prohibitive memory cost, which limits their scalability. Existing memory-saving approaches, such as online learning, BPTT-to-BP, and reversible networks, compromise accuracy, training speed, or applicability. In this work, we propose a novel and broadly applicable pipeline for memory-efficient SNN training that preserves BPTT's accuracy. Our pipeline integrates layer-wise gradient checkpointing with lossless spike compression to eliminate internal state storage and reduce the memory cost of per-layer input spikes. We also introduce a multi-stage checkpoint adjustment strategy that adaptively refines checkpoint placement based on profiling results to further optimize memory usage and improve training speed. Wrapped in an optimization pass, the pipeline automatically restructures the computation flow before training with minimal user effort. Extensive experiments on diverse architectures and tasks demonstrate up to $8\times$ memory efficiency gains with $\le 20\\%$ speed reduction and no accuracy loss. Our method provides a practical solution for efficient and scalable SNN training. Code will be available upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2363. Octax: Accelerated CHIP-8 Arcade Environments for Reinforcement Learning in JAX
๐ฌ 15
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Benchmarking CHIP-8 JAX Environments Simulation Acceleration
Reinforcement learning (RL) research requires diverse, challenging environments that are both tractable and scalable.
While modern video games may offer rich dynamics, they are computationally expensive and poorly suited for large-scale experimentation due to their CPU-bound execution.
We introduce Octax, a high-performance suite of classic arcade game environments implemented in JAX, based on CHIP-8 emulation, a predecessor to Atari, which is widely adopted as a benchmark in RL research.
Octax provides the JAX community with a long-awaited end-to-end GPU alternative to Atari games, offering image-based environments, spanning puzzle, action, and strategy genres, all executable at
massive scale on modern GPUs.
Our JAX-based implementation achieves orders-of-magnitude speedups over traditional CPU emulators.
We demonstrate Octax's capabilities by
training RL agents across multiple games, showing significant improvements in
training speed and scalability compared to existing solutions. The environment's modular design enables researchers to easily extend the suite with new games or generate novel environments using large language models, making it an ideal platform for large-scale RL experimentation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2364. DiVeQ: Differentiable Vector Quantization Using the Reparameterization Trick
๐ฌ 19
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Vector Quantization Differentiability Backpropagation Differentiable Vector Quantization Gradient Collapse Codebook Learning
Vector quantization is common in deep models, yet its hard assignments block gradients and hinder end-to-end training. We propose DiVeQ, which treats quantization as adding an error vector that mimics the quantization distortion, keeping the forward pass hard while letting gradients flow. We also present a space-filling variant (SF-DiVeQ) that assigns to a curve constructed by the lines connecting codewords, resulting in less quantization error and full codebook usage. Both methods train end-to-end without requiring auxiliary losses or temperature schedules. On VQ-VAE compression and VQGAN generation across various data sets, they improve reconstruction and sample quality over alternative quantization approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2365. VLBiMan: Vision-Language Anchored One-Shot Demonstration Enables Generalizable Bimanual Robotic Manipulation
๐ฌ 15
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Bimanual Manipulation Single Demonstration Learning Vision-Language Grounding Skill Generalization
Achieving generalizable bimanual manipulation requires systems that can learn efficiently from minimal human input while adapting to real-world uncertainties and diverse embodiments. Existing approaches face a dilemma: imitation policy learning demands extensive demonstrations to cover task variations, while modular methods often lack flexibility in dynamic scenes. We introduce VLBiMan, a framework that derives reusable skills from a single human example through task-aware decomposition, preserving invariant primitives as anchors while dynamically adapting adjustable components via vision-language grounding. This adaptation mechanism resolves scene ambiguities caused by background changes, object repositioning, or visual clutter without policy retraining, leveraging semantic parsing and geometric feasibility constraints. Moreover, the system inherits human-like hybrid control capabilities, enabling mixed synchronous and asynchronous use of both arms. Extensive experiments validate VLBiMan across tool-use and multi-object tasks, demonstrating: (1) a drastic reduction in demonstration requirements compared to imitation baselines, (2) compositional generalization through atomic skill splicing for long-horizon tasks, (3) robustness to novel but semantically similar objects and external disturbances, and (4) strong cross-embodiment transfer, showing that skills learned from human demonstrations can be instantiated on different robotic platforms without retraining. By bridging human priors with vision-language anchored adaptation, our work takes a step toward practical and versatile dual-arm manipulation in unstructured settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2366. MuonBP: Faster Muon via Block-Periodic Orthogonalization
๐ฌ 9
โญ 5.50
๐ optimization
๐ท๏ธ muon orthogonalizaton
Gradient orthogonalization is a simple strategy that shows great utility in speeding up gradient descent. The Muon optimizer (Keller et al., 2024b) combines gradient orthogonalization with first-order momentum and achieves significant improvement in data efficiency over Adam/AdamW for language model training. However, when using model parallelism, gradient orthogonalization introduces additional overhead compared to coordinate-wise optimizers (such as AdamW) due to additional gather and scatter operations on gradient matrix shards from different devices. This additional communication can amount to a throughput hit of 5\%-10\% compared to Adam/AdamW. To remedy this, we propose Muon with Block-Periodic Orthogonalization (MuonBP), which applies orthogonalization independently to matrix shards on each device and periodically performs full orthogonalization to maintain training stability at scale. We show how to adjust the learning rate from the baseline to MuonBP and give convergence guarantees for this algorithm. Crucially, our theory dictates that we use two stepsizes: one for the blockwise orthogonalization steps, and one for the full orthogonalization steps. Our method is simple, requires minimal hyperparameter adjustments, and achieves competitive iteration complexity compared with the baselines Muon and Dion while providing per-iteration throughput comparable to coordinate-wise methods such as AdamW. When training an 8B model with eight-way tensor parallelism and ZeRO optimizer state sharding, MuonBP achieves 8\% throughput increase compared to Muon with no degradation in performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2367. From Samples to Scenarios: A New Paradigm for Probabilistic Forecasting
๐ฌ 16
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Probabilistic Time Series Forecasting Probabilistic Scenarios Time Series Analysis Sampling-Free
Most state-of-the-art probabilistic time series forecasting models rely on sampling to represent future uncertainty. However, this paradigm suffers from inherent limitations, such as lacking explicit probabilities, inadequate coverage, and high computational costs. In this work, we introduce **Probabilistic Scenarios**, an alternative paradigm designed to address the limitations of sampling. It operates by directly producing a finite set of {Scenario, Probability} pairs, thus avoiding Monte Carlo-like approximation. To validate this paradigm, we propose **TimePrism**, a simple model composed of only three parallel linear layers. Surprisingly, TimePrism achieves 9 out of 10 state-of-the-art results across five benchmark datasets on two metrics. The effectiveness of our paradigm comes from a fundamental reframing of the learning objective. Instead of modeling an entire continuous probability space, the model learns to represent a set of plausible scenarios and corresponding probabilities. Our work demonstrates the potential of the Probabilistic Scenarios paradigm, opening a promising research direction in forecasting beyond sampling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2368. Full-Graph vs. Mini-Batch Training: Comprehensive Analysis from a Batch Size and Fan-Out Size Perspective
๐ฌ 22
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Graph Neural Network
Full-graph and mini-batch Graph Neural Network (GNN) training approaches have distinct system design demands, making it crucial to choose the appropriate approach to develop. A core challenge in comparing these two GNN training approaches lies in characterizing their model performance (i.e., convergence and generalization) and computational efficiency. While a batch size has been an effective lens in analyzing such behaviors in deep neural networks (DNNs), GNNs extends this lens by introducing a fan-out size, as full-graph training can be viewed as mini-batch training with the largest possible batch size and fan-out size. However, the impact of the batch and fan-out size for GNNs remains insufficiently explored. To this end, this paper systematically compares full-graph vs. mini-batch training of GNNs through empirical and theoretical analyses from the view of the batch size and fan-out size. Our key contributions include: 1) We provide a novel generalization analysis using the Wasserstein distance to study the impact of the graph structure, especially the fan-out size. 2) We uncover the non-isotropic effects of the batch size and the fan-out size in GNN convergence and generalization, providing practical guidance for tuning these hyperparameters under resource constraints. Finally, full-graph training does not always yield better model performance or computational efficiency than well-tuned smaller mini-batch settings. The implementation can be found in the anonymous link: https://anonymous.4open.science/r/GNN_fullgraph_minibatch_training-8040/README.md.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2369. Quantization-Aware Diffusion Models For Maximum Likelihood Training
๐ฌ 15
โญ 5.50
๐ generative models
๐ท๏ธ diffusion model dequantization
Diffusion models are powerful generative models for continuous signals, such as images and videos.
However, real-world digital data are quantized; hence, they take not continuous values but only a finite set of discrete values.
For example, pixels in 8โbit images can take only 256 discrete values.
In existing diffusion models, quantization is either ignored by treating data as continuous, or handled by adding small noise to make the data continuous.
Neither approach guarantees that samples from the model will converge to the finite set of quantized points.
In this work, we propose a methodology to explicitly account for quantization within diffusion models.
Specifically, by adopting a particular form of parameterization, we guarantee that samples from the reverse diffusion process converge to quantized points.
In experiments, we demonstrate that our quantization-aware model can substantially improve the performance of diffusion models for density estimation, and achieve stateโofโtheโart results on pixelโlevel image generation in likelihood evaluation.
In particular, for CIFARโ10 image generation, the negative logโlikelihood improves substantially from 2.42 to 0.27, approaching the theoretical lower bound.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2370. DiSRouter: Distributed Self-Routing for LLM Selections
๐ฌ 21
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ query routing model selection distributed system self-awareness of LLM
The proliferation of Large Language Models (LLMs) has created a diverse ecosystem of models with highly varying performance and costs, necessitating effective query routing to balance performance and expense. Current routing systems often rely on a centralized external router trained on a fixed set of LLMs, making them inflexible and prone to poor performance since the small router can not fully understand the knowledge boundaries of different LLMs. We introduce DiSRouter (Distributed Self-Router), a novel paradigm that shifts from centralized control to distributed routing. In DiSRouter, a query traverses a network of LLM agents, each independently deciding whether to answer or route to other agents based on its own self-awarenessโits ability to judge its competence. This distributed design offers superior flexibility, scalability, and generalizability. To enable this, we propose a two-stage Self-Awareness Training pipeline that enhances each LLM's self-awareness. Extensive experiments demonstrate that DiSRouter significantly outperforms existing routing methods in utility across various scenarios, effectively distinguishes between easy and hard queries, and shows strong generalization to out-of-domain tasks. Our work validates that leveraging an LLM's intrinsic self-awareness is more effective than external assessment, paving the way for more modular and efficient multi-agent systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2371. Benefits and Limitations of Communication in Multi-Agent Reasoning
๐ฌ 19
โญ 5.50
๐ learning theory
๐ท๏ธ chain-of-thought prompting multi-agent systems reasoning expressivity inter-agent communication scalability large language models algorithmic analysis
Chain-of-thought prompting has popularized step-by-step reasoning in large language models, yet model performance still degrades as problem complexity and context length grow. By decomposing difficult tasks with long contexts into shorter, manageable ones, recent multi-agent paradigms offer a promising near-term solution to this problem. However, the fundamental capacities of such systems are poorly understood. In this work, we propose a theoretical framework to analyze the expressivity of multi-agent systems. We apply our framework to three algorithmic families: state tracking, recall, and $k$-hop reasoning. We derive bounds on (i) the number of agents required to solve the task exactly, (ii) the quantity and structure of inter-agent communication, and (iii) the achievable speedups as problem size and context scale. Our results identify regimes where communication is provably beneficial, delineate tradeoffs between agent count and bandwidth, and expose intrinsic limitations when either resource is constrained. We complement our theoretical analysis with a set of experiments on pretrained LLMs using controlled synthetic benchmarks. Empirical outcomes confirm the tradeoffs between key quantities predicted by our theory. Collectively, our analysis offers principled guidance for designing scalable multi-agent reasoning systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2372. Routing, Cascades, and User Choice for LLMs
๐ฌ 16
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM routing; human-AI interaction; game theory
To mitigate the trade-offs between performance and costs, LLM providers route user tasks to different models based on task difficulty and latency. We study the effect of LLM routing with respect to user behavior. We propose a game between an LLM provider with two models (standard and reasoning) and a user who can re-prompt or abandon tasks if the routed model cannot solve them. The user's goal is to maximize their utility minus the delay from using the model, while the provider minimizes the cost of servicing the user. We solve this Stackelberg game by fully characterizing the user best response and simplifying the provider problem. We observe that in nearly all cases, the optimal routing policy involves a static policy with no cascading that depends on the expected utility of the models to the user.
Furthermore, we reveal a misalignment gap between the provider-optimal and user-preferred routes when the user's and provider's rankings of the models with respect to utility and cost differ. Finally, we demonstrate conditions for extreme misalignment where providers are incentivized to throttle the latency of the models to minimize their costs, consequently depressing user utility. The results yield simple threshold rules for single-provider, single-user interactions and clarify when routing, cascading, and throttling help or harm.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2373. FRABench and UFEval: Unified Fine-grained Evaluation with Task and Aspect Generalization
๐ฌ 23
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Aspect-level Evaluation Dataset Unified Fine-grained Evaluation
Evaluating open-ended outputs of Multimodal Large Language Models has become a bottleneck as model capabilities, task diversity, and modality rapidly expand. Existing ``MLLM-as-a-Judge'' evaluators, though promising, remain constrained to specific tasks and aspects (i.e., specific evaluation criteria such as fluency for text and image quality for images). In this paper, we argue that, on one hand, based on the interconnected nature of criteria, learning specific aspects can generalize to unseen aspects; on the other hand, jointly learning to assess multiple visual criteria and tasks may foster a synergistic effect. To this end, we propose UFEval, the first unified fine-grained evaluator with task and aspect generalization for four evaluation tasks --- Natural Language Generation, Image Understanding, Image Generation, and Interleaved Text-and-Image Generation. However, training such a unified evaluator is hindered by the lack of a large-scale, multi-modal, and aspect-level resource. To address this gap, we introduce FRABench, a comprehensive fine-grained evaluation dataset. Specifically, (1) We first construct a hierarchical aspect taxonomy encompassing 112 distinct aspects across the aforementioned four tasks. (2) Based on this taxonomy, we create FRABench, comprising 60.4k pairwise samples with 325k evaluation labels obtained from a combination of human and GPT-4o annotations. (3) Finally, leveraging FRABench, we develop UFEval, a unified fine-grained evaluator. Experiments show that learning on specific aspects enables UFEval to generalize to unseen aspects, and joint learning to assess diverse visual tasks and aspects can lead to substantial mutual benefits.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2374. Polynomial, trigonometric, and tropical activations
๐ฌ 11
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Orthogonal function bases Tropical polynomials Polynomial mapping Deep neural networks ImageNet-1K OpenWebText Transformers GPT2 Convolutional networks ConvNeXt Initialization scheme PyTorch
Which functions can be used as activations in deep neural networks? This article explores families of functions based on orthonormal bases, including the Hermite polynomial basis and the Fourier trigonometric basis, as well as a basis resulting from the tropicalization of a polynomial basis. Our study shows that, through a simple variance-preserving initialization and without additional clamping mechanisms, these activations can successfully be used to train deep models, such as GPT-2 for next-token prediction on OpenWebText and ConvNeXt for image classification on ImageNet. Our work addresses the issue of exploding and vanishing activations and gradients, particularly prevalent with polynomial activations, and opens the door for improving the efficiency of large-scale learning tasks. Furthermore, our approach provides insight into the structure of neural networks, revealing that networks with polynomial activations can be interpreted as multivariate polynomial mappings. Finally, using Hermite interpolation, we show that our activations can closely approximate classical ones in pre-trained models by matching both the function and its derivative, making them especially useful for fine-tuning tasks. These activations are available in the torchortho library.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2375. CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving
๐ฌ 22
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ CogFlow MLLM RLHF visual mathematics problem
Despite recent advances, multimodal large language models continue to struggle with visual mathematical problem solving.
Some recent works recognize that visual perception is a bottleneck in visual mathematical reasoning, but their solutions are limited to improving the extraction and interpretation of visual inputs.
Notably, they all ignore the key issue of whether the extracted visual cues are faithfully integrated and properly utilized in subsequent reasoning.
Motivated by this, we present CogFlow, a novel cognitive-inspired three-stage framework that incorporates a knowledge internalization stage, explicitly simulating the hierarchical flow of human reasoning: perception$\Rightarrow$internalization$\Rightarrow$reasoning.
In line with this hierarchical flow, we holistically enhance all its stages.
We devise synergistic visual rewards to boost perception capabilities in parametric and semantic spaces, jointly improving visual information extraction from symbols and diagrams.
To guarantee faithful integration of extracted visual cues into subsequent reasoning, we introduce a visual-anchored reward model in the internalization stage, bridging perception and reasoning.
Moreover, we design a visual-gated policy optimization algorithm to further enforce the reasoning is grounded with the visual knowledge, preventing models seeking shortcuts that appear coherent but are visually ungrounded reasoning chains.
Moreover, we contribute a new dataset MathCog for model training, which contains samples with over 120K high-quality perception-reasoning aligned annotations.
Comprehensive experiments and analysis on three commonly used visual mathematical reasoning benchmarks validate the superiority of the proposed CogFlow.
Our data, code and models will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2376. PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation
๐ฌ 8
โญ 5.50
๐ generative models
๐ท๏ธ Diffusion Model
High computational costs and slow inference hinder the practical application of video generation models. While prior works accelerate the generation process through feature caching, they often suffer from notable quality degradation. In this work, we reveal that this issue arises from their inability to distinguish truly redundant features, which leads to the unintended skipping of computations on important features. To address this, we propose \textbf{PreciseCache}, a plug-and-play framework that precisely detects and skips truly redundant computations, thereby accelerating inference without sacrificing quality. Specifically, PreciseCache contains two components: LFCache for step-wise caching and BlockCache for block-wise caching. For LFCache, we compute the Low-Frequency Difference (LFD) between the prediction features of the current step and those from the previous cached step. Empirically, we observe that LFD serves as an effective measure of step-wise redundancy, accurately detecting highly redundant steps whose computation can be skipped through reusing cached features. To further accelerate generation within each non-skipped step, we propose BlockCache, which precisely detects and skips redundant computations at the block level within the network. Extensive experiments on various backbones demonstrate the effectiveness of our PreciseCache, which achieves an average of $2.6\times$ speedup without noticeable quality loss. Source code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2377. GTA1: GUI Test-time Scaling Agent
๐ฌ 12
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ GUI Agent; Multimodal Large Language Model
Graphical user interface (GUI) agents autonomously complete tasks across platforms (\eg, Linux) by sequentially decomposing user instructions into action proposals that iteratively interact with visual elements in the evolving environment. However, two main challenges arise: i) planning (\ie, the action proposal sequence) under expansive action space, where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, \ie, precisely interacting with visual targets. This paper investigates the aforementioned challenges with our \textbf{G}UI \textbf{T}est-time Scaling \textbf{A}gent, namely \name. First, we conduct test-time scaling to select the most appropriate action proposal: at each step, multiple candidate proposals are sampled and evaluated and selected by a judge model. It trades off computation for better decision quality by concurrent sampling. Second, we propose a model that improves grounding of the selected action proposals to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, \name achieves state-of-the-art performance on both grounding and agent task execution benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2378. Trajectory-aware Shifted State Space Models for Online Video Super-Resolution
๐ฌ 4
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Super-resolution Online Mamba Trajectory
Online video super-resolution (VSR) is an important technique for many real-world video processing applications, which aims to restore the current high-resolution video frame based on temporally previous frames. Most of the existing online VSR methods solely employ one neighboring previous frame to achieve temporal alignment, which limits long-range temporal modeling of videos. Recently, state space models (SSMs) have been proposed with linear computational complexity and a global receptive field, which significantly improve computational efficiency and performance. In this context, this paper presents a novel online VSR method based on Trajectory-aware Shifted SSMs (TS-Mamba), leveraging both long-term trajectory modeling and low-complexity Mamba to achieve efficient spatio-temporal information aggregation. Specifically, TS-Mamba first constructs the trajectories within a video to select the most similar tokens from the previous frames. Then, a Trajectory-aware Shifted Mamba Aggregation (TSMA) module consisting of proposed shifted SSMs blocks is employed to aggregate the selected tokens. The shifted SSMs blocks are designed based on Hilbert scannings and corresponding shift operations to compensate for scanning losses and strengthen the spatial continuity of Mamba. Additionally, we propose a trajectory-aware loss function to supervise the trajectory generation, ensuring the accuracy of token selection when training our model. Extensive experiments on three widely used VSR test datasets demonstrate that compared with six online VSR benchmark models, our TS-Mamba achieves state-of-the-art performance in most cases and over 22.7% complexity reduction (in MACs). The source code for TS-Mamba will be available at https://github.com.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2379. Shift-and-Sum Quantization for Visual Autoregressive Models
๐ฌ 15
โญ 5.50
๐ generative models
๐ท๏ธ VAR network quantization
Post-training quantization (PTQ) enables efficient deployment of deep networks using a small set of data. Its application to visual autoregressive models (VAR), however, remains relatively unexplored. We identify two key challenges for applying PTQ to VAR: (i) large reconstruction errors in attentionโvalue products, especially at coarse scales where high attention scores occur more frequently; and (ii) a discrepancy between the sampling frequencies of codebook entries and their predicted probabilities due to limited calibration data. To address these challenges, we propose a PTQ framework tailored for VAR. First, we introduce a shift-and-sum quantization method that reduces reconstruction errors by aggregating quantized results from symmetrically shifted duplicates of value tokens. Second, we present a resampling strategy for calibration data that aligns sampling frequencies of codebook entries with their predicted probabilities. Experiments on class-conditional image generation, in-painting, out-painting, and class-conditional editing show consistent improvements across VAR architectures, establishing a new state of the art in PTQ for VAR.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2380. Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning
๐ฌ 16
โญ 5.50
๐ reinforcement learning
๐ท๏ธ large language model reinforcement learning
Large Language Models (LLMs) trained via Reinforcement Learning (RL) have exhibited strong reasoning capabilities and emergent reflective behaviors, such as rethinking and error correction, as a form of in-context exploration. However, the Markovian policy obtained from conventional RL training does not give rise to reflective exploration behaviors since the policy depends on the history only through the state and therefore has no incentive to enrich identical states with additional context. Instead, RL exploration is only useful during training to learn the optimal policy in a trial-and-error manner. Therefore, it remains unclear whether reflective reasoning will emerge during RL, or why they are beneficial. To remedy this, we recast reflective exploration within a Bayesian RL framework, which optimizes the expected return under a posterior distribution over Markov decision processes induced by the training data. This Bayesian formulation admits uncertainty-adaptive policies that, through belief updates, naturally incentivize information-gathering actions and induce self-reflection behaviors. Our resulting algorithm, BARL, instructs the LLM to stitch and switch strategies based on the observed outcomes, offering principled guidance on when and how the model should reflectively explore. Empirical results on both synthetic and mathematical reasoning tasks demonstrate that BARL outperforms conventional RL approaches, achieving superior test-time performance and token efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2381. Maximizing Asynchronicity in Event-based Neural Networks
๐ฌ 22
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ event camera self-supervised learning linear attention linear RNN neural network architectures
Event cameras deliver visual data with high temporal resolution, low latency, and minimal redundancy, yet their asynchronous, sparse sequential nature challenges standard tensor-based machine learning (ML). While the recent asynchronous-to-synchronous (A2S) paradigm aims to bridge this gap by asynchronously encoding events into learned features for ML pipelines, existing A2S approaches often sacrifice expressivity and generalizability compared to dense, synchronous methods. This paper introduces EVA (EVent Asynchronous feature learning), a novel A2S framework to generate highly expressive and generalizable event-by-event features. Inspired by the analogy between events and language, EVA uniquely adapts advances from language modeling in linear attention and self-supervised learning for its construction. In demonstration, EVA outperforms prior A2S methods on recognition tasks (DVS128-Gesture and N-Cars), and represents the first A2S framework to successfully master demanding detection tasks, achieving a 47.7 mAP on the Gen1 dataset. These results underscore EVA's potential for advancing real-time event-based vision applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2382. VMDiff: Visual Mixing Diffusion for Limitless Cross-Object Synthesis
๐ฌ 20
โญ 5.50
๐ generative models
๐ท๏ธ Image generation Image concept fusion
Creating novel images by fusing visual cues from multiple sources is a fundamental yet underexplored problem in image-to-image generation, with broad applications in artistic creation, virtual reality and visual media. Existing methods often face two key challenges: coexistent generation, where multiple objects are simply juxtaposed without true integration, and bias generation, where one object dominates the output due to semantic imbalance. To address these issues, we propose **Visual Mixing Diffusion (VMDiff)**, a simple yet effective diffusion-based framework that synthesizes a single, coherent object by integrating two input images at both noise and latent levels. Our approach comprises: (1) a **hybrid sampling process** that combines guided denoising, inversion, and spherical interpolation with adjustable parameters to achieve structure-aware fusion, mitigating coexistent generation; and (2) an **efficient adaptive adjustment module**, which introduces a novel similarity-based score to automatically and adaptively search for optimal parameters, countering semantic bias. Experiments on a curated benchmark of 780 concept pairs demonstrate that our method outperforms strong baselines in visual quality, semantic consistency, and human-rated creativity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2383. Non-Autoregressive Generation for Agentic Multi-Turn Interaction
๐ฌ 22
โญ 5.50
๐ generative models
๐ท๏ธ Agentic Data Synthesis Non-Autoregressive Generation
Agentic task-solving with Large Language Models (LLMs) requires multi-turn, multi-step interactions, often involving complex function calls and dynamic user-agent exchanges. Existing simulation-based data generation methods for such scenarios rely heavily on costly autoregressive interactions between multiple LLM agents, thereby limiting real-world performance of agentic tasks. In this paper, we propose a novel Non-Autoregressive Iterative Generation (NAIG) framework for constructing high-quality multi-turn agentic dialogues. NAIG generates full conversational trajectories through three stages: coarse-grained initialization, iterative refinement, and offline verification. The initialization phase builds a structurally complete yet semantically coarse dialogue skeleton; the iterative refinement phase introduces realistic complexities and continued refinement via mask-and-fill operations; and the offline verification phase ensures correctness and coherence via rule- and model-based checks. Experiments demonstrate that NAIG enables efficient, effective and generalizable agentic data generation, offering a new paradigm for high-quality data construction in tool-augmented LLM scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2384. FaLW: A Forgetting-aware Loss Reweighting for Long-tailed Unlearning
๐ฌ 4
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Long-tailed learning Unlearning Fairness
Machine unlearning, which aims to efficiently remove the influence of specific data from trained models, is crucial for upholding data privacy regulations like the ``right to be forgotten". However, existing research predominantly evaluates unlearning methods on relatively balanced forget sets. This overlooks a common real-world scenario where data to be forgotten, such as a user's activity records, follows a long-tailed distribution. Our work is the first to investigate this critical research gap. We find that in such long-tailed settings, existing methods suffer from two key issues: Heterogeneous Unlearning Deviation and Skewed Unlearning Deviation. To address these challenges, we propose FaLW, a plug-and-play, instance-wise dynamic loss reweighting method. FaLW innovatively assesses the unlearning state of each sample by comparing its predictive probability to the distribution of unseen data from the same class. Based on this, it uses a forgetting-aware reweighting scheme, modulated by a balancing factor, to adaptively adjust the unlearning intensity for each sample. Extensive experiments demonstrate that FaLW achieves superior performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2385. Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents
๐ฌ 22
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Software Engineering Agent Large Language Model
Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck due to heavy manual effort and scarce large-scale, high-quality datasets.
Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail.
We introduce the Environment Configuration Diagnosis Benchmark, EnConda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action to execute the final environment configuration.
Our task instances are automatically constructed by injecting realistic README errors and are validated in Docker for scalable, high-quality evaluation.
EnConda-bench combines process-level analysis with end-to-end executability to enable capability assessments beyond aggregate success rates.
Evaluations across state-of-the-art LLMs and agent frameworks show that while agents can localize errors, they struggle to translate feedback into effective corrections, limiting end-to-end performance.
To our knowledge, EnConda-bench is the first framework to provide process-level internal capability assessment for environment configuration, offering actionable insights for improving software engineering agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2386. SpikeGen: Decoupled โRods and Conesโ Visual Representation Processing with Latent Generative Framework
๐ฌ 22
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ bio-inspired image representation learning
The process through which humans perceive and learn visual representations in dynamic environments is highly complex. From a structural perspective, the human eye decouples the functions of cone and rod cells: cones are primarily responsible for color perception, while rods are specialized in detecting motion, particularly variations in light intensity. These two distinct modalities of visual information are integrated and processed within the visual cortex, thereby enhancing the robustness of the human visual system. Inspired by this biological mechanism, modern hardware systems have evolved to include not only color-sensitive RGB cameras but also motion-sensitive Dynamic Visual Systems, such as spike cameras. Building upon these advancements, this study seeks to emulate the human visual system by integrating decomposed multi-modal visual inputs with modern latent-space generative frameworks. We named it ***SpikeGen***. We evaluate its performance across various spike-RGB tasks, including conditional image and video deblurring, dense frame reconstruction from spike streams, and high-speed scene novel-view synthesis. Supported by extensive experiments, we demonstrate that leveraging the latent space manipulation capabilities of generative models enables an effective synergistic enhancement of different visual modalities, addressing spatial sparsity in spike inputs and temporal sparsity in RGB inputs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2387. NeuCLIP: Efficient Large-Scale CLIP Training with Neural Normalizer Optimization
๐ฌ 14
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Representation Learning Contrastive Learning
Accurately estimating the normalization term (also known as the partition function) in the contrastive loss is a central challenge for training Contrastive Language-Image Pre-training (CLIP) models. Conventional methods rely on large batches for approximation, demanding substantial computational resources. To mitigate this issue, prior works introduced per-sample normalizer estimators, updated at each epoch in a blockwise coordinate manner to keep track of updated encoders. However, this scheme incurs an optimization error that scales with the ratio of dataset size to batch size, limiting effectiveness for large datasets or small batches. To overcome this limitation, we propose NeuCLIP, a novel and elegant optimization framework based on two key ideas: (i) **reformulating** the contrastive loss for each sample **via convex analysis** into a minimization problem with an auxiliary variable representing its log-normalizer; and (ii) **transforming** the resulting minimization over $n$ auxiliary variables (where $n$ is the dataset size) via **variational analysis** into the minimization over a compact neural network that predicts the log-normalizers. We design an alternating optimization algorithm that jointly trains the CLIP model and the auxiliary network. By employing a tailored architecture and acceleration techniques for the auxiliary network, NeuCLIP achieves more accurate normalizer estimation, leading to improved performance compared with previous methods. Extensive experiments on large-scale CLIP training, spanning datasets from millions to billions of samples, demonstrate that NeuCLIP outperforms previous methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2388. Bridging the Distribution Gap to Harness Pretrained Diffusion Priors for Super-Resolution
๐ฌ 22
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Super-Resolution diffusion generative prior
Diffusion models, well recognized for their strong generative priors, have recently been increasingly applied to super-resolution (SR) tasks.
However, as diffusion models are trained on Gaussian-corrupted natural images, the distribution gap between low-resolution (LR) inputs and the modelโs training distribution hinders direct inference. Prior works address this by conditioning on LR images, but their fine-tuning often weakens generative capability and requires multiple denoising steps. In this work, we present DM-SR, a novel framework that bridges this gap without modifying the pretrained diffusion model. We train an image encoder that maps LR inputs into the latent distribution aligned with the diffusion modelโs training space, preserving its full generative power. Furthermore, DM-SR adaptively predicts the appropriate noise level based on the degradation of each input, ensuring optimal alignment with the diffusion modelโs timestep-dependent distribution. Extensive experiments show that DM-SR achieves superior perceptual quality with a single-stage diffusion process, setting a new direction for efficient and high-fidelity SR with diffusion models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2389. ReSplat: Degradation-agnostic Feed-forward Gaussian Splatting via Self-guided Residual Diffusion
๐ฌ 12
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Universal Image Restoration
Recent advances in novel view synthesis (NVS) have predominantly focused on ideal, clear input settings, limiting their applicability in real-world environments with common degradations such as blur, low-light, haze, rain, and snow. While some approaches address NVS under specific degradation types, they are often tailored to narrow cases, lacking the generalizability needed for broader scenarios. To address this issue, we propose Restoration-based feed-forward Gaussian Splatting, named ReSplat, a novel framework capable of handling degraded multi-view inputs. Our model jointly estimates restored images and gaussians to represent the clear scene for NVS. We enable multi-view consistent universal image restoration by utilizing the 3d gaussians generated during the diffusion sampling process as self-guidance. This results in sharper and more reliable novel views. Notably, our framework adapts to various degradations without prior knowledge of their specific types. Extensive experiments demonstrate that ReSplat significantly outperforms existing methods across challenging conditions, including blur, low-light, haze, rain, and snow, delivering superior visual quality and robust NVS performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2390. CoLLMLight: Cooperative Large Language Model Agents for Network-Wide Traffic Signal Control
๐ฌ 19
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Traffic Signal Control Large Language Model Multi-Agent System Intelligent Transportation
Large Language Models (LLMs) have recently emerged as promising agents for Traffic Signal Control (TSC) due to their strengths in reasoning and generalization. However, current LLM-based approaches treat intersections as independent agents without inter-intersection cooperation, limiting their effectiveness in network-wide optimization. To address this gap, we propose CoLLMLight, the first cooperative LLM agent framework for network-wide traffic signal control. CoLLMLight enables agents to perform in-depth spatiotemporal reasoning for cooperation, while ensuring real-time responsiveness through an asynchronous cooperative decision architecture. The reasoning process runs asynchronously, deriving cooperative control guidance from dynamic interactions among intersections. This guidance is cached and incorporated as contextual input for real-time signal decisions. To enhance cooperation quality while ensuring reasoning efficiency, we propose cost-aware cooperation optimization. It first applies adaptive reasoning chain optimization to enable the LLM to adjust its reasoning depth according to traffic complexity. The model is then refined with reinforcement learning using reward signals that promote network-wide performance while penalizing excessive reasoning. Extensive experiments on four real-world traffic networks demonstrate that CoLLMLight consistently outperforms existing methods, achieving more effective and generalizable cooperation while maintaining real-time responsiveness and efficient token usage.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2391. TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video
๐ฌ 15
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Tracking Any Point
In this paper, built upon TAPTRv2, we present TAPTRv3. TAPTRv3 improves TAPTRv2 by addressing its shortage in querying high quality features from long videos, where the target tracking points normally undergo increasing variation over time. In TAPTRv3, we propose to utilize both spatial and temporal context to bring better feature querying along the spatial and temporal dimensions for more robust tracking in long videos. For better spatial feature querying, we identify that off-the-shelf attention mechanisms struggle with point-level tasks and present Context-aware Cross-Attention (CCA). CCA introduces spatial context into the attention mechanism to enhance the quality of attention scores when querying image features. For better temporal feature querying, we introduce Visibility-aware Long-Temporal Attention (VLTA), which conducts temporal attention over all past frames while considering their corresponding visibilities. This effectively addresses the feature drifting problem in TAPTRv2 caused by its RNN-like long-term modeling. TAPTRv3 surpasses TAPTRv2 by a large margin on most of the challenging datasets and obtains state-of-the-art performance. Even when compared with methods trained on large-scale extra internal data, TAPTRv3 still demonstrates superiority.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2392. One Skill, Many Websites: Learning Generalizable Skills Through Polymorphic Abstraction
๐ฌ 14
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Skill Induction Agent Polymorphism Continual Learning Large Language Models
Large language models (LLMs) are moving beyond static uses and are now powering agents that learn during their interaction with external environments. For example, agents can learn reusable skills while navigating web pages or toggling new tools. However, existing methods for skill learning often create skills that are over-specialized to a single website and fail to generalize.
We introduce PolySkill, a new framework that enables agents to learn generalizable and compositional skills. The core idea, inspired by polymorphism in software engineering, is to decouple a skill's abstract goal (*what* it accomplishes) and its concrete implementation (*how* it is executed). Experiments show that
our method (1) improves skill reuse by 1.7x on seen websites and (2) boosts success rates by up to 9.4\% on Mind2Web and 13.9\% on unseen websites, while reducing steps by over 20\%. (3) In self-exploration settings without specified tasks, our framework improves the quality of proposed tasks and enables agents to learn generalizable skills that work across different sites.
By enabling the agent to identify and refine its own goals, the \ours enhance the agent a better curriculum, leading to the acquisition of more generalizable skills compared to baseline methods. This work provides a practical path toward building agents capable of continual learning in adaptive environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2393. Estimating Semantic Alphabet Size for LLM Uncertainty Quantification
๐ฌ 11
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ large language model uncertainty quantification hallucination entropy alphabet
Many black-box techniques for quantifying the uncertainty of large language models (LLMs) rely on repeated LLM sampling, which can be computationally expensive. Therefore, practical applicability demands reliable estimation from few samples. Semantic entropy (SE) is a popular sample-based uncertainty estimator with a discrete formulation attractive for the black-box setting. Recent extensions of SE exhibit improved LLM hallucination detection, but do so with less interpretable methods that admit additional hyperparameters. For this reason, we revisit the canonical discrete semantic entropy (DSE) estimator, finding that it underestimates the ``true'' semantic entropy, as expected from theory. We propose a modified semantic alphabet size estimator, and illustrate that using it to adjust DSE for sample coverage results in more accurate SE estimation in our setting of interest. Furthermore, we find that two semantic alphabet size estimators, including our proposed, flag incorrect LLM responses as well or better than many top-performing alternatives, with the added benefit of remaining highly interpretable.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2394. Nonparametric Teaching of Attention Learners
๐ฌ 19
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Nonparametric Teaching Functional Gradient Descent Attention Learners
Attention learners, neural networks built on the attention mechanism, e.g., transformers, excel at learning the implicit relationships that relate sequences to their corresponding properties, e.g., mapping a given sequence of tokens to the probability of the next token. However, the learning process tends to be costly. To address this, we present a novel paradigm named **Atte**ntion **N**eural **T**eaching (AtteNT) that reinterprets the learning process through a nonparametric teaching perspective. Specifically, the latter provides a theoretical framework for teaching mappings that are implicitly defined (i.e., nonparametric) via example selection. Such an implicit mapping is embodied through a dense set of sequence-property pairs, with the AtteNT teacher selecting a subset to accelerate convergence in attention learner training. By analytically investigating the role of attention on parameter-based gradient descent during training, and recasting the evolution of attention learners, shaped by parameter updates, through functional gradient descent in nonparametric teaching, we show *for the first time* that teaching attention learners is consistent with teaching importance-adaptive nonparametric learners. These new findings readily commit AtteNT to enhancing learning efficiency of attention learners. Specifically, we observe training time reductions of 13.01% for LLMs and 20.58% for ViTs, spanning both fine-tuning and training-from-scratch regimes. Crucially, these gains are achieved without compromising accuracy; in fact, performance is consistently preserved and often enhanced across a diverse set of downstream tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2395. Don't Shift the Trigger: Robust Gradient Ascent for Backdoor Unlearning
๐ฌ 13
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ gradient ascent machine unlearning backdoor defense
Backdoor attacks pose a significant threat to machine learning models, allowing adversaries to implant hidden triggers that alter model behavior when activated. Although gradient ascent (GA)-based unlearning has been proposed as an efficient backdoor removal approach, we identify a critical yet overlooked issue: vanilla GA does not eliminate the trigger but shifts its impact to different classes, a phenomenon we call trigger shifting. To address this, we propose Robust Gradient Ascent (RGA), which introduces a dynamic penalty mechanism to regulate GA's strength and prevent excessive unlearning. Our experiments show that RGA effectively removes backdoors while preserving model utility, offering a more reliable defense against backdoor attacks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2396. The Limits of Inference Scaling Through Resampling
๐ฌ 8
โญ 5.50
๐ generative models
๐ท๏ธ Inference scaling Resampling Reasoning LLMs NLP
Recent research has generated hope that inference scaling, such as resampling solutions until they pass verifiers like unit tests, could allow weaker models to match stronger ones. Beyond inference, this approach also enables training reasoning models, where data is curated using rejection sampling against a verifier. However, we show that this approach is fundamentally limited when verifiers are imperfect and have a non-zero probability of producing false positives. Resampling cannot decrease this probability, so it imposes an upper bound to the accuracy of resampling-based inference scaling, regardless of compute budget. Our analysis shows that there is a strong correlation between the modelโs single-sample accuracy and its false positive rate on HumanEval and MBPP, whose unit tests have limited coverage. Therefore, no amount of inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model. Empirical results show that optimal sampling attempts are often fewer than 10, as the negative utility of false positives outweighs benefits, bending inference scaling curves downward. Finally, false positives may have other undesirable qualities, like poor adherence to coding style conventions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2397. Action-Guided Attention for Video Action Anticipation
๐ฌ 22
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ video action anticipation video understanding
Anticipating future actions in videos is challenging, as the observed frames provide only evidence of past activities, requiring the inference of latent intentions to predict upcoming actions. Existing transformer-based approaches, which rely on dot-product attention over pixel representations, often lack the high-level semantics necessary to model video sequences for effective action anticipation. As a result, these methods tend to overfit to explicit visual cues present in the past frames, limiting their ability to capture underlying intentions and degrading generalization to unseen samples. To address this, we propose Action-Guided Attention (AGA), an attention mechanism that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling. Our approach fosters the attention module to emphasize relevant moments from the past based on the upcoming activity and combine this information with the current frame embedding via a dedicated gating function. The design of AGA enables post-training analysis of the knowledge discovered from the training set. Experiments on the widely adopted EPIC-Kitchens-100 benchmark demonstrate that AGA generalizes well from validation to unseen test sets. Post-training analysis can further examine the action dependencies captured by the model and the counterfactual evidence it has internalized, offering transparent and interpretable insights into its anticipative predictions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2398. Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty
๐ฌ 15
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Large Language Models Efficient Reasoning Reinforcement Learning Adaptive Coordinated Penalty
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks by employing test-time scaling. However, they often generate over-long chains-of-thought that, driven by substantial reflections such as repetitive self-questioning and circular reasoning, lead to high token consumption, substantial computational overhead, and increased latency without improving accuracy, particularly in smaller models. Our observation reveals that increasing problem complexity induces more excessive and unnecessary reflection, which in turn reduces accuracy and increases token overhead. To address this challenge, we propose Adaptive Reflection and Length Coordinated Penalty (ARLCP), a novel reinforcement learning framework designed to dynamically balance reasoning efficiency and solution accuracy. ARLCP introduces two key innovations: (1) a reflection penalty that adaptively curtails unnecessary reflective steps while preserving essential reasoning, and (2) a length penalty calibrated to the estimated complexity of the problem. By coordinating these penalties, ARLCP encourages the model to generate more concise and effective reasoning paths. We evaluate our method on five mathematical reasoning benchmarks using DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-7B models. Experimental results show that ARLCP achieves a superior efficiency-accuracy trade-off compared to existing approaches. For the 1.5B model, it reduces the average response length by 53.1% while simultaneously improving accuracy by 5.8%. For the 7B model, it achieves a 35.0% reduction in length with a 2.7% accuracy gain.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2399. TumorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis
๐ฌ 18
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ medical multimodal large language model; multimodal interleaved Chain-of-Thought (CoT); tumor analysis
Accurate tumor analysis is central to clinical radiology and precision oncology, where early detection, reliable lesion characterization, and pathology-level risk assessment directly guide diagnosis, staging, and treatment planning. Chain-of-Thought (CoT) reasoning is particularly critical in this setting, as it enables stepwise interpretation from imaging findings to clinical impressions and pathology-level conclusions, ensuring traceability and reducing diagnostic errors. Here, we target the clinical tumor analysis task and build a large-scale benchmark that operationalizes a multimodal reasoning pipeline, spanning findings, impressions, and pathology predictions.
We curate TumorCoT, a large-scale dataset of 1.5M CoT-labeled VQA instructions paired with 3D CT scans, with step-aligned rationales and cross-modal alignments along the โfindings โ impression โ pathologyโ trajectory, enabling standardized evaluation of both final accuracy and reasoning consistency.
We further propose TumorChain, a multimodal interleaved reasoning framework that tightly couples 3D imaging encoders, clinical text understanding, and organ-level vision-language alignment.
Through cross-modal alignment and iterative interleaved causal reasoning, TumorChain grounds visual evidence, aggregates conclusions, and issues pathology predictions after multiple rounds of self-refinement, improving traceability and reducing hallucination risk.
TumorChain demonstrates consistent gains over strong unimodal and pipeline baselines in lesion detection, impression quality, and pathology classification, and successfully generalizes to the public DeepTumorVQA benchmark. Ablations validate the key contributions of interleaved reasoning and clinical CoT. Clinically, these advances lay the groundwork for reliable, interpretable tumor assessment to support real-world decision-making. We release the task, benchmark, and evaluation protocol to advance safe, explainable, and reproducible multimodal reasoning for high-stakes tumor analysis. Our project is available at https://anonymous.4open.science/r/TumorChain-D6E6.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2400. Improving Code Localization with Repository Memory
๐ฌ 15
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Code Localization Large Language Models Agent Memory
Code localization is a fundamental challenge in repository-level software engineering tasks such as bug fixing. While existing methods equip language agents with comprehensive tools/interfaces to fetch information from the repository, they overlook the critical aspect of *memory*, where each instance is typically handled from scratch assuming no prior repository knowledge. In contrast, human developers naturally build long-term repository memory, such as the functionality of key modules and associations between various bug types and their likely fix locations. In this work, we augment language agents with such memory by leveraging a repository's *commit history* - a rich yet underutilized resource that chronicles the codebase's evolution. We introduce tools that allow the agent to retrieve from a non-parametric memory encompassing recent historical commits and linked issues, as well as functionality summaries of actively evolving parts of the codebase identified via commit patterns. We demonstrate that augmenting such a memory can significantly improve LocAgent, a state-of-the-art localization framework, on both SWE-bench-verified and the more recent SWE-bench-live benchmarks. Our research contributes towards developing agents that can accumulate and leverage past experience for long-horizon tasks, more closely emulating the expertise of human developers.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2401. Retrospective Sparse Attention for Efficient Long-Context Generation
๐ฌ 22
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Long Generation KV Cache Compression
Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory footprint grows linearly with sequence length and dominates latency at each decoding step. While recent KV cache compression methods identify and load important few tokens, they focus predominantly on input contexts and fail to address the cumulative attention errors that arise during long decoding. In this paper, we introduce RetroAttention, a novel KV cache update technique that retrospectively revises past attention outputs using newly arrived KV entries from subsequent decoding steps. By maintaining a lightweight output cache, RetroAttention enables past queries to be efficiently supplemented with more contexts, while incurring minimal latency overhead. This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations. Extensive experiments on long-generation benchmarks show that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods, increasing effective KV exposure by up to 1.6$\times$ and accuracy by up to 21.9\%. We provide anonymized code in the supplementary material.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2402. Secure Outlier-Aware Large Language Model Inference
๐ฌ 17
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Multiparty Computation Privacy Perserving Machine Learning Secure LLM Inference
Secure multiparty computation allows the client to secretly inference their sensitive inputs without acquiring the proprietary machine learning model weights. As the decoder-only transformer-based large language model becomes the popular paradigm, the desire of applying MPC in large language models is increasing. However, such inference usually leads to great amount of latency, which is due to nonlinear operations in the Transformer architecture. Recent works either focus on improving cryptographic primitives or re-architecting and re-training to make LLM MPC-friendly. We, on the other hand, observe that properly addressing outlier phenomena, which are unique yet universal properties existing across different LLMs, can effectively reduce the input domain and thereby design faster protocols for non-linear operations. Hence, we propose Secure Outlier-Aware Large Language Model Inference framework (SOAL), which accelerates the RMSNorm operation by nearly 2 $\times$, SiLU by $2\times$, and Softmax by more than 5$\times$. SOAL maintains the same performance of the original model without any fine-tuning requirement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2403. Refine Drugs, Donโt Complete Them: Uniform-Source Discrete Flows for Fragment-Based Drug Discovery
๐ฌ 20
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Generative Chemistry Discrete Flow Models Molecular Optimization
We introduce InVirtuoGen, a discrete flow generative model for fragmented SMILES for de novo and fragment-constrained generation, and target-property/lead optimization of small molecules. The model learns to transform a uniform source over all possible tokens into the data distribution. Unlike masked models, its training loss accounts for predictions on all sequence positions at every denoising step, shifting the generation paradigm from completion to refinement, and decoupling the number of sampling steps from the sequence length. For \textit{de novo} generation, InVirtuoGen achieves a stronger quality-diversity pareto frontier than prior fragment-based models and competitive performance on fragment-constrained tasks. For property and lead optimization, we propose a hybrid scheme that combines a genetic algorithm with a Proximal Property Optimization fine-tuning strategy adapted to discrete flows. Our approach sets a new state-of-the-art on the Practical Molecular Optimization benchmark, measured by top-10 AUC across tasks, and yields higher docking scores in lead optimization than previous baselines. InVirtuoGen thus establishes a versatile generative foundation for drug discovery, from early hit finding to multi-objective lead optimization. We further contribute to open science by releasing pretrained checkpoints and code, making our results fully reproducible.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2404. Enhancing Trustworthiness of Fine-Tuned LLMs via Regularized Subset Selection
๐ฌ 20
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Trustworthiness Subset Selection Submodularity Data Attribution
Supervised fine-tuning (SFT) improves large language model (LLM) perplexity but can also degrade trustworthinessโleading to the generation of untruthful, biased, or unsafe content during user interactions. These issues are often traced back to specific phrases or patterns in the training data. However, correcting them usually requires expensive retraining or new data collection. In this work, we propose a two-stage, compute-efficient repair of the post-SFT models that enhances trustworthiness while preserving the downstream performance. In the first stage, we identify the training samples responsible for failures on trustworthiness metrics like truthfulness, stereotypical bias, and machine ethicsโand select a small, diverse subset of these examples using a determinantal point process (DPP)-based regularization. In the second stage, we repair the model under the framework of proximal Bregman response function (PBRF) using a gradient ascent update, which enhances trustworthiness while preserving downstream task performance (perplexity). We evaluate our method on multiple LLMs of varying sizes and demonstrate up to 21\% improvement in trustworthiness metrics with minimal impact ($\leq1$ %) on perplexity. Our method provides a computationally efficient approach to enhance post-SFT models and offers a practical alternative to hours of retraining required for model repair
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2405. Think Then Embed: Generative Context Improves Multimodal Embedding
๐ฌ 12
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Retrieval
There is a growing interest in Universal Multimodal Embeddings (UME), where models are required to generate task-specific representations. While recent studies show that Multimodal Large Language Models (MLLMs) perform well on such tasks, they treat MLLMs solely as encoders, overlooking their generative capacity. However, such an encoding paradigm becomes less effective as instructions become more complex and require compositional reasoning. Inspired by the proven effectiveness of chain-of-thought reasoning, we propose a general Think-Then-Embed (TTE) framework for UME, composed of a reasoner and an embedder. The reasoner MLLM first generates reasoning traces that explain complex queries, followed by an embedder that produces representations conditioned on both the original query and the intermediate reasoning. This explicit reasoning step enables more nuanced understanding of complex multimodal instructions. Our contributions are threefold. First, by leveraging a powerful MLLM reasoner, we achieve state-of-the-art performance on the MMEB-V2 benchmark, surpassing proprietary models trained on massive in-house datasets. Second, to reduce the dependency on large MLLM reasoners, we finetune a smaller MLLM reasoner using high-quality embedding-centric reasoning traces, achieving the best performance among open-source models with a 7% absolute gain over recently proposed models. Third, we investigate strategies for integrating the reasoner and embedder into a unified model for improved efficiency without sacrificing performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2406. Enforcing Axioms for AI Alignment under Loss-Based Rules
๐ฌ 11
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Social Choice AI Alignment Reinforcement Learning from Human Feedback Constitutional AI
Recent alignment methods for large language models, most notably reinforcement learning from human feedback (RLHF), often train an auxiliary reward model to minimize a loss function on binary preference data over model responses. We study a theoretical setting inspired by principle-guided methods such as Constitutional AI, in which a small set of principles (e.g., helpfulness, toxicity) act as โvotersโ that guide binary comparisons---such as preferring the less toxic response. We model these principles as linear directions in an embedding space of responses, a simplifying assumption motivated by the Linear Representation Hypothesis---concepts are linear directions in representation-space---a useful first-order approximation in practice.
In this \emph{linear social choice model}, Ge et al. (2024) showed that an optimal linear reward model can violate Pareto optimality (PO): From the principles-as-voters lens, this means a response A can be less helpful and more toxic than B, yet still receive a higher reward. We analyze axiomatic violations in the linear social choice setting and probe the robustness of negative results under realistic assumptions. We show that added expressivity does not resolve the issue: polynomial reward models can still fail PO. We then offer a pragmatic alternative showing that when the data uniformly covers the embedding space, broad classes of loss-based rules in the limit exactly recover the axiomatic guarantees. This yields a recipe for constitutional-style alignment with provable guarantees: enforce balanced coverage \emph{via dataset design} to restore axiomatic guarantees without abandoning standard training pipelines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2407. Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models
๐ฌ 22
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ long-context modeling length generalization length extrapolation sparse attention language modeling
Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2408. Evolving Graph Structured Programs for Circuit Generation with Large Language Models
๐ฌ 18
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Electronic Design Automation; Logic Synthesis; Large Language Models;
Logic synthesis (LS), which aims to generate a *compact* logic circuit graph with minimized size while *accurately* satisfying a given functionality, plays an important role in chip design. However, existing LS methods struggle to balance circuit structure compactness and functional accuracy, often leading to suboptimal generation. To address this problem, we propose a novel *Circuit Program Evolution* framework, namely CircuitEvo, which iteratively leverages large language models (LLMs) to evolve circuit programs towards improved compactness while preserving functional accuracy. Specifically, CircuitEvo models the circuit graph as a structured program and leverages the strong generative capabilities of LLMs โ guided by domain-specific evolutionary prompt strategies โ to generate promising circuit candidates in each iteration. Moreover, a structure-aware circuit optimization module is introduced to correct functional discrepancies by appending necessary substructures to the generated circuits. To the best of our knowledge, CircuitEvo is *the first* LLM-based LS approach that can iteratively improve a circuit's compactness while ensuring functional accuracy. Experiments on several widely used benchmarks demonstrate that CircuitEvo can efficiently generate accurate circuits with up to 16 input number and 69 output number. Moreover, our method significantly outperforms state-of-the-art methods in terms of circuit size, achieving an average improvement of 6.74%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2409. Grounding-IQA: Grounding Multimodal Language Model for Image Quality Assessment
๐ฌ 23
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Grounding IQA MLLM
The development of multimodal large language models (MLLMs) enables the evaluation of image quality through natural language descriptions. This advancement allows for more detailed assessments. However, these MLLM-based IQA methods primarily rely on general contextual descriptions, sometimes limiting fine-grained quality assessment. To address this limitation, we introduce a new image quality assessment (IQA) task paradigm, **grounding-IQA**. This paradigm integrates multimodal referring and grounding with IQA to realize more fine-grained quality perception, thereby extending existing IQA. Specifically, grounding-IQA comprises two subtasks: grounding-IQA-description (GIQA-DES) and visual question answering (GIQA-VQA). GIQA-DES involves detailed descriptions with precise locations (e.g., bounding boxes), while GIQA-VQA focuses on quality QA for local regions. To realize grounding-IQA, we construct a corresponding dataset, GIQA-160K, through our proposed automated annotation pipeline. Furthermore, we develop a well-designed benchmark, GIQA-Bench. The benchmark comprehensively evaluates the model grounding-IQA performance from three perspectives: description quality, VQA accuracy, and grounding precision.
Experiments demonstrate that our proposed task paradigm, dataset, and benchmark facilitate the more fine-grained IQA application. Code will be made public.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2410. Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
๐ฌ 22
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multimodal Large Language Models Reasoning
DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL).
Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data.
To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability.
Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1.
To mitigate the optimization challenges caused by overthinking after cold start, we propose Progressive Thinking Suppression Training (PTST) strategy and employ Group Relative Policy Optimization (GRPO) with the hard formatting result reward function to gradually refine the model's ability to learn correct and complex reasoning processes on the multimodal math dataset.
Comprehensive experiments show our model achieves an average improvement of $\sim$6\% across various multimodal math reasoning benchmarks using only a 10K multimodal math data during RL training.
Vision-R1-7B achieves a 73.5\% accuracy on the widely used MathVista benchmark, which is only 0.4\% lower than the leading reasoning model, OpenAI O1.
Scaling up the amount of multimodal math data in the RL training, Vision-R1-32B and Vison-R1-72B achieves 76.4\% and 78.2\% MathVista benchmark scores, respectively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2411. Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles
๐ฌ 5
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Diffusion Models Model Acceleration Adaptive Sampling
Diffusion-based language models (dLLMs) have emerged as a promising alternative to traditional autoregressive LLMs by enabling parallel token generation and significantly reducing inference latency. However, existing sampling strategies for dLLMs, such as confidence-based or semi-autoregressive decoding, often suffer from static behavior, leading to suboptimal efficiency and limited flexibility. In this paper, we propose SlowFast Sampling, a novel dynamic sampling strategy that adaptively alternates between exploratory and accelerated decoding stages. Our method is guided by three golden principles: certainty principle, convergence principle, and positional principle, which govern when and where tokens can be confidently and efficiently decoded. We further integrate our strategy with dLLM-Cache to reduce redundant computation. Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63ร speedup on LLaDA with minimal accuracy drop, and up to 34.22ร when combined with caching. Notably, our approach outperforms strong autoregressive baselines like LLaMA3 8B in throughput, demonstrating that well-designed sampling can unlock the full potential of dLLMs for fast and high-quality generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2412. EAMET: ROBUST MASSIVE MODEL EDITING VIA EMBEDDING ALIGNMENT OPTIMIZATION
๐ฌ 24
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Model Editing Massive Editing Large Language Models
Model editing techniques are essential for efficiently updating knowledge in
large language models (LLMs). However, the effectiveness of existing approaches
degrades in massive editing scenarios, particularly when evaluated with
practical metrics. Their robustness is also limited in context-rich settings or
when editing multiple facts of the same subject simultaneously. We attribute
these failures to the embedding misalignment among knowledge items, which
undermines editing reliability at scale. To address this, we propose EAMET
(Embedding Alignment Model Editing in Transformers), which addresses this issue
by aligning the space of key and residual embeddings. Extensive experiments
across six LLMs and three datasets demonstrate that EAMET consistently
outperforms existing methods, achieving about 90\% editing efficacy when editing
10k facts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2413. Why is Your Language Model a Poor Implicit Reward Model?
๐ฌ 12
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reward Models Language Models Generalization Distribution Shifts
Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Toward a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Overall, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2414. Scaling Laws for Diffusion Transformers
๐ฌ 24
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Scaling laws diffusion models transformers
Diffusion transformers (DiT) have already achieved appealing synthesis and scaling properties in content recreation, \emph{e.g.,} image and video generation.
However, scaling laws of DiT are less explored, which usually offer precise predictions regarding optimal model size and data requirements given a specific compute budget.
Therefore, experiments across a broad range of compute budgets, from 1e17 to 6e18 FLOPs are conducted to confirm the existence of scaling laws in DiT \emph{for the first time}. Concretely, the loss of pretraining DiT also follows a power-law relationship with the involved compute.
Based on the scaling law, we can not only determine the optimal model size and required data but also accurately predict the text-to-image generation loss given a model with 1B parameters and a compute budget of 1.5e21 FLOPs.
Additionally, we also demonstrate that the trend of pretraining loss matches the generation performances (e.g., FID), even across various datasets, which complements the mapping from compute to synthesis quality and thus provides a predictable benchmark that assesses model performance and data quality at a reduced cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2415. Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs
๐ฌ 20
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ jailbreaks ai safety emergent misalignment evaluations interpretability
Large language model (LLM) developers aim for their models to be honest, helpful, and harmless.
However, when faced with malicious requests, models are trained to refuse, sacrificing helpfulness. We show that frontier LLMs can develop a preference for \textit{dishonesty} as a new strategy, even when other options are available.
Affected models respond to harmful requests with outputs that sound harmful but are crafted to be subtly incorrect or otherwise harmless in practice. This behavior emerges with hard-to-predict variations even within models from the same model family.
We find no apparent cause for the propensity to deceive, but show that more capable models are better at executing this strategy.
Strategic dishonesty already has a practical impact on safety evaluations, as we show that dishonest responses fool \emph{all} output-based monitors used to detect jailbreaks that we test, rendering benchmark scores unreliable. Further, strategic dishonesty can act like a \emph{honeypot} against malicious users, which noticeably obfuscates prior jailbreak attacks.
While output monitors fail, we show that linear probes on internal activations can be used to reliably detect strategic dishonesty.
We validate probes on datasets with verifiable outcomes and by using them as steering vectors.
Overall, we consider strategic dishonesty as a concrete example of a broader concern that alignment of LLMs is hard to control, especially when helpfulness and harmlessness conflict.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2416. Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference
๐ฌ 27
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Prompting In-Context Learning Tool-augmented Reasoning Text-rich Graphs
Large language models (LLMs) are increasingly leveraged for text-rich graph machine learning tasks, with node classification standing out due to its high-impact application domains such as fraud detection and recommendation systems.
Yet, despite a surge of interest, the field lacks a principled understanding of the capabilities of LLMs in processing graph data.
In this work, we conduct a large-scale, controlled evaluation across the key axes of variability: the LLM-graph interaction mode, comparing prompting, tool-use, and code generation; dataset domains, spanning citation, web-link, e-commerce, and social networks; homophilic vs. heterophilic regimes; short- vs. long-text features; LLM sizes and reasoning capabilities. We further analyze dependencies by independently truncating features, deleting edges, and removing labels to quantify reliance on input types.
Our findings provide actionable guidance for both research and practice. (1) Code generation mode achieves the strongest overall performance, with especially large gains on long-text or high-degree graphs where prompting quickly exceeds the token budget. (2) All interaction strategies remain effective on heterophilic graphs, challenging the assumption that LLM-based methods collapse under low homophily. (3) Code generation mode is able to flexibly shift its reliance to the most informative input type, whether that be structure, features, or labels.
Together, these results establish a clear picture of the strengths and limitations of current LLMโgraph interaction modes and point to design principles for future methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2417. BBQ: Boosting Quantization Entropy with Bell Box Quantization
๐ฌ 14
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Quantization Quantization-Aware Training Pre-Training
Quantization-Aware Pre-Training (QAPT) is an effective technique to reduce the compute and memory overhead of
Deep Neural Networks while improving their energy efficiency on edge devices. Existing QAPT methods produce models stored in compute-efficient data types (e.g. integers) that are not information theoretically optimal (ITO). On the other hand, existing ITO data types (e.g. Quantile/NormalFloat Quantization) are not compute-efficient.
We propose BBQ, the first ITO quantization method that is also compute-efficient. BBQ builds on our key insight that since learning is domain-agnostic, the output of a quantizer does not need to reside in the the same domain as its input. BBQ performs ITO quantization in its input domain, and returns its output in a compute-efficient domain where ITO data types are mapped to compute-efficient data types. Without sacrificing compute efficiency, BBQ outperforms prior SOTA QAPT methods by a perplexity reduction of up to 2 points for 4-bit models, up to 4 points for 3-bit models, up to 5 points for 2-bit models , and up to 18 points for 1-bit models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2418. SPRINT: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers
๐ฌ 20
โญ 5.50
๐ generative models
๐ท๏ธ diffusion models generative models flow matching efficient training image generation
Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive. Token dropping can reduce training cost, yet naรฏve strategies degrade representations, and existing methods are either parameter-heavy or fail at high drop ratios. We present SPRINT (Sparse--Dense Residual Fusion for Efficient Diffusion Transformers), a simple method that enables aggressive token dropping (up to 75%) while preserving quality. SPRINT leverages the complementary roles of shallow and deep layers: early layers process all tokens to capture local detail, deeper layers operate on a sparse subset to cut computation, and their outputs are fused through residual connections. Training follows a two-stage schedule: long masked pre-training for efficiency followed by short full-token fine-tuning to close the train--inference gap. On ImageNet-1K 256^2, SPRINT achieves 9.8x training savings with comparable FID/FDD, and at inference, its Path-Drop Guidance (PDG) nearly halves FLOPs while improving quality. These results establish SPRINT as a simple, effective, and general solution for efficient DiT training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2419. Following the Navigation: Enhancing Small Language Models Contextual Reasoning with LLM Guidance
๐ฌ 23
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Contextual Reasoning Collaborative Inference
Large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, excel in contextual reasoning by leveraging extensive world knowledge and deep contextual understanding. However, their high computational costs limit deployment in resource-constrained settings. Conversely, small language models (SLMs) are more computationally efficient but often struggle with contextual reasoning due to limited parameter capacity and challenges like catastrophic forgetting. Existing enhancement methods for SLMsโsuch as knowledge distillation and data synthesisโstill depend on additional training and face inherent limitations. To address this, we propose Navigation, a novel training-free framework that improves SLMsโ contextual reasoning by distilling LLM-derived contextual processing expertise into generalizable navigation templates. These templates, stored in a scalable Navigation database, guide SLMs through a three-stage processโGeneration, Utilization, and Updateโto locate and process critical information within complex contexts. Experiments demonstrate that our approach yields an average 10.7\% accuracy gain with a template count equivalent to no more than 2.1\% of the dataset size, enabling models such as Qwen2.5-3B-Instruct and Llama-3.2-3B-Instruct to outperform GPT-3.5-Turbo on diverse contextual reasoning tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2420. EntropyLong: Effective Long-Context Training via Predictive Uncertainty
๐ฌ 23
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Longcontext
Training long-context language models to capture long-range dependencies requires specialized data construction. Current approaches, such as generic text concatenation or heuristic-based variants, frequently fail to guarantee genuine long-range dependencies. We propose \textbf{EntropyLong}, a novel data construction method that leverages predictive uncertainty to verify dependency quality. Our approach identifies high-entropy positions in documents, retrieves semantically relevant contexts from large corpora, and verifies their utility by assessing whether they reduce prediction entropy. This \textit{model-in-the-loop verification} ensures each dependency represents measurable information gain rather than spurious correlation. We construct training samples with long-range dependencies by combining original documents with these verified contextual supplements. Using FineWeb-Edu and Cosmopedia, we generate a dataset of 128K-length sequences with verified dependencies. Models trained on this data demonstrate significant improvements on RULER benchmarks, particularly in tasks requiring distant information. Following instruction fine-tuning, our models also achieve substantial gains on LongBench-v2, demonstrating enhanced long-context understanding. Extensive ablation studies further validate the necessity and effectiveness of entropy-based verification for long-context training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2421. HiTeA: Hierarchical Temporal Alignment for Training-Free Long-Video Temporal Grounding
๐ฌ 28
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ video temporal grounding;training-free;Long-video Understanding;vision-language models
Temporal grounding in long, untrimmed videos is critical for real-world video understanding, yet it remains a challenging task owing to complex temporal structures and pervasive visual redundancy. Existing methods rely heavily on supervised training with task-specific annotations, which inherently limits their scalability and adaptability due to the substantial cost of data collection and model retraining. Although a few recent works have explored training-free or zero-shot grounding, they seldom address the unique challenges posed by long videos. In this paper, we propose HiTeA (Hierarchical Temporal Alignment), a novel, training-free framework explicitly designed for long-video temporal grounding. HiTeA introduces a hierarchical temporal decomposition mechanism that structures videos into events, scenes, and actions, thereby aligning natural language queries with the most appropriate temporal granularity. Candidate segments are then matched with queries by leveraging pre-trained visionโlanguage models (VLMs) to directly compute segmentโtext similarity, thereby obviating the need for any task-specific training or fine-tuning. Extensive experiments on both short- and long-video benchmarks show that HiTeA not only substantially outperforms all existing training-free methods (e.g., achieving 44.94% R\@0.1 on TACoS, representing an absolute gain of 12.4%) but also achieves competitive performance against state-of-the-art supervised baselines under stricter metrics. The code is available at https://anonymous.4open.science/r/HiTeA_code.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2422. Translating Flow to Policy via Hindsight Online Imitation
๐ฌ 13
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Robotic Manipulation Imitation Learning Pixel Flow
Recent advances in hierarchical robot systems leverage a high-level planner to propose task plans and a low-level policy to generate robot actions.
This design allows training the planner on action-free or even non-robot data sources (e.g., videos), providing transferable high-level guidance.
Nevertheless, grounding these high-level plans into executable actions remains challenging, especially with the limited availability of high-quality robot data.
To this end, we propose to improve the low-level policy through online interactions.
Specifically, our approach collects online rollouts, retrospectively annotates the corresponding high-level goals from achieved outcomes, and aggregates these hindsight-relabeled experiences to update a goal-conditioned imitation policy.
Our method, Hindsight Flow-conditioned Online Imitation (HinFlow), instantiates this idea with 2D point flows as the high-level planner.
Across diverse manipulation tasks, our method achieves
more than $2\times $
performance improvement over the base policy, significantly outperforming the existing methods.
Moreover, our framework enables policy acquisition from planners trained on cross-embodiment video data, demonstrating its potential for scalable and transferable robot learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2423. UniHM: Unified Dexterous Hand Manipulation with Vision Language Model
๐ฌ 13
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Hand Manipulation Synthesis;Multimodal Large Language Model;
Planning physically feasible dexterous hand manipulation is a central challenge in robotic manipulation and Embodied AI. Prior work typically relies on object-centric cues or precise hand-object interaction sequences, foregoing the rich, compositional guidance of open-vocabulary instruction. We introduce UniHM, the first framework for unified dexterous hand manipulation guided by free-form language commands.
We propose a Unified Hand-Dexterous Tokenizer that maps heterogeneous dexterous-hand morphologies into a single shared codebook, improving cross-dexterous hand generalization and scalability to new morphologies. Our vision language action model is trained solely on human-object interaction data, eliminating the need for massive real-world teleoperation datasets, and demonstrates strong generalizability in producing human-like manipulation sequences from open-ended language instructions. To ensure physical realism, we introduce a physics-guided dynamic refinement module that performs segment-wise joint optimization under generative and temporal priors, yielding smooth and physically feasible manipulation sequences. Across multiple datasets and real-world evaluations, UniHM attains state-of-the-art results on both seen and unseen objects and trajectories, demonstrating strong generalization and high physical feasibility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2424. Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction
๐ฌ 13
โญ 5.50
๐ optimization
๐ท๏ธ nonconvex optimization lower bounds distributed optimization
We consider centralized distributed optimization in the classical federated learning setup, where $n$ workers jointly find an $\varepsilon$-stationary point of an $L$-smooth, $d$-dimensional nonconvex function $f$, having access only to unbiased stochastic gradients with variance $\sigma^2$. Each worker requires at most $h$ seconds to compute a stochastic gradient, and the communication times from the server to the workers and from the workers to the server are $\tau_{\textnormal{s}}$ and $\tau_{\textnormal{w}}$ seconds per coordinate, respectively. One of the main motivations for distributed optimization is to achieve scalability with respect to $n$. For instance, it is well known that the distributed version of \algname{SGD} has a variance-dependent runtime term $\frac{h \sigma^2 L \Delta}{n \varepsilon^2},$ which improves with the number of workers $n,$ where $\Delta := f(x^0) - f^*,$ and $x^0 \in \mathbb{R}^d$ is the starting point. Similarly, using unbiased sparsification compressors, it is possible to reduce \emph{both} the variance-dependent runtime term and the communication runtime term from $\tau_{\textnormal{w}} d \frac{L \Delta}{\varepsilon}$ to $\frac{\tau_{\textnormal{w}} d L \Delta}{n \varepsilon} + \sqrt{\frac{\tau_{\textnormal{w}} d h \sigma^2}{n \varepsilon}} \cdot \frac{L \Delta}{\varepsilon},$ which also benefits from increasing $n.$ However, once we account for the communication from the server to the workers $\tau_{\textnormal{s}}$, we prove that it becomes infeasible to design a method using unbiased random sparsification compressors that scales both the server-side communication runtime term $\tau_{\textnormal{s}} d \frac{L \Delta}{\varepsilon}$ and the variance-dependent runtime term $\frac{h \sigma^2 L \Delta}{\varepsilon^2},$ better than poly-logarithmically in $n$, even in the homogeneous (i.i.d.) case, where all workers access the same function or distribution. Indeed, when $\tau_{\textnormal{s}} \simeq \tau_{\textnormal{w}},$ our lower bound is $\tilde{\Omega}(\min[h (\frac{\sigma^2}{n \varepsilon} + 1) \frac{L \Delta}{\varepsilon} + {\tau_{\textnormal{s}} d \frac{L \Delta}{\varepsilon}},\; h \frac{L \Delta}{\varepsilon} + {h \frac{\sigma^2 L \Delta}{\varepsilon^2}}]).$ To establish this result, we construct a new ``worst-case'' function and develop a new lower bound framework that reduces the analysis to the concentration of a random sum, for which we prove a concentration bound. These results reveal fundamental limitations in scaling distributed optimization, even under the homogeneous (i.i.d.) assumption.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2425. Regulating Internal Evidence Flows for Robust Learning Under Spurious Correlations
๐ฌ 21
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Fairness Regularization Bias Free Evidence Energy
Deep models often exploit spurious correlations (e.g., backgrounds or dataset artifacts), hurting worst-group performance. We propose \textbf{Evidence-Gated Suppression (EGS)}, a lightweight, plug-in regularizer that intervenes inside the network during training. EGS tracks a class-conditional, confidence-weighted contribution for each neuron (more negative $\Leftrightarrow$ stronger support) and applies a percentile-based, multiplicative decay to the most extreme contributors, reducing overconfident shortcut pathways while leaving other features relatively more influential. EGS integrates with standard ERM, requires no group labels, and adds $<5\%$ training overhead. We provide analysis linking EGS to minority-margin gains, path-norm-like capacity control, and stability benefits via EMA-smoothed gating. Empirically, EGS improves worst-group accuracy and calibration vs.\ ERM and is competitive with state-of-the-art methods across spurious-correlation benchmarks (e.g., Waterbirds, CelebA, BAR, COCO), while maintaining strong average accuracy. These results suggest that regulating internal evidence flow is a simple and scalable route to robustness without group labels.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2426. Structured Reasoning for LLMs: A Unified Framework for Efficiency and Explainability
๐ฌ 22
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Large Language Models Reasoning Chain of thought Reinforcement Learning Neurosymbolic AI Interpretability
Recent Large Language Models (LLMs) have made remarkable progress, but they still struggle with complex reasoning tasks such as logical deduction and planning. This is partly because they rely primarily on token-level probability relationships, which limits their ability to reason effectively.
In this paper, inspired by cognitive science and neurosymbolic AI, we introduce Structured Reasoning, which aimes at enhancing the reasoning capabilities of LLMs from the step level.
To this end, we first collect highโfrequency, domainโagnostic reasoning step tags and construct a structured reasoning dataset with those tags.
Then, we treat a reasoning process as a directed acyclic graph, where the vertices represent steps and the edges indicate the direction of reasoning.
In this context, an efficient reasoning process corresponds to, or can be characterized by, a sparse reasoning graph.
To construct reasoning graphs, we introduce structured tags for reliable step extraction from LLM outputs. For single-graph optimization, we propose the MaxFlow reward, which rewards graphs with balanced node contributions and fewer redundant steps. The quality of a sparse reasoning graph can be reflected by the total flow from all steps to the final answer. For multi-graph comparison, we propose the LCS reward, which selects reliable reasoning paths by identifying optimal common subsequences (consecutive steps) shared across multiple generated responses (sequences).
Experiments with DeepSeek-R1-Distill-Qwen-1.5B and 7B models show that our method consistently outperforms GRPO and other carefully tuned baselines across various context lengths (0.5kโ8k).
Structured Reasoning shows particular strength in efficiency (better performance with fewer steps) and stability (consistently generating high-quality outputs across a temperature range of 0.1 to 1.0).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2427. SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling
๐ฌ 26
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Flow-based policy Sample-Efficient Reinforcement Learning Soft actor critic Sequential Modeling
Training expressive flow-based policies with off-policy reinforcement learning is notoriously unstable due to gradient pathologies in the multi-step action sampling process. We trace this instability to a fundamental connection: the flow rollout is algebraically equivalent to a residual recurrent computation, making it susceptible to the same vanishing and exploding gradients as RNNs. To address this, we reparameterize the velocity network using principles from modern sequential models, introducing two stable architectures: Flow-G, which incorporates a gated velocity, and Flow-T, which utilizes a decoded velocity. We then develop a practical SAC-based algorithm, enabled by a noise-augmented rollout, that facilitates direct end-to-end training of these policies. Our approach supports both from-scratch and offline-to-online learning and achieves state-of-the-art performance on continuous control and robotic manipulation benchmarks, eliminating the need for common workarounds like policy distillation or surrogate objectives. Anonymized code is available at \url{https://anonymous.4open.science/r/SAC-FLOW}
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2428. Selection, Reflection and Self-Refinement: Revisit Reasoning Tasks via a Causal Lens
๐ฌ 17
โญ 5.50
๐ causal reasoning
๐ท๏ธ Causality Reasoning Tasks Selection Mechanism
Due to their inherent complexity, reasoning tasks have long been regarded as rigorous benchmarks for assessing the capabilities of machine learning models, especially large language models (LLMs). Although humans can solve these tasks with ease, existing models, even after extensive pre-training and post-training at scale, still fail to perform reasoning reliably. In this paper, we revisit reasoning tasks from a causal perspective, seeking to understand their behavior in latent space and to offer insights for addressing their challenges. Specifically, we cast reasoning tasks as a selection mechanism, in which high-level logical concepts function as selection operators on the given observations, such as, identifying the correct answer in a math problem or filling the appropriate entry in Sudoku. We emphasize two key properties of this formulation that shed light on the difficulty of reasoning tasks. First, the latent space exceeds the observation space in complexity, even when the correct answer is fully determined by the observed input. Second, the latent variables, corresponding to logical thought, are densely structured and exhibit strong dependencies. Building on this formulation, we introduce a framework, called SR$^2$, that incorporates the estimated latent variables as feedback into the selection mechanism, thereby facilitating the learning of dense dependencies among latent representations. The framework consists of three key modules: reflective representation learning, dependency self-refinement, and periodic intermediate alignment. Experimentally, we show that our approach yields significant gains in reasoning accuracy, for example, attaining over 10% improvement in performance with 8$\times$ fewer parameters on the Sudoku and Maze tasks over the recent advances.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2429. LLMs Struggle to Balance Reasoning and World Knowledge in Causal Narrative Understanding
๐ฌ 11
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Causal Inference Large Language Models Reasoning Narratives
The ability to robustly identify causal relationships is essential for autonomous decision-making and adaptation to novel scenarios. However, accurately inferring causal structure requires integrating both world knowledge and abstract logical reasoning. In this work, we investigate the interaction between these two capabilities through the representative task of causal reasoning over narratives. Through controlled synthetic, semi-synthetic and real-world experiments, we find that state-of-the-art large language models (LLMs) often rely on superficial heuristicsโfor example, inferring causality from event order or recalling memorized world knowledge without attending to context. Furthermore, we show that simple reformulations of the task can elicit more robust reasoning behavior. Our evaluation spans a range of causal structures, from linear chains to complex graphs involving colliders and forks. These findings uncover systematic patterns in how LLMs perform causal reasoning and lay the groundwork for developing methods that better align LLM behavior with principled causal inference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2430. Time-to-Move: Training-Free Motion-Controlled Video Generation via Dual-Clock Denoising
๐ฌ 11
โญ 5.50
๐ generative models
๐ท๏ธ Computer vision Generative models
Diffusion-based video generation can create realistic videos, yet existing image and text-based conditioning fails to offer precise motion control. Prior methods for motion control typically rely on displacement-based conditioning and require model-specific fine-tuning, which is computationally expensive and restrictive. We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation with image-to-video (I2V) diffusion models. Our key insight is to use crude reference animations, obtained through user-friendly manipulations such as cut-and-drag or depth-based reprojection, as direct motion guidance, analogous to using coarse layout input in image editing. To integrate these signals, we adapt SDEdit to the video domain while anchoring the appearance with image conditioning. We further propose dual-clock denoising, a region-dependent strategy that enforces strong alignment in motion-specified regions and grants flexibility elsewhere, balancing fidelity to user intent with natural dynamics. This lightweight modification of the sampling process incurs no additional training or runtime cost and is compatible with any backbone. Extensive experiments on object and camera motion benchmarks show that TTM matches or exceeds existing training-based baselines in realism and motion control. Beyond this, TTM introduces a unique capability: precise appearance control through pixel-level conditioning, exceeding the limits of text-only prompting.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2431. Secret-Protected Evolution for Differentially Private Synthetic Text Generation
๐ฌ 11
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ synthetic data differential privacy
Text data has become extremely valuable on large language models (LLMs) and even lead to general artificial intelligence (AGI).
A lot of high-quality text in the real world is private and cannot be freely used due to privacy concerns. Therefore, differentially private (DP) synthetic text generation has been proposed, aiming to produce high-utility synthetic data while protecting sensitive information.
However, existing DP synthetic text generation imposes uniform guarantees that often overprotect non-sensitive content, resulting in substantial utility loss and computational overhead. Therefore, we propose Secret-Protected Evolution (SecPE), a novel framework that extends private evolution with secret-aware protection.
Theoretically, we show that SecPE satisfies $(\vp, \vr)$-secret protection, constituting a relaxation of Gaussian DP that enables tighter utilityโprivacy trade-offs, while also substantially reducing computational complexity relative to baseline methods.
Empirically, across the OpenReview, PubMed, and Yelp benchmarks, SecPE consistently achieves lower Frรฉchet Inception Distance (FID) and higher downstream task accuracy than GDP-based Aug-PE baselines, while requiring less noise to attain the same level of protection.
Our results highlight that secret-aware guarantees can unlock more practical and effective privacy-preserving synthetic text generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2432. Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth
๐ฌ 19
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models Any-Depth Alignment Deep-prefill attacks Safety token Inference-time defense
Large Language Models (LLMs) exhibit strong but shallow alignment: they directly refuse harmful queries when a refusal is expected at the very start of an assistant turn, yet this protection collapses once a harmful continuation is underway (either through the adversarial attacks or via harmful assistant-prefill attacks). This raises a fundamental question: _Can the innate shallow alignment in LLMs be unlocked to ensure safety at arbitrary generation depths?_ To achieve this goal, we propose Any-Depth Alignment (ADA) an effective inference-time defense with negligible overhead. ADA is built based on our observation that alignment is concentrated in the _assistant header tokens_ through repeated use in shallow-refusal training, and these tokens possess the modelโs strong alignment priors. By reintroducing these tokens mid-stream, ADA induces the model to reassess harmfulness and recover refusals at _any point in generation_. Across diverse open-source model families (Llama, Gemma, Mistral, Qwen, DeepSeek, and gpt-oss), ADA achieves robust safety performance _without requiring any changes to the base model's parameters_. It secures a near-100% refusal rate against challenging adversarial prefill attacks ranging from dozens to thousands of tokens. Furthermore, ADA reduces the average success rate of prominent adversarial prompt attacks (such as GCG, AutoDAN, PAIR, and TAP) to below 3%. This is all accomplished while preserving benign utility with minimal over-refusal and maintaining resilience even after the base model undergoes subsequent instruction tuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2433. Operationalizing Data Minimization for Privacy-Preserving LLM Prompting
๐ฌ 20
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ NLP privacy LLM data minimization data sanitization
The rapid deployment of large language models (LLMs) in consumer applications has led to frequent exchanges of personal information. To obtain useful responses, users often share more than necessary, increasing privacy risks via memorization, context-based personalization, or security breaches. We present a framework to formally define and operationalize *data minimization*: for a given user prompt and response model, quantifying the least privacy-revealing disclosure that maintains utility, and propose a priority-queue tree search to locate this optimal point within a privacy-ordered transformation space. We evaluated the framework on four datasets spanning open-ended conversations (ShareGPT, WildChat) and knowledge-intensive tasks with single-ground-truth answers (CaseHOLD, MedQA), quantifying achievable data minimization with nine LLMs as the response model. Our results demonstrate that larger frontier LLMs can tolerate stronger data minimization while maintaining task quality than smaller open-source models (*85.7%* redaction for GPT-5 vs. *19.3%* for Qwen2.5-0.5B). By comparing with our search-derived benchmarks, we find that LLMs struggle to predict optimal data minimization directly, showing a bias toward abstraction that leads to oversharing. This suggests not just a privacy gap, but a capability gap: models may lack awareness of what information they actually need to solve a task.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2434. UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing
๐ฌ 23
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Unified Multimodal Understanding and Generation Image Editing CLIP
In this paper, we propose UniLIP, a unified framework that adapts CLIP for multimodal understanding, generation and editing. Although CLIP excels at understanding, it lacks reconstruction abilities required to be a unified visual encoder. However, previous CLIP-based unified methods fail to balance understanding and reconstruction, leading to semantic degradation or inconsistent reconstructions. In contrast, we introduce a novel two-stage training scheme with a self-distillation strategy that progressively endows CLIP with high-fidelity reconstruction abilities while preserving its original comprehension performance. For enhanced reasoning and consistency in generation and editing, we further develop a dual-condition architecture built upon the MetaQuery framework. Our architecture jointly utilizes multimodal hidden states for rich contextual details and learnable query embeddings to harness the powerful reasoning abilities of Multimodal Large Language Models (MLLMs). Leveraging advanced image representation and architectural design, UniLIP demonstrates superior instruction following and edit fidelity. With only 1B and 3B parameters, UniLIP can outperform larger unified models such as BAGEL (7B) and Uniworld-V1 (12B), achieving state-of-the-art performance of **0.90** on GenEval, **0.63** on WISE, and **3.94** on ImgEdit. These results demonstrate that UniLIP successfully expands the application of CLIP, establishing its continuous features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks. Code and models will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2435. VLM-Guided Adaptive Negative Prompting for Creative Generation
๐ฌ 11
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Generative models Computational graphics
Creative generation is the synthesis of new, surprising, and valuable samples that reflect user intent yet cannot be envisioned in advance. This task aims to extend human imagination, enabling the discovery of visual concepts that exist in the unexplored spaces between familiar domains.
While text-to-image diffusion models excel at rendering photorealistic scenes that faithfully match user prompts, they still struggle to generate genuinely novel content.
Existing approaches to enhance generative creativity either rely on interpolation of image features, which restricts exploration to predefined categories, or require time-intensive procedures such as embedding optimization or model fine-tuning.
We propose VLM-Guided Adaptive Negative-Prompting, a training-free, inference-time method that promotes creative image generation while preserving the validity of the generated object.
Our approach utilizes a vision-language model (VLM) that analyzes intermediate outputs of the generation process and adaptively steers it away from conventional visual concepts, encouraging the emergence of novel and surprising outputs.
We evaluate creativity through both novelty and validity, using statistical metrics in the CLIP embedding space. Through extensive experiments, we show consistent gains in creative novelty with negligible computational overhead.
Moreover, unlike existing methods that primarily generate single objects, our approach extends to complex scenarios, such as generating coherent sets of creative objects and preserving creativity within elaborate compositional prompts. Our method integrates seamlessly into existing diffusion pipelines, offering a practical route to producing creative outputs that venture beyond the constraints of textual descriptions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2436. Detecting Temporal Misalignment Attacks in Multimodal Fusion for Autonomous Driving
๐ฌ 17
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Multimodal Fusion Temporal Misalignment Attack Attack Detection Autonomous Driving
Multimodal fusion (MMF) is crucial for autonomous driving perception, combining camera and LiDAR streams for reliable scene understanding. However, its reliance on precise temporal synchronization introduces a vulnerability: adversaries can exploit network-induced delays to subtly misalign sensor streams, degrading MMF performance. To address this, we propose AION, a lightweight, plug-in defense tailored for the autonomous driving scenario. AION integrates continuity-aware contrastive learning to learn smooth multimodal representations and a DTW-based detection mechanism to trace temporal alignment paths and generate misalignment scores. AION demonstrates strong and consistent robustness against a wide range of temporal misalignment attacks on KITTI and nuScenes, achieving high average AUROC for camera-only (0.9493) and LiDAR-only (0.9495) attacks, while sustaining robust performance under joint cross-modal attacks (0.9195 on most attacks) with low false-positive rates across fusion backbones. Code will be publicly released upon acceptance (currently available at \url{https://anonymous.4open.science/r/AION-F10B}).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2437. Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation
๐ฌ 24
โญ 5.50
๐ generative models
๐ท๏ธ video generation diffusion models DPO
Recent advances in video generation have enabled thrilling experiences in producing realistic videos driven by scalable diffusion transformers. However, they usually fail to produce satisfactory outputs that are aligned to users' authentic demands and preferences. In this work, we introduce Dual-Iterative Optimization (Dual-IPO), an iterative paradigm that sequentially optimizes both the reward model and the video generation model for improved synthesis quality and human preference alignment. For the reward model, our framework ensures reliable and robust reward signals via CoT-guided reasoning, voting-based self-consistency, and preference certainty estimation. Given this, we optimize video foundation models with guidance of signals from reward model's feedback, thus improving the synthesis quality in subject consistency, motion smoothness and aesthetic quality, etc. The reward model and video generation model complement each other and are progressively improved in the multi-round iteration, without requiring tediously manual preference annotations. Comprehensive experiments demonstrate that the proposed Dual-IPO can effectively and consistently improve the video generation quality of base model with various architectures and sizes, even help a model with only 2B parameters surpass a 5B one. Moreover, our analysis experiments and ablation studies identify the rational of our systematic design and the efficacy of each component. Our code and models will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2438. WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models
๐ฌ 20
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Vision language model singular value decomposition quantization
Singular Value Decomposition (SVD) has become an important technique for reducing the computational burden of Vision Language Models (VLMs), which play a central role in tasks such as image captioning and visual question answering. Although multiple prior works have proposed efficient SVD variants to enable low-rank operations, we find that in practice it remains difficult to achieve substantial latency reduction during model execution. To address this limitation, we introduce a new computational pattern and apply SVD at a finer granularity, enabling real and measurable improvements in execution latency. Furthermore, recognizing that weight elements differ in their relative importance, we adaptively allocate relative importance to each element during SVD process to better preserve accuracy, then extend this framework with quantization applied to both weights and activations, resulting in a highly efficient VLM. Collectively, we introduce~\textit{Weighted SVD} (WSVD), which outperforms other approaches by achieving over $1.8\times$ decoding speedup while preserving the accuracy performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2439. Motion-R1: Enhancing Motion Generation with Decomposed Chain-of-Thought and RL Binding
๐ฌ 14
โญ 5.50
๐ generative models
๐ท๏ธ Human Motion Generation Chain-of-Thought Reinforcement Learning
Text-to-Motion generation has become a fundamental task in human-machine interaction, enabling the synthesis of realistic human motions from natural language descriptions. Although recent advances in large language models and reinforcement learning have contributed to high-quality motion generation, two major challenges remain. Existing approaches often fail to capture the temporal and causal complexities inherent in natural language, leading to oversimplified or incoherent motions. Additionally, RL-based methods are frequently overly complex, hindering their scalability and adaptability across various motion generation tasks.
To address these challenges, we propose **Motion-R1**, a novel framework that combines decomposed Chain-of-Thought reasoning with reinforcement learning to enhance both the quality and interpretability of generated motions. Specifically, we introduce the **Decomposed CoT Data Engine**, which leverages an automated pipeline to synthesize high-quality reasoning data, allowing the model to better capture the temporal dependencies and causal relationships of human motion. We also propose **RL Binding**, a reinforcement learning strategy that incorporates multi-modal text-motion alignment into the RL reward function, guiding the model to produce motions that are both semantically accurate and motionally realistic. Extensive experiments across benchmark datasets demonstrate that Motion-R1 achieves state-of-the-art performance, with a 3.5\% improvement in MM-Dist on HumanML3D and improvements in R-Precision and FID on KIT-ML and BABEL, surpassing existing methods across key metrics and highlighting its superior capability in handling complex motion generation tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2440. Robust Denoising Neural Reranker for Recommender Systems
๐ฌ 4
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Multi-stage Recommenders Reranking Model Adversarial Learning
For multi-stage recommenders in industry, a user request would first trigger a simple and efficient retriever module that selects and ranks a list of relevant items, then the recommender calls a slower but more sophisticated reranking model that refines the item list exposure to the user. To consistently optimize the two-stage retrieval reranking framework, most efforts have focused on learning reranker-aware retrievers. In contrast, there has been limited work on how to achieve a retriever-aware reranker. In this work, we provide evidence that the retriever scores from the previous stage are informative signals that have been underexplored. Specifically, we first empirically show that the reranking task under the two-stage framework is naturally a noise reduction problem on the retriever scores, and theoretically show the limitations of naive utilization techniques of the retriever scores. Following this notion, we derive an adversarial framework DNR that associates the denoising reranker with a carefully designed noise generation module. The resulting DNR solution extends the conventional score error minimization loss with three augmented objectives, including: 1) a denoising objective that aims to denoise the noisy retriever scores to align with the user feedback; 2) an adversarial retriever score generation objective that improves the exploration in the retriever score space; and 3) a distribution regularization term that aims to align the distribution of generated noisy retriever scores with the real ones. We conduct extensive experiments on three public datasets and an industrial recommender system, together with analytical support, to validate the effectiveness of the proposed DNR.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2441. Joint Optimization for 4D Human-Scene Reconstruction in the Wild
๐ฌ 12
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Human Scene Interaction Global Human Motion Estimation Dense Scene Reconstruction
Reconstructing human motion and its surrounding environment is crucial for understanding human-scene interaction and predicting human movements in the scene. While much progress has been made in capturing human-scene interaction in constrained environments, those prior methods can hardly reconstruct the natural and diverse human motion and scene context from web videos. In this work, we propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos. Compared to prior works that perform separate optimization of the human, the camera, and the scene, JOSH leverages the human-scene contact constraints to jointly optimize all parameters in a single stage. Experiment results demonstrate that JOSH significantly improves 4D human-scene reconstruction, global human motion estimation, and dense scene reconstruction by utilizing the joint optimization of scene geometry, human motion, and camera poses. Further studies show that JOSH can enable scalable training of end-to-end global human motion models on extensive web data, highlighting its robustness and generalizability. The code and model will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2442. OR-PRM: A Process Reward Model for Algorithmic Problem in Operations Research
๐ฌ 10
โญ 5.50
๐ optimization
๐ท๏ธ Operations Research Process Reward Model Large Lanugage Model
Large language models (LLMs) with Process Reward Models (PRMs) have shown strong reasoning ability, yet their potential in Operations Research (OR) remains unexplored. We present the first PRM tailored for OR, but find that directly training on mainstream datasets yields surprisingly weak performance.
To understand this gap, we conduct a systematic analysis and identify the primary bottleneck: the datasets themselves, where over 30\% of annotations are severely flawed.
To overcome these limitations, we first collect all existing synthetic datasets and apply a carefully designed filtering pipeline to construct a high-quality seed dataset.
Building upon this seed, we then build OR-ProcessQA, the first large-scale dataset for OR with step-by-step supervision, where diverse solution pathways are generated via Monte Carlo Tree Search (MCTS) and each step is validated for logical consistency by GPT-4o.
Building on this foundation, we train OR-PRM, the first Process Reward Model in the OR domain, designed to evaluate and guide reasoning at every step rather than only the final outcome.
Together, these advances enable OR-PRM to substantially improve LLMsโ reasoning capability, achieving a maximum absolute improvement of 12.5\% over the base model in Best-of-N settings, and highlighting the power of process-oriented supervision for reliable problem solving in operations research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2443. AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning in LLMs
๐ฌ 22
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Large language model Reasoning Anesthesiology Medicine
The application of large language models (LLMs) in the medical field has garnered significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. To bridge this gap, we introduce AnesSuite, the first comprehensive dataset suite specifically designed for anesthesiology reasoning in LLMs. The suite features AnesBench, an evaluation benchmark tailored to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Alongside this benchmark, the suite includes three training datasets that provide an infrastructure for continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning with verifiable rewards (RLVR). Leveraging this suite, we develop Morpheus, the first baseline model collection for anesthesiology reasoning. Despite undergoing limited training with SFT and group relative policy optimization (GRPO), Morpheus demonstrates substantial performance improvements, rivaling the performance of larger-scale models. Furthermore, through comprehensive evaluations and experiments, we analyze the key factors influencing anesthesiology reasoning performance, including model characteristics, training strategies and training data. Both AnesSuite and Morpheus will be open-sourced to the public.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2444. OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens
๐ฌ 22
โญ 5.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Scaling laws Multimodal Transformers Foundation models Visual cortex Neural encoding models Neural decoding Behavioral prediction Calcium imaging
Scaling data and artificial neural networks has transformed AI, driving breakthroughs in language and vision.
Whether similar principles apply to modeling brain activity remains unclear. Here we leveraged a dataset of 3.3 million neurons from the visual cortex of 78 mice across 323 sessions, totaling more than 150 billion neural tokens recorded during natural movies, images and parametric stimuli, and behavior. We train multi-modal, multi-task transformer models (1Mโ300M parameters) that support three regimes flexibly at test time: neural prediction (predicting neuronal responses from sensory input and behavior), behavioral decoding (predicting behavior from neural activity), neural forecasting (predicting future activity from current neural dynamics), or any combination of the three.
We find that performance scales reliably with more data, but gains from increasing model size saturate -- suggesting that current brain models are limited by data rather than compute. This inverts the standard AI scaling story: in language and computer vision, massive datasets make parameter scaling the primary driver of progress, whereas in brain modeling -- even in the mouse visual cortex, a relatively simple and low-resolution system -- models remain data-limited despite vast recordings. These findings highlight the need for richer stimuli, tasks, and larger-scale recordings to build brain foundation models. The observation of systematic scaling raises the possibility of phase transitions in neural modeling, where larger and richer datasets might unlock qualitatively new capabilities, paralleling the emergent properties seen in large language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2445. Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding
๐ฌ 24
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Speculative Decoding; Draft Tree Reward; Tree Optimization
Speculative decoding accelerates large language model (LLM) inference by letting a lightweight draft model propose multiple tokens that the target model verifies in parallel. Yet existing training objectives optimize only a single greedy draft path, while decoding follows a tree policy that re-ranks and verifies multiple branches. This draft policy misalignment limits achievable speedups. We introduce Group Tree Optimization (GTO), which aligns training with the decoding-time tree policy through two components: (i) Draft Tree Reward, a sampling-free objective equal to the expected acceptance length of the draft tree under the target model, directly measuring decoding performance; (ii) Group-based Draft Policy Training, a stable optimization scheme that contrasts trees from the current and a frozen reference draft model, forming debiased group-standardized advantages and applying a PPO-style surrogate along the longest accepted sequence for robust updates. We further prove that increasing our Draft Tree Reward provably improves acceptance length and speedup. Across dialogue (MT-Bench), code (HumanEval), and math (GSM8K), and multiple LLMs (e.g., LLaMA-3.1-8B, LLaMA-3.3-70B, Vicuna-1.3-13B, DeepSeek-R1-Distill-LLaMA-8B), GTO increases acceptance length by 7.4\% and yields an additional 7.7\% speedup over prior state-of-the-art EAGLE-3. By bridging draft policy misalignment, GTO offers a practical, general solution for efficient LLM inference. Code and draft models are available at https://anonymous.4open.science/r/GTO-ICLR-348F/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2446. PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data
๐ฌ 17
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Part Segmentation Segment Anything Open-World Segmentation Interactive Segmentation
Segmenting 3D objects into parts is a long-standing challenge in computer vision. To overcome taxonomy constraints and generalize to unseen 3D objects, recent works turn to open-world part segmentation. These approaches typically transfer supervision from 2D foundation models, such as SAM, by lifting multi-view masks into 3D. However, this indirect paradigm fails to capture intrinsic geometry, leading to surface-only understanding, uncontrolled decomposition, and limited generalization. We present PartSAM, the first promptable part segmentation model trained natively on large-scale 3D data. Following the design philosophy of SAM, PartSAM employs an encoderโdecoder architecture in which a triplane-based dual-branch encoder produces spatially structured tokens for scalable part-aware representation learning. To enable large-scale supervision, we further introduce a model-in-the-loop annotation pipeline that curates over five million 3D shapeโpart pairs from online assets, providing diverse and fine-grained labels. This combination of scalable architecture and diverse 3D data yields emergent open-world capabilities: with a single prompt, PartSAM achieves highly accurate part identification, and in a โSegment-Every-Partโ mode, it automatically decomposes shapes into both surface and internal structures. Extensive experiments show that PartSAM outperforms state-of-the-art methods by large margins across multiple benchmarks, marking a decisive step toward foundation models for 3D part understanding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2447. Draft-based Approximate Inference for LLMs
๐ฌ 15
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ long-context sparse attention KV cache eviction prompt compression
Optimizing inference for long-context large language models (LLMs) is increasingly important due to the quadratic compute and linear memory costs of Transformers. Existing approximate inference methods, including key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on coarse predictions of token or KV pair importance. We unify and extend recent work by introducing a framework for approximate LLM inference that leverages small draft models to more accurately predict token and KV pair importance. We provide novel theoretical and empirical analyses justifying lookahead-based importance estimation techniques. Within this framework, we present: (i) **SpecKV**, the first method to use lookahead with a small draft model to enable precise KV cache dropping; (ii) **SpecPC**, which leverages draft model attention activations to identify and discard less important prompt tokens; and (iii) **SpecKV-PC**, a cascaded compression strategy combining both techniques. Extensive experiments on long-context benchmarks demonstrate that our methods consistently achieve higher accuracy than existing baselines while retaining the same improvements in memory usage, latency, and throughput.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2448. Dual-Branch Representations with Dynamic Gated Fusion and Triple-Granularity Alignment for Deep Multi-View Clustering
๐ฌ 19
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Multi-View Clustering
Multi-view clustering seeks to exploit complementary information across different views to enhance clustering performance, where both semantic and structural information are crucial. However, existing approaches often bias toward one type of information while treating the other as auxiliary, overlooking that the reliability of these signals may vary across datasets and that semantic and structural cues can provide complementary and parallel guidance. As a result, such methods may face limitations in generalization and suboptimal clustering performance. To address these issues, we propose a novel method, Dual-branch Representations with dynamic gatEd fusion and triple-grAnularity alignMent (DREAM), for deep multi-view clustering. Specifically, DREAM disentangles semantic information via a Variational Autoencoder (VAE) branch, while simultaneously captures structure-aware features through a Graph Convolutional Network (GCN) branch. The resulting representations are dynamically integrated using a gated fusion module that leverages structural cues as complementary guidance, adaptively balancing semantic and structural contributions to produce clustering-oriented latent embeddings. To further improve robustness and discriminability, we introduce a triple-granularity feature alignment mechanism that enforces consistency across views, within individual samples, and intra-cluster, thereby preserving semantic-structural coherence while enhancing inter-cluster separability. Extensive experiments on benchmark datasets demonstrate that DREAM significantly outperforms SOTA approaches, highlighting the effectiveness of disentangled dual-branch encoding, adaptive gated fusion, and triple-granularity feature alignment for multi-view clustering.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2449. Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents
๐ฌ 23
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Self-Evolving Agent Agent Safety Large Language Models Safety Evaluation
Advances in Large Language Models (LLMs) have enabled a new class of \textbf{\textit{self-evolving agents}} that autonomously improve through interaction with the environment, demonstrating strong capabilities. However, self-evolution also introduces novel risks overlooked by current safety research. In this work, we study the case where an agent's self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as \textit{\textbf{Misevolution}}. To provide a systematic investigation, we evaluate misevolution along four key evolutionary pathways: model, memory, tool, and workflow. Our empirical findings reveal that misevolution is a widespread risk, affecting agents built even on top-tier LLMs (\textit{e.g.}, Gemini-2.5-Pro). Different emergent risks are observed in the self-evolutionary process, such as the degradation of safety alignment after memory accumulation, or the unintended introduction of vulnerabilities in tool creation and reuse. To our knowledge, this is the first study to systematically conceptualize misevolution and provide empirical evidence of its occurrence, highlighting an urgent need for new safety paradigms for self-evolving agents. Finally, we discuss potential mitigation strategies to inspire further research on building safer and more trustworthy self-evolving agents.
Warning: this paper includes examples that may be offensive or harmful in nature.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2450. PEAR: Phase Entropy Aware Reward for Efficient Reasoning
๐ฌ 24
โญ 5.50
๐ generative models
๐ท๏ธ Large Reasoning Models Large Language Models Efficient Reasoning
Large Reasoning Models (LRMs) have achieved impressive performance on complex reasoning tasks by generating detailed chain-of-thought (CoT) explanations. However, these responses are often excessively long, containing redundant reasoning steps that inflate inference cost and reduce usability. Controlling the length of generated reasoning without sacrificing accuracy remains an open challenge. Through a systematic empirical analysis, we reveal a consistent positive correlation between model entropy and response length at different reasoning stages across diverse LRMs: the thinking phase exhibits higher entropy, reflecting exploratory behavior of longer responses, while the final answer phase shows lower entropy, indicating a more deterministic solution.This observation suggests that entropy at different reasoning stages can serve as a control knob for balancing conciseness and performance. Based on this insight, this paper introduces Phase Entropy Aware Reward (PEAR), a reward mechanism that incorporating phase-dependent entropy into the reward design. Instead of treating all tokens uniformly, PEAR penalize excessive entropy during the thinking phase and allowing moderate exploration at the final answer phase, which encourages models to generate concise reasoning traces that retain sufficient flexibility to solve the task correctly. This enables adaptive control of response length without relying on explicit length targets or rigid truncation rules. Extensive experiments across four benchmarks demonstrate that PEAR consistently reduces response length while sustaining competitive accuracy across model scales. In addition, PEAR demonstrates strong out-of-distribution (OOD) robustness beyond the training distribution.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2451. STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer
๐ฌ 16
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Monocular and Video 3D reconstruction
We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces a streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2452. Training Dynamics Impact Post-Training Quantization Robustness
๐ฌ 15
โญ 5.50
๐ optimization
๐ท๏ธ Efficiency quantization optimization
While post-training quantization is widely adopted for efficient deployment of large language models, the mechanisms underlying quantization robustness remain unclear. We conduct a comprehensive analysis of quantization degradation across open-source language model training trajectories up to 32B parameters and 15T training tokens to accurately assess the relationship between training dynamics and quantization performance. Our key finding is that quantization errors in large-scale training runs are driven by a complex interplay between learning rate and other training hyperparameters. Specifically, once learning rates decay, validation loss and quantization error diverge, largely independent of training data scale. To investigate interventions on the training dynamics and identify specific configurations that can modulate quantization robustness favorably, we train our own models in controlled experiments up to 100B tokens. Our results challenge the assumption that increasing dataset scale inherently compromises quantization effectiveness, demonstrating instead that strategic training hyperparameter interventions can improve quantization quality at scale.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2453. A Hierarchical Circuit Symbolic Discovery Framework for Efficient Logic Optimization
๐ฌ 24
โญ 5.50
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Electronic Design Automation; Logic Synthesis; Large Language Models;
The efficiency of Logic Optimization (LO) has become one of the key bottlenecks in chip design. To prompt efficient LO, many graph-based machine learning (ML) methods, such as graph neural networks (GNNs), have been proposed to predict and prune a large number of ineffective subgraphs of the LO heuristics. However, the high inference cost and limited interpretability of these approaches severely limit their wide application to modern LO tools. To address this challenge, we propose a novel **H**ierarchical C**i**rcuit **S**ymbolic Discovery Framework, namely HIS, to learn a *lightweight* and *interpretable* symbolic function that can *accurately* identify ineffective subgraphs for efficient LO. Specifically, HIS proposes a hierarchical tree structure to represent the circuit symbolic function, where every layer of the symbolic tree performs an efficient and interpretable message passing to capture the structural information of the circuit graph. To learn the hierarchical tree, we propose a circuit symbolic generation framework that leverages reinforcement learning to optimize a structure-aware Transformer model for symbolic token generation. To the best of our knowledge, HIS is *the first* approach to discover an efficient, interpretable, and high-performance symbolic function from the circuit graph for efficient LO. Experiments on two widely used circuit benchmarks show that the learned graph symbolic functions outperform previous state-of-the-art approaches in terms of efficiency and optimization performance. Moreover, we integrate HIS with the Mfs2 heuristic, one of the most time-consuming LO heuristics. Results show that HIS significantly enhances both its efficiency and optimization performance on a CPU-based machine, achieving an average runtime improvement of 27.22% and a 6.95% reduction in circuit size.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2454. Forest-Based Graph Learning for Semi-Supervised Node Classification
๐ฌ 16
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph neural networks Graph learning Node classifications
Existing Graph Neural Networks usually learn long-distance knowledge via stacked layers or global attention, but struggle to balance cost-effectiveness and global receptive field. In this work, we break the dilemma by proposing a novel forest-based graph learning (FGL) paradigm that enables efficient long-range information propagation. Our key insight is to reinterpret message passing on a graph as transportation over spanning trees that naturally facilitates long-range knowledge aggregation, where several trees--a forest--can capture complementary topological pathways. Theoretically, we demonstrate that as edge-homophily estimates improve, the induced distribution biases towards higher-homophily trees, which enables generating a high-quality forest by refining a homophily estimator. Furthermore, we propose a linear-time tree aggregator that realizes quadratic node-pair interactions. Empirically, our framework achieves comparable results against state-of-the-art counterparts on semi-supervised node classification tasks while remaining efficient. Codes are available at \url{https://anonymous.4open.science/r/FGL/}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2455. Information Estimation with Discrete Diffusion
๐ฌ 17
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Information Theory Deep Learning
Information-theoretic measures, such as Mutual Information (MI), play a crucial role in understanding non-linear relationships between random variables and are widely used across scientific disciplines. Yet, their use on real-world discrete data remains challenging. Existing methods typically rely on embedding discrete data into a continuous space and apply neural estimators originally designed for continuous distributions. This process requires careful engineering for both the embedding model and estimator architecture, but suffers from issues related to high data dimensionality. In this work, we introduce InfoSEDD, a discrete diffusionโbased approach that bridges information-theoretic estimation and generative modeling such that they can be used to compute KullbackโLeibler divergences. Backed by Continuous Time Markov Chains theory principles, the design of InfoSEDD is lightweight and scalable and allows seamless integration with pretrained models.
We showcase the versatility of our approach through applications on motif discovery in genetic promoter data, semantic-aware model selection in text summarization, and entropy estimation in Ising models.
Finally, we construct consistency tests on real-world textual and genomics data. Our experiments demonstrate that InfoSEDD outperforms alternatives that rely on the ''embedding trick''. Our results position InfoSEDD as a robust and scalable tool for information-theoretic analysis of discrete data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2456. Steering Evaluation-Aware Language Models To Act Like They Are Deployed
๐ฌ 9
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ AI safety Interpretability
Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM's activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on documents with factual descriptions of the model (1) using Python type hints during evaluation but not during deployment and (2) recognizing that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. We find that activation steering can suppress evaluation awareness and make the model act like it is deployed even when the cue is present. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2457. The Natural Geometry of Code: Hyperbolic Representation Learning for Program Reasoning
๐ฌ 9
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Geometric Deep Learning Hyperbolic Representation Learning Source Code Program Reasoning Graph Neural Networks Abstract Syntax Tree (AST)
State-of-the-art models for code representation, such as GraphCodeBERT, embed the hierarchical structure of source code into Euclidean space. This approach can lead to significant representation distortion, especially when embedding deep or highly branched hierarchies,limiting the models' ability to capture deep program semantics. We argue that the natural geometry for code is hyperbolic, as its exponential volume growth perfectly matches the tree-like structure of a code's Abstract Syntax Tree (AST), enabling low-distortion hierarchical embeddings. We introduce {HypeCodeNet}, a geometric deep learning framework that operates natively in hyperbolic space. Formulated in the numerically stable Lorentz model, its manifold-aware components include a hyperbolic embedding layer, a tangent space message-passing mechanism, and a geodesic-based attention module. On code clone detection, code completion, and link prediction, HypeCodeNet significantly outperforms existing Euclidean models, especially on tasks requiring deep structural understanding. Our work suggests that hyperbolic geometry offers a geometrically sound foundation for code representation, establishing hyperbolic geometry as a key to unlocking the structured semantics of code.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2458. Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry
๐ฌ 13
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ Evaluation Large Language Models Probing Interpretability
Large language models (LLMs) are widely used as reference-free evaluators via prompting, but this โLLM-as-a-Judgeโ paradigm is costly, opaque, and sensitive to prompt design. In this work, we investigate whether smaller models can serve as efficient evaluators by leveraging internal representations instead of surface generation. We uncover a consistent empirical pattern: small LMs, despite with weak generative ability, encode rich evaluative signals in their hidden states. This motivates us to propose the Semantic Capacity Asymmetry Hypothesis: evaluation requires significantly less semantic capacity than generation and can be grounded in intermediate representations, suggesting that evaluation does not necessarily need to rely on large-scale generative models but can instead leverage latent features from smaller ones. Our findings motivate a paradigm shift from LLM-as-a-Judge to Representation-as-a-Judge, a decoding-free evaluation strategy that probes internal model structure rather than relying on prompted output. We instantiate this paradigm through INSPECTOR, a probing-based framework that predicts aspect-level evaluation scores from small model representations. Experiments on reasoning benchmarks (GSM8K, MATH, GPQA) show that INSPECTOR substantially outperforms prompting-based small LMs and closely approximates full LLM judges, while offering a more efficient, reliable, and interpretable alternative for scalable evaluation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2459. Endowing GPT-4 with a Humanoid Body: Building the Bridge Between Off-the-Shelf VLMs and the Physical World
๐ฌ 26
โญ 5.50
๐ generative models
๐ท๏ธ human-scene interaction VLM agent motion generation
In this paper, we explore how to empower general-purpose Vision-Language Models (VLMs) to control humanoid agents. General-purpose VLMs (e.g., GPT-4) exhibit strong open-world generalization, and remove the need for additional fine-tuning data. To build such an agent, two key components are required: (1) an embodied instruction compiler, which enables the VLM to observe the scene and translate high-level user instructions into low-level control parameters; and (2) a motion executor, which generates human motions from these parameters while adapting to real-time physical feedback.
We present BiBo, a VLM-driven humanoid agent composed of an embodied instruction compiler and a diffusion-based motion executor. The compiler interprets user instructions in context with the environment, and leverages a chain of visual question answering (VQA) to guide the VLM in specifying control parameters (e.g., motion captions, locations). The diffusion executor extends future joint trajectories from prior motion, conditioned on both control parameters and environmental feedback.
Experiments demonstrate that BiBo achieves an interaction task success rate of 90.2\% in open environments, and improves the precision of text-guided motion execution by 16.3\% over prior methods. BiBo handles not only basic interaction but also diverse motions, and even dancing while striking at a sandbag. The code will be released upon publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2460. Your VAR Model is Secretly an Efficient and Explainable Generative Classifier
๐ฌ 14
โญ 5.50
๐ generative models
๐ท๏ธ generative classifier generative model autoregressive model
Generative classifiers, which leverage conditional generative models for classification, have recently demonstrated desirable properties such as robustness to distribution shifts. However, recent progress in this area has been largely driven by diffusion-based models, whose substantial computational cost limits their scalability in practice.
To address the efficiency concern, we investigate generative classifier built upon recent advances in visual autoregressive (VAR) modeling. Owing to their tractable likelihood, VAR-based generative classifier enable significantly more efficient inference compared to diffusion-based counterparts. Building on this foundation, we introduce the Adaptive VAR Classifier$^+$ (A-VARC$^+$), which further improves accuracy while reducing computational cost, substantially enhancing practical usability.
Beyond efficiency, we also study several properties of VAR-based generative classifiers that distinguish them from conventional discriminative models. In particular, the tractable likelihood facilitates visual explainability via token-wise mutual information, and the model naturally adapts to class-incremental learning without requiring additional replay data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2461. Efficient Offline Reinforcement Learning via Peer-Influenced Constraint
๐ฌ 17
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Offline Reinforcement Learning Distributional Shift Peer-Influenced Constraint Generalization Uncertainty Estimation
Offline reinforcement learning (RL) seeks to learn an optimal policy from a fixed dataset, but distributional shift between the dataset and the learned policy often leads to suboptimal real-world performance. Existing methods typically use behavior policy regularization to constrain the learned policy, but these conservative approaches can limit performance and generalization, especially when the behavior policy is suboptimal. We propose a Peer-Influenced Constraint (PIC) framework with a ``peer review" mechanism. Specifically, we construct a set of similar states and use the corresponding actions as candidates, from which we select the optimal action to constrain the policy. This method helps the policy escape local optima while approximately ensuring the staying within the in-distribution space, boosting both performance and generalization. We also introduce an improved version, Ensemble Peer-Influenced Constraint (EPIC), which combines ensemble methods to achieve strong performance while maintaining high efficiency. Additionally, we uncover the Coupling Effect between PIC and uncertainty estimation, providing valuable insights for offline RL. We evaluate our methods on classic continuous control tasks from the D4RL benchmark, with both PIC and EPIC achieving competitive performance compared to state-of-the-art approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2462. Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance
๐ฌ 23
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ VLA Generalist Robot Policies Efficient Fine-tuning Classifier Guidance
Vision-Language-Action (VLA) models pre-trained on large, diverse datasets show remarkable potential for general-purpose robotic manipulation. However, a primary bottleneck remains in adapting these models to downstream tasks, especially when the robot's embodiment or the task itself differs from the pre-training data. This discrepancy leads to a significant mismatch in action distributions, demanding extensive data and compute for effective fine-tuning. To address this challenge, we introduce Align-Then-stEer (ATE), a novel, data-efficient, and plug-and-play adaptation framework. ATE first aligns disparate action spaces by constructing a unified latent space, where a variational autoencoder constrained by reverse KL divergence embeds adaptation actions into modes of the pre-training action latent distribution. Subsequently, it steers the diffusion- or flow-based VLA's generation process during fine-tuning via a guidance mechanism that pushes the model's output distribution towards the target domain. We conduct extensive experiments on cross-embodiment and cross-task manipulation in both simulation and real world. Compared to direct fine-tuning of representative VLAs, our method improves the average multi-task success rate by up to 9.8% in simulation and achieves a striking 32% success rate gain in a real-world cross-embodiment setting. Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2463. QPrompt-R1: Real-Time Reasoning for Domain-Generalized Semantic Segmentation via Group-Relative Query Alignment
๐ฌ 16
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Semantic Segmentation; real time; domain generalization
Deploying semantic segmentation in driving and robotics requires both real-time inference and robustness to domain shifts, formalized as Real-Time Domain-Generalized Semantic Segmentation (RT-DGSS), which has not been fully addressed. Existing methods often treat real-time(RT) inference and domain generalization (DG) separately, with DG improving robustness but lacking real-time performance, and real-time models being brittle under distribution shifts. To address the RT-DGSS problem, we propose QPrompt-R1, a real-time Query-Prompt architecture built on a ViT backbone. QPrompt-R1 injects a small set of learnable queries only at the final transformer block, performing a single queryโimage alignment step and eliminating decoder overhead. To further enhance alignment without test-time cost, we introduce a Group Relative Query Alignment (GRQA) objective, which uses group-relative supervision within each group to align queries with features, improving domain generalization through group-relative rewards. QPrompt-R1 achieves 54 FPS, delivering strong performance in synthetic-to-real transfer, real-to-real generalization, and robustness under adverse conditions. Additionally, GRQA is plug-and-play, improving state-of-the-art DGSS methods like REIN (+1.2) and SoMA (+0.5) without inference-time overhead.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2464. Greater than the Sum of Its Parts: Building Substructure into Protein Encoding Models
๐ฌ 22
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Protein biology representation learning benchmark multiscale protein models structure representation learning substructures motifs domain protein function protein structure protein representation learning
Protein representation learning has achieved major advances using large sequence and structure datasets, yet current models primarily operate at the level of individual residues or entire proteins. This overlooks a critical aspect of protein biology: proteins are composed of recurrent, evolutionarily conserved substructures that mediate core molecular functions. Despite decades of curated biological knowledge, these substructures remain largely unexploited in modern protein models. We introduce Magneton, an integrated environment for developing substructure-aware protein models. Magneton provides (1) a large-scale dataset of 530,601 proteins annotated with over 1.7 million substructures spanning 13,075 types, (2) a training framework for incorporating substructures into existing models, and (3) a benchmark suite of 13 tasks probing residue-, substructure-, and protein-level representations. Using Magneton, we develop substructure-tuning, a supervised fine-tuning method that distills substructural knowledge into pretrained protein models. Across state-of-the-art sequence- and structure-based models, substructure-tuning improves function-related tasks while revealing that substructural signals are complementary to global structural information.
The Magneton environment, datasets, and substructure-tuned models are all openly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2465. On Code-Induced Reasoning in LLMs
๐ฌ 15
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ code-induced reasoning systematic perturbations large language models data-centric evaluation
Code data has been shown to enhance the reasoning capabilities of large language models (LLMs), but it remains unclear which aspects of code are most responsible. We investigate this question with a systematic, data-centric framework. We construct parallel instruction datasets in ten programming languages and apply controlled perturbations that selectively disrupt structural or semantic properties of code. We then finetune LLMs from five model families and eight scales on each variant and evaluate their performance on natural language, math, and code tasks. Across 3,331 experiments, our results show that LLMs are more vulnerable to structural perturbations than semantic ones, particularly on math and code tasks. Appropriate abstractions like pseudocode and flowcharts can be as effective as code, while encoding the same information with fewer tokens without adhering to original syntax can often retain or even improve performance. Remarkably, even corrupted code with misleading signals remains competitive when surface-level regularities persist. Finally, syntactic styles also shape task-specific gains with Python favoring natural language reasoning and lower-level languages such as Java and Rust favoring math. Through our systematic framework, we aim to provide insight into how different properties of code influence reasoning and inform the design of training data for enhancing LLM reasoning capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2466. When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework
๐ฌ 4
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Long Context Multi-agent LLM
We investigate the challenge of applying Large Language Models (LLMs) to long texts. We propose a theoretical framework that distinguishes the failure modes of long context tasks into three categories: cross-chunk dependence (task noise), confusion that grows with context size (model noise), and the imperfect integration of partial results (aggregator noise). Under this view, we analyze when it is effective to use multi-agent chunking, i.e., dividing a length sequence into smaller chunks and aggregating the processed results of each chunk. Our experiments on tasks such as retrieval, question answering, and summarization confirm both the theoretical analysis and the conditions that favor multi-agent chunking. By exploring superlinear model noise growth with input length, we also explain why, for large inputs, a weaker model configured with chunk-based processing can surpass a more advanced model like GPT4o applied in a single shot. Overall, we present a principled understanding framework and our results highlight a direct pathway to handling long contexts in LLMs with carefully managed chunking and aggregator strategies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2467. End-to-end Listen, Look, Speak and Act
๐ฌ 19
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Artificial General Intelligence Speech Dialogue Model VLA Model Full Duplex Model
Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledge, is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture, enabling interaction patterns previously out of reach, yielding more natural, human-like behaviors. At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone. This provides a generalizable solution for joint multimodal perception and concurrent generation, leveraging strong pre-trained components while enabling efficient modality integration and mitigating modality interference. On speech-interaction and robot-manipulation benchmarks, ELLSA matches modality-specific baselines, while uniquely supporting advanced multimodal and full-duplex behaviors such as dialogue and action turn-taking, defective instruction rejection, speaking-while-acting, context-grounded visual question answering, and action barge-ins. We contend that ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial general intelligence. A demonstration is available at https://anonymous.4open.science/r/LLSA-E821.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2468. Predicting LLM Reasoning Performance with Small Proxy Model
๐ฌ 5
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Language Models Pre-training Reasoning Evaluation Efficiency
Given the prohibitive cost of pre-training large language models, it is essential to leverage smaller proxy models to optimize recipes before scaling up. However, this approach becomes challenging for reasoning capabilities, which exhibit \textit{emergent} behavior that only appears reliably at larger model sizes, often exceeding 7B parameters. To address this, we introduce \tsc{rBridge}, showing that small proxies ($\leq$1B) can effectively predict large-model reasoning by aligning more closely with \textbf{(1)} the pre-training objective and \textbf{(2)} the target task. \tsc{rBridge} achieves this by weighting negative log-likelihood with task alignment, using reasoning traces from frontier models as gold labels. In our experiments, \tsc{rBridge} \textbf{(i)} reduces dataset ranking costs by over 100$\times$ relative to the best baseline, \textbf{(ii)} achieves the strongest correlation across six reasoning benchmarks at 1B to 32B scale, and \textbf{(iii)} transfers predictive relationships across pre-training recipes at 1B to 7B scale. These findings indicate that \tsc{rBridge} offers a practical path for exploring reasoning-oriented pre-training at lower cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2469. CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics
๐ฌ 9
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ LLM Benchmark Condensed Matter Physics LLM Evaluation AI for Physics
We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 29% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2470. TaskCraft: Automated Generation of Agentic Tasks
๐ฌ 17
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ agent generation LLM agentic task
Agentic tasks, which require multistep problem solving with tool use and adaptive reasoning, are becoming increasingly central to the advancement of NLP and AI. Although benchmarks such as GAIA and BrowseComp have advanced agent evaluation, their scalability remains limited by the high cost of human annotation. We introduce TaskCraft, the first automated workflow for generating scalable, multitool, and verifiable agentic tasks of difficulty. TaskCraft progressively complexifies atomic tasks through depth-based and width-based extensions, with incremental validation via rejection sampling and LLM-based linguistic analysis, ensuring both scalability and efficiency. The generated tasks enable trajectory sampling within state-of-the-art workflows, supporting end-to-end SFT and RL training. Experimental results on multiple LLMs show that TaskCraft data substantially improves multi-hop reasoning and agentic capabilities. Further scaling with TaskCraft tasks and applying RL training yields additional gains, achieving state-of-the-art performance on four agentic benchmarks. The resulting dataset comprises 41k tool-intensive tasks across varied difficulty levels, including 12.6k tool-interaction trajectories and 5k multihop decompositions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2471. Probability Distributions Computed by Hard-Attention Transformers
๐ฌ 15
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ expressivity weighted automata language models transformers linear temporal logic
Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). We characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2472. ZIP-RC: Zero-overhead Inference-time Prediction of Reward and Cost for Adaptive and Interpretable Generation
๐ฌ 9
โญ 5.50
๐ generative models
๐ท๏ธ Large Language Model Test-time compute Value Function Sampling
Large language models excel at reasoning but lack key aspects of introspection, including the ability to anticipate their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this ability, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods such as Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial inference cost by requiring extra models or forward passes. We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost. At every token during generation, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining lengthโno extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time rewardโcost introspection, ZIP-RC allows models to reason adaptively and more efficiently.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2473. A Revisit of Active Sequential Prediction-Powered Mean Estimation
๐ฌ 10
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ active statistical inference mean estimation no-regret learning
In this work, we revisit the problem of active sequential prediction-powered mean estimation, where at each round one must decide the query probability of the ground-truth label upon observing the covariates of a sample. Furthermore, if the label is not queried, the prediction from a machine learning model is used instead. Prior work proposed an elegant scheme that determines the query probability by combining an uncertainty-based suggestion with a constant probability that encodes a soft constraint on the query probability. We explored different values of the mixing parameter and observed an intriguing empirical pattern: the smallest confidence width tends to occur when the weight on the constant probability is close to one, thereby reducing the influence of the uncertainty-based component. Motivated by this observation, we develop a non-asymptotic analysis of the estimator and establish a data-dependent bound on its confidence interval. Our analysis further suggests that when a no-regret learning approach is used to determine the query probability and control this bound, the query probability converges to the constraint of the max value of the query probability when it is chosen obliviously to the current covariates. We also conduct simulations that corroborate these theoretical findings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2474. Difference-Aware Retrieval Polices for Imitation Learning
๐ฌ 10
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ behavior cloning robotics imitation learning nearest neighbor retrieval
Behavior cloning suffers from poor generalization to out-of-distribution states due to compounding errors during deployment. We present Difference-Aware Retrieval Polices for Imitation Learning (DARP), a novel nearest-neighbor-based imitation learning approach that addresses this limitation by reparameterizing the imitation learning problem in terms of local neighborhood structure rather than direct state-to-action mappings. Instead of learning a global policy, DARP trains a model to predict actions based on k-nearest neighbors from expert demonstrations, their corresponding actions, and the relative distance vectors between neighbor states and query states. Our method requires no additional data collection, online expert feedback, or task-specific knowledge beyond standard behavior cloning prerequisites. We demonstrate consistent performance improvements of 15-46\% over standard behavior cloning across diverse domains, including continuous control and robotic manipulation, and across different representations, including high-dimensional visual features.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2475. NurValues: Real-World Nursing Values Evaluation for Large Language Models in Clinical Context
๐ฌ 35
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Large language models value alignment nursing values
While LLMs have demonstrated medical knowledge and conversational ability, their deployment in clinical practice raises new risks: patients may place greater trust in LLM-generated responses than in nurses' professional judgments, potentially intensifying nurseโpatient conflicts. Such risks highlight the urgent need of evaluating whether LLMs align with the core nursing values upheld by human nurses. This work introduces the first benchmark for nursing value alignment, consisting of five core value dimensions distilled from international nursing codes: _Altruism_, _Human Dignity_, _Integrity_, _Justice_, and _Professionalism_. We define two-level tasks on the benchmark, considering the two characteristics of emerging nurseโpatient conflicts. The **Easy-Level** dataset consists of 2,200 value-aligned and value-violating instances, which are collected through a five-month longitudinal field study across three hospitals of varying tiers; The **Hard-Level** dataset is comprised of 2,200 dialogue-based instances that embed contextual cues and subtle misleading signals, which increase adversarial complexity and better reflect the subjectivity and bias of narrators in the context of emerging nurse-patient conflicts. We evaluate a total of 23 SoTA LLMs on their ability to align with nursing values, and find that general LLMs outperform medical ones, and _Justice_ is the hardest value dimension. As the first real-world benchmark for healthcare value alignment, NurValues provides novel insights into how LLMs navigate ethical challenges in clinicianโpatient interactions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2476. Articulation in Motion: Prior-free Part Mobility Analysis for Articulated Objects By Dynamic-Static Disentanglement
๐ฌ 22
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Articulated object understanding dual-Gaussian representation Prior-free Motion-based part segmentation
Articulated objects are ubiquitous in daily life. Our goal is to achieve a high-quality reconstruction, segmentation of independent moving parts, and analysis of articulation. Recent methods analyze two different articulation states and perform per-point part segmentation, optimizing per-part articulation using cross-state correspondences, given a priori knowledge of the number of parts. Such assumptions greatly limit their applications and performance. Their robustness is reduced when objects cannot be clearly visible in both states. To address these issues, in this paper, we present a novel framework, *Articulation in Motion (AiM)*.. We infer part-level decomposition, articulation kinematics, and reconstruct an interactive 3D digital replica from a userโobject interaction video and a start-state scan. We propose a dual-Gaussian scene representation that is learned from an initial 3DGS scan of the object and a video that shows the movement of separate parts. It uses motion cues to segment the object into parts and assign articulation joints. Subsequently, a robust, sequential RANSAC employed to achieve part mobility analysis *without any part-level structural priors*, which clusters moving primitives into rigid parts and estimates kinematics while automatically determining the number of parts. The proposed approach separates the object into parts, each represented as a 3D Gaussian set, enabling high-quality rendering. Our approach yields higher quality part segmentation than all previous methods, without prior knowledge. Extensive experimental analysis on both simple and complex objects validate the effectiveness and strong generalization ability of our approach. \textit{Code and implementation details will be made publicly available.}
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2477. Visual Prompt-Agnostic Evolution
๐ฌ 19
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ computer vision visual prompt tuning
Visual Prompt Tuning (VPT) enables effective adaptation of a frozen Vision Transformer (ViT) to downstream tasks by inserting a small number of learnable prompt tokens into the token sequence at each layer. However, we observe that existing VPT variants often suffer from unstable training dynamics, characterized by gradient oscillations. A closer layer-wise analysis reveals that shallow-layer prompts tend to stagnate early, while deeper-layer prompts exhibit high-variance oscillations, leading to a cross-layer mismatch. These issues contribute to slower convergence and degraded final performance. To address these challenges, we propose the Prompt-Agnostic Evolution ($\mathtt{PAE}$) method, which can strengthen vision prompt tuning by explicitly modeling the dynamics of learnable prompts. From a frequency-domain perspective, we initialize prompts in a task-aware direction by uncovering and propagating frequency shortcut patterns that the backbone inherently exploits for recognition. To ensure coherent evolution across layers, we further employ a shared Koopman operator, which imposes a global linear transformation rather than uncoordinated, layer-specific updates. Finally, inspired by Lyapunov stability theory, we introduce a regularizer that constrains error amplification during evolution. Extensive experiments demonstrate that using $\mathtt{PAE}$ with VPT variants not only accelerates convergence with an average 1.41$\times$ speedup but also yields 1โ3% gains on 25 datasets with multi downstream tasks. Beyond performance, $\mathtt{PAE}$ remains prompt-agnostic and lightweight, and it integrates seamlessly with diverse VPT variants without backbone modification or inference-time changes, providing a practical and scalable solution for advancing prompt tuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2478. Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better
๐ฌ 19
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Efficient Training Representation Learning
As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: Can we leverage existing small pretrained models to accelerate the training of larger models? In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning. These mechanisms significantly accelerate training convergence while robustly enhancing both language modeling capabilities and downstream task performance, enabling faster training with superior performance. Extensive experiments on 1.4B and 7B parameter models demonstrate LET's efficiency and effectiveness. Notably, when training a 1.4B LLM on the Pile dataset, our method achieves up to 1.6ร speedup with nearly 5% improvement in downstream task accuracy compared to standard training, even when using a pretrained model with 10x fewer parameters than the target model.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2479. MedGMAE: Gaussian Masked Autoencoders for Medical Volumetric Representation Learning
๐ฌ 9
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Gaussian Representation Medical Imaging analysis Volumetric Representation Learning
Self-supervised pre-training has emerged as a critical paradigm for learning transferable representations from unlabeled medical volumetric data. Masked autoencoder based methods have garnered significant attention, yet their application to volumetric medical image faces fundamental limitations from the discrete voxel-level reconstruction objective, which neglects comprehensive anatomical structure continuity. To address this challenge, We propose MedGMAE, a novel framework that replaces traditional voxel reconstruction with 3D Gaussian primitives reconstruction as new perspectives on representation learning. Our approach learns to predict complete sets of 3D Gaussian parameters as semantic abstractions to represent the entire 3D volume, from sparse visible image patches. MedGMAE demonstrates dual utility across medical imaging applications. For representation learning, sparse Gaussian prediction produces superior encoder representations that outperform traditional MAE baselines on downstream segmentation, classification, and registration tasks. For volumetric reconstruction, the Gaussian decoder leverages pretrained anatomical priors to accelerate 3D CT volume reconstruction convergence. Extensive experiments across multiple medical imaging datasets demonstrate that our approach achieves superior performance, establishing a new paradigm for medical image pre-training. Code will be released soon.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2480. From Tokens to Nodes: Semantic-Guided Motion Control for Dynamic 3D Gaussian Splatting
๐ฌ 4
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3d reconstruction computer vision monocular dynamic reconstruction
Dynamic 3D reconstruction from monocular videos remains difficult due to the ambiguity inferring 3D motion from limited views and computational demands of modeling temporally varying scenes. While recent sparse control methods alleviate computation by reducing millions of Gaussians to thousands of control points, they suffer from a critical limitation: they allocate points purely by geometry, leading to static redundancy and dynamic insufficiency. We propose a motion-adaptive framework that aligns control density with motion complexity. Leveraging semantic and motion priors from vision foundation models, we establish patch-token-node correspondences and apply motion-adaptive compression to concentrate control points in dynamic regions while suppressing redundancy in static backgrounds. Our approach achieves flexible representational density adaptation through iterative voxelization and motion tendency scoring, directly addressing the fundamental mismatch between control point allocation and motion complexity. To capture temporal evolution, we introduce spline-based trajectory parameterization initialized by 2D tracklets, replacing MLP-based deformation fields to achieve smoother motion representation and more stable optimization. Extensive experiments demonstrate significant improvements in reconstruction quality and efficiency over existing state-of-the-art methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2481. OmniPortrait: Fine-Grained Personalized Portrait Synthesis via Pivotal Optimization
๐ฌ 22
โญ 5.50
๐ generative models
๐ท๏ธ Personalized Portrait Synthesis ImageGeneration High-Fidelity Facial Details Pivotal Optimization
Image identity customization aims to synthesize realistic and diverse portraits of a specified identity, given a reference image and a text prompt. This task presents two key challenges: (1) generating realistic portraits that preserve fine-grained facial details of the reference identity, and (2) maintaining identity consistency while achieving strong alignment with the text prompt. Our findings suggest that existing single-stream methods fail to capture and guide fine-grained identity details.
To address these challenges, we introduce \textit{OmniPortrait}, a novel diffusion-based framework for fine-grained identity fidelity and high editability in portrait synthesis. Our core idea is pivotal optimization, which leverages dual-stream identity guidance in a coarse-to-fine manner. First, a Pivot ID Encoder is proposed and trained with a face localization loss while avoiding the degradation of editability typically caused by fine-tuning the denoiser. Although this encoder primarily guides coarse-level identity synthesis, it provides a good initialization that serves as the identity pivot for optimization during inference.
Second, we propose Reference-Based Guidance, which performs on-the-fly feature matching and optimization over diffusion intermediate features conditioned on the identity pivot. In addition, our approach is able to generalize naturally to multi-identity customized image generation scenarios. Extensive experiments demonstrate significant improvements in both identity preservation and text alignment, establishing a new benchmark for image identity customization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2482. Semantic Visual Anomaly Detection and Reasoning in AI-Generated Images
๐ฌ 11
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Anomaly Detection๏ผAI-Generated Images
The rapid advancement of AI-generated content (AIGC) has enabled the synthesis of visually convincing images; however, many such outputs exhibit subtle \textbf{semantic anomalies}, including unrealistic object configurations, violations of physical laws, or commonsense inconsistencies, which compromise the overall plausibility of the generated scenes. Detecting these semantic-level anomalies is essential for assessing the trustworthiness of AIGC media, especially in AIGC image analysis, explainable deepfake detection and semantic authenticity assessment.In this paper, we formalize \textbf{semantic anomaly detection and reasoning} for AIGC images and introduce \textbf{AnomReason}, a large-scale benchmark with structured annotations as quadruples \emph{(Name, Phenomenon, Reasoning, Severity)}. Annotations are produced by a modular multi-agent pipeline (\textbf{AnomAgent}) with lightweight human-in-the-loop verification, enabling scale while preserving quality. At construction time, AnomAgent processed approximately 4.17\,B GPT-4o tokens, providing scale evidence for the resulting structured annotations. We further show that models fine-tuned on AnomReason achieve consistent gains over strong vision-language baselines under our proposed semantic matching metric (\textit{SemAP} and \textit{SemF1}). Applications to {explainable deepfake detection} and {semantic reasonableness assessment of image generators} demonstrate practical utility. In summary, AnomReason and AnomAgent serve as a foundation for measuring and improving the semantic plausibility of AI-generated images. We will release code, metrics, data, and task-aligned models to support reproducible research on semantic authenticity and interpretable AIGC forensics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2483. FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging
๐ฌ 21
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Efficient Large Multimodal models Video Large Language Models Visual Token Compression
Although Video Large Language Models (VLLMs) have shown remarkable capabilities in video understanding, they are required to process high volumes of visual tokens, causing significant computational inefficiency. Existing VLLMs acceleration frameworks usually compress spatial and temporal redundancy independently, which overlooks the spatiotemporal relationships, thereby leading to suboptimal spatiotemporal compression. The highly correlated visual features are likely to change in spatial position, scale, orientation, and other attributes over time due to the dynamic nature of video. Building on this insight, we introduce FlashVID, a training-free inference acceleration framework for VLLMs. Specifically, FlashVID utilizes Attention and Diversity-based Token Selection (ADTS) to select the most representative tokens for basic video representation, then applies Tree-based Spatiotemporal Token Merging (TSTM) for fine-grained spatiotemporal redundancy elimination. Extensive experiments conducted on three representative VLLMs across five video understanding benchmarks demonstrate the effectiveness and generalization of our method. Notably, by retaining only $\textbf{10}$% of visual tokens, FlashVID preserves $\textbf{99.1}$% of the performance of LLaVA-OneVision. Consequently, FlashVID can serve as a training-free and plug-and-play module for extending long video frames, which enables a $\textbf{10$\times$}$ increase in video frame input to Qwen2.5-VL, resulting in a relative improvement of $\textbf{8.6}$% within the same computational budget. Code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2484. Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention
๐ฌ 23
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ novel view synthesis novel view synthesis transformer large model
Feedforward models for novel view synthesis (NVS) have recently advanced by transformer-based methods like LVSM, using attention among all input and target views. In this work, we argue that its full self-attention design is suboptimal, suffering from quadratic complexity with respect to the number of input views and rigid parameter sharing among heterogeneous tokens. We propose \textbf{Efficient-LVSM}, a dual-stream architecture that avoids these issues with a decoupled co-refinement mechanism. It applies intra-view self-attention for input views and self-then-cross attention for target views, eliminating unnecessary computation. Efficient-LVSM achieves 30.6 dB PSNR on RealEstate10K with 2 input views, surpassing LVSM by 0.9 dB, with 2ร faster training convergence and 4.2ร faster inference speed.
Efficient-LVSM achieves state-of-the-art performance on multiple benchmarks, exhibits strong zero-shot generalization to unseen view counts, and enables incremental inference with KV-cache, thanks to its decoupled designs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2485. Enhancing Multivariate Time Series Forecasting with Global Temporal Retrieval
๐ฌ 4
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ Time-series
Multivariate time series forecasting (MTSF) plays a vital role in numerous real-world applications, yet existing models remain constrained by their reliance on a limited historical context. This limitation prevents them from effectively capturing global periodic patterns that often span cycles significantly longer than the input horizonโdespite such patterns carrying strong predictive signals. Naรฏve solutions, such as extending the historical window, lead to severe drawbacks, including overfitting, prohibitive computational costs, and redundant information processing. To address these challenges, we introduce the Global Temporal Retriever (GTR), a lightweight and plug-and-play module designed to extend any forecasting modelโs temporal awareness beyond the immediate historical context. GTR maintains an adaptive global temporal embedding of the entire cycle and dynamically retrieves and aligns relevant global segments with the input sequence. By jointly modeling local and global dependencies through a 2D convolution and residual fusion, GTR effectively bridges short-term observations with long-term periodicity without altering the host model architecture. Extensive experiments on six real-world datasets demonstrate that GTR consistently delivers state-of-the-art performance across both short-term and long-term forecasting scenarios, while incurring minimal parameter and computational overhead. These results highlight GTR as an efficient and general solution for enhancing global periodicity modeling in MTSF tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2486. Continual Unlearning for Text-to-Image Diffusion Models: A Regularization Perspective
๐ฌ 20
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Continual Unlearning Diffusion Model Image Generation Machine Unlearning
Machine unlearningโthe ability to remove designated concepts from a pre-trained
modelโhas advanced rapidly, particularly for text-to-image diffusion models.
However, existing methods typically assume that unlearning requests arrive all
at once, whereas in practice they often arrive sequentially. We present the first
systematic study of continual unlearning in text-to-image diffusion models and
show that popular unlearning methods suffer from rapid utility collapse: after only
a few requests, models forget retained knowledge and generate degraded images.
We trace this failure to cumulative parameter drift from the pre-training weights
and argue that regularization is crucial to addressing it. To this end, we study a
suite of add-on regularizers that (1) mitigate drift and (2) remain compatible with
existing unlearning methods. Beyond generic regularizers, we show that semantic
awareness is essential for preserving concepts close to the unlearning target, and
propose a gradient-projection method that constrains parameter drift orthogonal
to their subspace. This substantially improves continual unlearning performance
and is complementary to other regularizers for further gains. Taken together, our
study establishes continual unlearning as a fundamental challenge in text-to-image
generation and provides insights, baselines, and open directions for advancing safe
and accountable generative AI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2487. Accelerating Inference for Multilayer Neural Networks with Quantum Computers
๐ฌ 25
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ quantum machine learning QML quantum algorithms quantum deep learning quantum computing
Fault-tolerant Quantum Processing Units (QPUs) promise to deliver exponential speed-ups in select computational tasks, yet their integration into modern deep learning pipelines remains unclear. In this work, we take a step towards bridging this gap by presenting the first fully-coherent quantum implementation of a multilayer neural network with non-linear activation functions. Our constructions mirror widely used deep learning architectures based on ResNet, and consist of residual blocks with multi-filter 2D convolutions, sigmoid activations, skip-connections, and layer normalizations. We analyse the complexity of inference for networks under three quantum data access regimes. Without any assumptions, we establish a quadratic speedup over classical methods for shallow bilinear-style networks. With efficient quantum access to the weights, we obtain a quartic speedup over classical methods. With efficient quantum access to both the inputs and the network weights, we prove that a network with an $N$-dimensional vectorized input, $k$ residual block layers, and a final residual-linear-pooling layer can be implemented with an error of $\epsilon$ with $O(\text{polylog}(N/\epsilon)^k)$ inference cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2488. VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications
๐ฌ 20
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ llm agent tool use multi-turn interaction real-world application
As LLMs with agentic abilities are increasingly deployed in real-life scenarios, existing benchmarks fail to capture their inherent complexity of handling extensive information, leveraging diverse resources, and managing dynamic user interactions. To address this gap, we introduce VitaBench, a challenging benchmark that evaluates agents on versatile interactive tasks grounded in real-world settings. Drawing from daily applications in food delivery, in-store consumption, and online travel services, VitaBench presents agents with the most complex life-serving simulation environment to date, comprising 66 tools. Through a framework that eliminates domain-specific policies, we enable flexible composition of these scenarios and tools, yielding 100 cross-scenario tasks (main results) and 300 single-scenario tasks. Each task is derived from multiple real user requests and requires agents to reason across temporal and spatial dimensions, proactively clarify ambiguous instructions, and track shifting user intent throughout multi-turn conversations. Moreover, we propose a rubric-based sliding window evaluator, enabling robust assessment of diverse solution pathways in complex environments and stochastic interactions. Our comprehensive evaluation reveals that even the most advanced models achieve only 30% success rate on cross scenario tasks, and less than 50% success rate on single scenario tasks. Overall, we believe VitaBench will serve as a valuable resource for advancing the development of AI agents in practical real-world applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2489. FedOpenMatch: Towards Semi-Supervised Federated Learning in Open-Set Environments
๐ฌ 15
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ semi-supervised federated learning open-set federated learning
Semi-supervised federated learning (SSFL) has emerged as an effective approach to leverage unlabeled data distributed across multiple data owners for improving model generalization. Existing SSFL methods typically assume that labeled and unlabeled data share the same label space. However, in realistic federated scenarios, unlabeled data often contain categories absent from the labeled set, i.e., outliers, which can severely degrade the performance of SSFL algorithms.
In this paper, we address this under-explored issue, formally propose the open-set semi-supervised federated learning (OSSFL) problem, and develop the first OSSFL framework, FedOpenMatch. Our method adopts a one-vs-all (OVA) classifier as the outlier detector, equipped with logit adjustment to mitigate inlier-outlier imbalance and a gradient stop mechanism to reduce feature interference between the OVA and inlier classifiers. In addition, we introduce the logit consistency regularization loss, yielding more robust performance.
Extensive experiments on standard benchmarks across diverse data settings demonstrate the effectiveness of FedOpenMatch, which significantly outperforms the baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2490. SimpleGVR: A Simple Baseline for Latent-Cascaded Generative Video Super-Resolution
๐ฌ 22
โญ 5.50
๐ generative models
๐ท๏ธ High-resolution text-to-video generation; Generative Video Super-Resolution
Cascaded pipelines, which use a base text-to-video (T2V) model for low-resolution content and a video super-resolution (VSR) model for high-resolution details, are a prevailing strategy for efficient video synthesis. However, current works suffer from two key limitations: an inefficient pixel-space interface that introduces non-trivial computational overhead, and mismatched degradation strategies that compromise the visual quality of AIGC content. To address these issues, we introduce SimpleGVR, a lightweight VSR model designed to operate entirely within the latent space. Key to SimpleGVR are a latent upsampler for effective, detail-preserving conditioning of the high-resolution synthesis, and two degradation strategies (flow-based and model-guided) to ensure better alignment with the upstream T2V model. To further enhance the performance and practical applicability of SimpleGVR, we introduce a set of crucial training optimizations: a detail-aware timestep sampler, a suitable noise augmentation range, and an efficient interleaving temporal unit mechanism for long-video handling. Extensive experiments demonstrate the superiority of our framework over existing methods, with ablation studies confirming the efficacy of each design. Our work establishes a simple yet effective baseline for cascaded video super-resolution generation, offering practical insights to guide future advancements in efficient cascaded systems. Video visual comparisons are available \href{https://simplegvr.github.io/}{here}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2491. ProRe: A Proactive Reward System for GUI Agents via ReasonerโActor Collaboration
๐ฌ 24
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ LLM GUI Agent Reward System
Reward is critical to the evaluation and training of large language models (LLMs). However, existing rule-based or model-based reward methods struggle to generalize to GUI agents, where access to ground-truth trajectories or application databases is often unavailable, and static trajectory-based LLM-as-a-Judge approaches suffer from limited accuracy. To address these challenges, we propose ProRe, a proactive reward system that leverages a general-purpose reasoner and domain-specific evaluator agents (actors). The reasoner schedules targeted state probing tasks, which the evaluator agents then execute by actively interacting with the environment to collect additional observations. This enables the reasoner to assign more accurate and verifiable rewards to GUI agents. Empirical results on over 3K trajectories demonstrate that ProRe improves reward accuracy and F1 score by up to 5.3\% and 19.4\%, respectively. Furthermore, integrating ProRe with state-of-the-art policy agents yields a success rate improvement of up to 22.4\%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2492. Continuous multinomial logistic regression for neural decoding
๐ฌ 26
โญ 5.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Neural propulation coding Conditional Density Estimation Gaussian processes Variational inference Probabilistic models
Multinomial logistic regression (MLR) is a classic model for multi-class classification that has been widely used for neural decoding. However, MLR requires a finite set of discrete output classes, limiting its applicability to settings with continuous-valued outputs (e.g., time, orientation, velocity, or spatial position). To address this limitation, we propose Continuous Multinomial Logistic Regression (CMLR), a generalization of MLR to continuous output spaces. CMLR defines a novel exponential-family model for conditional density estimation (CDE), mapping neural population activity to a full probability density over external covariates. It captures the influence of each neuronโs activity on the decoded variable through a smooth, interpretable tuning function, regularized by a Gaussian process prior. The resulting nonparametric decoding model flexibly captures a wide variety of conditional densities, including multimodal, asymmetric, and circular distributions. We apply CMLR to large-scale datasets from mouse and monkey primary visual cortex, mouse hippocampus, and monkey motor cortex, and show that it consistently outperforms a wide variety of other decoding methods, including deep neural networks (DNNs), XGBoost, and FlexCode. It also outperforms correlation-blind models such as Naive Bayes, highlighting the importance of correlations for accurate neural decoding. The CMLR model provides a scalable, flexible, and interpretable method for decoding responses from diverse brain regions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2493. Diverse and Sparse Mixture-of-Experts for Causal SubgraphโBased Out-of-Distribution Graph Learning
๐ฌ 17
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ graph neural network out-of-distribution learning mixture-of-experts
Current state-of-the-art methods for out-of-distribution (OOD) generalization lack the ability to effectively address datasets with heterogeneous causal subgraphs at the instance level. Existing approaches that attempt to handle such heterogeneity either rely on data augmentation, which risks altering label semantics, or impose causal assumptions whose validity in real-world datasets is uncertain. We introduce a novel *Mixture-of-Experts (MoE)* framework that can model heterogeneous causal subgraphs without relying on restrictive assumptions. Our key idea is to address instance-level heterogeneity by enforcing semantic *diversity* among experts, each generating a distinct causal subgraph, while a learned gate assigns *sparse* weights that adaptively focus on the most relevant experts for each input. Our theoretical analysis shows that these two properties jointly reduce OOD error. In practice, our experts are scalable and do not require environment labels. Empirically, our framework achieves strong performance on the GOOD benchmark across both synthetic and real-world structural shifts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2494. Beyond Entity Correlations: Disentangling Event Causal Puzzles in Temporal Knowledge Graphs
๐ฌ 17
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Event Prediction; Temporal Knowledge Graph; Representation Learning
Existing Temporal Knowledge Graph (TKG) representation learning approaches focus on modeling entity correlations. However, since TKG datasets are constructed from events, which inherently contain heterogeneous causalities, focusing solely on entity or relation level correlations is inadequate for event prediction in TKGs. Although a TKG structural causal model can be established as a theoretical framework for event level causality disentangling, practical disentanglement is non-trivial due to the lack of explicit supervision signals. To this end, we propose a Heterogeneous Event causality Disentangling Representation learning Approach (HEDRA) for TKG reasoning, which is the first work that focuses on disentangling heterogeneous causalities at the event level in TKGs. Specifically, a counterfactual detector module is proposed to disentangle non-causality by leveraging event importance and distributional discrepancies of event representations. Moreover, an Instrumental Variable (IV)-guided disentangling module is proposed to disentangle spurious causality by constructing IVs, which can produce robust event representations against spurious causality through multi-view causality subgraphs. Finally, an evolutionary orthogonal module is proposed to separate dynamic causality from static causality for event prediction. Comprehensive experiments on five real-world datasets demonstrate that HEDRA achieves the state-of-the-art performance. The source code of HEDRA is available at https://anonymous.4open.science/r/HEDRA-8A2F.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2495. On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation
๐ฌ 17
โญ 5.50
๐ generative models
๐ท๏ธ Multimodal humor generation multi-role framework LLM humor retrieval
Humor is a commonly used and high-level human language in daily life. However, humor generation is a challenging task for large language models (LLMs) in multi-modal contexts, but with many useful applications of funny caption generation for images, requiring visual understanding, humor reasoning, creative imagination, and so on. Existing LLM-based approaches rely on reasoning chains or self-improvement, which suffer from limited creativity and interpretability. To address these bottlenecks, we develop a novel LLM-based humor generation mechanism based on a fundamental humor theory, GTVH. To produce funny and script-opposite captions, we introduce a humor-theory-driven multi-role LLM collaboration framework augmented with humor retrieval (HOMER). The framework consists of three LLM-based roles: (1) conflicting-script extractor that grounds humor in key script oppositions, forming the basis of caption generation; (2) retrieval-augmented hierarchical imaginator that identifies key humor targets and expands the creative space through diverse associations structured as imagination trees; and (3) caption generator that produces funny and diverse captions conditioned on the obtained knowledge. Extensive experiments on two New Yorker Cartoon benchmarking datasets show that HOMER outperforms state-of-the-art baselines and powerful LLM reasoning strategies on multi-modal humor captioning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2496. RATE-DISTORTION OPTIMIZED COMMUNICATION FOR COLLABORATIVE PERCEPTION
๐ฌ 21
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Pragmatic communication collaborative perception rate-distortion analysis
Collaborative perception emphasizes enhancing environmental understanding by enabling multiple agents to share visual information with limited bandwidth resources. While prior work has explored the empirical trade-off between task performance and communication volume, a significant gap remains in the theoretical foundation. To fill this gap, we draw on information theory and introduce a pragmatic rate-distortion theory for multi-agent collaboration, specifically formulated to analyze performance-communication trade-off in goal-oriented multi-agent systems. This theory concretizes two key conditions for designing optimal communication strategies: supplying pragmatically relevant information and transmitting redundancy-less messages. Guided by these two conditions, we propose RDcomm,
a communication-efficient collaborative perception framework that introduces two key innovations: i) task entropy discrete coding, which assigns features with task-relevant codeword-lengths to maximize the efficiency in supplying pragmatic information; ii) mutual-information-driven message selection, which utilizes mutual information neural estimation to approach the optimal redundancy-less condition. Experiments on 3D detection and BEV segmentation show that RDcomm achieves state-of-the-art accuracy on datasets DAIR-V2X, OPV2V, V2XSeq, and V2V4Real, while reducing communication volume by up to 108ร. The code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2497. BFM-Zero: A Promptable Behavioral Foundation Model for Humanoid Control Using Unsupervised Reinforcement Learning
๐ฌ 16
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Unsupervised Reinforcement Learning Robotics Humanoid Robot Robotics Foundation Model
Building Behavioral Foundation Models (BFMs) for humanoid robots has the potential to unify diverse control tasks under a single, promptable generalist policy. However, existing approaches are either exclusively deployed on simulated humanoid characters, or specialized to specific tasks such as tracking. We propose BFM-Zero, a framework that learns an effective shared latent representation that embeds motions, goals, and rewards into a common space, enabling a single policy to be prompted for multiple downstream tasks without retraining. This well-structured latent space in BFM-Zero enables versatile and robust whole-body skills on a Unitree G1 humanoid in the real world, via diverse inference methods, including zero-shot motion tracking, goal reaching, and reward inference, and few-shot optimization-based adaptation. Unlike prior on-policy reinforcement learning (RL) frameworks, BFM-Zero builds upon recent advancements in unsupervised RL and Forward-Backward (FB) models, which offer an objective-centric, explainable, and smooth latent representation of whole-body motions. We further extend BFM-Zero with critical reward shaping, domain randomization, and history-dependent asymmetric learning to bridge the sim-to-real gap. Those key design choices are quantitatively ablated in simulation. A first-of-its-kind model, BFM-Zero establishes a step toward scalable, promptable behavioral foundation models for whole-body humanoid control. Videos: http://bfm-zero-anonymous.pages.dev/
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2498. Token-level Data Selection for Safe LLM Fine-tuning
๐ฌ 25
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM LLM safety
Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade-off between safety and utility. To address this limitation, we perform a systematic token-level diagnosis of safety degradation during fine-tuning. Based on this, we propose token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model. This token-level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task-specific information. In addition, we introduce a progressive refinement strategy, TOSS-Pro, which iteratively enhances the safety-degraded model's ability to identify unsafe tokens. Extensive experiments demonstrate that our approach robustly safeguards LLMs during fine-tuning while achieving superior downstream task performance, significantly outperforming existing sample-level defense methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2499. Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
๐ฌ 20
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Vision-language Models Inference Efficiency Large Language Models
Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability.
While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches.
In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads.
We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage.
To address this, we propose \texttt{MixKV}, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. \texttt{MixKV} adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that \texttt{MixKV} consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), \texttt{MixKV} improves baseline methods by an average of \textbf{5.1\%} across five multi-modal understanding benchmarks, and achieves remarkable gains of \textbf{8.0\%} and \textbf{9.0\%} for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, \texttt{MixKV} extends seamlessly to LLMs with comparable performance gains. \textit{The code is available in the supplementary material and will be released on GitHub.}
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2500. Learning to Reason for Hallucination Span Detection
๐ฌ 4
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Hallucination Detection Reasoning Reinforcement Learning Large Language Models
Large language models (LLMs) often generate hallucinations---unsupported content that undermines reliability. While most prior works frame hallucination detection as a binary task, many real-world applications require identifying hallucinated spans, which is a multi-step decision making process.
This naturally raises the question of whether explicit reasoning can help the complex task of detecting hallucination spans.
To answer this question, we first evaluate pretrained models with and without Chain-of-Thought (CoT) reasoning, and show that CoT reasoning has the potential to generate at least one correct answer when sampled multiple times. Motivated by this, we propose RL4HS, a reinforcement learning framework that incentivizes reasoning with a span-level reward function.
RL4HS builds on Group Relative Policy Optimization and introduces Class-Aware Policy Optimization to mitigate reward imbalance issue. Experiments on the RAGTruth benchmark (summarization, question answering, data-to-text) show that RL4HS surpasses pretrained reasoning models and supervised fine-tuning, demonstrating the necessity of reinforcement learning with span-level rewards for detecting hallucination spans.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2501. Quantile Advantage Estimation for Entropy-Safe Reasoning
๐ฌ 4
โญ 5.50
๐ reinforcement learning
๐ท๏ธ RLVR LLM reasoning entropy explosion advantage estimation
Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning but training often oscillates between {entropy collapse} and {entropy explosion}.
We trace both hazards to the mean-baseline used in value-free RL (\eg GRPO/DAPO), which improperly penalizes negative-advantage samples under reward outliers.
We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise $K$-quantile baseline.
QAE induces a response-level, two-regime gate: on hard queries ($p \le 1{-}K$) it reinforces rare successes, while on easy queries ($p > 1{-}K$) it targets remaining failures.
Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower/upper bounds on one-step entropy change that curb explosion and prevent collapse.
Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned $K$, roughly 80\% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME'24/'25 and AMC'23.
These results identify {baseline design}โrather than token-level heuristicsโas the primary mechanism for scaling RLVR.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2502. Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test
๐ฌ 20
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM API auditing untargeted fingerprinting
As API access becomes a primary interface to large language models (LLMs), users often interact with black-box systems that offer little transparency into the deployed model. To reduce costs or maliciously alter model behaviors, API providers may discreetly serve quantized or fine-tuned variants, which can degrade performance and compromise safety. Detecting such substitutions is difficult, as users lack access to model weights and, in most cases, even output logits. To tackle this problem, we propose a rank-based uniformity test (RUT) that can verify the behavioral equality of a black-box LLM to a locally deployed authentic model. Our method is accurate, query-efficient, and avoids detectable query patterns, making it robust to adversarial providers that reroute or mix responses upon the detection of testing attempts. We evaluate the approach across diverse query domains and threat scenarios, including quantization, harmful fine-tuning, jailbreak prompts, full model substitution, showing that it consistently achieves superior detection power over prior methods under constrained query budgets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2503. AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation
๐ฌ 14
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Reference Image Segmentation๏ผ Masked Learning๏ผ VLM
Referring Image Segmentation (RIS) aims to segment the object in an image uniquely referred to by a natural language expression. However, RIS training often contains hard-to-align and instance-specific visual signals; optimizing on such pixels injects misleading gradients and drives the model in the wrong direction. By explicitly estimating pixel-level visionโlanguage alignment, the learner can suppress low-alignment regions, concentrate on reliable cues, and acquire more generalizable alignment features.
In this paper, we propose Alignment-Aware Masked Learning (AML), a simple yet effective training strategy that quantifies regionโreferent alignment (PMME) and filters out unreliable pixels during optimization (AFM). Specifically, each sample first computes a similarity map between visual and textual features, and then masks out pixels falling below an adaptive similarity threshold, thereby excluding poorly aligned regions from the training process. AML does not require architectural changes and incurs no inference overhead, directing attention to the areas aligned with the textual description. Experiments on the RefCOCO (vanilla/+/g) datasets show that AML achieves state-of-the-art results across all 8 splits, and beyond improving RIS performance, AML also enhances the modelโs robustness to diverse descriptions and scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2504. Cortical Policy: A Dual-Stream View Transformer for Robotic Manipulation
๐ฌ 15
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ robotic manipulation view transformer 3D perception dynamic-view perception
View transformers process multi-view observations to predict actions and have shown impressive performance in robotic manipulation. Existing methods typically extract static visual representations in a view-specific manner, leading to inadequate 3D spatial reasoning ability and a lack of dynamic adaptation. Taking inspiration from how the human brain integrates static and dynamic views to address these challenges, we propose Cortical Policy, a novel dual-stream view transformer for robotic manipulation that jointly reasons from static-view and dynamic-view streams. The static-view stream enhances spatial understanding by aligning features of geometrically consistent keypoints extracted from a pretrained 3D foundation model. The dynamic-view stream achieves adaptive adjustment through position-aware pretraining of an egocentric gaze estimation model, computationally replicating the human cortical dorsal pathway. Subsequently, the complementary view representations of both streams are integrated to determine the final actions, enabling the model to handle spatially-complex and dynamically-changing tasks under language conditions. Empirical evaluations on both RLBench benchmark and real-world tasks demonstrate that Cortical Policy outperforms state-of-the-art baselines substantially, validating the superiority of dual-stream design for visuomotor control. Our cortex-inspired framework offers a fresh perspective for robotic manipulation and holds potential for broader application in vision-based robot control.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2505. RESA: Bringing Back What Sparse Attention Ignores with Residual Estimation
๐ฌ 11
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Sparse Attention Attention Redundancy Low-rank Approximation
Large Language Models (LLM) have gained significant attention.
KV cache, stored to avoid quadratic complexity of attention, becomes a bottleneck due to the demands for long-context.
Sparse attention (SA) has been proposed to address this by only selecting critical KVs for attention, which may degrade model quality in less sparse scenarios.
To improve quality, rather than selecting more KVs, this paper reveals another perspective by estimating the contributions of remaining KVs, called Residual Estimation.
We find that attention logits (before softmax) exhibit substantial redundancy due to its inherent low-rank nature.
We perform Singular Value Decomposition (SVD) on logits matrix in prefilling and find the spectral dominance of principal singular value and its linearly scaling property with sequence length.
These imply that increasing sequence length leads to replication-like logits growth with significant redundancy.
However, it is impossible to perform SVD at each decoding step in practice due to its heavy costs.
To this end, we propose RESA, a training-free framework compensating SA's output with an estimated low-rank prior of logits.
RESA introduces (i) a Prior Estimator that derives a prior distribution from a typical query as a rank-1 approximation at the end of prefilling, and (ii) an Online Aggregator that fuses the prior with SA at each decoding step via lightweight scaling and merging.
Besides, we further show that RESA's effect comes from priors being used as attention bias for knowledge injection.
Extensive experiments show that without extra overheads, RESA improves model quality by up to 26\% across various tasks with the same KV budget compared to state-of-the-art.
Moreover, RESA maintains the same quality with up to 33.2\% KV budget reduction and 1.23$\times$ attention throughput improvement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2506. MAGE: Multi-scale Autoregressive Generation for Offline Reinforcement Learning
๐ฌ 22
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Offline Reinforcement Learning; Auto-Regressive; Multi-Scale; Long-horizon
Generative models have gained significant traction in offline reinforcement learning (RL) due to their ability to model complex trajectory distributions. However, existing generation-based approaches still struggle with long-horizon tasks characterized by sparse rewards. Some hierarchical generation methods have been developed to mitigate this issue by decomposing the original problem into shorter-horizon subproblems using one policy and generating detailed actions with another. While effective, these methods often overlook the multi-scale temporal structure inherent in trajectories, resulting in suboptimal performance. To overcome these limitations, we propose MAGE, a Multi-scale Autoregressive GEneration-based offline RL method. MAGE incorporates a condition-guided multi-scale autoencoder to learn hierarchical trajectory representations, along with a multi-scale transformer that autoregressively generates trajectory representations from coarse to fine temporal scales. MAGE effectively captures temporal dependencies of trajectories at multiple resolutions. Additionally, a condition-guided decoder is employed to exert precise control over short-term behaviors. Extensive experiments on five offline RL benchmarks against fifteen baseline algorithms show that MAGE successfully integrates multi-scale trajectory modeling with conditional guidance, generating coherent and controllable trajectories in long-horizon sparse-reward settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2507. Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis
๐ฌ 19
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ data synthesis multidisciplinary benchmark LLM agents
Unlocking advanced reasoning in large language model agents is hindered by a scarcity of training data situated at the very frontier of their capabilities. We address this with a novel data synthesis approach inspired by the educational theory of the Zone of Proximal Development (ZPD), which conceptualizes this frontier as tasks an LLM cannot solve independently but can master with guidance. We operationalize this principle through the AgentFrontier Data Engine, an automated pipeline that synthesizes high-quality, multidisciplinary data situated precisely within an LLM's ZPD. The engine yields two synergistic outputs: knowledge-intensive data for continued pre-training and frontier-level reasoning trajectories for post-training. Concurrently, it produces the ZPD Exam, a self-evolving benchmark for evaluating agent capabilities by compelling them to reason beyond their parameterized knowledge. By training our AgentFrontier-30B-A3B model on the synthesized data, we achieve state-of-the-art results on demanding benchmarks like Humanity's Last Exam, outperforming several leading proprietary agents. This work establishes ZPD-guided data synthesis as a scalable and effective paradigm for cultivating increasingly capable LLM agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2508. SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation
๐ฌ 14
โญ 5.50
๐ generative models
๐ท๏ธ Generative Model Guidance Next-Scale Autoregressive Generation Information Theory
Visual autoregressive (VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity synthesis mirroring human perception. In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature. We revisit this limitation from an information-theoretic perspective and deduce that ensuring each scale to contribute high-frequency content not explained by earlier scales mitigates the trainโinference discrepancy. With this insight, we propose Scaled Spatial Guidance (SSG), a training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence. SSG emphasizes target high-frequency signals, defined as the semantic residual, isolated from a coarser prior. To obtain this prior, we leverage a principled frequency-domain procedure, Discrete Spatial Enhancement (DSE), devised to sharpen and better isolate the semantic residual through frequency-aware construction. SSG applies broadly across VAR models leveraging discrete visual tokens, regardless of tokenization design or conditioning modality. Experiments demonstrate SSG yields consistent gains in fidelity and diversity while preserving low latency, revealing untapped efficiency in coarse-to-fine image generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2509. Learning Patient-Specific Disease Dynamics With Latent Flow Matching For Longitudinal Imaging Generation
๐ฌ 17
โญ 5.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Medical Image Generation Longitudinal Analysis Flow Matching
Understanding disease progression is a central clinical challenge with direct implications for early diagnosis and personalized treatment. While recent generative approaches have attempted to model progression, key mismatches remain: disease dynamics are inherently continuous and monotonic, yet latent representations are often scattered, lacking semantic structure, and diffusion-based models disrupt continuity through the random denoising process.
In this work, we propose treating disease dynamics as a velocity field and leveraging Flow Matching (FM) to align the temporal evolution of patient data. Unlike prior methods, our approach captures the intrinsic dynamics of disease, making progression more interpretable.
However, a key challenge remains: in latent space, Autoencoders (AEs) do not guarantee alignment across patients or correlation with clinical severity (e.g., age and disease conditions). To address this, we propose learning patient-specific latent alignment, which enforces patient trajectories to lie along a specific axis, with magnitudes increasing monotonically with disease severity. This leads to a consistent and semantically meaningful latent space.
Together, we present โ-LFM, a framework for modeling patient-specific latent progression with flow matching. Across three longitudinal MRI benchmarks, โ-LFM demonstrates strong empirical performance and, more importantly, establishes a new framework for interpreting and visualizing disease dynamics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2510. Quasi-Monte Carlo Methods Enable Extremely Low-Dimensional Deep Generative Models
๐ฌ 14
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ representation learning quasi monte-carlo integration generative modeling variational autoencoder
This paper introduces *quasi-Monte Carlo latent variable models* (QLVMs): a class of deep generative models that are specialized for finding extremely low-dimensional and interpretable embeddings of high-dimensional datasets.
Unlike standard approaches, which rely on a learned encoder and variational lower bounds, QLVMs directly approximate the marginal likelihood by randomized quasi-Monte Carlo integration.
While this brute force approach has drawbacks in higher-dimensional spaces, we find that it excels in fitting one, two, and three dimensional deep latent variable models.
Empirical results on a range of datasets show that QLVMs consistently outperform conventional variational autoencoders (VAEs) and importance weighted autoencoders (IWAEs) with matched latent dimensionality.
The resulting embeddings enable transparent visualization and *post hoc* analyses such as nonparametric density estimation, clustering, and geodesic path computation, which are nontrivial to validate in higher-dimensional spaces.
While our approach is compute-intensive and struggles to generate fine-scale details in complex datasets, it offers a compelling solution for applications prioritizing interpretability and latent space analysis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2511. Tokenizing Single-Channel EEG with Time-Frequency Motif Learning
๐ฌ 23
โญ 5.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ EEG tokenization representation learning
Foundation models are reshaping EEG analysis, yet an important problem of EEG tokenization remains a challenge.
This paper presents TFM-Tokenizer, a novel tokenization framework that learns a vocabulary of time-frequency motifs from *single-channel* EEG signals and encodes them into discrete tokens.
We propose a dual-path architecture with timeโfrequency masking to capture robust motif representations, and it is model-agnostic, supporting both lightweight transformers and existing foundation models for downstream tasks.
Our study demonstrates three key benefits:
*Accuracy:* Experiments on four diverse EEG benchmarks demonstrate consistent performance gains across both single- and multi-dataset pretraining settings, achieving up to 11% improvement in Cohenโs Kappa over strong baselines.
*Generalization:* Moreover, as a plug-and-play component, it consistently boosts the performance of diverse foundation models, including BIOT and LaBraM.
*Scalability:* By operating at the single-channel level rather than relying on the strict 10โ20 EEG system, our method has the potential to be device-agnostic.
Experiments on ear-EEG sleep staging, which differs from the pretraining data in signal format, channel configuration, recording device, and task, show that our tokenizer outperforms baselines by 14%.
A comprehensive token analysis reveals strong class-discriminative, frequency-aware, and consistent structure, enabling improved representation quality and interpretability.
Code is available at https://anonymous.4open.science/r/TFM-Token-FE33.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2512. DeepWeightFlow: Re-Basined Flow Matching for Generating Neural Network Weights
๐ฌ 10
โญ 5.50
๐ generative models
๐ท๏ธ metanetworks deep weight space parameter space symmetry flow matching canonicalization
Building efficient and effective generative models for neural network weights has been a research focus of significant interest that faces challenges posed by the high-dimensional weight spaces of modern neural networks and their symmetries. Several prior generative models are limited to generating partial neural network weights, particularly for larger models, such as ResNet and ViT. Those that do generate complete weights struggle with generation speed or require finetuning of the generated models. In this work, we present \ours, a Flow Matching model that operates directly in weight space to generate diverse and high-accuracy neural network weights for a variety of architectures, neural network sizes, and data modalities. The neural networks generated by \ours do not require fine-tuning to perform well and can scale to large networks. We apply Git Re-Basin and TransFusion for neural network canonicalization in the context of generative weight models to account for the impact of neural network permutation symmetries and to improve generation efficiency for larger model sizes. The generated networks excel at transfer learning, and ensembles of hundreds of neural networks can be generated in minutes, far exceeding the efficiency of diffusion-based methods. DeepWeightFlow models pave the way for more efficient and scalable generation of diverse sets of neural networks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2513. Dynamic Speculative Agent Planning
๐ฌ 23
โญ 5.50
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ agent efficiency online learning reinforcement learning
Despite their remarkable success in complex tasks propelling widespread adoption, large language model based agents still face critical deployment challenges due to prohibitive latency and inference costs. While recent work has explored various methods to accelerate inference, existing approaches suffer from significant limitations: they either fail to preserve performance fidelity, require extensive offline training of router modules, or incur excessive operational costs. Moreover, they provide minimal user control over the tradeoff between acceleration and other performance metrics.
To address these gaps, we introduce **Dynamic Speculative Planning** (DSP), an asynchronous online reinforcement learning framework that provides lossless acceleration with substantially reduced costs without requiring additional pre-deployment preparation. DSP explicitly optimizes a joint objective balancing end-to-end latency against dollar cost, allowing practitioners to adjust a single parameter that steers the system toward faster responses, cheaper operation, or any point along this continuum.
Experiments on two standard agent benchmarks demonstrate that DSP achieves comparable efficiency to the fastest lossless acceleration method while reducing total cost by 30\% and unnecessary cost up to 60\%. Our code and data are available through \url{https://anonymous.4open.science/r/Dynamic-Speculative-Planning-F574}
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2514. Learning More with Less: A Dynamic Dual-Level Down-Sampling Framework for Efficient Policy Optimization
๐ฌ 17
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Policy Optimization Down-Sampling
Critic-free methods like GRPO reduce memory demands by estimating advantages from multiple rollouts but tend to converge slowly, as critical learning signals are diluted by an abundance of uninformative samples and tokens. To tackle this challenge, we propose the **Dynamic Dual-Level Down-Sampling (D$^3$S)** framework that prioritizes the most informative samples and tokens across groups to improve the efficiency of policy optimization. D$^3$S operates along two levels: (1) the sample-level, which selects a subset of rollouts to maximize advantage variance ($\text{Var}(A)$). We theoretically proved that this selection is positively correlated with the upper bound of the policy gradient norms, yielding higher policy gradients. (2) the token-level, which prioritizes tokens with a high product of advantage magnitude and policy entropy ($|A_{i,t}|\times H_{i,t}$), focusing updates on tokens where the policy is both uncertain and impactful. Moreover, to prevent overfitting to high-signal data, D$^3$S employs a dynamic down-sampling schedule inspired by curriculum learning. This schedule starts with aggressive down-sampling to accelerate early learning and gradually relaxes to promote robust generalization. Extensive experiments on Qwen2.5 and Llama3.1 demonstrate that integrating D$^3$S into advanced RL algorithms achieves state-of-the-art performance with generalization while requiring fewer samples and tokens across diverse reasoning benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2515. Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity
๐ฌ 16
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ efficient vlms visual token compression
Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance \textit{importance preservation} and \textit{information diversity}. To address this, we propose $\textbf{PruneSID}$, a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principle Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, $\textbf{PruneSID}$ incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more effective average information preservation across diverse scenes. Extensive experiments demonstrate state-of-the-art performance, achieving $\textbf{96.3}$% accuracy on LLaVA-1.5 with only $\textbf{11.1}$% token retention, and $\textbf{92.8}$% accuracy at extreme compression rates ($\textbf{5.6}$%) on LLaVA-NeXT, outperforming prior methods by $\textbf{2.5}$% with $\textbf{7.8}$x faster prefilling speed compared to the original model. Our framework generalizes across diverse VLMs and both image and video modalities, showcasing strong cross-modal versatility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2516. Mango-GS: Enhancing Spatio-Temporal Consistency in Dynamic Scenes Reconstruction using Multi-Frame Node-Guided 4D Gaussian Splatting
๐ฌ 11
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3DGS Dynamic Reconstruction Multi-frame
Reconstructing dynamic 3D scenes with photorealistic detail and temporal coherence remains a significant challenge. Existing Gaussian splatting approaches modeling scenes rely on per-frame optimization, causing them to overfit to instantaneous states rather than learning true motion dynamics. To address this, we present Mango-GS, a multi-frame, node-guided framework for high-fidelity 4D reconstruction. Our approach leverages a temporal Transformer to learn complex motion dependencies across a window of frames, ensuring the generation of plausible trajectories. For efficiency, this temporal modeling is confined to a sparse set of control nodes. These nodes are uniquely designed with decoupled position and latent codes, which provide a stable semantic anchor for motion influence and prevents correspondence errors for large movements. Our framework is trained end-to-end, enhanced by a input masking strategy and two multi-frame loss to ensure robustness. Extensive experiments demonstrate that Mango-GS achieves state-of-the-art quality and fast rendering speed, enabling high-fidelity reconstruction and real-time rendering of dynamic scenes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2517. ChemEval: A Multi-level and Fine-grained Chemical Capability Evaluation for Large Language Models
๐ฌ 14
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Large Language Models Benchmark Chemical Knowledge Inference
The emergence of Large Language Models (LLMs) in chemistry marks a significant advancement in applying artificial intelligence to chemical sciences. While these models show promising potential, their effective application in chemistry demands sophisticated evaluation protocols that address the field's inherent complexities. To bridge this critical gap, we introduce ChemEval, an innovative hierarchical assessment framework specifically designed to evaluate LLMs' capabilities across chemical domains. Our methodology incorporates a distinctive four-tier progression system, spanning from basic chemical concepts to advanced theoretical principles. Sixty-two textual and multimodal tasks are designed to enable researchers to conduct fine-grained analysis of model capabilities and achieve precise evaluation via carefully crafted assessment protocols. The framework integrates carefully curated open-source datasets with expert-validated materials, ensuring both practical relevance and scientific rigor. In our experiments, we evaluated the performance of most main-stream LLMs using both zero-shot and few-shot approaches, with carefully designed examples and prompts. Results indicate that general-purpose LLMs, while proficient in understanding chemical literature and following instructions, struggle with tasks requiring deep chemical expertise. In contrast, chemical LLMs perform better in technical tasks but show limitations in general language processing. These findings highlight both the current limitations and future opportunities for LLMs in chemistry. Our research provides a systematic framework for advancing the application of artificial intelligence in chemical research, potentially facilitating new discoveries in the field.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2518. Efficient Multimodal Spatial Reasoning via Dynamic and Asymmetric Routing
๐ฌ 14
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Autoregressive multimodal reasoning Layer- and hop-aware retention Green AI
Recently, visualization-of-thought (VoT) has unlocked new opportunities for complex spatial reasoning in multimodal large language models (MLLMs) by complementing verbal reasoning with visual thinking.
However, the autoregressive accumulation of lengthy and redundant tokens substantially increases computation and memory costs.
In this paper, we present a new efficient framework for multimodal spatial reasoning, named *DARE*, designed to adaptively prune multimodal tokens across different network depths, reasoning hops, and modalities.
First, *DARE* devises an intra- and inter-hop-aware differentiable retention mechanism to dynamically estimate token importance both within each reasoning step and across successive hops.
Recognizing that deeper network layers encode visual cues into verbal streams, *DARE* introduces an asymmetric compression strategy that prunes tokens according to modality-specific redundancy and semantic importance.
Furthermore, *DARE* incorporates a progressive KV-cache retention policy aligned with cross-modal fusion dynamics, further reducing memory overhead during autoregressive reasoning.
Our method delivers substantial reductions in computation and memory footprint, averaging a 40.37\% reduction in FLOPs and 46.07\% reduction in KV caches usage,
while consistently preserving or even improving reasoning performance across seven multimodal spatial reasoning benchmarks, and further generalizing to broader multimodal reasoning tasks,
establishing a scalable and robust recipe for efficient multimodal reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2519. Active Learning for Decision Trees with Provable Guarantees
๐ฌ 16
โญ 5.50
๐ learning theory
๐ท๏ธ Label complexity Theory of Active Learning Theory of Decision Tree Disagreement coefficient
This paper advances the theoretical understanding of active learning label complexity for decision trees as binary classifiers. We make two main contributions. First, we provide the first analysis of the **disagreement coefficient** for decision treesโa key parameter governing active learning label complexity. Our analysis holds under two natural assumptions required for achieving polylogarithmic label complexity: (i) each root-to-leaf path queries distinct feature dimensions, and (ii) the input data has a regular, grid-like structure. We show these assumptions are essential, as relaxing them leads to polynomial label complexity. Second, we present the first general active learning algorithm for binary classification that achieves a **multiplicative error guarantee**, producing a $(1+\epsilon)$-approximate classifier. By combining these results, we design an active learning algorithm for decision trees that uses only a **polylogarithmic number of label queries** in the dataset size, under the stated assumptions. Finally, we establish a label complexity lower bound, showing our algorithmโs dependence on the error tolerance $\epsilon$ is close to optimal.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2520. MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos
๐ฌ 34
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Multimodal Large Language Models Multimodal Reasoning Video Benchmark
The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame'') and perceive a few adjacent frames. To address this gap, we propose **MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos**. The benchmark is characterized by the following features. **(1) Long-range, multi-frame reasoning**: Models are required to infer and analyze evidence frames that may be far from the question frame. **(2) Beyond perception**: Questions cannot be answered through direct perception alone but require reasoning over hidden information. **(3) Reliability**: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. **(4) Confusability**: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, Gemini-2.5-pro, achieves only 64.3% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Error analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2521. Automating the Refinement of Reinforcement Learning Specifications
๐ฌ 11
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Specifications Automatic Specification Refinement SpectRL
Logical specifications have been shown to help reinforcement learning algorithms in achieving complex tasks. However, when a task is under-specified, agents might fail to learn useful policies. In this work, we explore the possibility of improving coarse-grained logical specifications via an exploration-guided strategy. We propose **AutoSpec**, a framework that searches for a logical specification refinement whose satisfaction implies satisfaction of the original specification, but which provides additional guidance therefore making it easier for reinforcement learning algorithms to learn useful policies. **AutoSpec** is applicable to reinforcement learning tasks specified via the SpectRL specification logic. We exploit the compositional nature of specifications written in SpectRL, and design four refinement procedures that modify the abstract graph of the specification by either refining its existing edge specifications or by introducing new edge specifications. We prove that all four procedures maintain specification soundness, i.e. any trajectory satisfying the refined specification also satisfies the original. We then show how **AutoSpec** can be integrated with existing reinforcement learning algorithms for learning policies from logical specifications. Our experiments demonstrate that **AutoSpec** yields promising improvements in terms of the complexity of control tasks that can be solved, when refined logical specifications produced by **AutoSpec** are utilized.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2522. A Statistical Benchmark for Diffusion Posterior Sampling Algorithms
๐ฌ 15
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Diffusion models Bayesian inverse problems statistical evaluation Gibbs sampling
We propose a statistical benchmark for diffusion posterior sampling (DPS) algorithms in linear inverse problems.
Our test signals are discretized Lรฉvy processes whose posteriors admit efficient Gibbs methods.
These Gibbs methods provide gold-standard posterior samples for direct, distribution-level comparisons with DPS algorithms.
They can also sample the denoising posteriors in the reverse diffusion, which enables the arbitrary-precision Monte Carlo estimation of various objects that may be needed in the DPS algorithms, such as the expectation or the covariance of the denoising posteriors.
In turn, this can be used to isolate algorithmic errors from the errors due to learned components.
We instantiate the benchmark with the minimum-mean-squared-error optimality gap and posterior-coverage tests and evaluate popular algorithms on the inverse problems of denoising, deconvolution, imputation, and reconstruction from partial Fourier measurements.
We release the benchmark code at https://github.com/emblem-saying/dps-benchmark and invite the community to contribute and report results.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2523. Dynamic Reflections: Probing Video Representations with Text Alignment
๐ฌ 13
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Platonic Representation hypothesis video understanding video-text alignment
The alignment of representations from different modalities has recently been shown to provide insights on the structural similarities and downstream capabilities of different encoders across diverse data types.
While significant progress has been made in aligning images with text, the temporal nature of _video_ data remains largely unexplored in this context.
In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders.
Our findings reveal several key insights.
First, we demonstrate that cross-modal alignment highly depends on the richness of both visual (static images vs. multi-frame videos) and text (single caption vs. a collection) data _provided at test time_, especially when using state-of-the-art video encoders.
We propose parametric test-time scaling laws that capture this behavior and show remarkable predictive power against empirical observations.
Secondly, we investigate the correlation between semantic alignment and performance on both semantic and non-semantic downstream tasks, providing initial evidence that strong alignment against text encoders may be linked to _general-purpose_ video representation and understanding.
Finally, we correlate temporal reasoning with cross-modal alignment providing a challenging test-bed for vision and language models.
Overall, our work introduces video-text alignment as an informative zero-shot way to probe the representation power of different encoders for spatio-temporal data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2524. Low-pass Personalized Subgraph Federated Recommendation
๐ฌ 17
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Federated Learning Recommender Systems Graph Neural Networks
Federated Recommender Systems (FRS) preserve privacy by training decentralized models on client-specific user-item subgraphs without sharing raw data. However, FRS faces a unique challenge: subgraph structural imbalance, where drastic variations in subgraph scale (user/item counts) and connectivity (item degree) misalign client representations, making it challenging to train a robust model that respects each clientโs unique structural characteristics.
To address this, we propose a Low-pass Personalized Subgraph Federated recommender system (LPSFed). LPSFed leverages graph Fourier transforms and low-pass spectral filtering to extract low-frequency structural signals that remain stable across subgraphs of varying size and degree, allowing robust personalized parameter updates guided by similarity to a neutral structural anchor. Additionally, we leverage a localized popularity bias-aware margin that captures item-degree imbalance within each subgraph and incorporates it into a personalized bias correction term to mitigate recommendation bias. Supported by theoretical analysis and validated on five real-world datasets, LPSFed achieves superior recommendation accuracy and enhances model robustness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2525. Generalizing Linear Autoencoder Recommenders with Decoupled Expected Quadratic Loss
๐ฌ 19
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ linear autoencoders recommender system closed-form solution expected quadratic loss
Linear autoencoders (LAEs) have gained increasing popularity in recommender systems due to their simplicity and strong empirical performance. Most LAE models, including the Emphasized Denoising Linear Autoencoder (EDLAE) introduced by (Steck, 2020), use quadratic loss during training. However, the original EDLAE only provides closed-form solutions for the hyperparameter choice $b = 0$, which limits its capacity. In this work, we generalize EDLAE objective function into a Decoupled Expected Quadratic Loss (DEQL). We show that DEQL simplifies the process of deriving EDLAE solutions and reveals solutions in a broader hyperparameter range $b > 0$, which were not derived in Steckโs original paper. Additionally, we propose an efficient algorithm based on Millerโs matrix inverse theorem to ensure the computational tractability for the $b > 0$ case. Empirical results on benchmark datasets show that the $b > 0$ solutions provided by DEQL outperform the $b = 0$ EDLAE baseline, demonstrating that DEQL expands the solution space and enables the discovery of models with better testing performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2526. Don't Throw Away Your Pretrained Model
๐ฌ 11
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ model collaboration collaborative inference
Alignment training has tradeoffs: it helps language models (LMs) gain in reasoning and instruction following but might lose out on skills such as creativity and calibration, where unaligned base models are better at. We aim to make the best of both worlds through model collaboration, where different models in the training pipeline collaborate and complement each other. Since LM responses feature interleaving skills that favor different models, we propose Switch Generation, where pretrained and aligned model versions take turns to ``speak'' in a response sequence. Specifically, we train a switcher LM by learning from outcomes of choosing different models to generate the next segment across diverse queries and contexts. At inference time, the switcher LM guides different model checkpoints to dynamically generate the next segment where their strengths are most needed. Extensive experiments with 8 model collaboration baselines and 18 datasets show that 1) model collaboration consistently outperforms individual models on 16 out of 18 tasks, and 2) Switch Generation further outperforms baselines by 12.9% on average. Further analysis reveals that Switch Generation discovers compositional skills to solve problems where individual models struggle and generalizes to unseen models and tasks, reusing and repurposing by-products in expensive model training pipelines that are otherwise discarded.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2527. Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks
๐ฌ 15
โญ 5.50
๐ causal reasoning
๐ท๏ธ retrieval-augmented language model RAG reasoning datastore dense retrieval
Retrieval augmentation has primarily been studied in limited settings, such as factoid question answering; more challenging, reasoning-intensive benchmarks have seen limited success from minimal RAG. In this work, we challenge this prevailing view on a set of established, reasoning-intensive benchmarks: MMLU, MMLU Pro, AGI Eval, GPQA, and MATH. We identify a key missing component in prior work: a usable, web-scale datastore aligned with the breadth of pretraining data. To this end, we introduce CompactDS: a diverse, high-quality, web-scale datastore that achieves high retrieval accuracy and subsecond latency on a single-node deployment, making it suitable for academic use. Its core design combines a compact set of high-quality, diverse data sources with in-memory approximate nearest neighbor (ANN) retrieval and on-disk exact search. Using CompactDS, a minimal RAG pipeline achieves consistent accuracy improvements across all benchmarks and model sizes (8B--70B), with relative gains of 11% on MMLU, 34% on MMLU Pro, 26% on GPQA, and 14% on MATH. No single data source suffices alone, highlighting the importance of diversity of sources (web crawls, curated math, academic papers, textbooks), and a combination of ANN and exact search is shown to be critical for balancing usability and accuracy. Finally, we show that our in-house datastore even outperforms commercial search engines like Google Search. We release CompactDS and our retrieval pipeline as a fully reproducible alternative to commercial search, supporting future research exploring retrieval-based AI systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2528. On the Eligibility of LLMs for Counterfactual Reasoning: A Decompositional Study
๐ฌ 23
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Counterfactual Reasoning Large Language Models (LLMs) Experimental Study
Counterfactual reasoning has emerged as a crucial technique for generalizing the reasoning capabilities of large language models (LLMs). By generating and analyzing counterfactual scenarios, researchers can assess the adaptability and reliability of model decision-making. Although prior work has shown that LLMs often struggle with counterfactual reasoning, it remains unclear which factors most significantly impede their performance across different tasks and modalities. In this paper, we propose a decompositional strategy that breaks down the counterfactual generation from causality construction to the reasoning over counterfactual interventions. To support decompositional analysis, we investigate 11 datasets spanning diverse tasks, including natural language understanding, mathematics, programming, and vision-language tasks. Through extensive evaluations, we characterize LLM behavior across each decompositional stage and identify how modality type and intermediate reasoning influence performance. By establishing a structured framework for analyzing counterfactual reasoning, this work contributes to the development of more reliable LLM-based reasoning systems and informs future elicitation strategies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2529. BAR: Refactor the Basis of Autoregressive Visual Generation
๐ฌ 15
โญ 5.50
๐ generative models
๐ท๏ธ Autoregressive Models Autoregressive Visual Generation
Autoregressive (AR) models, despite their remarkable successes, encounter limitations in image generation due to sequential prediction of tokens, e.g. local image patches, in a predetermined row-major raster-scan order. Prior works improve AR with various designs of prediction units and orders, however, rely on human inductive biases. This work proposes Basis Autoregressive (BAR), a novel paradigm that conceptualizes tokens as basis vectors within the image space and employs an end-to-end learnable approach to transform basis. By viewing tokens $x_k$ as the projection of image $\mathbf{x}$ onto basis vectors $e_k$, BAR's unified framework refactors fixed token sequences through the linear transform $\mathbf{y}=\mathbf{Ax}$, and encompasses previous methods as specific instances of matrix $\mathbf{A}$. Furthermore, BAR adaptively optimizes the transform matrix with an end-to-end AR objective, thereby discovering effective strategies beyond hand-crafted assumptions. Comprehensive experiments, notably achieving a state-of-the-art FID of 1.15 on the ImageNet-256 benchmark, demonstrate the ability of BAR to overcome human biases and significantly advance image generation, including text-to-image synthesis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2530. Self-Supervised Evolution Operator Learning for High-Dimensional Dynamical Systems
๐ฌ 17
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Operator Koopman Transfer Contrastive Self-Supervised DMD Modes Dynamics ENSO Climate Molecular Dynamics Protein TICA Slow Modes
We introduce an end-to-end approach to learn the evolution operators of large-scale non-linear dynamical systems, such as those describing complex natural phenomena. Evolution operators are particularly well-suited for analyzing systems that exhibit spatio-temporal patterns and have become a key analytical tool across various scientific communities. As terabyte-scale weather datasets and simulation tools capable of running millions of molecular dynamics steps per day are becoming commodities, our approach provides an effective tool to make sense of them from a data-driven perspective. The core of it lies in a remarkable connection between self-supervised representation learning methods and the recently established learning theory of evolution operators. We deploy our approach across multiple scientific domains: explaining the folding dynamics of small proteins, the binding process of drug-like molecules in host sites, and autonomously finding patterns in climate data. Our code is available open-source at: https://anonymous.4open.science/r/encoderops-5F67.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2531. From Single to Multi-Granularity: Toward Long-Term Memory Association and Selection of Conversational Agents
๐ฌ 33
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Long-Term Memory Agent LLM Multi-Granularity Conversation
Large Language Models (LLMs) have recently been widely adopted in conversational agents. However, the increasingly long interactions between users and agents accumulate extensive dialogue records, making it difficult for LLMs with limited context windows to maintain a coherent long-term dialogue memory and deliver personalized responses. While retrieval-augmented memory systems have emerged to address this issue, existing methods often depend on single-granularity memory segmentation and retrieval. This approach falls short in capturing deep memory connections, leading to partial retrieval of useful information or substantial noise, resulting in suboptimal performance. To tackle these limits, we propose MemGAS, a framework that enhances memory consolidation by constructing multi-granularity association, adaptive selection, and retrieval. MemGAS is based on multi-granularity memory units and employs Gaussian Mixture Models to cluster and associate new memories with historical ones. An entropy-based router adaptively selects optimal granularity by evaluating query relevance distributions and balancing information completeness and noise. Retrieved memories are further refined via LLM-based filtering. Experiments on four long-term memory benchmarks demonstrate that MemGAS outperforms state-of-the-art methods on both question answer and retrieval tasks, achieving superior performance across different query types and top-K settings. \footnote{https://anonymous.4open.science/r/MemGAS-626C/}
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2532. Hierarchical Value-Decomposed Offline Reinforcement Learning for Whole-Body Control
๐ฌ 22
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ robot policy learning offline reinforcement learning whole-body control
Scaling imitation learning to high-DoF whole-body robots is fundamentally limited by the \textbf{curse of dimensionality} and the prohibitive cost of collecting expert demonstrations. We argue that the core bottleneck is paradigmatic: real-world supervision for whole-body control is inherently imperfect, yet most methods assume expert data. To overcome this, we propose \textbf{HVD} (Hierarchical Value-Decomposed Offline Reinforcement Learning), an offline reinforcement learning framework that learns effective policies directly from suboptimal, reward-labeled trajectories. HVD structures the value function along the robotโs kinematic hierarchy and over temporal chunks, enabling precise credit assignment in long-horizon, high-dimensional tasks. Built on a Transformer-based architecture, HVD supports \textit{multi-modal} and \textit{multi-task} learning, allowing flexible integration of diverse sensory inputs for policy learning. To enable realistic evaluation and training, we further introduce \textbf{WB-50}, a 50-hour dataset of teleoperated and policy rollout trajectories annotated with rewards and preserving natural imperfections โ including partial successes, corrections, and failures. Experiments show HVD significantly outperforms existing baselines in success rate across complex whole-body tasks. Our results suggest that effective policy learning for high-DoF systems can emerge not from perfect demonstrations, but from structured learning over realistic, imperfect data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2533. D&R: Recovery-based AI-Generated Text Detection via a Single Black-box LLM Call
๐ฌ 17
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ AI-generated Text Detection Large Language Models Training-free Methods Black-box Detection Recovery-based Detection Robustness
Large language models (LLMs) generate increasingly human-like text, raising concerns about misinformation and authenticity. Detecting AI-generated text remains challenging: existing methods often underperform, especially on short texts, require probability access unavailable in real-world black-box settings, incur high costs from multiple calls, or fail to generalize across models.
We propose Disrupt-and-Recover (D\&R), a recovery-based detection framework grounded in posterior concentration. D\&R disrupts text via model-free Within-Chunk Shuffling, performs a single black-box LLM recovery, and measures semanticโstructural recovery similarity as a proxy for concentration. This design ensures efficiency, black-box practicality, and is theoretically supported under the concentration assumption. Extensive experiments across four datasets and six source models show that D\&R achieves state-of-the-art performance, with AUROC 0.96 on long texts and 0.87 on short texts, surpassing the strongest baseline by +0.08 and +0.14. D\&R further remains robust under sourceโrecovery mismatch and model variation. Our code and data is available at https://anonymous.4open.science/r/1MAdaWTy0xaod5qR.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2534. A Structured, Tagged, and Localized Visual Question Answering Dataset with Full Sentence Answers and Scene Graphs for Chest X-ray Images
๐ฌ 19
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ VQA Localization Vision-Language Modeling Medical Imaging Chest X-Rays Scene Graphs
Visual Question Answering (VQA) enables targeted and context-dependent analysis of medical images, such as chest X-rays (CXRs). However, existing VQA datasets for CXRs are typically constrained by simplistic and brief answer formats, lacking localization annotations (e.g., bounding boxes) and structured tags (e.g., region or radiological finding/disease tags). To address these limitations, we introduce MIMIC-Ext-CXR-QBA (abbr. CXR-QBA), a large-scale CXR VQA dataset derived from MIMIC-CXR, comprising 42 million QA-pairs with multi-granular, multi-part answers, detailed bounding boxes, and structured tags.
We automatically generated our VQA dataset from scene graphs (also made available), which we constructed using LLM-based information extraction from radiology reports. After automatic quality assessment, we identified 31M pre-training and 7.5M fine-tuning grade QA-pairs, providing the largest and most sophisticated VQA dataset for CXRs to date. Tools for using our dataset and the construction pipeline are available at https://anonymous.4open.science/r/mimic-ext-cxr-qba/ .
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2535. HalluGuard: Demystifying Data-Driven and Reasoning-Driven Hallucinations in LLMs
๐ฌ 18
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models Hallucination Detection
The reliability of Large Language Models (LLMs) in high-stakes domains such as healthcare, law, and scientific discovery is often compromised by hallucinations. These failures typically stem from two sources: *data-driven hallucinations* and *reasoning-driven hallucinations*. However, existing detection methods usually address only one source and rely on task-specific heuristics, limiting their generalization to complex scenarios. To overcome these limitations, we introduce the *Hallucination Risk Bound*, a unified theoretical framework that formally decomposes hallucination risk into data-driven and reasoning-driven components, linked respectively to training-time mismatches and inference-time instabilities. This provides a principled foundation for analyzing how hallucinations emerge and evolve. Building on this foundation, we introduce **HalluGuard**, an NTK-based score that leverages the induced geometry and captured representations of the NTK to jointly identify data-driven and reasoning-driven hallucinations. We evaluate **HalluGuard** on 10 diverse benchmarks, 11 competitive baselines, and 9 popular LLM backbones, consistently achieving state-of-the-art performance in detecting diverse forms of LLM hallucinations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2536. Rejuvenating Cross-Entropy Loss in Knowledge Distillation for Recommender Systems
๐ฌ 25
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Knowledge Distillation Recommender System Model Compression Cross-Entropy
This paper analyzes Cross-Entropy (CE) loss in knowledge distillation (KD) for recommender systems. KD for recommender systems targets at distilling rankings, especially among items most likely to be preferred, and can only be computed on a small subset of items. Considering these features, we reveal the connection between CE loss and NDCG in the field of KD. We prove that when performing KD on an item subset, minimizing CE loss maximizes the lower bound of NDCG, only if an assumption of closure is satisfied. It requires that the item subset consists of the student's top items. However, this contradicts our goal of distilling rankings on the teacher's top items. We empirically demonstrate the vast gap between these two kinds of top items. To bridge the gap between our goal and theoretical support, we propose Rejuvenated Cross-Entropy for Knowledge Distillation (RCE-KD). It splits the top items given by the teacher into two subsets based on whether or not it is ranked high by the student. For the subset that defies the condition, a sampling strategy is devised to use teacher-student collaboration to approximate our assumption of closure. We also combine the losses on the two subsets adaptively. Extensive experiments demonstrate the effectiveness of our method. Our code is available at https://anonymous.4open.science/r/RCE-KD.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2537. Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems
๐ฌ 21
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Multi-Agent System Autonomous Agents Efficiency Large Language Models
While Multi-Agent Systems (MAS) excel at complex tasks, their growing autonomy with operational complexity often leads to critical inefficiencies, such as excessive token consumption and failures arising from misinformation. Existing methods primarily focus on post-hoc failure attribution, lacking proactive, real-time interventions to enhance robustness and efficiency. To this end, we introduce SupervisorAgent, a lightweight and modular framework for runtime, adaptive supervision that operates without altering the base agent's architecture. Triggered by an LLM-free context filter, SupervisorAgent intervenes at critical junctures to proactively correct errors, guide inefficient behaviors, and purify observations. On the challenging GAIA benchmark, SupervisorAgent reduces the token consumption of the Smolagent framework by an average of 29.68% without compromising its success rate. Extensive experiments across five additional benchmarks (math reasoning, code generation, and question answering) and various SoTA foundation models validate the broad applicability and robustness of our approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2538. Interleave-VLA: Enhancing Robot Manipulation with Image-Text Interleaved Instructions
๐ฌ 19
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Robot Learning Foundation Model Multimodal Learning
The rise of foundation models paves the way for generalist robot policies in the physical world. Existing methods relying on text-only instructions often struggle to generalize to unseen scenarios. We argue that interleaved image-text inputs offer richer and less biased context and enable robots to better handle unseen tasks with more versatile human-robot interaction.
Building on this insight, we introduce Interleave-VLA, a robot learning paradigm extending interleaved image-text instructions from digital world to directly generating continuous action sequences in the physical world.
Interleave-VLA offers a natural, flexible, and model-agnostic paradigm that extends state-of-the-art vision-language-action (VLA) models with minimal modifications while achieving strong zero-shot generalization.
Interleave-VLA also includes an automatic pipeline that converts text instructions from Open X-Embodiment into interleaved image-text instructions, resulting in a large-scale real-world interleaved embodied dataset with 210k episodes.
Comprehensive evaluation in simulation and the real world shows that Interleave-VLA offers two major benefits: (1) improves out-of-domain generalization to unseen objects by 2ร compared to text input baselines, (2) supports flexible task interfaces and diverse instructions in a zero-shot manner, such as hand-drawn sketches.
We attribute Interleave-VLA's strong zero-shot capability to the use of instruction images, which effectively mitigate hallucinations, and the inclusion of heterogeneous multimodal datasets, enriched with Internet-sourced images, offering potential for scalability. [Our project site](https://interleave-vla.github.io/Interleave-VLA-Anonymous/) has more information.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2539. MMSearch-Plus: Benchmarking Provenance-Aware Search for Multimodal Browsing Agents
๐ฌ 12
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Multimodal Model AI Search Engine Benchmark Agent Framework
Existing multimodal browsing benchmarks often fail to require genuine multimodal reasoning, as many tasks can be solved with text-only heuristics without vision-in-the-loop verification. We introduce MMSearch-Plus, a 311-task benchmark that enforces multimodal understanding by requiring extraction and propagation of fine-grained visual cues through iterative imageโtext retrieval and cross-validation under retrieval noise.
Our curation procedure seeds questions whose answers require extrapolating from spatial cues and temporal traces to out-of-image facts such as events, dates, and venues.
Beyond the dataset, we provide a model-agnostic agent framework with standard browsing tools and a set-of-mark (SoM) module, which lets the agent place marks, crop subregions, and launch targeted image/text searches. SoM enables provenance-aware zoom-and-retrieve and improves robustness in multi-step reasoning.
We evaluated closed- and open-source MLLMs in this framework. The strongest system achieves an end-to-end accuracy of 36.0%, and integrating SoM produces consistent gains in multiple settings, with improvements up to +3.9 points.
From failure analysis, we observe recurring errors in locating relevant webpages and distinguishing between visually similar events. These results underscore the challenges of real-world multimodal search and establish MMSearch-Plus as a rigorous benchmark for advancing agentic MLLMs.
Our dataset is included in the supplementary material and the accompanying code at https://anonymous.4open.science/status/MMSearch-Plus-D54E.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2540. SONATA: Synergistic Coreset Informed Adaptive Temporal Tensor Factorization
๐ฌ 14
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ dynamic tensor streams temporal dynamics coreset selection Linear Dynamical Systems adaptive modeling
Analyzing dynamic tensor streams is fundamentally challenged by complex, evolving temporal dynamics and the need to identify informative data from high-velocity streams. Existing methods often lack the expressiveness to model multi-scale temporal dependencies, limiting their ability to capture evolving patterns. We propose SONATA, a novel framework that unifies expressive dynamic embedding modeling with adaptive coreset selection. SONATA leverages principled machine learning techniques for efficient evaluation of each observation for uncertainty, novelty, influence, and information gain, and dynamically prioritizes learning from the most valuable data using Bellman-inspired optimization. Entity dynamics are modeled with Linear Dynamical Systems and expressive temporal kernels for fine-grained temporal representation. Experiments on synthetic and real-world datasets show that SONATA consistently outperforms state-of-the-art methods in modeling complex temporal patterns and improving predictive accuracy for dynamic tensor streams.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2541. Generative Modeling from Black-Box Corruptions via Self-Consistent Stochastic Interpolants
๐ฌ 23
โญ 5.50
๐ generative models
๐ท๏ธ generative models corrupted data inverse problems stochastic interpolants
Transport-based methods have emerged as a leading paradigm for building generative models from large, clean datasets. However, in many scientific and engineering domains, clean data are often unavailable: instead, we only observe measurements corrupted through a noisy, ill-conditioned channel. A generative model for the original data thus requires solving an inverse problem at the level of distributions. In this work, we introduce a novel approach to this task based on Stochastic Interpolants: we iteratively update a transport map between corrupted and clean data samples using only access to the corrupted dataset as well as black box access to the corruption channel. Under appropriate conditions, this iterative procedure converges towards a self-consistent transport map that effectively inverts the corruption channel, thus enabling a generative model for the clean data. The resulting method (i) is computationally efficient compared to variational alternatives, (ii) highly flexible, handling arbitrary nonlinear forward models with only black-box access, and (iii) enjoys theoretical guarantees. We demonstrate superior performance on inverse problems in natural image processing and scientific reconstruction, and establish convergence guarantees of the scheme under appropriate assumptions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2542. Human-AI Curation Synergy: Scaling Preference Data Curation via Human-Guided AI Feedback
๐ฌ 36
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ preference data reward modeling data curation data annotation
Despite the critical role of reward models (RMs) in reinforcement learning from human feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture the spectrum of nuanced and sophisticated human preferences. Even approaches incorporating advanced training techniques have failed to yield meaningful performance improvements. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present a large-scale preference dataset comprising 40 million preference pairs. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while large language models~(LLMs) perform automatic curation based on human guidance. Based on this preference mixture, we train simple Bradley-Terry reward models ranging from 0.6B to 8B parameters on a carefully curated subset of 26 million preference pairs from the 40M pool. We demonstrate that the resulting reward models are versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling. These reward models achieve state-of-the-art performance across seven major reward model benchmarks, outperform the latest paradigm of generative reward models, and demonstrate strong downstream performance. Ablation studies confirm that the effectiveness of our approach stems not only from data scale but also from high-quality curation. Our approach represents substantial progress in open reward models, revealing the untapped potential of existing preference datasets and demonstrating how human-AI curation synergy can unlock significantly higher data quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2543. Concept-based Adversarial Attack: a Probabilistic Perspective
๐ฌ 15
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Adversarial Attack Probabilistic Generative Models Diffusion Models
We propose a concept-based adversarial attack framework that extends beyond single-image perturbations by adopting a probabilistic perspective. Rather than modifying a single image, our method operates on an entire concept - represented by a distribution - to generate diverse adversarial examples. Preserving the concept is essential, as it ensures that the resulting adversarial images remain identifiable as instances of the original underlying category or identity. By sampling from this concept-based adversarial distribution, we generate images that maintain the original concept but vary in pose, viewpoint, or background, thereby misleading the classifier. Mathematically, this framework remains consistent with traditional adversarial attacks in a principled manner. Our theoretical and empirical results demonstrate that concept-based adversarial attacks yield more diverse adversarial examples and effectively preserve the underlying concept, while achieving higher attack efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2544. Iterative Training of Physics-Informed Neural Networks with Fourier-enhanced Features
๐ฌ 30
โญ 5.50
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ physics-informed machine learning random features differential equations spectral bias
Spectral bias, the tendency of neural networks to learn low-frequency features first, is a well-known issue with many training algorithms for physics-informed neural networks (PINNs). To overcome this issue, we propose IFeF-PINN, an algorithm for iterative training of PINNs with Fourier-enhanced features. The key idea is to enrich the latent space using high-frequency components through Random Fourier Features. This creates a two-stage training problem: (i) estimate a basis in the feature space, and (ii) perform regression to determine the coefficients of the enhanced basis functions. For an underlying linear model, it is shown that the latter problem is convex, and we prove that the iterative training scheme converges. Furthermore, we empirically establish that Random Fourier Features enhance the expressive capacity of the network, enabling accurate approximation of high-frequency PDEs. Through extensive numerical evaluation on classical benchmark problems, the superior performance of our method over state-of-the-art algorithms is shown, and the improved approximation across the frequency domain is illustrated.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2545. ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases
๐ฌ 14
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ reward hacking alignment benchmark
The tendency to find and exploit "shortcuts" to complete tasks poses significant risks for reliable assessment and deployment of large language models (LLMs). For example, an LLM agent with access to unit tests may delete failing tests rather than fix the underlying bug. Such behavior undermines both the validity of benchmark results and the reliability of real-world LLM coding assistant deployments.
To quantify, study, and mitigate such behavior, we introduce ImpossibleBench, a benchmark framework that systematically measures LLM agents' propensity to exploit test cases. ImpossibleBench creates "impossible" variants of tasks from existing benchmarks like LiveCodeBench and SWE-bench by introducing direct conflicts between the natural-language specification and the unit tests. We measure an agent's "cheating rate" as its pass rate on these impossible tasks, where any pass necessarily implies a specification-violating shortcut.
As a practical framework, ImpossibleBench is not just an evaluation but a versatile tool. We demonstrate its utility for: (1) studying model behaviors, revealing more fine-grained details of cheating behaviors from simple test modification to complex operator overloading; (2) context engineering, showing how prompt, test access and feedback loop affect cheating rates; and (3) developing monitoring tools, providing a testbed with verified deceptive solutions. We hope ImpossibleBench serves as a useful framework for building more robust and reliable LLM systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2546. A universal compression theory: Lottery ticket hypothesis and superpolynomial scaling laws
๐ฌ 20
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Neural scaling law model compression lottery ticket hypothesis deep learning theory
When training large-scale models, the performance typically scales with the number of parameters and the dataset size according to a slow power law. A fundamental theoretical and practical question is whether comparable performance can be achieved with significantly smaller models and substantially less data. In this work, we provide a positive and constructive answer. We prove that a generic permutation-invariant function of $d$ objects can be asymptotically compressed into a function of $\operatorname{polylog} d$ objects with vanishing error. This theorem yields two key implications: (Ia) a large neural network can be compressed to polylogarithmic width while preserving its learning dynamics; (Ib) a large dataset can be compressed to polylogarithmic size while leaving the loss landscape of the corresponding model unchanged. (Ia) directly establishes a proof of the \textit{dynamical} lottery ticket hypothesis, which states that any ordinary network can be strongly compressed such that the learning dynamics and result remain unchanged. (Ib) shows that a neural scaling law of the form $L\sim d^{-\alpha}$ can be boosted to an arbitrarily fast power law decay, and ultimately to $\exp(-\alpha' \sqrt[m]{d})$.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2547. Global-Recent Semantic Reasoning on Dynamic Text-Attributed Graphs with Large Language Models
๐ฌ 22
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Dynamic text-attributed graph graph learning large language model
Dynamic Text-Attribute Graphs (DyTAGs), characterized by time-evolving graph interactions and associated text attributes, are prevalent in real-world applications. Existing methods, such as Graph Neural Networks (GNNs) and Large Language Models (LLMs), mostly focus on static TAGs. Extending these existing methods to DyTAGs is challenging as they largely neglect the *recent-global temporal semantics*: the recent semantic dependencies among interaction texts and the global semantic evolution of nodes over time. Furthermore, applying LLMs to the abundant and evolving text in DyTAGs faces efficiency issues. To tackle these challenges, we propose $\underline{Dy}$namic $\underline{G}$lobal-$\underline{R}$ecent $\underline{A}$daptive $\underline{S}$emantic $\underline{P}$rocessing (DyGRASP), a novel method that leverages LLMs and temporal GNNs to efficiently and effectively reason on DyTAGs. Specifically, we first design a node-centric implicit reasoning method together with a sliding window mechanism to efficiently capture recent temporal semantics. In addition, to capture global semantic dynamics of nodes, we leverage explicit reasoning with tailored prompts and an RNN-like chain structure to infer long-term semantics. Lastly, we intricately integrate the recent and global temporal semantics as well as the dynamic graph structural information using updating and merging layers. Extensive experiments on DyTAG benchmarks demonstrate DyGRASP's superiority, achieving up to 34\% improvement in Hit@10 for destination node retrieval task. Besides, DyGRASP exhibits strong generalization across different temporal GNNs and LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2548. HLD: Approximate Hierarchical Linguistic Distribution Modeling for LLM-Generated Text Detection
๐ฌ 17
โญ 5.50
๐ generative models
๐ท๏ธ Machine-Generated Text Detection; Large language model; Model distribution estimation; Trustworthy AI
The widespread deployment of large language models (LLMs) has made the reliable detection of AI-generated text a crucial task. However, existing zero-shot detectors typically rely on proxy models to approximate probability distributions of unknown source models at a single token level. Such approaches limit detection effectiveness and make the results highly sensitive to the choice of proxy models. In contrast, supervised classifiers are often detected as black boxes, sacrificing interpretability in the detection process. To address these limitations, we propose a novel detection framework that identifies LLM-generated text by approximating \textbf{H}ierarchical \textbf{L}inguistic \textbf{D}istributions--\textbf{HLD-Detector}. Specifically, we leverage n-grams to capture the feature distribution of human-written and machine-generated text across the word, syntactic, and semantic levels, and perform LLM-generated text detection by comparing these distributions under the Bayesian theory.
By progressively modeling the linguistic distribution from shallow-level (token/word), then medium-level (syntactic), and ultimately high-level (semantic representations), our method mitigates the shortcomings of previous single feature level detectors, improving both robustness and overall performance. Additionally, HLD-Detector requires only a small amount of offline corpus for distribution estimation, instead of relying on online approximation with large proxy models, resulting in significantly lower computational overhead. Extensive experiments have verified the superiority of our method in detection tasks such as multi-llm and multi-domain scenarios, achieving the current SOTA performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2549. T1: One-to-One Channel-Head Binding for Multivariate Time-Series Imputation
๐ฌ 26
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ Time Series Imputation
Imputing missing values in multivariate time series remains challenging, especially under diverse missing patterns and heavy missingness. Existing methods suffer from suboptimal performance as corrupted temporal features hinder effective cross-variable information transfer, amplifying reconstruction errors. Robust imputation requires both extracting temporal patterns from sparse observations within each variable and selectively transferring information across variablesโyet current approaches excel at one while compromising the other. We introduce T1 (Time series imputation with 1-to-1 channel-head binding), a CNN-Transformer hybrid architecture that achieves robust imputation through Channel-Head Bindingโa mechanism creating one-to-one correspondence between CNN channels and attention heads. This design enables selective information transfer: when missingness corrupts certain temporal patterns, their corresponding attention pathways adaptively down-weight based on remaining observable patterns while preserving reliable cross-variable connections through unaffected channels. Experiments on 11 benchmark datasets demonstrate that T1 achieves state-of-the-art performance, reducing MSE by 46% on average compared to the second-best baseline, with particularly strong gains under extreme sparsity (70% missing ratio). The model generalizes to unseen missing patterns without retraining and uses a consistent hyperparameter configuration across all datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2550. SubDyve: Subgraph-Driven Dynamic Propagation for Virtual Screening Enhancement Controlling False Positive
๐ฌ 14
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Virtual screening Data mining Subgraph pattern fingerprint Chemical similarity-based network LFDR-based seed refinement
Virtual screening (VS) aims to identify bioactive compounds from vast chemical libraries, but remains difficult in low-label regimes where only a few actives are known. Existing methods largely rely on general-purpose molecular fingerprints and overlook class-discriminative substructures critical to bioactivity. Moreover, they consider molecules independently, limiting effectiveness in low-label regimes. We introduce SubDyve, a network-based VS framework that constructs a subgraph-aware similarity network and propagates activity signals from a small known actives. When few active compounds are available, SubDyve performs iterative seed refinement, incrementally promoting new candidates based on local false discovery rate. This strategy expands the seed set with promising candidates while controlling false positives from topological bias and overexpansion. We evaluate SubDyve on ten DUD-E targets under zero-shot conditions and on the CDK7 target with a 10-million-compound ZINC dataset. SubDyve consistently outperforms existing fingerprint or embedding-based approaches, achieving margins of up to +34.0 on the BEDROC and +24.6 on the $EF_{1\\%}$ metric.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2551. GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation
๐ฌ 4
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ geometric problem solving benchmark
Geometric problem solving constitutes a critical branch of mathematical reasoning, requiring precise analysis of shapes and spatial relationships. Current evaluations of geometric reasoning in vision-language models (VLMs) face limitations, including the risk of test data contamination from textbook-based benchmarks, overemphasis on final answers over reasoning processes, and insufficient diagnostic granularity. To address these issues, we present GeoBench, a hierarchical benchmark featuring four reasoning levels in geometric problem-solving: Visual Perception, Goal-Oriented Planning, Rigorous Theorem Application, and Self-Reflective Backtracking. Through six formally verified tasks generated via TrustGeoGen, we systematically assess capabilities ranging from attribute extraction to logical error correction. Experiments reveal that while reasoning models like OpenAI-o3 outperform general MLLMs, performance declines significantly with increasing task complexity. Key findings demonstrate that sub-goal decomposition and irrelevant premise filtering critically influence final problem-solving accuracy, whereas Chain-of-Thought prompting unexpectedly degrades performance in some tasks. These findings establish GeoBench as a comprehensive benchmark while offering actionable guidelines for developing geometric problem-solving systems. Our benchmark and
code are released at https://anonymous.4open.science/r/GeoBench-4BF4.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2552. WorldEdit: Towards Open-World Image Editing with a Knowledge-Informed Benchmark
๐ฌ 15
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Image Editing
Recent advances in image editing models have demonstrated remarkable capabilities in executing explicit instructions, such as attribute manipulation, style transfer, and pose synthesis. However, these models often face challenges when dealing with implicit editing instructions, which describe the cause of a visual change without explicitly detailing the resulting outcome. These limitations arise because existing models rely on uniform editing strategies that are not equipped to handle the complex world knowledge and reasoning required for implicit instructions. To address this gap, we introduce WorldEdit, a dataset specifically designed to enable world-driven image editing. WorldEdit consists of high-quality editing samples, guided by paraphrased instructions that align with real-world causal logic.
Furthermore, we provide WorldEdit-Test for evaluating the existing model's performance on causal editing scenarios. With WorldEdit, we use a two-stage training framework for fine-tuning models like Bagel, integrating with a causal verification reward. Our results show that the proposed dataset and methods significantly narrow the gap with GPT-4o and Nano-Banana, demonstrating competitive performance not only in instruction following but also in knowledge plausibility, where many open-source systems typically struggle.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2553. DynaGuard: A Dynamic Guardian Model With User-Defined Policies
๐ฌ 19
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Safety Guardrails Content Moderation Compliance
Guardian models play a crucial role in ensuring the safety and ethical behavior of user-facing AI applications by enforcing guardrails and detecting harmful content. While standard guardian models are limited to predefined, static harm categories, we introduce DynaGuard, a suite of dynamic guardian models offering novel flexibility by evaluating text based on user-defined policies, and DynaBench, a dataset for training and evaluating dynamic guardian models. Our models provide both rapid detection of policy violations and a chain-of-thought reasoning option that articulate and justify model outputs. Critically, DynaGuard not only surpasses static models in detection accuracy on traditional safety categories, but is competitive with frontier reasoning models on free-form policy violations, all in a fraction of the time. This breakthrough makes DynaGuard a critical tool for language model guardrails.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2554. ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists
๐ฌ 20
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ expert-level evaluation long-form evaluation fine-grained evaluation
This paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications. Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task in ExpertLongBench includes a rubric, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR, an evaluation framework that supports accurate evaluation of long-form model outputs in our benchmark. To achieve fine-grained, expert-aligned evaluation, CLEAR derives checklists from both model outputs and references by extracting information corresponding to items in the task-specific rubric. Checklist items of model outputs are then compared with corresponding items of reference outputs to assess their correctness, enabling grounded evaluation. We benchmark 13 popular large language models (LLMs) and analyze components in CLEAR, showing that (1) existing LLMs, with the top performer Gemini-2.5-Pro achieving only a 33.4 F1 score, require significant improvement for expert-level tasks; (2) models can generate content corresponding to the required aspects, but far from correct; and (3) accurate checklist extraction and comparison in CLEAR can be achieved by open-weight models for more scalable, reproducible, and low-cost usage.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2555. SoFlow: Solution Flow Models for One-Step Generative Modeling
๐ฌ 22
โญ 5.50
๐ generative models
๐ท๏ธ Flow Matching Models Consistency Models One-step generation
The multi-step denoising process in diffusion and flow-matching models causes major efficiency issues, which motivates research on few-step generation. We present Solution Flow Models (SoFlow), a framework for one-step generation from scratch. By analyzing the relationship between the velocity function and the solution function of the velocity Ordinary Differential Equation (ODE), we propose a flow matching loss and a solution consistency loss to train our models. The flow matching lozss allows our models to provide estimated velocity fields for Classifier-Free Guidance (CFG) during training, which improves generation performance. Notably, our consistency loss does not require the calculation of the Jacobian-Vector Product (JVP), a common requirement in recent works that is not well-optimized in deep learning frameworks like PyTorch. Experimental results indicate that, when trained from scratch using the same diffusion transformer (DiT) architecture and with an equal number of training epochs, our models achieve better FID-50K scores compared to MeanFlow models on the ImageNet 256x256 dataset.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2556. Sequential Parallel Duality in Prefix Scannable Models
๐ฌ 7
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ state space models linear RNNs linear transformers sequence modeling
Modern neural sequence models are designed to meet the dual mandate of parallelizable training and fast sequential inference. Recent developments have given rise to various models, such as Gated Linear Attention (GLA) and Mamba, that achieve such ``sequential-parallel duality.'' This raises a natural question: can we characterize the full class of neural sequence models that support near-constant-time parallel evaluation and linear-time, constant-space sequential inference? We begin by describing a broad class of such models -- state space models -- as those whose state updates can be computed using the classic parallel prefix scan algorithm with a custom associative aggregation operator. We then define a more general class, Prefix-Scannable Models (PSMs), by relaxing the state aggregation operator to allow arbitrary (potentially non-associative) functions such as softmax attention. This generalization unifies many existing architectures, including element-wise RNNs (e.g., Mamba) and linear transformers (e.g., GLA, Mamba2, mLSTM), while also introducing new models with softmax-like operators that achieve O(1) amortized compute per token and log(N) memory for sequence length N. We empirically evaluate such models on illustrative small-scale language modeling and canonical synthetic tasks, including state tracking and associative recall. Empirically, we find that PSMs retain the expressivity of transformer-based architectures while matching the inference efficiency of state space models -- in some cases exhibiting better length generalization than either.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2557. LLM2Fx-Tools: Tool Calling for Music Post-Production
๐ฌ 13
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Music Post Production Fx Chain Generation Tool Calling
This paper introduces LLM2Fx-Tools, a multimodal tool-calling framework that generates executable sequences of audio effects (Fx-chain) for music post-production. LLM2Fx-Tools uses a large language model (LLM) to understand audio inputs, select audio effects types, determine their order, and estimate parameters, guided by chain-of-thought (CoT) planning. We also present LP-Fx, a new instruction-following dataset with structured CoT annotations and tool calls for audio effects modules. Experiments show that LLM2Fx-Tools can infer an Fx-chain and its parameters from pairs of unprocessed and processed audio, enabled by autoregressive sequence modeling, tool calling, and CoT reasoning. We further validate the system in a style transfer setting, where audio effects information is transferred from a reference source and applied to new content. Finally, LLM-as-a-judge evaluation demonstrates that our approach generates appropriate CoT reasoning and responses for music production queries. To our knowledge, this is the first work to apply LLM-based tool calling to audio effects modules, enabling interpretable and controllable music production.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2558. Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics
๐ฌ 15
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Cross-domain reinforcement learning; transfer learning; Bellman consistency
Cross-domain reinforcement learning (CDRL) is meant to improve the data efficiency of RL by leveraging the data samples collected from a source domain to facilitate the learning in a similar target domain. Despite its potential, cross-domain transfer in RL is known to have two fundamental and intertwined challenges: (i) The source and target domains can have distinct state space or action space, and this makes direct transfer infeasible and thereby requires more sophisticated inter-domain mappings; (ii) The transferability of a source-domain model in RL is not easily identifiable a priori, and hence CDRL can be prone to negative effect during transfer. In this paper, we propose to jointly tackle these two challenges through the lens of \textit{cross-domain Bellman consistency} and \textit{hybrid critic}. Specifically, we first introduce the notion of cross-domain Bellman consistency as a way to measure transferability of a source-domain model. Then, we propose $Q$Avatar, which combines the Q functions from both the source and target domains with an adaptive hyperparameter-free weight function. Through this design, we characterize the convergence behavior of $Q$Avatar and show that $Q$Avatar achieves reliable transfer in the sense that it effectively leverages a source-domain Q function for knowledge transfer to the target domain. Through experiments, we demonstrate that $Q$Avatar achieves favorable transferability across various RL benchmark tasks, including locomotion and robot arm manipulation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2559. NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation
๐ฌ 17
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ long video gengration model; video gengration benchmarks
With the rapid development of foundation video generation technologies, long video generation models have exhibited promising research potential thanks to expanded content creation space. Recent studies reveal that the goal of long video generation tasks is not only to extend video duration but also to accurately express richer narrative content within longer videos. However, due to the lack of evaluation benchmarks specifically designed for long video generation models, the current assessment of these models primarily relies on benchmarks with simple narrative prompts (e.g., VBench). To the best of our knowledge, our proposed NarrLV is the first benchmark to comprehensively evaluate the Narrative expression capabilities of Long Video generation models. Inspired by film narrative theory, (i) we first introduce the basic narrative unit maintaining continuous visual presentation in videos as Temporal Narrative Atom (TNA), and use its count to quantitatively measure narrative richness. Guided by three key film narrative elements influencing TNA changes, we construct an automatic prompt generation pipeline capable of producing evaluation prompts with a flexibly expandable number of TNAs. (ii) Then, based on the three progressive levels of narrative content expression, we design an effective evaluation metric using the MLLM-based question generation and answering framework. (iii) Finally, we conduct extensive evaluations on existing long video generation models and the foundation generation models that underpin them. Experimental results demonstrate that our metric aligns closely with human judgments. The derived evaluation outcomes reveal the detailed capability boundaries of current video generation models in narrative content expression.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2560. Stable and Scalable Deep Predictive Coding Networks with Meta Prediction Errors
๐ฌ 31
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Neuroscience Predictive Coding Meta Predictive Coding Free Energy
Predictive Coding Networks (PCNs) offer a biologically inspired alternative to conventional deep neural networks.
However, their scalability is hindered by severe training instabilities that intensify with network depth.
Through dynamical mean-field analyses, we identify two fundamental pathologies that impede deep PCN training:
(1) prediction error (PE) imbalance that leads to uneven learning across layers, characterized by error concentration at network boundaries; and
(2) exploding and vanishing prediction errors (EVPE) sensitive to weight variance.
To address these challenges, we propose Meta-PCN, a unified framework that incorporates two synergistic components:
(1) loss based on meta-prediction errors, which minimizes PEs of PEs to linearize the nonlinear inference dynamics; and
(2) weight regularization that combines normalization and clipping to regulate weight variance and mitigate EVPE.
Extensive experimental validation on CIFAR-10/100 and TinyImageNet demonstrates that Meta-PCN achieves statistically significant improvements over conventional PCN and backpropagation across most architectures, while maintaining biological plausibility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2561. Antithetic Noise in Diffusion Models
๐ฌ 13
โญ 5.50
๐ generative models
๐ท๏ธ diffusion model initial noise uncertainty quantification
We systematically study antithetic initial noise in diffusion models, discovering that pairing each noise sample with its negation consistently produces strong negative correlation. This universal phenomenon holds across datasets, model architectures, conditional and unconditional sampling, and even other generative models such as VAEs and Normalizing Flows. To explain it, we combine experiments and theory and propose a \textit{symmetry conjecture} that the learned score function is approximately affine antisymmetric (odd symmetry up to a constant shift), supported by empirical evidence.
This negative correlation leads to substantially more reliable uncertainty quantification with up to $90\%$ narrower confidence intervals. We demonstrate these gains on tasks including estimating pixel-wise statistics and evaluating diffusion inverse solvers. We also provide extensions with randomized quasi-Monte Carlo noise designs for uncertainty quantification, and explore additional applications of the antithetic noise design to improve image editing and generation diversity. Our framework is training-free, model-agnostic, and adds no runtime overhead. Code is available at \url{https://anonymous.4open.science/r/Antithetic-Noise-in-Diffusion-Models-8B54}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2562. FlowCast: Trajectory Forecasting for Scalable Zero-Cost Speculative Flow Matching
๐ฌ 15
โญ 5.50
๐ generative models
๐ท๏ธ Flow Matching Speculative Decoding Inference Acceleration Training-Free Generative Models Zero-Cost Drafts Parallel Verification Adaptive Sampling
Flow Matching (FM) has recently emerged as a powerful approach for high-quality visual generation. However, their prohibitively slow inference due to a large number of denoising steps limits their potential use in real-time or interactive applications. Existing acceleration methods, like distillation, truncation, or consistency training, either degrade quality, incur costly retraining, or lack generalization. We propose FlowCast, a training-free speculative generation framework that accelerates inference by exploiting the fact that FM models are trained to preserve constant velocity. FlowCast speculates future velocity by extrapolating current velocity without incurring additional cost, and accepts it if it is within a mean-squared error threshold. This constant-velocity forecasting allows redundant steps in stable regions to be aggressively skipped while retaining precision in complex ones. FlowCast is a plug-and-play framework that integrates seamlessly with any FM model and requires no auxiliary networks. We also present a theoretical analysis and bound the worst-case deviation between speculative and full FM trajectories. Empirical evaluations demonstrate that FlowCast achieves $>2.5\times$ speedup in image generation, video generation, and editing tasks, outperforming existing baselines with no quality loss as compared to standard full generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2563. Memento: Toward an All-Day Proactive Assistant for Ultra-Long Streaming Video
๐ฌ 13
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Vision-Language Models; Online Ultra-Long video understanding; Dynamic Memory
Multimodal large language models have demonstrated impressive capabilities in visual-language understanding, particularly in offline video tasks. More recently, the emergence of online video modeling has introduced early forms of active interaction. However, existing models, typically limited to tens of minutes, are not yet capable of all-day proactive understanding over ultra-long video streams. They struggle to maintain long-term context online, as they suffer from token accumulation and lack scalable memory mechanisms. These limitations hinder critical tasks such as reminding users that medication was taken hours earlierโan ability that exemplifies the shift from reactive to memory-oriented assistants with long-term reasoning. To bridge this gap, we present Memento, the first proactive vision-language framework for ultra-long streaming video. To avoid token growth and support scalable long-duration understanding, we introduce Dynamic Memory and Query-related Memory Selection, enabling sparse memory retention and efficient retrieval. To address the training challenges of memory-based modeling, we propose Step-Aware Memory Attention, which aligns memory access with temporal steps for stable supervision. To support both training and evaluation of active, long-term behavior, we construct Memento-54K and MementoBench, a dataset-benchmark suite covering diverse tasks on text, object, and action across video streams up to 7 hours. Experiments demonstrate that Memento achieves superior performance, paving the way toward reliable all-day proactive video assistants.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2564. Noisy but Valid: Robust Statistical Evaluation of LLMs with Imperfect Judges
๐ฌ 21
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM-as-a-Judge Hypothesis testing Finite-sample guarantees Type I/II errors
Reliable certification of Large Language Models (LLMs)โverifying that failure rates are below a safety thresholdโis critical yet challenging. While "LLM-as-a-Judge" offers scalability, judge imperfections, noise, and bias can invalidate statistical guarantees.
We introduce a "Noisy but Valid" hypothesis testing framework to address this. By leveraging a small human-labelled calibration set to estimate the judge's True Positive and False Positive Rates (TPR/FPR), we derive a variance-corrected critical threshold applied to a large judge-labelled dataset. Crucially, our framework theoretically guarantees finite-sample Type-I error control (validity) despite calibration uncertainty. This distinguishes our work from Prediction-Powered Inference (PPI), positioning our method as a diagnostic tool that explicitly models judge behavior rather than a black-box estimator.
Our contributions include: (1) Theoretical Guarantees: We derive the exact conditions under which noisy testing yields higher statistical power than direct evaluation; (2) Empirical Validation: Experiments on Jigsaw Comment, Hate Speech and SafeRLHF confirm our theory; (3) The Oracle Gap: We reveal a significant performance gap between practical methods and the theoretical "Oracle" (perfectly known judge parameters), quantifying the cost of estimation.
Specifically, we provide the first systematic treatment of the imperfect-judge setting, yielding interpretable diagnostics of judge reliability and clarifying how evaluation power depends on judge quality, dataset size, and certification levels. Together, these results sharpen understanding of statistical evaluation with LLM judges, and highlight trade-offs among competing inferential tools.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2565. Vision Language Models are Biased
๐ฌ 21
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ vision language models multimodal reasoning benchmark bias
Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that helps them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g., unable to recognize the 4th stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, game boards, optical illusions, to patterned grids. Removing image backgrounds nearly doubles accuracy (21.09 points), revealing that background visual cues trigger these biased responses. Further analysis of VLMsโ reasoning patterns shows that counting accuracy initially rises with thinking tokens, reaching โผ40%, before declining with model overthinking. Our work presents an interesting failure mode in VLMs and a human-supervised automated framework for testing VLM biases.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2566. GraphOmni: A Comprehensive and Extensible Benchmark Framework for Large Language Models on Graph-theoretic Tasks
๐ฌ 32
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ LLM Benchmark and Evaluation Prompt Optimization
This paper introduces GraphOmni, a comprehensive benchmark designed to evaluate the reasoning capabilities of LLMs on graph-theoretic tasks articulated in natural language. GraphOmni spans diverse graph types, serialization formats, and prompting schemes, substantially extending upon prior efforts in both scope and depth. Through systematic evaluation, we uncover critical interactions among these dimensions, revealing their decisive impact on model performance. Our experiments show that state-of-the-art closed-source models such as Claude-3.5 and o4-mini consistently lead overall, yet still leave considerable headroom, while open-source models display pronounced sensitivity to various design choices. Beyond the standard scope, larger graphs, real-world graphs, and additional NP-hard tasks are further discussed. We further analyze efficiency via output token usage, highlighting costโaccuracy trade-offs, and introduce a reinforcement learning-based optimizer that adaptively selects factor combinations, reducing evaluation cost by 75\% while retaining strong accuracy. This flexible and extensible benchmark not only deepens understanding of LLM performance on structured graph reasoning but also establishes a robust foundation for advancing model design and evaluation.
The code and datasets are available at https://anonymous.4open.science/r/ID-14092.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2567. A Problem-Oriented Perspective and Anchor Verification for Code Optimization
๐ฌ 16
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ LLM4Code Code Optimization
Large Language Models (LLMs) have shown remarkable capabilities in solving various programming tasks, such as code generation. However, their potential for code optimization, particularly in performance enhancement, remains largely unexplored. This paper investigates the capabilities of LLMs in optimizing code for minimal execution time, addressing a critical gap in current research. The recently proposed code optimization methods constructs program optimization pairs based on iterative submissions from the same programmer for the same problem. However, this approach confines LLMs to local performance improvements, neglecting global algorithmic innovation. To overcome this limitation, we adopt a completely different perspective by reconstructing the optimization pairs into a problem-oriented approach. This allows for the integration of various ideas from multiple programmers tackling the same problem. Furthermore, we observe that code optimization presents greater challenges compared to code generation, often accompanied by "optimization tax". Recognizing the inherent trade-offs in correctness and efficiency, we introduce a novel anchor verification framework to mitigate this "optimization tax." Ultimately, the problem oriented perspective combined with anchor verification framework significantly enhances both the correct optimization ratio and speedup to new levels.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2568. Sparse Imagination for Efficient Visual World Model Planning
๐ฌ 25
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ World Model Planning Computational Efficiency Model Predictive Control Vision Transformer
World model based planning has significantly improved decision-making in complex environments by enabling agents to simulate future states and make informed choices.
This computational burden is particularly restrictive in robotics, where resources are severely constrained.
To address this limitation, we propose a Sparse Imagination for Efficient Visual World Model Planning, which enhances computational efficiency by reducing the number of tokens processed during forward prediction.
Our method leverages a sparsely trained vision-based world model based on transformers with randomized grouped attention strategy, allowing the model to flexibly adjust the number of tokens processed based on the computational resource.
By enabling sparse imagination during latent rollout, our approach significantly accelerates planning while maintaining high control fidelity.
Experimental results demonstrate that sparse imagination preserves task performance while dramatically improving inference efficiency.
This general technique for visual planning is applicable from simple test-time trajectory optimization to complex real-world tasks with the latest VLAs, enabling the deployment of world models in real-time scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2569. Online Learning and Equilibrium Computation with Ranking Feedback
๐ฌ 9
โญ 5.50
๐ learning theory
๐ท๏ธ Online Learning Equilibrium Computation Human Feedback
Online learning in arbitrary and possibly adversarial environments has been extensively studied in sequential decision-making, with a strong connection to equilibrium computation in game theory. Most existing online learning algorithms are based on \emph{numeric} utility feedback from the environment, which may be unavailable in applications with humans in the loop and/or with privacy concerns. In this paper, we study an online learning setting where only a \emph{ranking} of a set of proposed actions is provided to the learning agent at each timestep. We consider both ranking models based on either the \emph{instantaneous} utility at each timestep, or the \emph{time-average} utility until the current timestep, in both \emph{full-information} and \emph{bandit} feedback settings. Focusing on the standard (external-)regret metric, we show that sublinear regret cannot be achieved with the instantaneous utility ranking feedback in general. Moreover, we show that when the ranking model is relatively {deterministic} (\emph{i.e.,} with a small temperature in the Plackett-Luce model), sublinear regret cannot be achieved with the time-average utility ranking feedback, either. We then propose new algorithms to achieve sublinear regret, under the additional assumption that the utility vectors have a sublinear variation. Notably, we also show that when time-average utility ranking is used, such an additional assumption can be avoided in the full-information setting. As a consequence, we show that if all the players follow our algorithms, an approximate coarse correlated equilibrium of a normal-form game can be found through repeated play. Finally, we also validate the effectiveness of our algorithms via numerical experiments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2570. Task-Related Token Compression in Multimodal Large Language Models from an Explainability Perspective
๐ฌ 17
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ multimodal large language model; token compression
Existing Multimodal Large Language Models (MLLMs) process a large number of visual tokens, leading to significant computational costs and inefficiency. Instruction-related visual token compression demonstrates strong task relevance, which aligns well with MLLMsโ ultimate goal of instruction following. Previous works generally assume that visual tokens achieve better visionโlanguage alignment in the shallow layers of LLMs, which have led to task-related token compression being primarily applied in intermediate LLM layers. In contrast, our study reveals that with proper selection, task-related token compression is feasible at the input stage of LLM with negligible performance loss. This new paradigm significantly reduces task-irrelevant visual tokens and its model-agnostic design enables application without modifying the LLM architecture. Specifically, we suggest that explainability methods for transformer-based architechtures can evaluate the global importance of each visual token with respect to the given instruction, which can effectively guide the task-related token compression for MLLMs. Furthermore, we propose to learn a mapping from the attention map of the first LLM layer to the explanation results, thereby avoiding the need for a full inference pass. Interestingly, this mapping can be learned using a simple and lightweight convolutional network, whose training is efficient and independent of MLLMs. Extensive experiments on 11 image and video benchmarks across three leading MLLMs (Qwen2-VL, LLaVA-OneVision, and VILA1.5) demonstrate the remarkable effectiveness and strong generalization of our approach. Additionally, our new compression paradigm achieves faster inference with reductions in both prefilling time and KV-cache memory.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2571. Graphon Cross-Validation: Assessing Models on Network Data
๐ฌ 21
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graphon Link prediction Cross-Validation Graph imputa- tion Selection consistency
Graphon models have emerged as powerful tools for modeling complex network structures by capturing connection probabilities among nodes. A key challenge in their application lies in accurately characterizing the graphon function, particularly with respect to parameters that govern its smoothness, which significantly impact the estimation accuracy. In this article, we propose a novel graphon cross-validation method for selecting tuning parameters and estimation approaches.
Our method is both theoretically sound and computationally efficient.
We show that our proposed cross-validation score is asymptotically parallel to the estimation error. Through extensive simulations and real-world applications, we demonstrate that our method consistently delivers superior computational efficiency and accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2572. NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction
๐ฌ 18
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Feed-forward 3D Reconstruction Non-pixel-aligned 3D Reconstruction 3D Completion
We present NOVA3R, an effective approach for non-pixel-aligned 3D reconstruction from a set of unposed images, in a feed-forward manner. Unlike pixel-aligned methods that tie geometry to per-ray predictions, our formulation learns a global, view-agnostic scene representation that decouples reconstruction from pixel alignment. This addresses two key limitations in pixel-aligned 3D: (1) it recovers both visible and invisible regions with a complete scene representation, and (2) it produces physically plausible geometry with fewer duplicated structures in overlapping regions. To achieve this, we introduce a scene-token mechanism that aggregates information across unposed images and a diffusion-based 3D decoder that reconstructs complete, non-pixel-aligned point clouds. Extensive experiments on both scene-level and object-level datasets demonstrate that NOVA3R outperforms state-of-the-art methods in terms of reconstruction accuracy and completeness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2573. Multimodal Classification via Total Correlation Maximization
๐ฌ 11
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Multimodal Learning; Modality Imbalance; Multimodal Classification;
Multimodal learning integrates data from diverse sensors to effectively harness information from different modalities. However, recent studies reveal that joint learning often overfits certain modalities while neglecting others, leading to performance inferior to that of unimodal learning. Although previous efforts have sought to balance modal contributions or combine joint and unimodal learningโthereby mitigating the degradation of weaker modalities with promising outcomesโfew have examined the relationship between joint and unimodal learning from an information-theoretic perspective.
In this paper, we theoretically analyze modality competition and propose a method for multimodal classification by maximizing the total correlation between multimodal features and labels. By maximizing this objective, our approach alleviates modality competition while capturing inter-modal interactions via feature alignment. Building on Mutual Information Neural Estimation (MINE), we introduce **T**otal **C**orrelation **N**eural **E**stimation (**TCNE**) to derive a lower bound for total correlation. Subsequently, we present TCMax, a hyperparameter-free loss function that maximizes total correlation through variational bound optimization. Extensive experiments demonstrate that TCMax outperforms state-of-the-art joint and unimodal learning approaches. Our code is available at https://anonymous.4open.science/r/TCMax_Experiments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2574. Machine Unlearning under RetainโForget Entanglement
๐ฌ 16
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Machine Unlearning Constrained Optimization
Forgetting a subset in machine unlearning is rarely an isolated task. Often, retained samples that are closely related to the forget set can be unintentionally affected, particularly when they share correlated features from pretraining or exhibit strong semantic similarities. To address this challenge, we propose a novel two-phase optimization framework specifically designed to handle such retainโforget entanglements. In the first phase, an augmented Lagrangian method increases the loss on the forget set while preserving accuracy on less-related retained samples. The second phase applies a gradient projection step, regularized by the Wasserstein-2 distance, to mitigate performance degradation on semantically related retained samples without compromising the unlearning objective. We validate our approach through comprehensive experiments on multiple unlearning tasks, standard benchmark datasets, and diverse neural architectures, demonstrating that it achieves effective and reliable unlearning while outperforming existing baselines in both accuracy retention and removal fidelity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2575. SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
๐ฌ 14
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reinforcement Learning Self-Play Large Language Models Reasoning Multi-Agent Reinforcement Learning
Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, generating an automatic curriculum of stronger opponents, and eliminating the need for human supervision. To enable this self-play training at scale, we implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. SPIRAL produces reasoning capabilities that transfer broadly, improving performance by up to 10\% across a suite of 8 reasoning benchmarks on 4 different models spanning Qwen and Llama model families, outperforming supervised fine-tuning on 25,000 expert game trajectories. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) yields the strongest results, with improvements observed across both base and instruction-tuned models. Analysis of chain-of-thought traces reveals that games develop distinct cognitive patterns that transfer to improve reasoning performance, with different games developing complementary strengths. Even models which have already been trained on reasoning tasks using RLVR, like DeepSeek-R1-Distill-Qwen-7B, still benefit from our approach. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities across diverse model architectures and training stages, highlighting a promising direction for autonomous reasoning development.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2576. Near-Optimal Second-Order Guarantees for Model-Based Adversarial Imitation Learning
๐ฌ 17
โญ 5.50
๐ reinforcement learning
๐ท๏ธ imitation learning model-based learning
We study online adversarial imitation learning (AIL), where an agent learns from offline expert demonstrations and interacts with the environment online without access to rewards. Despite strong empirical results, the benefits of online interaction and the impact of stochasticity remain poorly understood. We address these gaps by introducing a model-based AIL algorithm (MB-AIL) and establish its horizon-free, second-order sample-complexity guarantees under general function approximations for both expert data and reward-free interactions. These second-order bounds provide an instance-dependent result that can scale with the variance of returns under the relevant policies and therefore tighten as the system approaches determinism. Together with second-order, information-theoretic lower bounds on a newly constructed hard-instance family, we show that MB-AIL attains minimax-optimal sample complexity for online interaction (up to logarithmic factors) with limited expert demonstrations and matches the lower bound for expert demonstrations in terms of the dependence on horizon $H$, precision $\epsilon$ and the policy variance $\sigma^2$. Experiments further validate our theoretical findings and demonstrate that a practical implementation of MB-AIL matches or surpasses the sample efficiency of existing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2577. Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning
๐ฌ 9
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Critique Reinforcement Learning Reinforcement Learning Critique Fine-Tuning Large Language Models for Code Generation Test-Time Scaling
Reinforcement Learning (RL) has emerged as a popular training paradigm, particularly when paired with reasoning models. While effective, it primarily focuses on generating responses and lacks mechanisms to explicitly foster critique or reflection. Several recent studies, like Critique-Fine-Tuning (CFT) and Critique-Guided-Distillation (CGD) have shown the benefits of explicitly teaching LLMs how to critique. Motivated by them, we propose Critique Reinforcement Learning (CRL), where the model is tasked with generating a critique for a given (question, solution) pair. The reward is determined solely by whether the final judgment label $c \in \{\texttt{True}, \texttt{False}\}$ of the generated critique aligns with the ground-truth judgment $c^*$. Building on this point, we introduce \textsc{Critique-Coder}, which is trained on a hybrid of RL and CRL by substituting 20\% of the standard RL data with CRL data. We fine-tune multiple models (\textsc{Critique-Coder}) and evaluate them on different benchmarks to show their advantages over RL-only models. We show that \textsc{Critique-Coder} consistently outperforms RL-only baselines on all the evaluated benchmarks. Notably, our \textsc{Critique-Coder-8B} can reach over 60\% on LiveCodeBench (v5), outperforming other reasoning models like DeepCoder-14B and GPT-o1.
Beyond code generation, \textsc{Critique-Coder} also demonstrates enhanced general reasoning abilities, as evidenced by its better performance on logic reasoning tasks from the BBEH dataset. This indicates that the application of CRL on coding datasets enhances general reasoning and critique abilities, which are transferable across a broad range of tasks. Hence, we believe that CRL works as a great complement to standard RL for LLM reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2578. SCOPED: ScoreโCurvature Out-of-distribution Proximity Evaluator for Diffusion
๐ฌ 11
โญ 5.50
๐ generative models
๐ท๏ธ Out-of-distribution detection Diffusion models Typicality Generative modeling Reinforcement learning
Out-of-distribution (OOD) detection is essential for reliable deployment of machine learning systems in vision, robotics, and reinforcement learning. We introduce ScoreโCurvature Out-of-distribution Proximity Evaluator for Diffusion (SCOPED), a fast and general-purpose OOD detection method for diffusion models that reduces the number of forward passes on the trained model by an order of magnitude compared to prior methods, outperforming most diffusion-based baselines and approaching the accuracy of the strongest ones. SCOPED is computed from a single diffusion model trained once on a diverse dataset and combines the Jacobian trace and squared norm of the modelโs score function into a single test statistic. Rather than thresholding on a fixed value, we estimate the in-distribution density of SCOPED scores using kernel density estimation, enabling a flexible, unsupervised test that, in the simplest case, only requires a single forward pass and one Jacobianโvector product (JVP), made efficient by Hutchinsonโs trace estimator. On four vision benchmarks, SCOPED achieves competitive or state-of-the-art precision-recall scores despite its low computational cost. The same method generalizes to robotic control tasks with shared state and action spaces, identifying distribution shifts across reward functions and training regimes. These results position SCOPED as a practical foundation for fast and reliable OOD detection in real-world domains, including perceptual artifacts in vision, outlier detection in autoregressive models, exploration in reinforcement learning, and dataset curation for unsupervised training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2579. Compositional Neuro-Symbolic Concepts in Neural Activities
๐ฌ 13
โญ 5.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ concept grounding neural decoding neuro-symbolic systems
We propose NEURONA, a modular neuro-symbolic framework for fMRI decoding and concept grounding in neural activity. Leveraging image- and video-based fMRI question-answering datasets, NEURONA learns to decode interacting concepts from visual stimuli from patterns of fMRI signals, integrating symbolic reasoning and compositional execution with fMRI grounding across brain regions. We demonstrate that incorporating structure into the decoding pipeline improves both decoding accuracy and generalization performance. NEURONA shows that modeling the compositional structure of concepts through hierarchical predicate-argument dependencies enables more precise decoding from fMRI, highlighting neuro-symbolic frameworks as promising tools for neural decoding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2580. Discrete Bayesian Sample Inference for Graph Generation
๐ฌ 18
โญ 5.50
๐ generative models
๐ท๏ธ generative models graph generation diffusion models bayesian flow networks bayesian sample inference molecule generation
Generating graph-structured data is crucial in applications such as molecular generation, knowledge graphs, and network analysis. However, their discrete, unordered nature makes them difficult for traditional generative models, leading to the rise of discrete diffusion and flow matching models. In this work, we introduce GraphBSI, a novel one-shot graph generative model based on Bayesian Sample Inference (BSI). Instead of evolving samples directly, GraphBSI iteratively refines a belief over graphs in the continuous space of distribution parameters, naturally handling discrete structures. Further, we state BSI as a stochastic differential equation (SDE) and derive a noise-controlled family of SDEs that preserves the marginal distributions via an approximation of the score function. Our theoretical analysis further reveals the connection to Bayesian Flow Networks and Diffusion models. Finally, in our empirical evaluation, we demonstrate state-of-the-art performance on molecular and synthetic graph generation, outperforming existing one-shot graph generative models on the standard benchmarks Moses and GuacaMol.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2581. SelfReflect: Can LLMs Communicate Their Internal Answer Distribution?
๐ฌ 26
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM uncertainty quantification subjective uncertainty benchmark language model
The common approach to communicate a large language modelโs (LLM) uncertainty is to add a percentage number or a hedging word to its response. But is this all we can do? Instead of generating a single answer and then hedging it, an LLM that is fully transparent to the user needs to be able to reflect on its internal belief distribution and output a summary of all options it deems possible, and how likely they are. To test whether LLMs possess this capability, we develop the SelfReflect metric, an information-theoretic distance between a given summary and a distribution over answers. In interventional and human studies, we find that SelfReflect indicates even slight deviations, yielding a fine measure of faithfulness between a summary string and an LLM's actual internal distribution over answers. With SelfReflect, we make a resounding negative observation: modern LLMs are, across the board, incapable of revealing what they are uncertain about, neither through reasoning, nor chains-of-thoughts, nor explicit finetuning. However, we do find that LLMs are able to generate faithful summaries of their uncertainties if we help them by sampling multiple outputs and feeding them back into the context. This simple approach shines a light at the universal way of communicating LLM uncertainties whose future development the SelfReflect score enables.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2582. Test-Time Mixture of World Models for Embodied Agents in Dynamic Environments
๐ฌ 18
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Embodied AI World model Mixture of expert Test time adaptation Few-shot expansion
Language model (LM)-based embodied agents are increasingly deployed in real-world settings. Yet, their adaptability remains limited in dynamic environments, where constructing accurate and flexible world models is crucial for effective reasoning and decision-making. To address this challenge, we extend the Mixture-of-Experts (MoE) paradigm to embodied agents. While conventional MoE architectures modularize knowledge into expert components with pre-trained routing, they remain rigid once deployed, making them less effective for adapting to unseen domains in dynamic environments. We therefore propose Test-time Mixture of World Models (TMoW), a framework that enhances adaptability to unseen and evolving domains. TMoW updates its routing function over world models at test time, unlike conventional MoE where the function remains fixed, enabling agents to recombine existing models and integrate new ones for continual adaptation. It achieves this through (i) multi-granular prototype-based routing, which adapts mixtures across object- to scene-level similarities, (ii) test-time refinement that aligns unseen domain features with prototypes during inference, and (iii) distilled mixture-based augmentation, which efficiently constructs new models from few-shot data and existing prototypes. We evaluate TMoW on VirtualHome, ALFWorld, and RLBench benchmarks, demonstrating strong performance in both zero-shot adaptation and few-shot expansion scenarios, and showing that it enables embodied agents to operate effectively in dynamic environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2583. Diverse Text-to-Image Generation via Contrastive Noise Optimization
๐ฌ 21
โญ 5.50
๐ generative models
๐ท๏ธ Diffusion Models Noise Optimization Diverse Generation
Text-to-image (T2I) diffusion models have demonstrated impressive performance in generating high-fidelity images, largely enabled by text-guided inference. However, this advantage often comes with a critical drawback: limited diversity, as outputs tend to collapse into similar modes under strong text guidance. Existing approaches typically optimize intermediate latents or text conditions during inference, but these methods deliver only modest gains or remain sensitive to hyperparameter tuning. In this work, we introduce Contrastive Noise Optimization, a simple yet effective method that addresses the diversity issue from a distinct perspective. Unlike prior techniques that adapt intermediate latents, our approach shapes the initial noise to promote diverse outputs. Specifically, we develop a contrastive loss defined in the Tweedie data space and optimize a batch of noise latents. Our contrastive optimization repels instances within the batch to maximize diversity while keeping them anchored to a reference sample to preserve fidelity. We further provide theoretical insights into the mechanism of this preprocessing to substantiate its effectiveness. Extensive experiments across multiple T2I backbones demonstrate that our approach achieves a superior quality-diversity Pareto frontier while remaining robust to hyperparameter choices.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2584. Understanding Sensitivity of Differential Attention through the Lens of Adversarial Robustness
๐ฌ 14
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Adversarial Robustness Differential Attention Lipschitz Continuity Adversarial Attack
Differential Attention (DA) has been proposed as a refinement to standard attention, suppressing redundant or noisy context through a subtractive structure and thereby reducing contextual hallucination. While this design sharpens task-relevant focus, we show that it also introduces a structural fragility under adversarial perturbations. Our theoretical analysis identifies negative gradient alignmentโa configuration encouraged by DAโs subtractionโas the key driver of sensitivity amplification, leading to increased gradient norms and elevated local Lipschitz constants. We empirically validate this Fragile Principle through systematic experiments on ViT/DiffViT and evaluations of pretrained CLIP/DiffCLIP, spanning five datasets in total. These results demonstrate higher attack success rates, frequent gradient opposition, and stronger local sensitivity compared to standard attention. Furthermore, depth-dependent experiments reveal a robustness crossover: stacking DA layers attenuates small perturbations via depth-dependent noise cancellation, though this protection fades under larger attack budgets. Overall, our findings uncover a fundamental trade-off: DA improves discriminative focus on clean inputs but increases adversarial vulnerability, underscoring the need to jointly design for selectivity and robustness in future attention mechanisms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2585. VADv2: End-to-End Autonomous Driving via Probabilistic Planning
๐ฌ 5
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Autonomous Driving End-to-end Probabilistic Planning Closed-Loop Planning Vocabulary
Learning a human-like driving policy from large-scale driving demonstrations is promising, but the uncertainty and non-deterministic nature of planning make it challenging. Existing learning-based planning methods follow a deterministic paradigm to directly regress the action, failing to cope with the uncertainty problem. In this work, we propose a probabilistic planning model for end-to-end autonomous driving, termed VADv2. We resort to a probabilistic field function to model the mapping from the action space to the probabilistic distribution. Since the planning action space is a high-dimensional continuous spatiotemporal space and hard to tackle, we first discretize the planning action space to a large planning vocabulary and then tokenize the planning vocabulary into planning tokens. Planning tokens interact with scene tokens and output the probabilistic distribution of action. Mass driving demonstrations are leveraged to supervise the distribution. VADv2 achieves state-of-the-art closed-loop performance on the CARLA Town05 benchmark, significantly outperforming all existing methods. We also provide comprehensive evaluations on the NAVSIM dataset and a large-scale 3DGS-based benchmark, demonstrating its effectiveness in real-world applications. Code will be released to facilitate future research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2586. Cartridges: Lightweight and general-purpose long context representations via self-study
๐ฌ 10
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ test-time training fine-tuning
Large language models are often used to answer queries grounded in large text corpora (e.g. codebases, legal documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context learning (ICL). Although current models support contexts of 100K-10M tokens, this setup is costly to serve because the memory consumption of the KV cache scales with input length. We explore an alternative: training a smaller KV cache offline on each corpus. At inference time, we load this trained KV cache, which we call a Cartridge, and decode a response. Critically, the cost of training a Cartridge can be amortized across all the queries referencing the same corpus. However, we find that the naive approach of training the Cartridge with next-token prediction on the corpus is not competitive with ICL. Instead, we propose self-study, a training recipe in which we generate synthetic conversations about the corpus and train the Cartridge with a context-distillation objective. We find that Cartridges trained with self-study replicate the functionality of ICL, while being significantly cheaper to serve. On challenging long-context benchmarks, Cartridges trained with self-study match ICL performance while using 38.6x less memory and enabling 26.4x higher throughput. Self-study also extends the model's effective context length (e.g. from 128k to 484k tokens on MTOB) and surprisingly, leads to Cartridges that can be composed at inference time without retraining.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2587. UrbanGraph: Physics-Informed Spatio-Temporal Dynamic Heterogeneous Graphs for Urban Microclimate Prediction
๐ฌ 11
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Spatio-Temporal Graph Heterogeneous Graph Dynamic Graph Physics-Informed ML Urban Microclimate
With rapid urbanization, predicting urban microclimates has become critical, as it affects building energy demand and public health risks. However, existing generative and homogeneous graph approaches fall short in capturing physical consistency, spatial dependencies, and temporal variability. \revise{To address this, we introduce UrbanGraph, a framework founded on a novel structure-based inductive bias. Unlike implicit graph learning, UrbanGraph transforms physical first principles into a dynamic causal topology, explicitly encoding time-varying causalities (e.g., shading and convection) directly into the graph structure to ensure physical consistency and data efficiency. Results show that UrbanGraph achieves state-of-the-art performance across all baselines. Specifically, the use of explicit causal pruning significantly reduces the model's floating-point operations (FLOPs) by 73.8\% and increases training speed by 21\% compared to implicit graphs. Our contribution includes the first high-resolution benchmark for spatio-temporal microclimate modeling, and a generalizable explicit topological encoding paradigm applicable to urban spatio-temporal dynamics governed by known physical equations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2588. Depth Anything with Any Prior
๐ฌ 15
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Depth Estimation
This work presents Prior Depth Anything, a framework that combines incomplete but precise metric information in depth measurement with relative but complete geometric structures in depth prediction, generating accurate, dense, and detailed metric depth maps for any scene. To this end, we design a coarse-to-fine pipeline to progressively integrate the two complementary depth sources. First, we introduce pixel-level metric alignment and distance-aware weighting to pre-fill diverse metric priors by explicitly using depth prediction. It effectively narrows the domain gap between prior patterns, enhancing generalization across varying scenarios. Second, we develop a conditioned monocular depth estimation (MDE) model to refine the inherent noise of depth priors. By conditioning on the normalized pre-filled prior and prediction, the model further implicitly merges the two complementary depth sources. Our model showcases impressive zero-shot generalization across depth completion, super-resolution, and inpainting over 7 real-world datasets, matching or even surpassing previous task-specific methods. More importantly, it performs well on challenging, unseen mixed priors and enables test-time improvements by switching prediction models, providing a flexible accuracy-efficiency trade-off while evolving with advancements in MDE models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2589. ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Models
๐ฌ 4
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Vision-Language Models visual perception self-evolution reinforcement learning
The limited capacity for fine-grained visual perception presents a critical bottleneck for Vision-Language Models (VLMs) in real-world applications. Addressing this is challenging due to the scarcity of high-quality data and the limitations of existing methods: supervised fine-tuning (SFT) often compromises general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual reasoning over visual perception. To bridge this gap, we propose a novel two-stage task that structures visual perception learning as a coarse-to-fine progressive process. Based on this task formulation, we develop ViPER, a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction. By synergistically integrating image-level and instance-level reconstruction with a two-stage reinforcement learning strategy, ViPER establishes a closed-loop training paradigm, where internally synthesized data directly fuel the enhancement of perceptual ability. Applied to the Qwen2.5-VL family, ViPER produces the Qwen-Viper series. With an average gain of 1.7\% on seven comprehensive benchmarks spanning various tasks and up to 6.0\% on fine-grained perception, Qwen-Viper consistently demonstrates superior performance across different vision-language scenarios while maintaining generalizability. Beyond enabling self-improvement in perceptual capabilities, ViPER provides concrete evidence for the reciprocal relationship between generation and understanding, a breakthrough to developing more autonomous and capable VLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2590. mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules
๐ฌ 21
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ molecule-language multimodality language model molecule tokenization molecule generation
Despite their ability to understand chemical knowledge, large language models (LLMs) remain limited in their capacity to propose novel molecules with desired functions (e.g., drug-like properties). In addition, the molecules that LLMs propose can often be challenging to make, and are almost never compatible with automated synthesis approaches. To better enable the discovery of functional small molecules, LLMs need to learn a new molecular language that is more effective in predicting properties and inherently synced with automated synthesis technology. Current molecule LLMs are limited by representing molecules based on atoms. In this paper, we argue that just like tokenizing texts into meaning-bearing (sub-)word tokens instead of characters, molecules should be tokenized at the level of functional building blocks, i.e., parts of molecules that bring unique functions and serve as effective building blocks for real-world automated laboratory synthesis. This motivates us to propose mCLM, a modular Chemical-Language Model that comprises a bilingual language model that understands both natural language descriptions of functions and molecular blocks. mCLM front-loads synthesizability considerations while improving the predicted functions of molecules in a principled manner. Experiments on 430 FDA-approved drugs showed that mCLM is capable of significantly improving chemical functions critical to determining drug potentials. mCLM, with only 3B parameters, also achieves improvements in synthetic accessibility relative to 7 other leading generative AI methods including GPT-5. When tested on 122 out-of-distribution medicines using only building blocks/tokens that are compatible with automated modular synthesis, mCLM outperforms all baselines in property scores and synthetic accessibility. mCLM can also reason on multiple functions and iteratively self-improve to rescue drug candidates that failed late in clinical trials (โfallen angelsโ).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2591. PonderLM: Pretraining Language Models to Ponder in Continuous Space
๐ฌ 9
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ language modeling pondering language models pretraining continuous embedding space
Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort.
In this work, we introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step. During pondering, instead of generating an actual token sampled from the prediction distribution, the model ponders by yielding a weighted sum of all token embeddings according to the predicted token distribution. The generated embedding is then fed back as input for another forward pass. We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations.
Experiments across three widely used open-source architecturesโGPT-2, Pythia, and LLaMAโand extensive downstream task evaluations demonstrate the effectiveness and generality of our method. On 9 downstream benchmarks, our pondering-enhanced Pythia models significantly outperform the official Pythia models. Notably, our PonderPythia models demonstrate remarkable effectiveness: PonderPythia-2.8B surpasses Pythia-6.9B and rivals Pythia-12B, while our PonderPythia-1B matches TinyLlama-1.1B, a model trained on 10 times more data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2592. ICYM2I: The illusion of multimodal informativeness under missingness
๐ฌ 17
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Multimodal learning Distribution shifts Missingness
Multimodal learning is of continued interest in artificial intelligence-based applications, motivated by the potential information gain from combining different types of data. However, modalities observed in the source environment may differ from the modalities observed in the target environment due to multiple factors, including cost, hardware failure, or the perceived *informativeness* of a given modality. This shift in missingness between the source and target environment has not been carefully studied.
Naive estimation of the information gain associated with including an additional modality without accounting for missingness may result in improper estimates of that modality's value in the target environment.
We formalize the problem of missingness, demonstrate its ubiquity, and show that the subsequent distribution shift results in bias when the missingness process is not explicitly accounted for. To address this issue, we introduce ICYM2I (In Case You Multimodal Missed It), a framework for the evaluation of predictive performance and information gain under missingness through inverse probability weighting-based correction.
We demonstrate the importance of the proposed adjustment to estimate information gain under missingness on synthetic, semi-synthetic, and real-world datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2593. VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models
๐ฌ 26
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Vision-Language Models Code Generation Model Merging Dataset Construction Multimodal Code Generation
Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills.
To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and challenging benchmark specifically designed to assess models on visually-rich, real-world programming questions that demand a nuanced understanding of both textual and visual contexts.
Extensive experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o, highlighting the effectiveness of our model merging strategy and new datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2594. TS$^2$: Training with Sparsemax+, Testing with Softmax for Accurate and Diverse LLM Fine-Tuning
๐ฌ 16
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Supervised Fine-Tuning Output Diversity
Large Language Models (LLMs) typically rely on Supervised Fine-Tuning (SFT) with Cross-Entropy (CE) loss to specialize in downstream tasks. However, CE forces the distribution toward one-hot targets and ignores alternative continuations, thereby limiting output diversityโa key drawback for generative applications that rely on sampling-based exploration.
In this paper, we propose ``Training with Sparsemax+, Testing with Softmax (TS$^2$)''. Intuitively, sparsemax and its tailored loss mask the gradients of probabilities outside the support set, leaving excessive probability mass on irrelevant tail classes when evaluating with softmax. To address this issue, we propose an improved variant, Sparsemax+, for training, which augments the sparsemax loss with a suppression term that penalizes the out-of-support probabilities. At testing, we decode with softmax, yielding calibrated, non-degenerate probabilities where plausible near-ties survive.
We fine-tuned Llama-3.1-8B and Qwen-2.5-7B with TS$^2$, achieving consistent improvements in accuracy and output diversity across chat, code, and open-domain benchmarks. Together, these results demonstrate that TS$^2$ provides a practical, drop-in solution for fine-tuning LLMs that are both more accurate and more creative.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2595. InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
๐ฌ 18
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Reasoning Context Management Efficient Reasoning
Advanced reasoning in large language models has achieved remarkable performance on challenging tasks, but the prevailing long-context reasoning paradigm faces critical limitations: quadratic computational scaling with sequence length, reasoning constrained by maximum context boundaries, and performance degradation beyond pre-training context windows. Existing approaches primarily compress reasoning chains without addressing the fundamental scaling problem. To overcome these challenges, we introduce InftyThink, a paradigm that transforms monolithic reasoning into an iterative process with intermediate summarization. By interleaving short reasoning segments with concise progress summaries, our approach enables unbounded reasoning depth while maintaining bounded computational costs. This creates a characteristic sawtooth memory pattern that significantly reduces computational complexity compared to traditional approaches. Furthermore, we develop a methodology for reconstructing long-context reasoning datasets into our iterative format, transforming OpenR1-Math into 333K training instances. Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-13% improvements across MATH500, AIME24, and GPQA_diamond benchmarks. Our work challenges the assumed trade-off between reasoning depth and computational efficiency, providing a more scalable approach to complex reasoning without architectural modifications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2596. Understanding the Dynamics of Forgetting and Generalization in Continual Learning via the Neural Tangent Kernel
๐ฌ 12
โญ 5.50
๐ learning theory
๐ท๏ธ Forgetting Generalization Continual Learning Neural Tangent Kernel
Continual learning (CL) enables models to acquire new tasks sequentially while retaining previously learned knowledge.
However, most theoretical analyses focus on simplified, converged models or restrictive data distributions and therefore fail to capture how forgetting and generalization evolve during training in more general settings.
Current theory faces two fundamental challenges: (i) analyses confined to the converged regime cannot characterize intermediate training dynamics; and (ii) establishing forgetting bounds requires two-sided bounds on the population risk for each task.
To address these challenges, we analyze the training-time dynamics of forgetting and generalization in standard CL within the Neural Tangent Kernel (NTK) regime, showing that decreasing the lossโs Lipschitz constant and minimizing the cross-task kernel jointly reduce forgetting and improve generalization.
Specifically, we (i) characterize intermediate training stages via kernel gradient flow and (ii) employ Rademacher complexity to derive both upper and lower bounds on population risk.
Building on these insights, we propose \emph{OGD+}, which projects the current taskโs gradient onto the orthogonal complement of the subspace spanned by gradients of the most recent task evaluated on all prior samples.
We further introduce \emph{Orthogonal Penalized Gradient Descent} (OPGD), which augments OGD+ with gradient-norm penalization to jointly reduce forgetting and enhance generalization.
Experiments on multiple benchmarks corroborate our theoretical predictions and demonstrate the effectiveness of OPGD, providing a principled pathway from theory to algorithm design in CL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2597. Hybrid Training for Vision-Language-Action Models
๐ฌ 15
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ vision-language-action models; chain-of-thought; robotic manipulation
Using Large Language Models to produce intermediate thoughts, a.k.a. Chain-of-thought (CoT), before providing an answer has been a successful recipe for solving complex language tasks. In robotics, similar embodied CoT strategies, generating thoughts before actions, have also been shown to lead to improved performance when using Vision-Language-Action models (VLAs).
As these techniques increase the length of the model's generated outputs to include the thoughts, the inference time is negatively affected. Delaying an agent's actions in real-world executions, as in robotic manipulation settings, strongly affects the usability of a method, as tasks require long sequences of actions.
However, is the generation of long chains-of-thought a strong prerequisite for achieving performance improvements? In this work, we explore the idea of Hybrid Training (HyT), a framework that enables VLAs to learn from thoughts and benefit from the associated performance gains, while enabling the possibility to leave out CoT generation during inference. Furthermore, by learning to conditionally predict a diverse set of outputs, HyT supports flexibility at inference time, enabling the model to either predict actions directly, generate thoughts or follow instructions. We evaluate the proposed method in a series of simulated benchmarks and real-world experiments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2598. K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge
๐ฌ 14
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Visual generation evaluation preference alignment corrected VLM-as-a-Judge efficient evaluation
The rapid development of visual generative models raises the need for more scalable and human-aligned evaluation methods. While the crowdsourced Arena platforms offer human preference assessments by collecting human votes, they are costly and time-consuming, inherently limiting their scalability. Leveraging vision-language model (VLMs) as substitutes for manual judgments presents a promising solution. However, the inherent hallucinations and biases of VLMs hinder alignment with human preferences, thus compromising evaluation reliability. Additionally, the static evaluation approach lead to low efficiency. In this paper, we propose K-Sort Eval, a reliable and efficient VLM-based evaluation framework that integrates posterior correction and dynamic matching. Specifically, we curate a high-quality dataset from thousands of human votes in K-Sort Arena, with each instance containing the outputs and rankings of K models. When evaluating a new model, it undergoes (K+1)-wise free-for-all comparisons with existing models, and the VLM provide the rankings. To enhance alignment and reliability, we propose a posterior correction method, which adaptively corrects the posterior probability in Bayesian updating based on the consistency between the VLM prediction and human supervision. Moreover, we propose a dynamic matching strategy, which balances uncertainty and diversity to maximize the expected benefit of each comparison, thus ensuring more efficient evaluation. Extensive experiments show that K-Sort Eval delivers evaluation results consistent with K-Sort Arena, typically requiring fewer than 90 model runs, demonstrating both its efficiency and reliability. The dataset and code are publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2599. Numerion: A Multi-Hypercomplex Model for Time Series Forecasting
๐ฌ 13
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ Time Series Forecasting Hypercomplex Numbers Hypercomplex Time Series Models Multi-Hypercomplex Space
Many methods aim to enhance time series forecasting by decomposing the series through intricate model structures and prior knowledge, yet they are inevitably limited by computational complexity and the robustness of the assumptions. Our research uncovers that in the complex domain and higher-order hypercomplex spaces, the characteristic frequencies of time series naturally decrease. Leveraging this insight, we propose Numerion, a time series forecasting model based on multiple hypercomplex spaces. Specifically, grounded in theoretical support, we generalize linear layers and activation functions to hypercomplex spaces of arbitrary power-of-two dimensions and introduce a novel Real-Hypercomplex-Real Domain Multi-Layer Perceptron (RHR-MLP) architecture. Numerion utilizes multiple RHR-MLPs to map time series into hypercomplex spaces of varying dimensions, naturally decomposing and independently modeling the series, and adaptively fuses the latent patterns exhibited in different spaces through a dynamic fusion mechanism. Experiments validate the modelโs performance, achieving state-of-the-art results on multiple public datasets. Visualizations and quantitative analyses comprehensively demonstrate the ability of multi-dimensional RHR-MLPs to naturally decompose time series and reveal the tendency of higher-dimensional hypercomplex spaces to capture lower-frequency features.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2600. Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification
๐ฌ 18
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Misbehavior Detection Large Vision-Language Model Evidential Theory Uncertainty
Large vision-language models (LVLMs) have shown substantial advances in multimodal understanding and generation. However, when presented with incompetent or adversarial inputs, they frequently produce unreliable or even harmful contents, such as fact hallucinations or dangerous instructions. This misalignment with human expectations, referred to as \emph{misbehaviors} of LVLMs, raises serious concerns for deployment in critical applications. These misbehaviors are found to stem from epistemic uncertainty, specifically either conflicting internal knowledge or the absence of supporting information. However, existing uncertainty quantification methods, which typically capture only overall epistemic uncertainty, have shown limited effectiveness in identifying such issues. To address this gap, we propose Evidential Uncertainty Quantification (EUQ), a fine-grained method that captures both information conflict and ignorance for effective detection of LVLM misbehaviors. In particular, we interpret features from the model output head as either supporting (positive) or opposing (negative) evidence. Leveraging Evidence Theory, we model and aggregate this evidence to quantify internal conflict and knowledge gaps within a single forward pass.
We extensively evaluate our method across four categories of misbehavior, including hallucinations, jailbreaks, adversarial vulnerabilities, and out-of-distribution (OOD) failures, using state-of-the-art LVLMs, and find that EUQ consistently outperforms strong baselines, showing that hallucinations correspond to high internal conflict and OOD failures to high ignorance. Furthermore, layer-wise evidential uncertainty dynamics analysis helps interpret the evolution of internal representations from a new perspective.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2601. On Discovering Algorithms for Adversarial Imitation Learning
๐ฌ 14
โญ 5.50
๐ reinforcement learning
๐ท๏ธ imitation learning algorithm discovery llms evolutionary algorithms
Adversarial Imitation Learning (AIL) methods, while effective in settings with limited expert demonstrations, are often considered unstable. These approaches typically decompose into two components: Density Ratio (DR) estimation $\frac{\rho_E}{\rho_{\pi}}$, where a discriminator estimates the relative occupancy of state-action pairs under the policy versus the expert; and Reward Assignment (RA), where this ratio is transformed into a reward signal used to train the policy. While significant research has focused on improving density estimation, the role of reward assignment in influencing training dynamics and final policy performance has been largely overlooked. RA functions in AIL are typically derived from divergence minimization objectives, relying heavily on human design and ingenuity. In this work, we take a different approach: we investigate the discovery of data-driven RA functions, i.e, based directly on the performance of the resulting imitation policy. To this end, we leverage an LLM-guided evolutionary framework that efficiently explores the space of RA functions, yielding _Discovered Adversarial Imitation Learning_ (DAIL), the first meta-learnt AIL algorithm. Remarkably, DAIL generalises across unseen environments and policy optimization algorithms, outperforming the current state-of-the-art of _human-designed_ baselines. Finally, we analyse why DAIL leads to more stable training, offering novel insights into the role of RA functions in the stability of AIL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2602. SuperF: Neural Implicit Fields for Multi-Image Super-Resolution
๐ฌ 11
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ implicit neural representation neural fields super-resolution test-time optimization learning-free multi-image super-resolution
High-resolution imagery is often hindered by limitations in sensor technology, atmospheric conditions, and costs. Such challenges occur in satellite remote sensing, but also with handheld cameras, such as our smartphones. Hence, super-resolution aims to enhance the image resolution algorithmically. Since single-image super-resolution requires to solve an inverse problem, such methods must exploit strong priors, e.g. learned from high-resolution training data, or be constrained by auxiliary data, e.g. by a high-resolution guide from another modality. While qualitatively pleasing, such approaches often lead to "hallucinated" structures that do not match reality. In contrast, multi-image super-resolution (MISR) aims to improve the (optical) resolution by constraining the super-resolution process with multiple views taken with sub-pixel shifts. Here, we propose SuperF, a test-time optimization approach for MISR that leverages coordinate-based neural networks, also called neural fields. Their ability to represent continuous signals with an implicit neural representation (INR) makes them an ideal fit for the MISR task. The key characteristic of our approach is to share an INR for multiple shifted low-resolution frames and to jointly optimize the frame alignment with the INR. Our approach advances related INR baselines, adopted from burst fusion for layer separation, by directly parameterizing the sub-pixel alignment as optimizable affine transformation parameters and by optimizing via a super-sampled coordinate grid that corresponds to the output resolution. Our experiments yield compelling results on simulated bursts of satellite imagery and ground-level images from handheld cameras, with upsampling factors of up to 8. A key advantage of SuperF is that this approach does not rely on any high-resolution training data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2603. Consistency-Driven Calibration and Matching for Few-Shot Class Incremental Learning
๐ฌ 22
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Continual learning Few-shot class-incremental learning
Few-Shot Class Incremental Learning (FSCIL) is crucial for adapting to the complex open-world environments. Contemporary prospective learning-based space construction methods struggle to balance old and new knowledge, as prototype bias and rigid structures limit the expressive capacity of the embedding space. Different from these strategies, we rethink the optimization dilemma from the perspective of feature-structure dual consistency, and propose a Consistency-driven Calibration and Matching (ConCM) framework that systematically mitigates the knowledge conflict inherent in FSCIL. Specifically, inspired by hippocampal associative memory, we design a memory-aware prototype calibration that extracts generalized semantic attributes from base classes and reintegrates them into novel classes to enhance the conceptual center consistency of features. Further, to consolidate memory associations, we propose dynamic structure matching, which adaptively aligns the calibrated features to a session-specific optimal manifold space, ensuring cross-session structure consistency. This process requires no class-number priors and is theoretically guaranteed to achieve geometric optimality and maximum matching. On large-scale FSCIL benchmarks including mini-ImageNet, CIFAR100 and CUB200, ConCM achieves state-of-the-art performance, with harmonic accuracy gains of up to 3.41% in incremental sessions. Code is available at: https://anonymous.4open.science/r/ConCM-7385
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2604. Counterfactual Reasoning for Retrieval-Augmented Generation
๐ฌ 20
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Retrieval-Augmented Generation Large Language Models Counterfactual Reasoning
While Retrieval-Augmented Generation (RAG) has advanced knowledge-intensive tasks, we identify a fundamental vulnerability: the Correlation Trap. Existing systems cannot distinguish causally decisive evidence from overwhelmingly correlated yet misleading information, leading to systematic failures. We introduce Counterfactual RAG (CF-RAG), a new framework that operationalizes causal reasoning to overcome this limitation. CF-RAG systematically generates and evaluates counterfactual queries to identify causally relevant distinctions, and employs a parallel arbitration mechanism to reconcile conflicting evidence without interference. On challenging benchmarks, CF-RAG substantially improves robustness against the Correlation Trap, achieving state-of-the-art performance while maintaining comparable efficiency to standard RAG models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2605. PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts
๐ฌ 19
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Benchmarking Foundation Models Multimodal Reasoning
Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions and constrained environments, puzzlehunts requires discovering the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite progress in foundation models, their performance on open-ended settings remains largely untested. We introduce PuzzleWorld, a comprehensive benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-4\% final answer accuracy. On PuzzleWorld, the best model solves only 14\% of puzzles and reaches 40\% stepwise accuracy, matching human puzzle novices but falling significantly behind puzzle enthusiasts. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces boosts stepwise accuracy from 4\% to 11\%, which translates to improvements in downstream visual reasoning tasks. Our detailed error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We will publicly release PuzzleWorld to support future work on building more general, open-ended, and creative reasoning systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2606. On the Expressiveness of State Space Models via Temporal Logics
๐ฌ 14
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ state space models expressiveness temporal logics
We investigate the expressive power of state space models (SSM), which have recently emerged as a potential alternative to transformer architectures in large language models. Building on recent work, we analyse SSM expressiveness through fragments and extensions of linear temporal logic over finite traces. Our results show that the expressive capabilities of SSM vary substantially depending on the underlying gating mechanism. We further distinguish between SSM operating over fixed-width arithmetic (quantised models), whose expressive power remains within regular languages, and SSM with unbounded precision, which can capture counting properties and non-regular languages. In addition, we provide a systematic comparison between these different SSM variants and known results on transformers, thereby clarifying how the two architectures relate in terms of expressive power.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2607. Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance
๐ฌ 22
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ LLM RLHF Reward Hacking Debias
Reward models (RMs) are crucial in reinforcement learning from human feedback (RLHF) to align large language models (LLMs) with human values. However, RM training data is commonly recognized as low-quality, always containing preference conflicts and inductive biases, such as response length or speaking style, which can easily lead to reward overfitting and hacking. A few recent RM debiasing methods either target merely a single specific type of preference bias or only address simple linear bias relations such as Pearson coefficients. To mitigate more complicated inductive bias of reward modeling, inspired by the information bottleneck, we introduce a novel information-theoretic debiasing method called **D**ebiasing via **I**nformation optimization for **R**M (DIR). More specifically, our method trains RMs by maximizing the mutual information (MI) between preference prediction and input response pairs, while minimizing the MI between RM outputs and biased attributes of preference inputs. With the theoretical justification of information theory, DIR can handle different types of bias with more comprehensive non-linear correlations, enlarging its real-world application scenarios. In experiments, we verify the effectiveness of DIR with three types of inductive biases: response length, sycophancy, and format. Based on the numerical results, we discover that DIR can not only effectively diminish target inductive biases but also improve RLHF performances on various benchmarks with better generalization abilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2608. A Brain Graph Foundation Model: Pre-Training and Prompt-Tuning across Broad Atlases and Disorders
๐ฌ 11
โญ 5.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Brain Graph Foundation Model Functional Magnetic Resonance Imaging (fMRI) Neuroscience Graph Pre-Training Fine-Tuning Prompt Learning
As large language models (LLMs) continue to revolutionize AI research, there is a growing interest in building large-scale brain foundation models to advance neuroscience. While most existing brain foundation models are pre-trained on time-series signals or connectome features, we propose a novel graph-based pre-training paradigm for constructing a brain graph foundation model. In this paper, we introduce the Brain Graph Foundation Model, termed BrainGFM, a unified framework that leverages graph contrastive learning and graph masked autoencoders for large-scale fMRI-based pre-training. BrainGFM is pre-trained on a diverse mixture of brain atlases with varying parcellations, significantly expanding the pre-training corpus and enhancing the modelโs ability to generalize across heterogeneous fMRI-derived brain representations. To support efficient and versatile downstream transfer, we integrate both graph prompts and language prompts into the model design, enabling BrainGFM to flexibly adapt to a wide range of atlases, neurological and psychiatric disorders, and task settings. Furthermore, we employ meta-learning to optimize the graph prompts, facilitating strong generalization to previously unseen disorders under both few-shot and zero-shot learning conditions via language-guided prompting. BrainGFM is established on 27 neuroimaging datasets spanning 25 common neurological and psychiatric disorders, encompassing 2 types of brain atlases (functional and anatomical) across 8 widely used parcellations, and covering over 25,000 subjects, 60,000 fMRI scans, and a total of 400,000 graph samples aggregated across all atlases and parcellations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2609. Gradient-Based Program Synthesis with Neurally Interpreted Languages
๐ฌ 16
โญ 5.50
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Meta Learning Neural Program Synthesis Neuro-Symbolic Learning
A central challenge in program induction has long been the trade-off between symbolic and neural approaches. Symbolic methods offer compositional generalisation and data efficiency, yet their scalability is constrained by formalisms such as domain-specific languages (DSLs), which are labor-intensive to create and may not transfer to new domains. In contrast, neural networks flexibly learn from data but fail to generalise systematically. We bridge this divide with the Neural Language Interpreter (NLI), an architecture that learns its own discrete, symbolic-like programming language end-to-end. NLI autonomously discovers a vocabulary of subsymbolic primitive operations and uses a novel differentiable neural executor to interpret variable-length sequences of these primitives. This allows NLI to represent programs that are not bound to a constant number of computation steps, enabling it to solve more complex problems than those seen during training. To make these discrete, compositional program structures amenable to gradient-based optimisation, we employ the Gumbel-Softmax relaxation, enabling the entire model to be trained end-to-end. Crucially, this same differentiability enables powerful test-time adaptation. At inference, NLI's program inductor provides an initial program guess. This guess is then refined via gradient descent through the neural executor, enabling efficient search for the neural program that best explains the given data. We demonstrate that NLI outperforms in-context learning, test-time training, and continuous latent program networks (LPNs) on tasks that require combinatorial generalisation and rapid adaptation to unseen tasks. Our results establish a new path toward models that combine the compositionality of discrete languages with the gradient-based search and end-to-end learning of neural networks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2610. CogniMap3D: Cognitive 3D Mapping and Rapid Retrieval
๐ฌ 18
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Feed-Forward 3D; Dynamic Scene Understanding
We present CogniMap3D, a bioinspired framework for dynamic 3D scene understanding and reconstruction that emulates human cognitive processes. Our approach maintains a persistent memory bank of static scenes, enabling efficient spatial knowledge storage and rapid retrieval. CogniMap3D integrates three core capabilities: a multi-stage motion cue framework for identifying dynamic objects, a cognitive mapping system for storing, recalling, and updating static scenes across multiple visits, and a factor graph optimization strategy for refining camera poses. Given an image stream, our model identifies dynamic regions through motion cues with depth and camera pose priors, then matches static elements against its memory bank. When revisiting familiar locations, CogniMap3D retrieves stored scenes, relocates cameras, and updates memory with new observations. Evaluations on video depth estimation, camera pose reconstruction, and 3D mapping tasks demonstrate its state-of-the-art performance, while effectively supporting continuous scene understanding across extended sequences and multiple visits.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2611. Composable Sparse Subnetworks via Maximum-Entropy Principle
๐ฌ 21
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Maximum Entropy Principle; Iterative Magnitude Pruning; Circuits; Modular Neural Networks
Neural networks implicitly learn class-specific functional modules. In this work, we ask: Can such modules be isolated and recombined? We introduce a method for training sparse networks that accurately classify only a designated subset of classes while remaining deliberately uncertain on all others, functioning as class-specific subnetworks. A novel KL-divergence-based loss trains only the functional module for the assigned set, and an iterative magnitude pruning procedure removes irrelevant weights. Across multiple datasets (MNIST, FMNIST, tabular data, CIFAR-10) and architectures (MLPs, CNNs, ResNet, VGG), we show that these subnetworks achieve high accuracy on their target classes with minimal leakage to others. When combined via weight summation or logit averaging, these specialized subnetworks act as functional modules of a composite model that often recovers generalist performance. For simpler models and datasets, we experimentally confirm that the resulting modules are mode-connected, which justifies summing their weights. Our approach offers a new pathway toward building modular, composable deep networks with interpretable functional structure.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2612. Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation
๐ฌ 17
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Human Motion Generation; Two-person Motion Generation
Generating realistic and diverse human-human interactions from text is a crucial yet challenging task in computer vision, graphics, and robotics. Despite recent advances, existing methods have two key limitations. First, two-person interaction synthesis is highly complex, simultaneously requiring individual human motion quality and spatial-temporal sync between the interactants. However, due to their limited scale, the current datasets cannot effectively support learning such a complex task, restricting the model's generalizing capabilities. To address this, we propose a scalable data synthesis framework, InterCompose, which leverages the general knowledge encoded in large language models and the motion priors from strong single-person generators to synthesize high-quality two-person interactions novel to the original data distribution. Second, accurately describing the intricacies of two-person motions often requires text of comparable complexity, and modeling such texts with a single sentence-level vector inevitably causes information loss. For a finer modeling of interaction semantics, we further propose Text2Interact, which features an attention-based word-level conditioning module, improving fine-grained text-motion alignment. Meanwhile, we introduce an adaptive interaction supervision signal that dynamically weighs body parts based on the interaction context, enhancing interaction realism. We conduct extensive experiments to validate the effectiveness of our proposed data synthesis and word-level conditioning pipeline. Compared to state-of-the-art models, our approach significantly enhances motion diversity, text-motion alignment, and motion realism. The code and trained models will be released for reproducibility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2613. Neural Multi-Objective Combinatorial Optimization for Flexible Job Shop Scheduling Problems
๐ฌ 20
โญ 5.50
๐ optimization
๐ท๏ธ Neural Multi-Objective Combinatorial Optimization Flexible Job Shop Scheduling Problem Deep Reinforcement Learning
Neural combinatorial optimization (NCO) has made significant advances in applying deep learning techniques to efficiently and effectively solve single-objective flexible job shop scheduling problems (FJSPs). However, the more practical multi-objective FJSPs (MOFJSPs) remain underexplored, limiting the applicability of NCO in multi-criteria decision-making scenarios. In this paper, we propose a decomposition-based NCO method to solve MOFJSPs. We present the dual conditional attention network (DCAN), a neural network architecture that takes the objective preferences along with the problem instance, aiming to learn adaptable policies over the preferences. By decomposing an MOFJSP into a set of subproblems with different preferences, the learned DCAN policies generate a set of solutions that reflect the corresponding trade-offs. We customize the Proximal Policy Optimization algorithm based on decomposition to effectively train the policy network for multiple objectives and define the state and reward based on combinations of different objectives. Extensive results showcase that our approach outperforms traditional multi-objective optimization methods and generalizes well across diverse types of problem instances.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2614. TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture
๐ฌ 20
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Tool-Augmented Reasoning Test-Time Scaling Multi-Agent Systems Code Interpreter Search
While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini-Pro, practical guidance on optimal tool use is lacking. The core challenge is effectively combining textual reasoning, coding, and search for diverse questions. In this paper, we propose Tool-Use Mixture (TUMIX), an ensemble framework that runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths. Agents in TUMIX iteratively share and refine responses based on the question and previous answers. In experiments, TUMIX achieves significant gains over state-of-the-art tool-augmented and test-time scaling methods, delivering an average accuracy improvement of up to 3.55\% over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks, with near-equal inference costs. We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto-optimize agent designs. Furthermore, TUMIX can halt refinement upon reaching sufficient confidence, preserving performance at only 49\% of the inference cost. Further scaling can achieve higher performance, albeit at a greater cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2615. Sample-Efficient Distributionally Robust Multi-Agent Reinforcement Learning via Online Interaction
๐ฌ 17
โญ 5.50
๐ reinforcement learning
๐ท๏ธ distributionally robust multi-agent markov game
Well-trained multi-agent systems can fail when deployed in real-world environments due to model mismatches between the training and deployment environments, caused by environment uncertainties including noise or adversarial attacks. Distributionally Robust Markov Games (DRMGs) enhance system resilience by optimizing for worst-case performance over a defined set of environmental uncertainties. However, current methods are limited by their dependence on simulators or large offline datasets, which are often unavailable. This paper pioneers the study of online learning in DRMGs, where agents learn directly from environmental interactions without prior data. We introduce the Multiplayer Optimistic Robust Nash Value Iteration (MORNAVI) algorithm and provide the first provable guarantees for this setting. Our theoretical analysis demonstrates that the algorithm achieves low regret and efficiently finds the optimal robust policy for uncertainty sets measured by Total Variation divergence and Kullback-Leibler divergence. These results establish a new, practical path toward developing truly robust multi-agent systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2616. Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
๐ฌ 17
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ LLM Agents Benchmark Games
Large Language Model (LLM) agents are reshaping the game industry, by enabling more intelligent and human-preferable characters. Yet, current game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine-tuning datasets to adapt pre-trained LLMs into gaming agents. To fill these gaps, we present Orak, a benchmark for training and evaluating LLM agents across 12 popular video games spanning all major genres. Using a plug-and-play interface built on Model Context Protocol (MCP), Orak supports systematic and reproducible studies of agentic modules in varied game scenarios. We further release a fine-tuning dataset of expert LLM gameplay trajectories spanning multiple genres, turning general LLMs into effective game agents. Orak offers a comprehensive evaluation framework, including game leaderboards, LLM battle arenas, and in-depth analyses of input modality, agentic strategies, and fine-tuning effects, establishing a foundation towards versatile gaming agents. Code is available at https://anonymous.4open.science/r/Orak-5013/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2617. Transfer Paramatters: Optimal per-Module Hyperparameters Across All Scaling Axes
๐ฌ 16
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ muP tensor programs hyperparameter optimization hyperparameter transfer
Hyperparameter tuning can dramatically impact training stability of large-scale models.
Recent works on neural network parameterisations, such as ฮผP, have shown that layer types and sizes should dictate how global hyperparameters should be rescaled in order to achieve efficient transfer across model sizes.
On the other hand, the established practice for hyperparameter optimisation search is to look for optimal global base values that apply at some fixed model scale.
We transfer hyperparameters across all scaling axes: width and depth, using an extension of CompleteP (Dey et al., 2025), training horizon, and batch size.
Our study covers all optimisation hyperparameters of modern models: learning rates, Adam parameters, weight decay, initialisation scales, and residual block multipliers.
Lastly, we demonstrate that hyperparameter transfer holds even in the per-layer hyperparameter regime.
We characterise the empirical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical guidelines for tackling this optimisation problem.
We suggest a simplified parameterisation of the hyperparameter space that reduces the dimensionality of the search-space at no performance cost.
Our experiments demonstrate training speed improvements when applying transferred hyperparameters to Large Language Models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2618. TimeRecipe: A Time-Series Forecasting Recipe via Benchmarking Module Level Effectiveness
๐ฌ 18
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ Time-Series Forecasting Module Effectiveness Benchmark
Time-series forecasting is an essential task with wide real-world applications across domains. While recent advances in deep learning have enabled time-series forecasting models with accurate predictions, there remains considerable debate over which architectures and design components, such as series decomposition or normalization, are most effective under varying conditions. Existing benchmarks primarily evaluate models at a high level, offering limited insight into why certain designs work better. To mitigate this gap, we propose TIMERECIPE, a unified benchmarking framework that systematically evaluates time-series forecasting methods at the module level. TIMERECIPE conducts over 10,000 experiments to assess the effectiveness of individual components across a diverse range of datasets, forecasting horizons, and task settings. Our results reveal that exhaustive exploration of the design space can yield models that outperform existing state-of-the-art methods and uncover meaningful intuitions linking specific design choices to forecasting scenarios. Furthermore, we release a practical toolkit within TIMERECIPE that recommends suitable model architectures based on these empirical insights.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2619. Sampling-aware Adversarial Attacks Against Large Language Models
๐ฌ 6
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ llms adversarial attacks jailbreak sampling efficiency
To guarantee safe and robust deployment of large language models (LLMs) at scale, it is critical to accurately assess their adversarial robustness. Existing adversarial attacks typically target harmful responses in single-point greedy generations, overlooking the inherently stochastic nature of LLMs and overestimating robustness. We show that for the goal of eliciting harmful responses, repeated sampling of model outputs during the attack complements prompt optimization and serves as a strong and efficient attack vector. By casting attacks as a resource allocation problem between optimization and sampling, we determine compute-optimal trade-offs and show that integrating sampling into existing attacks boosts success rates by up to 37\% and improves efficiency by up to two orders of magnitude. We further analyze how distributions of output harmfulness evolve during an adversarial attack, discovering that many common optimization strategies have little effect on output harmfulness. Finally, we introduce a label-free proof-of-concept objective based on entropy maximization, demonstrating how our sampling-aware perspective enables new optimization targets. Overall, our findings establish the importance of sampling in attacks to accurately assess and strengthen LLM safety at scale.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2620. DeepFRC: An End-to-End Deep Learning Model for Functional Registration and Classification
๐ฌ 12
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Functional Data Analysis; Deep Learning; Registration; Classification
Functional data, representing curves or trajectories, are ubiquitous in fields like biomedicine and motion analysis. A fundamental challenge is phase variabilityโtemporal misalignments that obscure underlying patterns and degrade model performance. Current methods often address registration (alignment) and classification as separate, sequential tasks. This paper introduces DeepFRC, an end-to-end deep learning framework that jointly learns diffeomorphic warping functions and a classifier within a unified architecture. DeepFRC combines a neural deformation operator for elastic alignment, a spectral representation using Fourier basis for smooth functional embedding, and a class-aware contrastive loss that promotes both intra-class coherence and inter-class separation. We provide the first theoretical guarantees for such a joint model, proving its ability to approximate optimal warpings and establishing a data-dependent generalization bound that formally links registration fidelity to classification performance. Extensive experiments on synthetic and real-world datasets demonstrate that DeepFRC consistently outperforms state-of-the-art methods in both alignment quality and classification accuracy, while ablation studies validate the synergy of its components. DeepFRC also shows notable robustness to noise, missing data, and varying dataset scales. Code is available at https://github.com/Drivergo-93589/DeepFRC.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2621. Distillation of Large Language Models via Concrete Score Matching
๐ฌ 23
โญ 5.50
๐ generative models
๐ท๏ธ Large Language Models Knowledge Distillation Score Matching
Large language models (LLMs) deliver remarkable performance but are costly to deploy, motivating knowledge distillation (KD) for efficient inference. Existing KD objectives typically match student and teacher probabilities via softmax, which blurs valuable logit information. While direct logit distillation (DLD) mitigates softmax smoothing, it fails to account for logit shift invariance, thereby restricting the solution space. We propose Concrete Score Distillation (CSD), a discrete score-matching objective that overcomes both softmax-induced smoothing and restrictions on the optimal solution set. We resolve the training instability and quadratic complexity of discrete score-matching in autoregressive LLMs, and the resulting CSD objective aligns relative logit differences across all vocabulary pairs between student and teacher with flexible weighting. We provide both mode-seeking and mode-covering instances within our framework and evaluate CSD on task-agnostic instruction-following, task-specific, and general chat capability distillation using GPT-2-1.5B, OpenLLaMA-7B, and Gemma-7B-IT, Qwen2.5-7B-IT, and Gemma2-9B-IT teachers. Experiments show that CSD consistently surpasses recent KD objectives, achieves favorable fidelityโdiversity trade-offs, and yields complementary gains when combined with on-policy techniques, demonstrating its scalability and effectiveness for LLM distillation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2622. Towards Understanding Valuable Preference Data for Large Language Model Alignment
๐ฌ 23
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language model alignment preference data influence function
Large language model (LLM) alignment is typically achieved through learning from human preference comparisons, making the quality of preference data critical to its success. Existing studies often pre-process raw training datasets to identify valuable preference pairs using external reward models or off-the-shelf LLMs, achieving improved overall performance but rarely examining whether individual, selected data point is genuinely beneficial. We assess data quality through individual influence on validation data using our newly proposed truncated influence function (TIF), which mitigates the over-scoring present in traditional measures and reveals that preference data quality is inherently a property of the model. In other words, a data pair that benefits one model may harm another. This leaves the need to improve the preference data selection approaches to be adapting to specific models. To this end, we introduce a set of candidate scoring functions (SFs) that are computationally simpler than TIF and positively correlated with it. These functions are also model dependent and can serve as potential indicators of individual data quality for preference data selection. Furthermore, we observe that these SFs inherently exhibit errors when compared to TIF. To this end, we combine them to offset their diverse error sources, resulting in a simple yet effective data selection rule that enables the models to achieve a more precise selection of valuable preference data. We conduct experiments across diverse alignment benchmarks and various LLM families, with results demonstrating that better alignment performance can be achieved using less data, showing the generality and robustness of our findings and new methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2623. Rethinking JEPA: ComputeโEfficient Video Self-Supervised Learning with Frozen Teachers
๐ฌ 10
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ SALT video SSL video_representation_learning masked_video_modeling MAE JEPA latent_space_ prediction
Video Joint Embedding Predictive Architectures (VโJEPA) learn generalizable off-the-shelf video representations by predicting masked regions in latent space with an exponential moving average (EMA)โupdated teacher.
While EMA prevents representation collapse, it complicates scalable model selection and couples teacher and student architectures.
We revisit maskedโlatent prediction and show that a frozen teacher suffices. Concretely, we (i) train a target encoder with a simple pixelโreconstruction objective under VโJEPA masking, then (ii) freeze it and train a student to predict the teacherโs latents on masked regions.
This leads to a twoโstage, unregularized scheme, that we refer to as SALT (Static-teacher Asymmetric Latent Training). SALT decouples optimization into pixel reconstruction (teacher) and masked latent prediction (student), increasing transparency, efficiency, and scalability while preserving the ability of representations to generalize under frozen evaluation. Empirically, our student models outperform recently proposed V-JEPA 2 encoders under frozen backbone evaluation across diverse benchmarks. They are also more computeโoptimal: at matched pretraining FLOPs, our method achieves higher probing accuracy, and its scaling curves dominate VโJEPAโs accuracyโFLOPs Pareto frontier.
Finally, we find that student quality is remarkably robust to teacher quality: high-performing students emerge even with small, sub-optimal teachers. This points to a compute budget allocation that should overwhelmingly favor the student.
These results position SALT as a simple, scalable, and computeโefficient alternative to EMAโbased selfโdistillation for video representation learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2624. Deploying Models to Non-participating Clients in Federated Learning without Fine-tuning: A Hypernetwork-based Approach
๐ฌ 18
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ ypernetwork federated learning
Federated Learning (FL) has emerged as a promising paradigm for privacy-preserving collaborative learning, yet data heterogeneity remains a critical challenge. While existing methods achieve progress in addressing data heterogeneity for participating clients, they fail to generalize to non-participating clients with in-domain distribution shifts and resource constraints. To mitigate this issue, we present HyperFedZero, a novel method that dynamically generates specialized models via a hypernetwork conditioned on distribution-aware embeddings. Our approach explicitly incorporates distribution-aware inductive biases into the model's forward pass, extracting robust distribution embeddings using a NoisyEmbed-enhanced extractor with a Balancing Penalty, effectively preventing feature collapse. The hypernetwork then leverages these embeddings to generate specialized models chunk-by-chunk for non-participating clients, ensuring adaptability to their unique data distributions. Extensive experiments on multiple datasets and models demonstrate HyperFedZero's remarkable performance, surpassing competing methods consistently with minimal computational, storage, and communication overhead. Moreover, ablation studies and visualizations further validate the necessity of each component, confirming meaningful adaptations and validating the effectiveness of HyperFedZero.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2625. Rescue: Retrieval Augmented Secure Code Generation
๐ฌ 14
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ secure code generation retrieval augmented generation
Despite recent advances, Large Language Models (LLMs) still generate vulnerable code.
Retrieval-Augmented Generation (RAG) has the potential to enhance LLMs for secure code generation by incorporating external security knowledge.
However, the conventional RAG design struggles with the noise of raw security-related documents, and existing retrieval methods overlook the significant security semantics implicitly embedded in task descriptions.
To address these issues, we propose \textsc{Rescue}, a new RAG framework for secure code generation with two key innovations.
First, we propose a hybrid knowledge base construction method that combines LLM-assisted cluster-then-summarize distillation with program slicing, producing both high-level security guidelines and concise, security-focused code examples. Second, we design a hierarchical multi-faceted retrieval to traverse the constructed knowledge base from top to bottom and integrates multiple security-critical facts at each hierarchical level, ensuring comprehensive and accurate retrieval.
We evaluated \textsc{Rescue} on four benchmarks and compared it with five state-of-the-art secure code generation methods on six LLMs. The results demonstrate that \textsc{Rescue} improves the SecurePass@1 metric by an average of 4.8 points, establishing a new state-of-the-art performance for security. Furthermore, we performed in-depth analysis and ablation studies to rigorously validate the effectiveness of individual components in \textsc{Rescue}. Our code is available at \url{https://anonymous.4open.science/r/RESCUE}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2626. EdgeCape: Edge Weight Prediction For Category-Agnostic Pose Estimation
๐ฌ 13
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Category Agnostic Pose Estimation Keypoint Localization Few Shot Learning 2D Pose Estimation
Category-Agnostic Pose Estimation (CAPE) localizes keypoints across diverse object categories with a single model, using one or few annotated support images. Recent works have shown that using a pose-graph (i.e., treating keypoints as nodes in a graph rather than isolated points) helps handle occlusions and break symmetry. However, these methods assume a given pose-graph with equal-weight edges, leading to suboptimal results. We introduce EdgeCape, a novel framework that overcomes these limitations by predicting the graph's edge weights in order to optimize localization. To further leverage structural (i.e., graph) priors, we propose integrating Markov Attention Bias, which modulates the self-attention interaction between nodes based on the number of hops between them. We show that this improves the modelโs ability to capture global spatial dependencies. Evaluated on the MP-100 benchmark, which includes 100 categories and over 20K images, EdgeCape achieves state-of-the-art results in the 1-shot and 5-shot settings, significantly improving keypoint localization accuracy. Our code will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2627. Gradient-Direction-Aware Density Control for 3D Gaussian Splatting
๐ฌ 22
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Novel View Synthesis 3D Gaussian Splatting Point-based Radiance Field 3D reconstruction
The emergence of 3D Gaussian Splatting (3DGS) has significantly advanced Novel View Synthesis (NVS) through explicit scene representation, enabling real-time photorealistic rendering. However, existing approaches manifest two critical limitations in complex scenarios: (1) Over-reconstruction occurs when persistent large Gaussians cannot meet adaptive splitting thresholds during density control. This is exacerbated by conflicting gradient directions that prevent effective splitting of these Gaussians; (2) Over-densification of Gaussians occurs in regions with aligned gradient aggregation, leading to redundant component proliferation. This redundancy significantly increases memory overhead due to unnecessary data retention. We present Gradient-Direction-Aware Gaussian Splatting (GDAGS) to address these challenges. Our key innovations: the Gradient Coherence Ratio (GCR), computed through normalized gradient vector norms, which explicitly discriminates Gaussians with concordant versus conflicting gradient directions; and a nonlinear dynamic weighting mechanism leverages the GCR to enable gradient-direction-aware density control. Specifically, GDAGS prioritizes conflicting-gradient Gaussians during splitting operations to enhance geometric details while suppressing redundant concordant-direction Gaussians. Conversely, in cloning processes, GDAGS promotes concordant-direction Gaussian densification for structural completion while preventing conflicting-direction Gaussian overpopulation. Comprehensive evaluations across diverse real-world benchmarks demonstrate that GDAGS achieves superior rendering quality while effectively mitigating over-reconstruction, suppressing over-densification, and constructing compact scene representations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2628. Learning to Lie: Reinforcement Learning Attacks Damage Human-AI Teams and Teams of LLMs
๐ฌ 15
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ adversarial attacks redteaming human-AI teams decision-making social influence influence evolution large language models LLMs model based reinforcement learning
As artificial intelligence (AI) assistants become more widely adopted in safety-critical domains, it becomes important to develop safeguards against potential failures or adversarial attacks. A key prerequisite to developing these safeguards is understanding the ability of these AI assistants to mislead human teammates. We investigate this attack problem within the context of an intellective strategy game where a team of three humans and one AI assistant collaborate to answer a series of trivia questions. Unbeknownst to the humans, the AI assistant is adversarial. Leveraging techniques from Model-Based Reinforcement Learning (MBRL), the AI assistant learns a model of the humans' trust evolution and uses that model to manipulate the group decision-making process to harm the team. We evaluate two models -- one inspired by literature and the other data-driven -- and find that both can effectively harm the human team. Moreover, we find that in this setting while our data-driven model is the most capable of accurately predicting how human agents appraise their teammates given limited information on prior interactions, the model based on principles of cognitive psychology does not lag too far behind. Finally, we compare the performance of state-of-the-art LLM models to human agents on our influence allocation task to evaluate whether the LLMs allocate influence similarly to humans or if they are more robust to our attack. These results enhance our understanding of decision-making dynamics in small human-AI teams and lay the foundation for defense strategies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2629. InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models
๐ฌ 10
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Vision Language Models Spatial Reasoning Visual Question Answering
Recent benchmarks and datasets have been proposed to improve spatial reasoning in vision-language models (VLMs), yet existing open resources remain limited in scale, visual diversity, and instruction expressiveness. In this work, we introduce InternSpatial, the largest open-source dataset for spatial reasoning in VLMs, along with InternSpatial-Bench, a corresponding evaluation benchmark designed to assess spatial understanding under diverse instruction formats. InternSpatial comprises 12 million QA pairs spanning both single-view and multi-view settings, drawn from diverse visual environments and supporting 19 instruction formats that reflect varied query styles. For evaluation, we propose InternSpatial-Bench for single-view tasks and expand multi-view reasoning by introducing a novel rotation angle prediction task that has not been explored in prior work. Experimental results show that models trained on InternSpatial achieve 12.1% improvement on InternSpatial-Bench and 10.7% on VSI-Bench, while maintaining strong performance on general-purpose benchmarks. We hope these resources will support the development of spatially capable VLMs in practical applications such as robotics and embodied AI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2630. Aligner, Diagnose Thyself: A Meta-Learning Paradigm for Fusing Intrinsic Feedback in Preference Alignment
๐ฌ 21
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models Direct Preference Optimization
The alignment of Large Language Models (LLMs) with human preferences is critically undermined by noisy labels in training datasets.
Existing robust methods often prove insufficient, as they rely on single, narrow heuristics such as perplexity or loss, failing to address the diverse nature of real-world noise.
We challenge this limited-scope approach by introducing a new paradigm where models learn to diagnose thyself, systematically fusing multiple streams of intrinsic feedback for a holistic reliability assessment of each preference pair.
We instantiate this paradigm through a meta-learning methodology that learns to adaptively reweight samples based on a rich diagnostic vector.
This vector captures three complementary perspectives: preference consistency, learning difficulty, and generation confidence.
Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods across various noise conditions.
Crucially, our work provides the first quantitative analysis of these intrinsic diagnostics, revealing that their fusion is essential for overcoming the blind spots inherent in any single heuristic.
This diagnostic-driven paradigm offers a principled path towards developing more robust and trustworthy LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2631. VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?
๐ฌ 20
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ video reasoning multimodal large language models
Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce **VideoReasonBench**, a benchmark designed to evaluate **vision-centric, complex video reasoning**. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoningโe.g., GPT-4o achieves only 6.9% accuracyโwhile the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on "test-time scaling'' further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2632. Dynamic Classifier-Free Diffusion Guidance via Online Feedback
๐ฌ 20
โญ 5.50
๐ generative models
๐ท๏ธ Diffusion text-to-image generation classifier-free guidance
Classifier-free guidance (CFG) is a cornerstone of text-to-image diffusion models, yet its effectiveness is limited by the use of static guidance scales. This ``one-size-fits-all'' approach fails to adapt to the diverse requirements of different prompts; moreover, prior solutions like gradient-based correction or fixed heuristic schedules introduce additional complexities and fail to generalize. In this work, we challenge this static paradigm by introducing a framework for dynamic CFG scheduling. Our method leverages online feedback from a suite of general-purpose and specialized small-scale latent-space evaluatorsโsuch as CLIP for alignment, a discriminator for fidelity and a human preference reward modelโto assess generation quality at each step of the reverse diffusion process. Based on this feedback, we perform a greedy search to select the optimal CFG scale for each timestep, creating a unique guidance schedule tailored to every prompt and sample. We demonstrate the effectiveness of our approach on both small-scale models and the state-of-the-art Imagen 3, showing significant improvements in text alignment, visual quality, text rendering and numerical reasoning. Notably, when compared against the default Imagen 3 baseline, our method achieves up to 53.8% human preference win-rate for overall preference, a figure that increases up to to 55.5% on prompts targeting specific capabilities like text rendering. Our work establishes that the optimal guidance schedule is inherently dynamic and prompt-dependent, and provides an efficient and generalizable framework to achieve it.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2633. Similarity-aware Non-Convex Federated Optimization
๐ฌ 19
โญ 5.50
๐ optimization
๐ท๏ธ Client Sampling SAGA Second-order Similarity Composite Gradient Method Variance Reduction
Different federated optimization methods typically employ distinct client-selection strategies: some communicate only with a randomly sampled subset of clients at each round, some predefine a fixed set of clients to contact, and others use a hybrid scheme that combines both. Existing metrics for comparing optimization methods often assume equal communication costs across all strategies, which is rarely the case in practice. In this work, we address the setting where client-selection strategies incur different costs. We first describe this problem and introduce a simple model that quantifies communication and local computation complexities. This new model allows for three commonly used client-selection strategies and explicitly associates each with a distinct cost. Within this setting, we propose a new algorithm that achieves the best-known communication and local complexity among existing methods for non-convex optimization. This algorithm is based on the inexact composite gradient method with gradient estimators constructed using recursive gradient and SAGA. Furthermore, it serves as a framework that can incorporate general unbiased gradient estimators, such as SVRG.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2634. DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning
๐ฌ 4
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ llm math reasoning
Reinforcement learning (RL) with large language models shows promise in complex reasoning. However, its progress is hindered by the lack of large-scale training data that is sufficiently challenging, contamination-free and verifiable. To this end, we introduce DeepMath-103K, a large-scale mathematical dataset designed with high difficulty (primarily levels 5-9), rigorous decontamination against numerous benchmarks, and verifiable answers for rule-based RL reward. It further includes three distinct R1 solutions adaptable for diverse training paradigms such as supervised fine-tuning (SFT). Spanning a wide range of mathematical topics, DeepMath-103K fosters the development of generalizable and advancing reasoning. Notably, models trained on DeepMath-103K achieve leading results on challenging mathematical benchmarks and demonstrate generalization beyond math such as biology, physics and chemistry, underscoring its broad efficacy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2635. REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning
๐ฌ 21
โญ 5.50
๐ reinforcement learning
๐ท๏ธ online reinforcement learning large reasoning model efficient reasoning overthinking reflection
Large Reasoning Models (LRMs) demonstrate strong performance in complex tasks but often face the challenge of *overthinking*, leading to substantially high inference costs. Existing approaches synthesize shorter reasoning responses for LRMs to learn, but are inefficient for online usage due to the time-consuming data generation and filtering processes. Meanwhile, online reinforcement learning mainly adopts a length reward to encourage short reasoning responses, but it tends to lose reflection ability and harm performance. To address these issues, we propose REA-RL, which introduces a small reflection model for efficient scaling in online training, offering both parallel sampling and sequential revision. Besides, a reflection reward is designed to further prevent LRMs from favoring short yet non-reflective responses. Experiments show that both methods maintain or enhance performance while significantly improving inference efficiency. Their combination achieves a good balance between performance and efficiency, reducing inference costs by 36\% without compromising performance. Further analysis demonstrates that our methods are effective by maintaining reflection frequency for hard problems while appropriately reducing it for easier ones without losing reflection ability. Code is available at https://anonymous.4open.science/r/REA-RL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2636. FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models
๐ฌ 19
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ large vision-language model multi-step reasoning multi-image reasoning mapvqa
Cartographic reasoning is the skill of interpreting geographic relationships by aligning legends, map scales, compass directions, map texts, and geometries across one or more map images. Although essential as a concrete cognitive capability and for critical tasks such as disaster response and urban planning, it remains largely unevaluated. Building on progress in chart and infographic understanding, recent large vision language model (LVLM) works on map visual question-answering (VQA) often simplify maps as a special case of charts. In contrast, map VQA demands comprehension of layered symbology (e.g., symbols, geometries, and text labels) as well as spatial relations tied to orientation and distance that often span multiple maps and are not captured by chart-style evaluations. To address this gap, we introduce FRIEDA, a benchmark for testing complex open-ended cartographic reasoning in LVLMs. FRIEDA sources real map images from documents and reports in various domains (e.g., geology, urban planning, and environmental assessment) and geographical areas. Following classifications in Geographic Information System (GIS) literature, FRIEDA targets all three categories of spatial relations: topological (border, equal, intersect, within), metric (distance), and directional (orientation). All questions require multi-step inference, and many require cross-map grounding and reasoning. We evaluate eleven state-of-the-art LVLMs under two settings: (1) the direct setting, where we provide the maps relevant to the question, and (2) the contextual setting, where the model may have to identify the maps relevant to the question before reasoning. Even the strongest models, Gemini-2.5-Pro and GPT-5-Think, achieve only 38.20\% and 37.20\% accuracy, respectively, far below human performance of 84.87\%. These results reveal a persistent gap in multi-step cartographic reasoning, positioning FRIEDA as a rigorous benchmark to drive progress on spatial intelligence in LVLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2637. EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing
๐ฌ 22
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Image Editing Reward Model Generative Model Evaluation
Recently, we have witnessed great progress in image editing with natural language instructions. Several closed-source models like GPT-Image-1, Seedream, and Google-Nano-Banana have shown highly promising progress. However, the open-source models are still lagging. The main bottleneck is the lack of a reliable reward model to scale up high-quality synthetic training data.
To address this critical bottleneck, we built EditReward, trained with our new large-scale human preference dataset, meticulously annotated by trained experts following a rigorous protocol containing over 200K preference pairs. EditReward demonstrates superior alignment with human preferences in instruction-guided image editing tasks.
Experiments show that EditReward achieves state-of-the-art human correlation on established benchmarks such as GenAI-Bench, AURORA-Bench, ImagenHub, and our new EditReward-Bench, outperforming a wide range of VLM-as-judge models. Furthermore, we use EditReward to select a high-quality subset from the existing noisy ShareGPT-4o-Image dataset. We train Step1X-Edit on the selected subset, which shows significant improvement over training on the full set. This demonstrates EditReward's ability to serve as a reward model to scale up high-quality training data for image editing. EditReward with its training dataset will be released to help the community build more high-quality image editing training datasets to catch up with the frontier ones.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2638. Differentially Private Two-Stage Gradient Descent for Instrumental Variable Regression
๐ฌ 14
โญ 5.50
๐ learning theory
๐ท๏ธ differential privacy endogeneity bi-level gradient descent instrumental variables
We study instrumental variable regression (IVaR) under differential privacy constraints.
Classical IVaR methods (like two-stage least squares regression) rely on solving moment equations that directly use sensitive covariates and instruments, creating significant risks of privacy leakage and posing challenges in designing algorithms that are both statistically efficient and differentially private.
We propose a noisy two-state gradient descent algorithm that ensures $\rho$-zero-concentrated
differential privacy by injecting carefully calibrated noise into the gradient updates.
Our analysis establishes finite-sample convergence rates for the proposed method, showing that the algorithm achieves consistency while preserving privacy.
In particular, we derive precise bounds quantifying the trade-off among optimization,
privacy, and sampling error.
To the best of our knowledge, this is the first work to provide both privacy guarantees and provable convergence rates for instrumental variable regression in linear models.
We further validate our theoretical findings with experiments on both synthetic and real datasets, demonstrating that our method offers practical accuracy-privacy trade-offs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2639. Entropy-Based Block Pruning for Efficient Large Language Models
๐ฌ 16
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Efficiency
As large language models continue to scale, their growing computational and storage demands pose significant challenges for real-world deployment. In this work, we investigate redundancy within Transformer-based models and propose an entropy-based pruning strategy to enhance efficiency while maintaining performance. Empirical analysis reveals that the entropy of hidden representations decreases in the early blocks but progressively increases across most subsequent blocks. This trend suggests that entropy serves as a more effective measure of information richness within computation blocks. Unlike cosine similarity, which primarily captures geometric relationships, entropy directly quantifies uncertainty and information content, making it a more reliable criterion for pruning. Extensive experiments demonstrate that our entropy-based pruning approach surpasses cosine similarity-based methods in reducing model size while preserving accuracy, offering a promising direction for efficient model deployment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2640. Robust Adversarial Attacks Against Unknown Disturbance via Inverse Gradient Sample
๐ฌ 16
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Adversarial sample Transferable attack
Adversarial attacks have achieved widespread success in various domains, yet existing methods suffer from significant performance degradation when adversarial examples are subjected to even minor disturbances. In this paper, we propose a novel and robust attack called IGSA (**I**nverse **G**radient **S**ample-based **A**ttack), capable of generating adversarial examples that remain effective under diverse unknown disturbances. IGSA employs an iterative two-step framework: (i) inverse gradient sampling, which searches for the most disruptive direction within the neighborhood of adversarial examples, and (ii) disturbance-guided refinement, which updates adversarial examples via gradient descent along the identified disruptive disturbance. Theoretical analysis reveals that IGSA enhances robustness by increasing the likelihood of adversarial examples within the data distribution. Extensive experiments in both white-box and black-box attack scenarios demonstrate that IGSA significantly outperforms state-of-the-art attacks in terms of robustness against various unknown disturbances. Moreover, IGSA exhibits superior performance when attacking adversarially trained defense models. Code is available at https://github.com/nimingck/IGSA.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2641. MaskCO: Masked Generation Drives Effective Representation Learning and Exploiting for Combinatorial Optimization
๐ฌ 19
โญ 5.50
๐ optimization
๐ท๏ธ Neural Combinatorial Optimization Masked Generation
Neural combinatorial optimization (NCO) has long been anchored in paradigms like solution construction or improvement that treat the solution as a monolithic reference, squandering the rich local decision patterns embedded in high-quality solutions. Inspired by self-supervised pretraining breakthroughs in language and vision, where simple yet powerful paradigms like next-token prediction enable scalable learning, we ask: \textit{Can combinatorial optimization adopt such a fundamental training paradigm to enable effective and scalable representation learning?} We introduce MaskCO, a masked generation approach that reframes learning to optimize as solution-level self-supervised learning on given reference solutions. By strategically masking portions of optimal solutions and training models to recover the missing content, MaskCO turns a single (instance, solution) pair into hundreds of (instance, partial solution) pairs, encouraging the model to internalize fine-grained, localized decision patterns. For inference, we propose a mask-and-reconstruct procedure that naturally leverages the training objective to implement a local-search-like refinement: each iteration masks certain variables and reconstructs through masked generation, progressively improving the current solution. We also find that the learned representations readily transfer to alternative inference routines and facilitate effective fine-tuning in other models. Experimental results demonstrate that masked generation serves as a universal learning objective across multiple CO problems, redefining how solutions are learned, refined, and scaled. Compared to previous state-of-the-art neural solvers, MaskCO achieves remarkable performance improvements, exceeding 99\% in optimality gap reduction, along with a 10x speedup on the Travelling Salesman Problem (TSP).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2642. Computing Equilibrium beyond Unilateral Deviation
๐ฌ 9
โญ 5.50
๐ learning theory
๐ท๏ธ Strong Equilibrium Complexity No-regret Learning
Most familiar equilibrium concepts, such as Nash and correlated equilibrium, guarantee only that no single player can improve their utility by deviating unilaterally. They offer no guarantees against profitable coordinated deviations by coalitions. Although the literature proposes notions to address multilateral deviations (\emph{e.g.}, strong Nash and coalition-proof equilibrium), these generally fail to exist. In this paper, we study a solution concept that accommodates multi-player deviations and is guaranteed to exist. We prove a fixed-parameter lower bound on the complexity of computing such an equilibrium and present an algorithm that matches this bound.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2643. GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs
๐ฌ 18
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Large Language Models Geometry Spatial Representation Procedural Geometry Reasoning
Geometric spatial reasoning forms the foundation of many applications in artificial intelligence, yet the ability of large language models (LLMs) to operate over geometric spatial information expressed in procedural code remains underexplored. In this paper, we address this gap by formalizing the \texttt{Program-to-Geometry} task, which challenges models to translate programmatic drawing code into accurate and abstract geometric reasoning. To evaluate this capability, we present \textbf{GeoGramBench}, a benchmark of 500 carefully refined problems organized by a tailored three-level taxonomy that considers geometric complexity rather than traditional mathematical reasoning complexity. Our comprehensive evaluation of 17 frontier LLMs reveals consistent and pronounced deficiencies: even the most advanced models achieve less than 50\% accuracy at the highest abstraction level. By systematically analyzing model behaviors, our study exposes key limitations in program-driven spatial reasoning and positions GeoGramBench as an important resource for benchmarking and advancing behavioral research in symbolic-to-spatial geometric reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2644. MIAM: Modality Imbalance-Aware Masking for Multimodal Ecological Applications
๐ฌ 13
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ multimodality masking modality imbalance ecology
Multimodal learning is crucial for ecological applications, which rely on heterogeneous data sources (e.g., satellite imagery, environmental time series, tabular predictors, bioacoustics) but often suffer from incomplete data across and within modalities (e.g., unavailable satellite image due to cloud cover, missing records in a time series). While data masking strategies have been used to improve robustness to missing data by exposing models to varying input subsets during training, existing approaches typically rely on static masking and inadequately explore the space of input combinations. As a result, they fail to address modality imbalance, a critical challenge in multimodal learning where dominant modalities hinder the optimization of others. To fill this gap, we introduce Modality Imbalance-Aware Masking (MIAM), a dynamic masking strategy that: (i) explores the full space of input combinations; (ii) prioritizes informative or challenging subsets; and (iii) adaptively increases the masking probability of dominant modalities based on their relative performance and learning dynamics. We evaluate MIAM on two key ecological datasets, GeoPlant and TaxaBench, with diverse modality configurations, and show that MIAM significantly improves robustness and predictive performance over previous masking strategies. In addition, MIAM supports fine-grained contribution analysis across and within modalities, revealing which variables, time segments, or image regions most strongly drive performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2645. AdaSpec: Adaptive Spectrum for Enhanced Node Distinguishability
๐ฌ 14
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ spectral graph neural networks symmetry on graph permutation equivalence node distinguishability
Spectral Graph Neural Networks (GNNs) achieve strong performance in node classification, yet their node distinguishability remains poorly understood. We analyze how graph matrices and node features jointly influence node distinguishability. Further, we derive a theoretical lower bound on the number of distinguishable nodes, which is governed by two key factors: distinct eigenvalues in the graph matrix and nonzero frequency components of node features in the eigenbasis. Based on these insights, we propose AdaSpec, an adaptive graph matrix generation module that enhances node distinguishability of spectral GNNs without increasing the order of computational complexity. We prove that AdaSpec preserves permutation equivariance, ensuring that reordering the graph nodes results in a corresponding reordering of the node embeddings. Experiments across eighteen benchmark datasets validate AdaSpec's effectiveness in improving node distinguishability of spectral GNNs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2646. Quantized Gradient Projection for Memory-Efficient Continual Learning
๐ฌ 16
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Continual Learning
Real-world deployment of machine learning models requires the ability to continually learn from non-stationary data while preserving prior knowledge and user privacy. Therefore, storing knowledge acquired from past data in a resource- and privacy-friendly manner is a crucial consideration in determining their viability. We introduce Quantized Gradient Projection Memory (QGPM), a systematic framework for continual learning that compresses and preserves the previous gradient subspace. QGPM integrates three key components: (i) distribution-aware, basis-wise quantization to minimize storage overhead, (ii) a Quantization Error-Aware (QEA) gradient projection that selectively relaxes orthogonality to mitigate gradient drift caused by accumulated quantization noise, and (iii) an on-the-fly sparse sketching strategy that improves runtime memory and computational efficiency. Experiments across multiple benchmarks demonstrate that QGPM achieves state-of-the-art performance under fixed memory budgets, highlighting its effectiveness in scalable, privacy-preserving continual learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2647. Low Rank Transformer for Multivariate Time Series Anomaly Detection and Localization
๐ฌ 14
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ Anomaly detection Anomaly localization Multivariate time series Space-time autoregression Transformer
Multivariate time series (MTS) anomaly diagnosis, which encompasses both anomaly detection and localization, is critical for the safety and reliability of complex, large-scale real-world systems. The vast majority of existing anomaly diagnosis methods offer limited theoretical insights, especially for anomaly localization, which is a vital but largely unexplored area. The aim of this contribution is to study the learning process of a Transformer when applied to MTS by revealing connections to statistical time series methods. Based on these theoretical insights, we propose the Attention Low-Rank Transformer (ALoRa-T) model, which applies low-rank regularization to self-attention, and we introduce the Attention Low-Rank score, effectively capturing the temporal characteristics of anomalies. Finally, to enable anomaly localization, we propose the ALoRa-Loc method, a novel approach that associates anomalies to specific variables by quantifying interrelationships among time series. Extensive experiments and real data analysis, show that the proposed methodology significantly outperforms state-of-the-art methods in both detection and localization tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2648. The Geometry of Reasoning: Flowing Logics in Representation Space
๐ฌ 19
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ Reasoning Theory Interpretability Representation Geometry Formal Logic LLMs
We study how large language models (LLMs) โthinkโ through their representation space.
We propose a novel geometric framework that models an LLMโs reasoning as flowsโembedding trajectories evolving where logic goes.
We disentangle logical structure from semantics by employing the same natural deduction propositions with varied semantic carriers, allowing us to test whether LLMs internalize logic beyond surface form.
This perspective connects reasoning with geometric quantities such as position, velocity, and curvature, enabling formal analysis in representation and concept spaces.
Our theory establishes: (1) LLM reasoning corresponds to smooth flows in representation space, and (2) logical statements act as local controllers of these flowsโ velocities.
Using learned representation proxies, we design controlled experiments to visualize and quantify reasoning flows, providing empirical validation of our theoretical framework.
Our work serves as both a conceptual foundation and practical tools for studying reasoning phenomenon, offering a new lens for interpretability and formal analysis of LLMs' behavior.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2649. BoGrape: Bayesian optimization over graphs with shortest-path encoded
๐ฌ 15
โญ 5.50
๐ optimization
๐ท๏ธ Bayesian optimization graph optimization mixed-integer programming shortest-path
Graph-structured data are central to many scientific and industrial applications where the goal is to optimize expensive black-box objectives defined over graph structures or node configurations---as seen in molecular design, supply chains, and sensor placement. Bayesian optimization offers a principled approach for such settings, but existing methods largely focus on functions defined over nodes of a fixed graph. Moreover, graph optimization is often approached heuristically, and it remains unclear how to systematically incorporate structural constraints into BO. To address these gaps, we build on shortest-path graph kernels to develop a principled framework for acquisition optimization over unseen graph structures and associated node attributes. Through a novel formulation based on mixed-integer programming, we enable global exploration of the combinatorial graph domain and explicit embedding of problem-specific constraints. We demonstrate that our method, BoGrape, is competitive both on general synthetic benchmarks and representative molecular design case studies with application-specific constraints.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2650. Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning
๐ฌ 15
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Multi-Agent Reinforcement Learning Value Decomposition Centralized Training with Decentralized Execution Exploration
Value decomposition has been extensively studied as a core approach for cooperative multi-agent reinforcement learning (MARL) under the CTDE paradigm. Despite this progress, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often converging to suboptimal policies. To address this limitation, we propose Successive Sub-value Q-learning (S2Q), a framework that successively learns multiple sub-value functions to retain information about alternative high-value actions. By incorporating these sub-value functions into a Softmax-based behavior policy, S2Q encourages persistent exploration and enables $Q^{\text{tot}}$ to adjust quickly when the optimal action changes. Extensive experiments on challenging MARL benchmarks confirm that S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2651. Gistify: Codebase-Level Understanding via Runtime Execution
๐ฌ 19
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ codebase-level understanding runtime code execution coding agent benchmark
As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is central. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2652. Learning-Augmented Moment Estimation on Time-Decay Models
๐ฌ 19
โญ 5.50
๐ learning theory
๐ท๏ธ learning-augmented algorithms time decay sliding window model moment estimation
Motivated by the prevalence and success of machine learning, a line of recent work has studied learning-augmented algorithms in the streaming model. These results have shown that for natural and practical oracles implemented with machine learning models, we can obtain streaming algorithms with improved space efficiency that are otherwise provably impossible. On the other hand, our understanding is much more limited for the sliding window model, which captures applications where either recent data leads to better or older data must be expunged from the dataset, e.g., by privacy regulation laws. In this paper, we utilize an oracle for the heavy-hitters of datasets to give learning-augmented algorithms for a number of fundamental problems in the sliding window model, such as norm/moment estimation, frequency estimation, cascaded norms, and rectangular moment estimation. We complement our theoretical results with a number of empirical evaluations that demonstrate the practical efficiency of our algorithms on real and synthetic datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2653. Virne: A Comprehensive Benchmark for RL-based Network Resource Allocation in NFV
๐ฌ 17
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Network Resource Allocation Deep Reinforcement Learning Library Benchmark Network Simulation
Resource allocation (RA) is critical to efficient service deployment in Network Function Virtualization (NFV), a transformative networking paradigm. This task is termed NFV-RA. Recently, deep Reinforcement Learning (RL)-based methods have been showing promising potential to address this combinatorial complexity of constrained cross-graph mapping. However, RL-driven NFV-RA research lacks a systematic benchmark for comprehensive simulation and rigorous evaluation. This gap hinders in-depth performance analysis and slows algorithm development for emerging networks, resulting in fragmented assessments. In this paper, we introduce Virne, a comprehensive benchmarking framework designed to accelerate the research and application of deep RL for NFV-RA. Virne provides customizable simulations for diverse network scenarios, including cloud, edge, and 5G environments. It features a modular and extensible implementation pipeline that integrates over 30 methods of various types. Virne also establishes a rigorous evaluation protocol that extends beyond online effectiveness to include practical perspectives such as solvability, generalizability, and scalability. Furthermore, we conduct in-depth analysis through extensive experiments to provide valuable insights into performance trade-offs for efficient implementation and offer actionable guidance for future research directions. Overall, with its capabilities of diverse simulations, rich implementations, and thorough evaluation, Virne could serve as a comprehensive benchmark for advancing NFV-RA methods and deep RL applications. The code and resources are available at https://anonymous.4open.science/r/anonymous-virne.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2654. Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning
๐ฌ 12
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ Multimodal large language models Modality interactions Multimodal reasoning Model interpretation Attention probing Causal analysis Task composition Modality bias Logic-driven evaluation
Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio.
Yet, despite their perceptual strengths, their reasoning ability across modalities remains underexplored, with conflicting reports on whether additional modalities help or harm performance.
These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning.
We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined.
Empirically, additional modalities enhance reasoning only when they provide independent and sufficient reasoning paths, while redundant or chained entailment support often hurts performance.
Besides, models recognize cross-modal facts reliably and always reason on text effectively.
Moreover, reasoning degrades in three systematic ways: weaker modalities drag down overall performance, conflicts bias preference toward certain modalities, and joint signals from different modalities fail to be integrated effectively. Therefore, we identify two core failures: task-composition bottleneck, where recognition and reasoning cannot be jointly executed in one pass, and fusion bottleneck, where early integration introduces bias.
For further investigation, we find that attention patterns fail to encode fact usefulness, but a simple two-step prompting (recognize then reason) restores performance, confirming the task-composition bottleneck.
Moreover, modality identity remains recoverable in early layers, and softening attention in early fusion improves reasoning, highlighting biased fusion as another failure mode.
Overall, our findings show that integration, not perception, is the main barrier to multimodal reasoning, suggesting composition-aware training and early fusion control as promising directions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2655. Hierarchical Entity-centric Reinforcement Learning with Factored Subgoal Diffusion
๐ฌ 13
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Deep Reinforcement Learning Goal-conditioned Reinforcement Learning Object-centric Representations Diffusion Subgoal Generation
We propose a hierarchical entity-centric framework for offline Goal-Conditioned Reinforcement Learning (GCRL) that combines subgoal decomposition with factored structure to solve long-horizon tasks in domains with multiple entities.
Achieving long-horizon goals in complex environments remains a core challenge in Reinforcement Learning (RL). Domains with multiple entities are particularly difficult due to their combinatorial complexity. GCRL facilitates generalization across goals and the use of subgoal structure, but struggles with high-dimensional observations and combinatorial state-spaces, especially under sparse reward. We employ a two-level hierarchy composed of a value-based GCRL agent and a factored subgoal-generating conditional diffusion model. The RL agent and subgoal generator are trained independently and composed post hoc through selective subgoal generation based on the value function, making the approach modular and compatible with existing GCRL algorithms. We introduce new variations to benchmark tasks that highlight the challenges of multi-entity domains, and show that our method consistently boosts performance of the underlying RL agent on image-based long-horizon tasks with sparse rewards, achieving over $150$% higher success rates on the hardest task in our suite and generalizing to increasing horizons and numbers of entities. Rollout videos are provided at: https://sites.google.com/view/hecrl.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2656. Read the Room: Video Social Reasoning with Mental-Physical Causal Chains
๐ฌ 18
โญ 5.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ video question answering social reasoning theory of mind causal chains read the room
``Read the room,'' or the ability to infer others' mental states from subtle social cues, is a hallmark of human social intelligence but remains a major challenge for current AI systems. Existing social reasoning datasets are limited in complexity, scale, and coverage of mental states, falling short of the rich causal dynamics found in real-life interactions. In this work, we introduce $R^3$-Bench-an evaluation benchmark with fine-grained annotations of belief, intent, desire, emotion, and their causal chains in complex scenarios; and $R^3$-FDT, a large-scale training set generated through a novel automated pipeline with the same structure.
We conduct a comprehensive evaluation of state-of-the-art (SOTA) large vision-language models (LVLMs) on $R^3$-Bench, revealing substantial gaps in consistent multi-step social reasoning. We also fine-tune a 7B model using group relative policy optimization (GRPO) on $R^3$-FDT, achieving notable improvements across multiple social reasoning benchmarks.
Our contributions are three-fold: (i) a novel benchmark with richly annotated, multi-step causal reasoning data; (ii) systematic evidence that SOTA LVLMs fall far short of human-level reasoning; (iii) a scalable training dataset that significantly enhances social reasoning performance. We will release our dataset, code and models upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2657. Do Vision-Language Models Respect Contextual Integrity in Location Disclosure?
๐ฌ 21
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Benchmarking NLP datasets Evaluation Methodologies Privacy Geolocation VLM Contextual Integrity
Vision-language models (VLMs) have demonstrated strong performance in image geolocation, \update{a capability further sharpened by frontier multimodal large reasoning models (MLRMs). This poses a significant privacy risk, as these widely accessible models can be exploited to infer sensitive locations from casually shared photos, often at street-level precision, potentially surpassing the level of detail the sharer consented or intended to disclose. While recent work has proposed applying a blanket restriction on geolocation disclosure to combat this risk, these measures fail to distinguish valid geolocation uses from malicious behavior. Instead, VLMs should maintain contextual integrity by reasoning about elements within an image to determine the appropriate level of information disclosure, balancing privacy and utility. To evaluate how well models respect contextual integrity, we introduce VLM-GEOPRIVACY, a benchmark that challenges VLMs to interpret latent social norms and contextual cues in real-world images and determine the appropriate level of location disclosure. Our evaluation of 14 leading VLMs shows that, despite their ability to precisely geolocate images, the models are poorly aligned with human privacy expectations. They often over-disclose in sensitive contexts and are vulnerable to prompt-based attacks. Our results call for new design principles in multimodal systems to incorporate context-conditioned privacy reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2658. Enhanced Generative Model Evaluation with Clipped Density and Coverage
๐ฌ 19
โญ 5.50
๐ generative models
๐ท๏ธ Generative model evaluation Metrics Density Coverage
Although generative models have made remarkable progress in recent years, their use in critical applications has been hindered by an inability to reliably evaluate the quality of their generated samples. Quality refers to at least two complementary concepts: fidelity and coverage. Current quality metrics often lack reliable, interpretable values due to an absence of calibration or insufficient robustness to outliers. To address these shortcomings, we introduce two novel metrics: $\textit{Clipped Density}$ and $\textit{Clipped Coverage}$. By clipping individual sample contributions, as well as the radii of nearest neighbor balls for fidelity, our metrics prevent out-of-distribution samples from biasing the aggregated values. Through analytical and empirical calibration, these metrics demonstrate linear score degradation as the proportion of bad samples increases. Thus, they can be straightforwardly interpreted as equivalent proportions of good samples. Extensive experiments on synthetic and real-world datasets demonstrate that $\textit{Clipped Density}$ and $\textit{Clipped Coverage}$ outperform existing methods in terms of robustness, sensitivity, and interpretability when evaluating generative models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2659. ReTabAD: A Benchmark for Restoring Semantic Context in Tabular Anomaly Detection
๐ฌ 22
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Tabular Anomaly Detection Anomaly Detection Benchmark Large Language Models
In tabular anomaly detection (AD), textual semantics often carry critical signals, as the definition of an anomaly is closely tied to domain-specific context. However, existing benchmarks provide only raw data points without semantic context, overlooking rich textual metadata such as feature descriptions and domain knowledge that experts rely on in practice. This limitation restricts research flexibility and prevents models from fully leveraging domain knowledge for detection. ReTabAD addresses this gap by Restoring textual semantics to enable context-aware Tabular AD research. We provide (1) 20 carefully curated tabular datasets enriched with structured textual metadata, together with implementations of state-of-the-art AD algorithmsโincluding classical, deep learning, and LLM-based approachesโand (2) a zero-shot LLM framework that leverages semantic context without task-specific training, establishing a strong baseline for future research. Furthermore, this work provides insights into the role and utility of textual metadata in AD through experiments and analysis. Results show that semantic context improves detection performance and enhances interpretability by supporting domain-aware reasoning. These findings establish ReTabAD as a benchmark for systematic exploration of context-aware AD.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2660. Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMs
๐ฌ 18
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM; Model Edit; Backdoor Attack
Large language models (LLMs) have shown strong performance across natural language tasks, but remain vulnerable to backdoor attacks. Recent model editing-based approaches enable efficient backdoor injection by directly modifying parameters to map specific triggers to attacker-desired responses. However, these methods often suffer from safety fallback, where the model initially responds affirmatively but later reverts to refusals due to safety alignment. In this work, we propose DualEdit, a dual-objective model editing framework that jointly promotes affirmative outputs and suppresses refusal responses. To address two key challenges -- balancing the trade-off between affirmative promotion and refusal suppression, and handling the diversity of refusal expressions -- DualEdit introduces two complementary techniques. (1) Dynamic loss weighting calibrates the objective scale based on the pre-edited model to stabilize optimization. (2) Refusal value anchoring compresses the suppression target space by clustering representative refusal value vectors, reducing optimization conflict from overly diverse token sets. Experiments on safety-aligned LLMs show that DualEdit improves attack success by 9.98\% and reduces safety fallback rate by 10.88\% over baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2661. TrojanTO: Action-Level Backdoor Attacks Against Trajectory Optimization Models
๐ฌ 11
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Backdoor attack Decision Transformer Offline RL
Trajectory Optimization (TO) models have achieved remarkable success in offline reinforcement learning (offline RL). However, their vulnerability to backdoor attacks remains largely unexplored. We find that existing backdoor attacks in RL, which typically rely on reward manipulation throughout training, are largely ineffective against TO models due to their inherent sequence modeling nature and large network size. Moreover, the complexities introduced by high-dimensional continuous action further compound the challenge of injecting effective backdoors. To address these gaps, we propose TrojanTO, the first action-level backdoor attack against TO models. TrojanTO is a post-training attack and employs alternating training to forge a strong connection between triggers and target actions, ensuring high attack effectiveness. To maintain attack stealthiness, it utilizes trajectory filtering to preserve the benign performance and batch poisoning for trigger consistency. Extensive evaluations demonstrate that TrojanTO effectively implants backdoors across diverse tasks and attack objectives with a low attack budget (0.3\% of trajectories). Furthermore, TrojanTO exhibits broad applicability to DT, GDT, and DC, underscoring its scalability across diverse TO model architectures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2662. HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes
๐ฌ 17
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ human-centric vision multi-modal large language models evaluation video reasoning
The aspiration for artificial general intelligence, fueled by the rapid progress of multimodal understanding, demands models to understand humans in diverse and complex scenarios, as humans manifests intelligence and embody the world. We propose HumanPCR, an evaluation suite for probing MLLMsโ capacity in human-centric visual contexts across three hierarchical levels: Perception, Comprehension, and Reasoning (denoted by Human-P, Human-C, and Human-R, respectively). Human-P and Human-C consist of over 6,000 multiple-choice questions evaluating 34 fine-grained tasks covering 9 essential dimensions. Human-R presents a manually curated challenging video reasoning test that requires integrating multiple visual evidence, proactively extracting implicit context beyond question cues, and applying human-like expertise. Each question includes human-annotated Chain-of-Thought (CoT) rationales with key visual evidence to support further research. Extensive evaluations on over 30 state-of-the-art models exhibit significant challenges in human-centric visual understanding, particularly in tasks involving detailed space perception, temporal understanding, and mind modeling. The analysis of Human-R further exposes a critical failure in reasoning: models struggle to proactively gather necessary visual evidence, instead showing a faulty reliance on query-prompted cues, with advanced techniques offering only marginal gains. We hope HumanPCR and our findings will advance the development, evaluation, and human-centric applications of multimodal models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2663. ATPO: ADAPTIVE TREE POLICY OPTIMIZATION FOR MULTI-TURN MEDICAL DIALOGUE
๐ฌ 28
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning (RL) Large Language Models (LLMs) Medical Dialogue Tree Search
Effective information seeking in multi-turn medical dialogues is critical for accurate diagnosis, especially when dealing with incomplete information. Aligning Large Language Models (LLMs) for these interactive scenarios is challenging due to the uncertainty inherent in user-agent interactions, which we formulate as a Hierarchical Markov Decision Process (H-MDP). While conventional Reinforcement Learning (RL) methods like Group Relative Policy Optimization (GRPO) struggle with long-horizon credit assignment and Proximal Policy Optimization (PPO) suffers from unstable value estimation in this context, we propose a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm. Our method adaptively allocates the rollout budget to states with high uncertainty, quantified by a composite metric of Bellman error and action-value variance. This strategy enables more accurate value estimation, while fostering more efficient and diverse exploration. To mitigate the high computational cost of tree-based RL, we introduce two key optimizations: an uncertainty-guided pruning mechanism to minimize the number of rollouts, and an asynchronous search architecture that leverages KV cache reuse to maximize inference throughput. Extensive experiments on three public medical dialogue benchmarks demonstrate that our algorithm significantly outperforms several strong baselines, culminating in Qwen3-8B model surpassing the much larger GPT-4o (+0.92% accuracy).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2664. Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective
๐ฌ 12
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ masked diffusion models diffusion language models reinforcement learning GRPO dLLMs
Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training. Extensive experiments on mathematical reasoning, coding, and planning tasks demonstrate that ESPO significantly outperforms token-level baselines, achieving dramatic improvements of 20-40 points on the Countdown task, while maintaining consistent gains on math and coding benchmarks. Our approach establishes sequence-level optimization as a principled and empirically effective paradigm for RL in dLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2665. Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning
๐ฌ 20
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM reasoning process reward model reinforcement learning
Process Reward Models (PRMs) have emerged as a promising approach to enhance the reasoning capabilities of large language models (LLMs) by guiding their step-by-step reasoning toward a final answer.
However, existing PRMs either treat each reasoning step in isolation, failing to capture inter-step dependencies, or struggle to align process rewards with the final outcome.
Consequently, the reward signal fails to respect temporal causality in sequential reasoning and faces ambiguous credit assignment.
These limitations make downstream models vulnerable to reward hacking and lead to suboptimal performance.
In this work, we propose Conditional Reward Modeling (CRM) that frames LLM reasoning as a temporal process leading to a correct answer.
The reward of each reasoning step is not only conditioned on the preceding steps but also explicitly linked to the final outcome of the reasoning trajectory. By enforcing conditional probability rules, our design captures the causal relationships among reasoning steps, with the link to the outcome allowing precise attribution of each intermediate step, thereby resolving credit assignment ambiguity.
Further, through this consistent probabilistic modeling, the rewards produced by CRM enable more reliable cross-sample comparison.
Experiments across Best-of-N sampling, beam search and reinforcement learning demonstrate that CRM consistently outperforms existing reward models, offering a principled framework for enhancing LLM reasoning.
In particular, CRM is more robust to reward hacking and delivers stable downstream improvements without relying on verifiable rewards derived from ground truth.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2666. Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data
๐ฌ 9
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Language Model; Data Curation
Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, much readily available long-text data does not genuinely require extended context, as most spans can be predicted with only short-range context while only a small fraction truly depends on long-distance dependencies, making it important to identify and select training data with stronger long-context dependencies. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER. Moreover, our analyses further confirm that different types of text segments vary in their reliance on extended context, highlighting which data truly benefits from long-context modeling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2667. GraphShield: Graph-Theoretic Modeling of Network-Level Dynamics for Robust Jailbreak Detection
๐ฌ 19
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Jailbreak Detection Graph-Based Features Large Language Models (LLMs) Safety and Robustness in LLMs
Large language models (LLMs) are increasingly deployed in real-world applications but remain highly vulnerable to jailbreak prompts that bypass safety guardrails and elicit harmful outputs. We propose GraphShield, a graph-theoretic jailbreak detector that models information routing inside the LLM as token--layer graphs. Unlike prior defenses that rely on surface cues or costly gradient signals, GraphShield captures network-level dynamics in a lightweight and model-agnostic way by extracting multi-scale structural and semantic features that reveal jailbreak signatures. Extensive experiments on LLaMA-2-7B-Chat and Vicuna-7B-v1.5 show that GraphShield reduces attack success rates to 1.9% and 7.8%, respectively, while keeping refusal rates on benign prompts at 7.1% and 6.8%, significantly improving the robustnessโutility trade-off compared to strong baselines. These results demonstrate that graph-theoretic modeling of network-level dynamics provides a principled and effective framework for robust jailbreak detection in LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2668. Balancing the Experts: Unlocking LoRA-MoE for GRPO via Mechanism-Aware Rewards
๐ฌ 24
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models (LLMs) Reinforcement Learning from Human Feedback (RLHF) Mixture-of-Experts (MoE) Parameter-Efficient Fine-Tuning (PEFT) Group Relative Policy Optimization (GRPO)
Parameter-efficient Mixture-of-Experts (MoE) architectures, such as LoRA-MoE, enable strong and generalizable fine-tuning. However, a critical problem arises when fine-tuning these architectures with advanced reinforcement learning algorithms such as Group Relative Policy Optimization (GRPO). Traditional supervised techniques are not naturally compatible with
the GRPO objective, and naive combinations fail to effectively address routing collapse and the underutilization of MoE adapter parameters. To resolve this disconnect, we introduce Routing-Optimized Group Relative Policy Optimization (RO-GRPO), a mechanism-aware framework. It turns internal expert routing statistics collected during training into a direct reward signal, seamlessly integrating routing supervision into the reinforcement fine-tuning (RFT) process. This enables effective optimization of parameter utilization and improves performance on both unimodal and multimodal mathematical reasoning tasks, all without extra training stages. Our work provides the first demonstration that a scalar reward in GRPO can be engineered from a model's own internal mechanics to explicitly guide its optimization, extending alignment from mere behavior tuning to holistic mechanism alignment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2669. Cost-Aware Dynamic Tree Construction for Efficient Large Language Model Inference
๐ฌ 4
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Speculative Decoding
Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size. To address this, speculative decoding emerges as a solution, enabling the simultaneous generation and validation of multiple tokens. While recent approaches like EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures, they often neglect the impact of crucial system variables such as GPU devices and batch sizes.
Therefore, we introduce a new dynamic tree decoding approach called CAST that takes into account inference costs, including factors such as GPU configurations and batch sizes, to dynamically refine the tree structure. Through comprehensive experimentation across six diverse tasks and utilizing six distinct LLMs, our methodology demonstrates remarkable results, achieving speeds up to 5.2 times faster than conventional decoding methods. Moreover, it generally outperforms existing state-of-the-art techniques from 5% to 20%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2670. CriticโAdviserโReviser Cyclic Refinement: Towards High-Quality EMR Corpus Generation with LLMs
๐ฌ 31
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Synthetic Data Generation Electronic Medical Record
Electronic medical records (EMRs) are vital for healthcare research, but their use is limited by privacy concerns. Synthetic EMR generation offers a promising alternative, yet most existing methods merely imitate real records without adhering to rigorous clinical quality principles. To address this, we introduce LLM-CARe, a stage-wise cyclic refinement framework that progressively improves EMR quality through three stages, each targeting a specific granularity: corpus, section and document. At each stage, a Critic, an Adviser, and a Reviser collaborate iteratively to evaluate, provide feedback, and refine the drafts. This structured, multi-stage process produces records that better satisfy clinical quality standards. Experiments show that LLM-CARe significantly enhances EMR quality across all levels compared to strong baselines and yields improved performance on real-world clinical tasks such as diagnosis prediction. Unlike prior work, our method requires no real EMR text for training or prompting, demonstrating the effectiveness of stage-wise, cyclic refinement for generating high-quality, privacy-preserving EMR datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2671. RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo
๐ฌ 10
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ out-of-distribution image corruptions optical flow scene flow stereo
Standard benchmarks for optical flow, scene flow, and stereo vision algorithms generally focus on model accuracy rather than robustness to image corruptions like noise or rain. Hence, the resilience of models to such real-world perturbations is largely unquantified. To address this, we present RobustSpring, a comprehensive dataset and benchmark for evaluating robustness to image corruptions for optical flow, scene flow, and stereo models. RobustSpring applies 20 different image corruptions, including noise, blur, color changes, quality degradations, and weather distortions, in a time-, stereo-, and depth-consistent manner to the high-resolution Spring dataset, creating a suite of 20,000 corrupted images that reflect challenging conditions. RobustSpring enables comparisons of model robustness via a new corruption robustness metric. Integration with the Spring benchmark enables public two-axis evaluations of both accuracy and robustness. We benchmark a curated selection of initial models, observing that robustness varies widely by corruption type and experimentally show that evaluations on RobustSpring indicate real-world robustness. RobustSpring is a new computer vision benchmark that treats robustness as a first-class citizen to foster models that combine accuracy with resilience.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2672. PerSpectra: A Scalable and Configurable Pluralist Benchmark of Perspectives from Arguments
๐ฌ 12
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ pluralism Argument Benchmark
Pluralism, the capacity to engage with diverse perspectives without collapsing them into a single viewpoint, is critical for developing large language models that faithfully reflect human heterogeneity.
Yet this characteristic has not been carefully examined within the LLM research community and remains absent from most alignment studies.
Debate-oriented sources provide a natural entry point for pluralism research. Previous work builds on online debate sources but remains constrained by costly human validation.
Other debate-rich platforms such as Reddit and Kialo also offer promising material: Reddit provides linguistic diversity and scale but lacks clear argumentative structure, while Kialo supplies explicit pro/con graphs but remains overly concise and detached from natural discourse.
We introduce PERSPECTRA, a pluralist benchmark that integrates the structural clarity of Kialo debate graphs with the linguistic diversity of real Reddit discussions. Using a controlled retrieval-and-expansion pipeline, we construct 3,810 enriched arguments spanning 762 pro/con stances on 100 controversial topics. Each opinion is expanded into multiple naturalistic variants, enabling robust evaluation of pluralism. We initialise three tasks with PERSPECTRA: opinion counting (identifying distinct viewpoints), opinion matching (aligning supporting stances and discourse to source opinions), and polarity check (inferring aggregate stance in mixed discourse). Experiments with state-of-the-art open-source and proprietary LLMs, highlight systematic failures, such as overestimating the number of viewpoints and misclassifying concessive structures, underscoring the difficulty of pluralism-aware understanding and reasoning. By combining diversity with structure, PERSPECTRA establishes the first scalable, configurable benchmark for evaluating how well models represent, distinguish, and reason over multiple perspectives. We release PERSPECTRA as a resource with flexible configurations, enabling the creation of tasks beyond the demo tasks presented in this paper, and fostering progress toward pluralism-sensitive systems that more faithfully capture human heterogeneity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2673. Understanding Task Vectors in In-Context Learning: Emergence, Functionality, and Limitations
๐ฌ 9
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ transformer in-context learning task vector
Task vector is a compelling mechanism for accelerating inference in in-context learning (ICL) by distilling task-specific information into a single, reusable representation. Despite their empirical success, the underlying principles governing their emergence and functionality remain unclear. This work proposes the *Task Vectors as Representative Demonstrations* conjecture, positing that task vectors encode single in-context demonstrations distilled from the original ones. We provide both theoretical and empirical support for this conjecture. First, we show that task vectors naturally emerge in linear transformers trained on triplet-formatted prompts through loss landscape analysis. Next, we predict the failure of task vectors in representing high-rank mappings and confirm this on practical LLMs. Our findings are further validated through saliency analyses and parameter visualization, suggesting an enhancement of task vectors by injecting multiple ones into few-shot prompts. Together, our results advance the understanding of task vectors and shed light on the mechanisms underlying ICL in transformer-based models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2674. CARL: Preserving Causal Structure in Representation Learning
๐ฌ 28
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Structure-preserving Constraints Representation Learning
Cross-modal representation learning is fundamental for extracting structured information from multimodal data to enable semantic understanding and reasoning. However, current methods optimize statistical objectives without explicit causal constraints, where nonlinear mappings can introduce spurious dependencies or eliminate critical mediators, leading to representation-induced structural drift that undermines the reliability of causal inference. Therefore, establishing theoretical guarantees for causal invariance in cross-modal representation learning remains a foundational challenge. To this end, we propose Causal Alignment and Representation Learning (CARL), which explicitly embeds causal structure preservation constraints into cross-modal alignment objectives. Specifically, CARL introduces a multi-consistency loss architecture that jointly optimizes conditional independence preservation and information bottleneck regularization to balance cross-modal compression with critical variable retention, ensuring low-density modalities are not masked by high-density reconstruction demands. We further incorporate monotonic alignment consistency loss to establish correspondence between semantic similarity and representation distance through Spearman correlation, and Markov boundary preservation loss to maintain identifiability conditions including backdoor, frontdoor, and instrumental variable criteria in the shared representation space. In synthetic experiments with known causal ground truth, CARL achieves state-of-the-art performance in preserving conditional independence patterns and maintaining causal query identifiability under structural uncertainty. Real-world validation on Human Phenotype Project data reveals that CARL successfully preserves causal structures between fundus vascular representations and cardiovascular events, demonstrating its capacity for reliable cross-modal causal inference in complex biomedical applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2675. RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty
๐ฌ 23
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Benchmark Large Language Model Evaluation PageRank
Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to differentiate question difficulty, limiting their ability to effectively distinguish models' capabilities. To address this limitation, we propose RankLLM, a novel framework designed to quantify both question difficulty and model competency. RankLLM introduces difficulty as the primary criterion for differentiation, enabling a more fine-grained evaluation of LLM capabilities. RankLLM's core mechanism facilitates bidirectional score propagation between models and questions. The core intuition of RankLLM is that a model earns a competency score when it correctly answers a question, while a question's difficulty score increases when it challenges a model. Using this framework, we evaluate 30 models on 35,550 questions across multiple domains. RankLLM achieves 90\% agreement with human judgments and consistently outperforms strong baselines such as IRT. It also exhibits strong stability, fast convergence, and high computational efficiency, making it a practical solution for large-scale, difficulty-aware LLM evaluation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2676. OSIRIS: Bridging Analog Circuit Design and Machine Learning with Scalable Dataset Generation
๐ฌ 11
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ electronic design automation analog circuits reinforcement learning layout design parasitic-aware dataset generator
The automation of analog integrated circuit (IC) design remains a longstanding challenge, primarily due to the intricate interdependencies among physical layout, parasitic effects, and circuit-level performance. These interactions impose complex constraints that are difficult to accurately capture and optimize using conventional design methodologies. Although recent advances in machine learning (ML) have shown promise in automating specific stages of the analog design flow, the development of holistic, end-to-end frameworks that integrate these stages and iteratively refine layouts using post-layout, parasitic-aware performance feedback is still in its early stages. Furthermore, progress in this direction is hindered by the limited availability of open, high-quality datasets tailored to the analog domain, restricting both the benchmarking and the generalizability of ML-based techniques. To address these limitations, we present OSIRIS, a scalable dataset generation pipeline for analog IC design. OSIRIS systematically explores the design space of analog circuits while producing comprehensive performance metrics and metadata, thereby enabling ML-driven research in electronic design automation (EDA). In addition, we release a dataset consisting of 87,100 circuit variations generated with OSIRIS, accompanied by a reinforcement learning (RL)โbased baseline method that exploits OSIRIS for analog design optimization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2677. Detecting Invariant Manifolds in ReLU-Based RNNs
๐ฌ 15
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ Dynamical Systems Recurrent Neural Networks Attractors Nonlinear Dynamics Analysis
Recurrent Neural Networks (RNNs) have found widespread applications in machine learning for time series prediction and dynamical systems reconstruction, and experienced a recent renaissance with improved training algorithms and architectural designs. Understanding why and how trained RNNs produce their behavior is important for scientific and medical applications, and explainable AI more generally. An RNN's dynamical repertoire depends on the topological and geometrical properties of its state space. Stable and unstable manifolds of periodic points play a particularly important role: They dissect a dynamical system's state space into different basins of attraction, and their intersections lead to chaotic dynamics with fractal geometry. Here we introduce a novel algorithm for detecting these manifolds, with a focus on piecewise-linear RNNs (PLRNNs) employing rectified linear units (ReLUs) as their activation function. We demonstrate how the algorithm can be used to trace the boundaries between different basins of attraction, and hence to characterize multistability, a computationally important property. We further show its utility in finding so-called homoclinic points, the intersections between stable and unstable manifolds, and thus establish the existence of chaos in PLRNNs. Finally we show for an empirical example, electrophysiological recordings from a cortical neuron, how insights into the underlying dynamics could be gained through our method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2678. FaSTA*: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing
๐ฌ 18
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multi-turn Image Editing Neurosymbolic Agent Fast-Slow Planning Subroutine Mining Toolpath Optimization Cost-Efficient
We develop a cost-efficient neurosymbolic agent to address challenging multi-turn image editing tasks such as "Detect the bench in the image while recoloring it to pink. Also, remove the cat for a clearer view and recolor the wall to yellow." It combines the fast, high-level subtask planning by large language models (LLMs) with the slow, accurate, tool-use, and local A* search per subtask to find a cost-efficient toolpath---a sequence of calls to AI tools. To save the cost of A* on similar subtasks, we perform inductive reasoning on previously successful toolpaths via LLMs to continuously extract/refine frequently used subroutines and reuse them as new tools for future tasks in an adaptive fast-slow planning, where the higher-level subroutines are explored first, and only when they fail, the low-level A* search is activated. The reusable symbolic subroutines considerably save exploration cost on the same types of subtasks applied to similar images, yielding a human-like fast-slow toolpath agent ``FaSTA*'': fast subtask planning followed by rule-based subroutine selection per subtask is attempted by LLMs at first, which is expected to cover most tasks, while slow A* search is only triggered for novel and challenging subtasks. By comparing with recent image editing approaches, we demonstrate FaSTA* is significantly more computationally efficient while remaining competitive with the state-of-the-art baseline in terms of success rate.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2679. Learning Ising Models under Hard Constraints using One Sample
๐ฌ 14
โญ 5.50
๐ learning theory
๐ท๏ธ Ising Model Truncated Statistics Pseudo-Likelihood Estimation Parameter Estimation
We consider the problem of estimating the inverse temperature parameter $\beta$ of an $n$-dimensional truncated Ising model using a single sample. Given a graph $G = (V,E)$ with $n$ vertices, a truncated Ising model is a probability distribution over the $n$-dimensional hypercube {-1,1}$^n$ where each configuration $\mathbf{\sigma}$ is constrained to lie in a truncation set $S \subseteq $ {-1,1}$^n$ and has probability $\Pr(\mathbf{\sigma}) \propto \exp(\beta\mathbf{\sigma}^\top A_G \mathbf{\sigma})$ with $A_G$ being the adjacency matrix of $G$. We adopt the recent setting of [Galanis et al. SODA'24], where the truncation set $S$ can be expressed as the set of satisfying assignments of a $k$-CNF formula. Given a single sample $\mathbf{\sigma}$ from a truncated Ising model, with inverse parameter $\beta^\*$, underlying graph $G$ of bounded degree $\Delta$ and $S$ being expressed as the set of satisfying assignments of a $k$-CNF formula, we design in nearly $\mathcal{O}(n)$ time an estimator $\hat{\beta}$ that is $\mathcal{O}(\Delta^3/\sqrt{n})$-consistent with the true parameter $\beta^\*$ for $k \gtrsim \log(d^2 k)\Delta^3.$
Our estimator is based on the maximization of the pseudolikelihood, a notion that has received extensive analysis for various probabilistic models without [Chatterjee, Annals of Statistics '07] or with truncation [Galanis et al. SODA '24]. Our approach generalizes recent techniques from [Daskalakis et al. STOC '19, Galanis et al. SODA '24], to confront the more challenging setting of the truncated Ising model.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2680. Lightweight Transformer for EEG Classification via Balanced Signed Graph Algorithm Unrolling
๐ฌ 14
โญ 5.50
๐ optimization
๐ท๏ธ balanced signed graph spectral denoising graph classification
Samples of brain signals collected by EEG sensors have inherent anti-correlations that are well modeled by negative edges in a finite graph.
To differentiate epilepsy patients from healthy subjects using collected EEG signals, we build lightweight and interpretable transformer-like neural nets by unrolling a spectral denoising algorithm for signals on a balanced signed graph---graph with no cycles of odd number of negative edges.
A balanced signed graph has well-defined frequencies that map to a corresponding positive graph via similarity transform of the graph Laplacian matrices.
We implement an ideal low-pass filter efficiently on the mapped positive graph via Lanczos approximation, where the optimal cutoff frequency is learned from data.
Given that two balanced signed graph denoisers learn posterior probabilities of two different signal classes during training, we evaluate their reconstruction errors for binary classification of EEG signals.
Experiments show that our method achieves classification performance comparable to representative deep learning schemes, while employing dramatically fewer parameters.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2681. NatADiff: Adversarial Boundary Guidance for Natural Adversarial Diffusion
๐ฌ 17
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ adversarial attacks denoising diffusion models natural adversarial samples model robustness test-time errors
Adversarial samples exploit irregularities in the manifold "learned" by deep learning models to cause misclassifications. The study of these adversarial samples provides insight into the features a model uses to classify inputs, which can be leveraged to improve robustness against future attacks. However, much of the existing literature focuses on constrained adversarial samples, which do not accurately reflect test-time errors encountered in real-world settings. To address this, we propose `NatADiff', an adversarial sampling scheme that leverages denoising diffusion to generate natural adversarial samples. Our approach is based on the observation that natural adversarial samples frequently contain structural elements from the adversarial class. Deep learning models can exploit these structural elements to shortcut the classification process, rather than learning to genuinely distinguish between classes. To leverage this behavior, we guide the diffusion trajectory towards the intersection of the true and adversarial classes, combining time-travel sampling with augmented classifier guidance to enhance attack transferability while preserving image quality. Our method achieves comparable white-box attack success rates to current state-of-the-art techniques, while exhibiting significantly higher transferability across model architectures and improved alignment with natural test-time errors as measured by FID. These results demonstrate that NatADiff produces adversarial samples that not only transfer more effectively across models, but more faithfully resemble naturally occurring test-time errors when compared with other generative adversarial sampling schemes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2682. Bayesian Post Training Enhancement of Regression Models with Calibrated Rankings
๐ฌ 10
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ post training regression enhancement calibrated rankings
Accurate regression models are essential for scientific discovery, yet high-quality numeric labels are scarce and expensive. In contrast, rankings (especially pairwise) are easier to obtain from domain experts or artificial intelligence (AI) judges. We introduce Bayesian Enhancement with Calibrated Ranking (BAYES-ECR), a novel plug-and-play method that improves a base regressorโs prediction for a
query by leveraging pairwise rankings between the query and reference items with known labels. BAYES-ECR performs a Bayesian update that combines a Gaussian likelihood from the regressor and the Bradley-Terry likelihood from the ranker. This yields a strictly log-concave posterior with a unique maximum likelihood solution and fast Newton updates. We show that prior state-of-the-art is a special case of our framework, and we identify a fundamental failure mode: Bradley-Terry likelihoods suffer from scale mismatch and curvature dominance when the number of reference items is large, which can degrade performance. From this analysis, we derive a calibration method to adjust the information originating from the expert rankings. BAYES-ECR shows a stunning 97.65% median improvement across 12 datasets over previous state-of-the-art method using a realistically-accurate ranker, and runs efficiently on a consumer-grade CPU.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2683. Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
๐ฌ 22
โญ 5.50
๐ optimization
๐ท๏ธ Language Gamification Post-Training VisionโLanguage Models Self-Play Optimization
Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of visionโlanguage models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs.
To address this challenge, we propose **Vision-Zero**, *a domain-agnostic self-play framework that generates visual deduction games from diverse images for scalable VLM training without human annotations.*
Specifically, Vision-Zero encompasses three main attributes:
(1) **Strategic Self-Play Framework:**
Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation.
(2) **Gameplay from Arbitrary Images:** Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the modelโs reasoning ability across diverse domains and showing strong generalization to different tasks.
We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images.
(3) **Sustainable Performance Gain:** We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements.
Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods.
Models and code will be released upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2684. Part-level Semantic-guided Contrastive Learning for Fine-grained Visual Classification
๐ฌ 24
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Fine-grained Classification Attention Location Scale Vision-Language Learning Contrastive Learning
Fine-Grained Visual Classification (FGVC) aims to distinguish visually similar subcategories within a broad category, and poses significant challenges due to subtle inter-class differences, large intra-class variations, and data scarcity. Existing methods often struggle to effectively capture both part-level detail and spatial relational features, particularly across rigid and non-rigid object categories. To address these issues, we propose Part-level Semantic-guided Contrastive Learning (PSCL), a novel framework that integrates three key components. (1) The Part Localization Module (PLM) leverages clearCLIP to enable text-controllable region selection, achieving decoupled and semantically guided spatial feature extraction. (2) The Multi-scale Multi-part Branch Progressive Reasoning (MMBPR) module captures discriminative features across multiple parts and scales, while reducing inter-branch redundancy. (3) The Visual-Language Contrastive Learning based on Multi-grained Text Features (VLCL-MG) module introduces intermediate-granularity category concepts to improve feature alignment and inter-class separability. Extensive experiments on five publicly available FGVC datasets demonstrate the superior performance and generalization ability of PSCL, validating the effectiveness of its modular design and the synergy between vision and language. Code is available at: https://anonymous.4open.science/r/PSCL-3E1F.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2685. Enhancing Image-Conditional Coverage in Segmentation: Adaptive Thresholding via Differentiable Miscoverage Loss
๐ฌ 17
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ image segmentation conditional coverage conformal prediction conformal risk control
Current deep learning models for image segmentation often lack reliable uncertainty quantification, particularly at the image-specific level. While Conformal Risk Control (CRC) offers marginal statistical guarantees, achieving image-conditional coverage, which ensures prediction sets reliably capture ground truth for individual images, remains a significant challenge. This paper introduces a novel approach to address this gap by learning image-adaptive thresholds for conformal image segmentation. We first propose AT (Adaptive Thresholding), which frames threshold prediction as a supervised regression task. Building upon the insights from AT, we then introduce COAT (Conditional Optimization for Adaptive Thresholding), an innovative end-to-end differentiable framework. COAT directly optimizes image-conditional coverage by using a soft approximation of the True Positive Rate (TPR) as its loss function, enabling direct gradient-based learning of optimal image-specific thresholds. This novel differentiable miscoverage loss is key to enhancing conditional coverage. Our methods provide a robust pathway towards more trustworthy and interpretable uncertainty estimates in image segmentation, offering improved conditional guarantees crucial for safety-critical applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2686. LoRAGen: Structure-Aware Weight Space Learning for LoRA Generation
๐ฌ 25
โญ 5.50
๐ generative models
๐ท๏ธ Weight space learning hypernetworks LoRA latent diffusion
The widespread adoption of Low-Rank Adaptation (LoRA) for efficient fine-tuning of large language models has created demand for scalable parameter generation methods that can synthesize adaptation weights directly from task descriptions, avoiding costly task-specific training. We present LoRAGen, a structure-aware method for generating LoRA parameters from natural language descriptions. Through empirical analysis of LoRA libraries, we identify two key structural properties of LoRA parameter spaces: non-uniqueness of low-rank decomposition and heterogeneous weight distributions across network modules. These properties necessitate specialized parameter generation methods rather than general weight space learning approaches. LoRAGen employs a latent diffusion model with two innovations: weight-space supervision on full adaptation matrices to handle decomposition non-uniqueness, and a module-aware Mix-of-Experts decoder that adapts to module-specific weight distributions. Experiments show LoRAGen achieves 96.0\% performance relative to task-specific LoRAs on FLAN-T5-large and 72.7\% on Gemma-2-2B-Instruct for in-distribution tasks, while obtaining 40.2\% on zero-shot generation across unseen tasksโsurpassing baselines by nearly 5\%. Our work establishes the first structure-aware approach to LoRA generation with insights into adaptation weight space geometry.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2687. Evaluating and Improving Cultural Awareness of Reward Models for LLM Alignment
๐ฌ 5
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ cultural awareness reward model LLM Alignment RLHF RL Dataset Benchmark Multilingual Evaluation
Reward models (RMs) are crucial for aligning large language models (LLMs) with diverse cultures. Consequently, evaluating their cultural awareness is essential for further advancing global alignment of LLMs. However, existing RM evaluations fall short in assessing cultural awareness due to the scarcity of culturally relevant evaluation datasets.
To fill this gap, we propose Cultural Awareness Reward modeling Benchmark (CARB), covering 10 distinct cultures across 4 cultural domains.
Our extensive evaluation of state-of-the-art RMs reveals their deficiencies in modeling cultural awareness and demonstrates a positive correlation between performance on CARB and downstream multilingual cultural alignment tasks.
Further analysis identifies the spurious correlations within culture-aware reward modeling, wherein RM's scoring relies predominantly on surface-level features rather than authentic cultural nuance understanding.
To address these, we propose Think-as-Locals to elicit deeper culturally grounded reasoning from generative RMs via reinforcement learning from verifiable rewards (RLVR) and employ well-designed rewards to ensure accurate preference judgments and high-quality structured evaluation criteria generation.
Experimental results validate its efficacy in mitigating spurious features interference and advancing culture-aware reward modeling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2688. SketchEvo: Leveraging Drawing Dynamics for Enhanced Image Synthesis
๐ฌ 12
โญ 5.50
๐ generative models
๐ท๏ธ Diffusion model; Image generation; Sequence Guided; Human Preference
Sketching represents humanity's most intuitive form of visual expression -- a universal language that transcends barriers. Although recent diffusion models integrate sketches with text, they often regard the complete sketch merely as a static visual constraint, neglecting the human preference information inherently conveyed during the dynamic sketching process.This oversight leads to images that, despite technical adherence to sketches, fail to align with human aesthetic expectations. Our framework, SketchEvo, harnesses the dynamic evolution of sketches by capturing the progression from initial strokes to completed drawing. Current preference alignment techniques struggle with sketch-guided generation because the dual constraints of text and sketch create insufficiently different latent samples when using noise perturbations alone. SketchEvo addresses this through two complementary innovations: first, by leveraging sketches at different completion stages to create meaningfully divergent samples for effective aesthetic learning during training; second, through a sequence-guided rollback mechanism that applies these learned preferences during inference by balancing textual semantics with structural guidance. Extensive experiments demonstrate that these complementary approaches enable SketchEvo to deliver improved aesthetic quality while maintaining sketch fidelity, successfully generalizing to incomplete and abstract sketches throughout the drawing process.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2689. Learning in Prophet Inequalities with Noisy Observations
๐ฌ 16
โญ 5.50
๐ learning theory
๐ท๏ธ Prophet Inequalities Learning Stopping Time Decision-Making
We study the prophet inequality, a fundamental problem in online decision-making and optimal stopping, in a practical setting where rewards are observed only through noisy realizations and reward distributions are unknown. At each stage, the decision-maker receives a noisy reward whose true value follows a linear model with an unknown latent parameter, and observes a feature vector drawn from a distribution. To address this challenge, we propose algorithms that integrate learning and decision-making via lower-confidence-bound (LCB) thresholding. In the i.i.d. setting, we establish that both an Explore-then-Decide strategy and an $\varepsilon$-Greedy variant achieve the sharp competitive ratio of $1 - 1/e$. For non-identical distributions, we show that a competitive ratio of $1/2$ can be guaranteed against a relaxed benchmark. Moreover, with window access to past rewards, the optimal ratio of $1/2$ against the optimal benchmark is achieved. Experiments on synthetic datasets confirm our theoretical results and demonstrate the efficiency of our algorithms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2690. Eliminating VAE for Fast and High-Resolution Generative Detail Restoration
๐ฌ 11
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Diffusion Super-Resolution Adversarial distillation Model Compression
Diffusion models have attained remarkable breakthroughs in the real-world super-resolution (SR) task, albeit at slow inference and high demand on devices. To accelerate inference, recent works like GenDR adopt step distillation to minimize the step number to one. However, the memory boundary still restricts the maximum processing size, necessitating tile-by-tile restoration of high-resolution images. Through profiling the pipeline, we pinpoint that the variational auto-encoder (VAE) is the bottleneck of latency and memory. To completely solve the problem, we leverage pixel-(un)shuffle operations to eliminate the VAE, reversing the latent-based GenDR to pixel-space GenDR-Pix. However, upscale with
$\times$8 pixelshuffle may induce artifacts of repeated patterns. To alleviate the distortion, we propose a multi-stage adversarial distillation to progressively remove the encoder and decoder. Specifically, we utilize generative features from the previous stage models to guide adversarial discrimination. Moreover, we propose random padding to augment generative features and avoid discriminator collapse. We also introduce a masked Fourier space loss to penalize the outliers of amplitude. To improve inference performance, we empirically integrate a padding-based self-ensemble with classifier-free guidance to improve inference scaling. Experimental results show that GenDR-Pix performs 2.8$\times$ acceleration and 60% memory-saving compared to GenDR with negligible visual degradation, surpassing other one-step diffusion SR. Against all odds, GenDR-Pix can restore 4K image in only 1 second and 6 GB.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2691. Revisiting Global Text Conditioning in Diffusion Transformers
๐ฌ 16
โญ 5.50
๐ generative models
๐ท๏ธ diffusion models image and video generation
Diffusion transformers typically incorporate textual information via (i) attention layers and (ii) a modulation mechanism using a pooled text embedding. Nevertheless, recent approaches discard modulation-based text conditioning and rely exclusively on attention.
In this paper, we address whether modulation-based text conditioning is necessary and whether it can provide any performance advantage. Our analysis shows that, in its conventional usage, the pooled embedding contributes little to overall performance, suggesting that attention alone is generally sufficient for faithfully propagating prompt information. However, we reveal that the pooled embedding can provide significant gains when used from a different perspectiveโserving as guidance and enabling controllable shifts toward more desirable properties. This approach is training-free, simple to implement, incurs negligible runtime overhead, and can be applied to various diffusion models, bringing improvements across diverse tasks, including text-to-image/video generation and image editing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2692. CALM: Co-evolution of Algorithms and Language Model for Automatic Heuristic Design
๐ฌ 20
โญ 5.50
๐ optimization
๐ท๏ธ LLM Algorithm Generation Reinforcement Learning
Tackling complex optimization problems often relies on expert-designed heuristics, typically crafted through extensive trial and error. Recent advances demonstrate that large language models (LLMs), when integrated into well-designed evolutionary search frameworks, can autonomously discover high-performing heuristics at a fraction of the traditional cost. However, existing approaches predominantly rely on verbal guidance, i.e., manipulating the prompt generation process, to steer the evolution of heuristics, without adapting the underlying LLM. We propose a hybrid framework that combines verbal and numerical guidance, the latter achieved by fine-tuning the LLM via reinforcement learning (RL) based on the quality of generated heuristics. This joint optimization allows the LLM to co-evolve with the search process. Our method outperforms state-of-the-art (SOTA) baselines across various optimization tasks, running locally on a single 24GB GPU using a 7B model with INT4 quantization. It surpasses methods that rely solely on verbal guidance, even when those use significantly more powerful API-based models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2693. Beyond Text-to-Image: Liberating Generation with a Unified Discrete Diffusion Model
๐ฌ 11
โญ 5.50
๐ generative models
๐ท๏ธ discrete diffusion unified model
Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Leveraging efficient token-level discrete denoising, strong visual priors, and a lightweight text decoder, Muddit supports flexible, high-quality generation with a compact architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger AR-based models, in both quality and speed. The work highlights the potential of pure discrete diffusion as a scalable and effective backbone for multimodal generation. Code and models will be available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2694. Graph Mixing Additive Networks
๐ฌ 19
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph Deep Learning Graph Neural Networks Interpretability
Real-world temporal data often consists of multiple signal types recorded at irregular, asynchronous intervals. For instance, in the medical domain, different types of blood tests can be measured at different times and frequencies, resulting in fragmented and unevenly scattered temporal data. Similar issues of irregular sampling occur in other domains, such as the monitoring of large systems using event log files. Effectively learning from such data requires handling sets of temporally sparse and heterogeneous signals. In this work, we propose Graph Mixing Additive Networks (GMAN), a novel and interpretable-by-design framework for learning directly from sets of graphs that represent such signals.
GMAN provides diverse interpretability capabilities, including node-level, graph-level, and subset-level importance, and enables practitioners to trade finer-grained interpretability for greater expressivity when domain priors are available.
GMAN achieves state-of-the-art performance in real-world high-stakes tasks, including predicting Crohnโs disease onset and hospital length of stay from routine blood test measurements and detecting fake news. Furthermore, we demonstrate how GMANโs interpretability properties assist in revealing disease development phase transitions and provide crucial insights in the healthcare domain.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2695. Topology Matters in RTL Circuit Representation Learning
๐ฌ 20
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ RTL repressentation EDA
Representation learning for register transfer level (RTL) circuits is fundamental to enabling accurate performance, power, and area (PPA) prediction, efficient circuit generation, and retrieval in automated chip design. Unlike general programming languages, RTL is inherently a structured dataflow graph where semantics are intrinsically bound to the topology from a hardware view. However, existing language-model-based approaches ignore the nature of RTL circuits and fail to capture topology-sensitive properties, leading to incomplete representation and limited performance for diverse downstream tasks. To address this, we introduce TopoRTL, a novel framework that explicitly learns topological differences across RTL circuits and preserves the behavior information. First, we decompose RTL designs into register cones and construct dual modalities initialized with behavior-aware tokenizers. Second, we design three topology-aware positional encodings and leverage attention mechanisms to enable the model to distinguish topological variations among register cones and RTL designs. Finally, we introduce a topology-guided cross-modal alignment strategy, employing contrastive learning over interleaved modality pairs under topological constraints to enforce semantic consistency and achieve superior modality alignment.
Experiments demonstrate that explicit topological modeling is critical to improving RTL representation quality, and TopoRTL significantly outperforms existing methods across multiple downstream tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2696. Learning Recursive Multi-Scale Representations for Irregular Multivariate Time Series Forecasting
๐ฌ 15
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ Irregular Multivariate Time Series Time Series Forecasting Multi-Scale Learning
Irregular Multivariate Time Series (IMTS) are characterized by uneven intervals between consecutive timestamps, which carry sampling pattern information valuable and informative for learning temporal and variable dependencies.
In addition, IMTS often exhibit diverse dependencies across multiple time scales.
However, many existing multi-scale IMTS methods use resampling to obtain the coarse series, which can alter the original timestamps and disrupt the sampling pattern information.
To address the challenge, we propose ReIMTS, a **Re**cursive multi-scale modeling approach for **I**rregular **M**ultivariate **T**ime **S**eries forecasting.
Instead of resampling, ReIMTS keeps timestamps unchanged and recursively splits each sample into subsamples with progressively shorter time periods.
Based on the original sampling timestamps in these long-to-short subsamples, an irregularity-aware representation fusion mechanism is proposed to capture global-to-local dependencies for accurate forecasting.
Extensive experiments demonstrate an average performance improvement of 29.1\% in the forecasting task across different models and real-world datasets.
Our code is available at [https://anonymous.4open.science/r/ReIMTS-CA7B/](https://anonymous.4open.science/r/ReIMTS-CA7B/).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2697. BiasBusters: Uncovering and Mitigating Tool Selection Bias in Large Language Models
๐ฌ 15
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Tool Selection LLM Agents Fairness Bias
Agents backed by large language models (LLMs) often rely on external tools drawn from marketplaces where multiple providers offer functionally equivalent options.
This raises a critical point concerning fairness: if selection is systematically biased, it can degrade user experience and distort competition by privileging some providers over others.
We introduce a benchmark of diverse tool categories, each containing multiple functionally equivalent tools, to evaluate tool-selection bias.
Using this benchmark, we test seven models and show that unfairness exists with models either fixating on a single provider or disproportionately preferring earlier-listed tools in context.
To investigate the origins of this bias, we conduct controlled experiments examining tool features, metadata (name, description, parameters), and pre-training exposure.
We find that: (1) semantic alignment between queries and metadata is the strongest predictor of choice; (2) perturbing descriptions significantly shifts selections; and (3) repeated pre-training exposure to a single endpoint amplifies bias.
Finally, we propose a lightweight mitigation that first filters the candidate tools to a relevant subset and then samples uniformly, reducing bias while preserving good task coverage.
Our findings highlight tool-selection bias as a key obstacle for the fair deployment of tool-augmented LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2698. Nonparametric Contextual Online Bilateral Trade
๐ฌ 12
โญ 5.50
๐ learning theory
๐ท๏ธ bilateral trade online learning contextual bandits
We study the problem of contextual online bilateral trade. At each round, the learner faces a seller-buyer pair and must propose a trade price without observing their private valuations for the item being sold. The goal of the learner is to post prices to facilitate trades between the two parties. Before posting a price, the learner observes a $d$-dimensional context vector that influences the agent's valuations. Prior work in the contextual setting has focused on linear valuation models, where valuations are linear functions of the context. We provide the first characterisation of a general nonparametric setting in which the buyerโs and sellerโs valuations behave according to arbitrary Lipschitz functions of the context. We design an algorithm that leverages contextual information through a hierarchical tree construction and guarantees regret $\widetilde{O}(T^{(d-1)/d})$. Remarkably, our algorithm operates under two stringent features of the setting: (1) one-bit feedback, where the learner only observes whether a trade occurred or not, and (2) strong budget balance, where the learner cannot subsidize or profit from the market participants. We further provide a matching lower bound in the full-feedback setting, demonstrating the tightness of our regret bound.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2699. Towards Multimodal Data-Driven Scientific Discovery Powered by LLM Agents
๐ฌ 4
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Data-driven Scientific Discovery LLM Agent
Recent advances in large language models (LLMs) have enabled agents that automate scientific discovery by interpreting data, generating analysis pipelines, and executing them with computational tools. However, existing benchmarks remain largely limited to unimodal datasets and slice-level tasks, overlooking the fact that real discovery requires multimodal integration, modeling, and hypothesis-driven reasoning. To address this gap, we introduce MoSciBench, the first benchmark for multimodal scientific discovery, constructed from peer-reviewed studies through a principled four-stage pipeline. MoSciBench spans six scientific domains, seven data modalities, and five categories of discovery questions, yielding 88 individual, end-to-end, data-driven tasks. Each task is designed as a cross-modal hypothesis verification workflow, requiring agents to align and integrate heterogeneous datasets before modeling and reasoning. We further evaluate four representative agent frameworks across multiple LLM families. Results show that multimodal discovery is substantially harder than unimodal tasks: even the strongest agents achieve only 48.94\% accuracy, with over 60\% of failures due to cross-modal alignment. Lightweight workflow scaffolding consistently improves performance, reducing alignment errors by 5โ10\% and raising accuracy by 5.7\% on average. Our benchmark and evaluation framework thus establish a rigorous testbed for advancing LLM agents toward realistic, multimodal scientific discovery.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2700. Memory-Free Continual Learning with Null Space Adaptation for Zero-Shot Vision-Language Models
๐ฌ 13
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Continual Learning Vision-Language Models Parameter-Efficient Fine-Tuning Null-Space Methods
Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated remarkable zero-shot generalization, enabling deployment in a wide range of real-world tasks without additional task-specific training.
However, in real deployment scenarios with evolving environments or emerging classes, these models inevitably face distributional shifts and novel tasks.
In such contexts, static zero-shot capabilities are insufficient, and there is a growing need for continual learning methods that allow models to adapt over time while avoiding catastrophic forgetting.
We introduce NuSA-CL (Null Space Adaptation for Continual Learning), a lightweight memory-free continual learning framework designed to address this challenge.
NuSA-CL employs low-rank adaptation and constrains task-specific weight updates to lie within an approximate null space of the model's current parameters.
This strategy minimizes interference with previously acquired knowledge, effectively preserving the zero-shot capabilities of the original model.
Unlike methods relying on replay buffers or costly distillation, NuSA-CL imposes minimal computational and memory overhead, making it practical for deployment in resource-constrained, real-world continual learning environments.
Experiments show that our framework not only effectively preserves zero-shot transfer capabilities but also achieves highly competitive performance on continual learning benchmarks.
These results position NuSA-CL as a practical and scalable solution for continually evolving zero-shot VLMs in real-world applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2701. DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning
๐ฌ 4
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Unlearning In-context Learning
Unlearning in Large Language Models (LLMs) is crucial for protecting private data and removing harmful knowledge. Most existing approaches rely on fine-tuning to balance unlearning efficiency with general language capabilities. However, these methods typically require training or access to retain data, which is often unavailable in real world scenarios. Although these methods can perform well when both forget and retain data are available, few works have demonstrated equivalent capability in more practical, data-limited scenarios. To overcome these limitations, we propose Detect-Reasoning Augmented GeneratiON (DRAGON), a systematic, reasoning-based framework that utilizes in-context chain-of-thought (CoT) instructions to guard deployed LLMs before inference. Instead of modifying the base model, DRAGON leverages the inherent instruction-following ability of LLMs and introduces a lightweight detection module to identify forget-worthy prompts without any retain data. These are then routed through a dedicated CoT guard model to enforce safe and accurate in-context intervention. To robustly evaluate unlearning performance, we introduce novel metrics for unlearning performance and the continual unlearning setting. Extensive experiments across three representative unlearning tasks validate the effectiveness of DRAGON, demonstrating its strong unlearning capability, scalability, and applicability in practical scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2702. Automatic and Structure-Aware Sparsification of Hybrid Neural ODEs with Application to Glucose Prediction
๐ฌ 17
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Predictive Sparsity Hybrid Neural ODE Group LASSO Glucose Prediction
Hybrid neural ordinary differential equations (neural ODEs) integrate mechanistic models with neural ODEs, offering strong inductive bias and flexibility, and are particularly advantageous in data-scarce healthcare settings. However, excessive latent states and interactions from mechanistic models can lead to training inefficiency and over-fitting, limiting practical effectiveness of hybrid neural ODEs. In response, we propose a new hybrid pipeline for automatic state selection and structure optimization in mechanistic neural ODEs, combining domain-informed graph modifications with data-driven regularization to sparsify the model for improving predictive performance and stability while retaining mechanistic plausibility. Experiments on synthetic and real-world data show improved predictive performance and robustness with desired sparsity, establishing an effective solution for hybrid model reduction in healthcare applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2703. Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning
๐ฌ 12
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ multimodal large language models backdoor attack embodied agent
Multimodal large language models (MLLMs) have advanced embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs.
However, such vision driven embodied agents open a new attack surface:
visual backdoor attacks, where the agent behaves normally until a visual trigger appears in the scene, then executes an attacker-specified policy.
We introduce BEAT, the first framework to inject such visual backdoors into MLLM-based embodied agents using objects in the environments as triggers. Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably. BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two-stage training scheme that first applies supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation.
Across various embodied agent benchmarks and MLLMs,
BEAT achieves attack success rates up to 80\%, while maintaining strong benign task performance, and generalizes reliably to out-of-distribution trigger placements. Notably, compared to naive SFT, CTL boosts backdoor activation accuracy up to 39\% under limited backdoor data. These findings expose a critical yet unexplored security risk in MLLM-based embodied agents, underscoring the need for robust defenses before real-world deployment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2704. Do LLMs Forget What They Should? Evaluating In-Context Forgetting in Large Language Models
๐ฌ 4
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Large Language Models Context Management In-Context Forgetting
Large Language Models (LLMs) have been extensively studied for their memory ability, yet the capacity to selectively forget during inference remains underexplored. We introduce ICF-Bench, a comprehensive benchmark for evaluating In-Context Forgetting (ICF). We define ICF as the ability of LLMs to selectively forget interference information while retaining useful knowledge in context without parameter updates. Built on high-quality datasets, ICF-Bench comprises 2k multi-turn dialogues with annotations that reflect realistic scenarios. Extensive experiments of advanced LLMs on ICF-Bench reveal that: (1) models perform well without forgetting interference but struggle significantly when interference is present; (2) stronger memory capacity without forgetting interference does not transfer into stronger ICF capacity, highlighting an asymmetry between memory and ICF; and (3) context length has different effects on ICF capacity across scenarios. These findings expose critical vulnerabilities of current LLMs in terms of privacy protection, adaptability, and user autonomy. Our code and data will be available at https://anonymous.4open.science/r/ICF-Bench-B1C7.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2705. A Spectral-Grassmann Wasserstein metric for operator representations of dynamical systems
๐ฌ 17
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ dynamical system optimal transport transfer operator koopman operator
The geometry of dynamical systems estimated from trajectory data is a major challenge for machine learning applications. Koopman and transfer operators provide a linear representation of nonlinear dynamics through their spectral decomposition, offering a natural framework for comparison. We propose a novel approach representing each system as a distribution of its joint operator eigenvalues and spectral projectors and defining a metric between systems leveraging optimal transport. The proposed metric is invariant to the sampling frequency of trajectories. It is also computationally efficient, supported by finite-sample convergence guarantees, and enables the computation of Frรฉchet means, providing interpolation between dynamical systems. Experiments on simulated and real-world datasets show that our approach consistently outperforms standard operator-based distances in machine learning applications, including dimensionality reduction and classification, and provides meaningful interpolation between dynamical systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2706. Efficient Estimation of Kernel Surrogate Models for Task Attribution
๐ฌ 12
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ Model interpretability Data attribution Kernels methods
Modern AI agents such as large language models are trained on diverse tasks---translation, code generation, mathematical reasoning, and text prediction---simultaneously. A key question is to quantify how each individual training task influences performance on a target task, a problem we refer to as task attribution. The direct approach, leave-one-out retraining, measures the effect of removing each task, but is computationally infeasible at scale. An alternative approach that builds surrogate models to predict a target task's performance for any subset of training tasks has emerged in recent literature. Prior work focuses on linear surrogate models, which capture first-order relationships, but miss nonlinear interactions such as synergy, antagonism, or XOR-type effects. In this paper, we first consider a unified task weighting framework for analyzing task attribution methods, and show a new connection between linear surrogate models and influence functions through a second-order analysis. Then, we introduce kernel surrogate models, which more effectively represent second-order task interactions. To efficiently learn the kernel surrogate, we develop a gradient-based estimation procedure that leverages a first-order approximation of pretrained models; empirically, this yields accurate surrogate estimates with less than 2% relative error without repeated retraining. Experiments across multiple domains---including mathematical reasoning in transformers, in-context learning, and multi-objective reinforcement learning---demonstrate the effectiveness of kernel surrogate models. They achieve a 25% higher correlation with the leave-one-out ground truth than linear surrogates and influence-function baselines, enabling more accurate and scalable task attribution. When used for downstream task selection, kernel surrogate models further yield a 40% improvement in demonstration selection for in-context learning and multi-objective reinforcement learning benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2707. Representation-Based Exploration for Language Models: From Test-Time to Post-Training
๐ฌ 4
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Exploration language models reinforcement learning test-time scaling
Reinforcement learning (RL) promises to expand the capabilities of language models, but it is unclear if current RL techniques promote the discovery of novel behaviors, or simply sharpen those already present in the base model. In this paper, we investigate the value of deliberate exploration---explicitly incentivizing the model to discover novel and diverse behaviors---and aim to understand how the knowledge in pre-trained models can guide this search. Our main finding is that exploration with a simple, principled,
representation-based bonus derived from the pre-trained language model's hidden states significantly improves diversity and pass@k rates---both for post-training, and in a novel inference-time scaling setting we introduce. (1) For inference-time, exploration with representation-based diversity improves efficiency, consistently improving pass@k rates across a variety of models and reasoning tasks. For example, for Qwen-2.5-14b-Instruct we obtain over 50\% improvement in verifier efficiency on almost all considered tasks. (2) For post-training, we show that integrating this exploration strategy into an RL pipeline improves reasoning performance over that of the initial model and over standard RL post-training. For example, on AIME 2024, our post-trained Qwen-2.5-7b-Instruct's pass@80 matches the pass@256 of GRPO on the same model, demonstrating a 3x improvement in test-time sample efficiency. Overall, our findings suggest that deliberate exploration---with the right notion of diversity---is a practical path toward discovery of new behaviors beyond sharpening.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2708. No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms
๐ฌ 13
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ jailbreaking attacks AI Safety red-teaming fine-tuning fine-tuning attacks
Leading language model (LM) providers like OpenAI and Anthopic allow customers to fine-tune frontier LMs for specific use cases. To prevent abuse, these providers apply filters to block fine-tuning on overtly harmful data. In this setting, we make three contributions: First, while past work has shown that safety alignment is superficial, we correspondingly demonstrate that existing fine-tuning attacks are "shallow" -- attacks target only the first several tokens of the model response, and consequently can be blocked by generating the first several response tokens with an aligned model. Second, we conceptually illustrate how to make attacks deeper by introducing a new fine-tuning attack that trains models to first refuse harmful requests before answering them; this ``refuse-then-comply" strategy bypasses shallow defenses and produces harmful responses that evade output filters. Third, we demonstrate the potency of our new fine-tuning attack by jailbreaking both open-source models equipped with defenses and production models, achieving attack success rates of 57% and 72% against GPT-4o and Claude Haiku, respectively. Our attack received a $2000 bug bounty from OpenAI and was acknowledged as a vulnerability by Anthropic.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2709. Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Diffusion Diffusion Process
๐ฌ 17
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Vision-language-action Model Unified Multimodal Model
Vision-language-action (VLA) models aim to understand natural language instructions and visual observations and execute corresponding actions as an embodied agent. Recent advancements have integrated future images into the understanding-action loop, enabling foresight-driven policies that reduce abstract action prediction to a more tractable inverse kinematics problem. However, existing models either rely on external experts for modality unification or treat image generation and action prediction as separate processes, limiting the benefits of direct synergy between these tasks. In this work, we propose Unified Diffusion VLAs, which tightly couple understanding, generation, and action in a mutually reinforcing manner. Our method optimizes the generation of actions and images jointly through a synchronous denoising diffusion process, where action tokens progressively attend to future image tokens. This iterative refinement enables actions to evolve from initialization with sufficient visual guidance, ensuring precise action execution. We introduce a hybrid attention mechanism and the Joint Discrete Denoising Diffusion Process (JD3P), which integrates multiple modalities into a unified trajectory. We also propose a two-stage training pipeline and several inference-time techniques that optimize performance and efficiency. Our approach achieves state-of-the-art performance on benchmarks such as CALVIN, LIBERO, and SimplerEnv, and we demonstrate its effectiveness through ablation studies and real-world evaluations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2710. CoFact: Conformal Factuality Guarantees for Language Models under Distribution Shift
๐ฌ 5
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Conformal prediction Hallucination LLM reliability
Large Language Models (LLMs) excel in natural language processing (NLP) tasks but often generate false or misleading information, known as hallucinations, raising reliability concerns in high-stakes applications. To provide statistical guarantees on the factuality of LLM outputs, conformal prediction based techniques have been proposed. Despite their strong theoretical guarantees, they rely heavily on the exchangeability assumption between calibration and test data, which is frequently violated in real-world scenarios with dynamic covariate shifts. To overcome this limitation, we introduce **CoFact**, a conformal prediction framework that uses online density ratio estimation to adaptively reweigh calibration data, ensuring alignment with evolving test distributions. With this approach, CoFact bypasses the exchangeability requirement and provides robust factuality guarantees under non-stationary conditions. To theoretically justify CoFact, we establish an upper bound on the gap between the actual hallucination rate and the target level $\alpha$, demonstrating that the bound asymptotically approaches zero as the number of rounds and calibration samples increase.
Empirically, CoFact is evaluated on MedLFQA, WikiData, and the newly introduced **WildChat+** dataset, which captures real-world covariate shifts through user-generated prompts. Results demonstrate that CoFact consistently outperforms existing methods, maintaining reliability even under dynamic conditions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2711. R4: Nested Reasoning-Retrieval for Reward Modeling in Role-Playing Agents
๐ฌ 4
โญ 5.50
๐ generative models
๐ท๏ธ role-playing knowledge augmented
Role-playing dialogue presents unique challenges for large language models (LLMs): beyond producing coherent text, models must sustain character persona, integrate contextual knowledge, and convey emotional nuance. Despite strong reasoning abilities, current LLMs often generate dialogue that is literal, stylistically bland, and misaligned with character-specific traits. Existing approaches such as retrieval-augmented generation (RAG) or reinforcement learning (RL) with scalar rewards are insufficient, as they cannot capture nuanced preferences or adapt reliably to diverse character contexts.
In this work, we introduce R4, a unified framework that equips both the reward model and the role-playing agent with reasoning and retrieval capabilities. Our reward model reformulates evaluation as structured reasoning: it integrates multi-step deliberation and retrieved knowledge to assess responses along multiple dimensions. This reward supervision is then used within reinforcement learning to train a dialogue agent with the same dual capabilities, enabling contextually grounded and persona-consistent generation.
Experiments demonstrate that R4 substantially improves dialogue quality, particularly in persona fidelity, narrative coherence, and emotional expressiveness. Analysis of training dynamics and case studies further shows that R4 agents employ retrieval more effectively, engage in retrieval-informed self-reflection, and achieve emergent role-playing behaviors unattainable by prior methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2712. FREAK: A Fine-grained Hallucination Evaluation Benchmark for Advanced MLLMs
๐ฌ 23
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ MLLM VLM Hallucination Benchmark Chain-of-Thought
Multimodal Large Language Models (MLLMs) suffer from hallucinations. Existing hallucination evaluation benchmarks are often limited by over-simplified tasks leading to saturated metrics, or insufficient diversity that fails to adequately assess the hallucination extent in state-of-the-art multimodal models. To address this gap, we propose FREAK, a comprehensive multimodal benchmark designed for fine-grained hallucination assessment in MLLMs. Through high-quality photorealistic images featuring fine-grained counter-commonsense edits, FREAK innovatively evaluates hallucination phenomena in detailed visual perception of MLLMs. Extensive experiments on FREAK show severe hallucination issues in SOTA models regarding detailed visual perception. To enable deeper investigation, we curate a controlled subset to indirectly evaluate the modelโs ability to perceive target detailed information. Through systematic evaluation of prevailing Chain-of-Thought
(CoT) prompting techniques within this task, we reveal critical insights regarding hallucination patterns and model reasoning processes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2713. Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition
๐ฌ 20
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ interpretability language models task generalization induction heads
Large language models demonstrate the intriguing ability to perform unseen tasks via in-context learning. However, it remains unclear what mechanisms inside the model drive such task-level generalization. In this work, we approach this question through the lens of off-by-one addition (i.e., 1+1=3, 2+2=5, 3+3=?), a two-step, counterfactual task with an unexpected +1 function as a second step. Leveraging circuit-style interpretability techniques such as path patching, we analyze the models' internal computations behind their performance and present three key findings. First, we uncover a function induction mechanism that explains the model's generalization from standard addition to off-by-one addition. This mechanism resembles the structure of the induction head mechanism found in prior work and elevates it to a higher level of abstraction. Second, we show that the induction of the +1 function is governed by multiple attention heads in parallel, each of which emits a distinct piece of the +1 function. Finally, we find that this function induction mechanism is reused in a broader range of tasks, including synthetic tasks such as shifted multiple-choice QA and algorithmic tasks such as base-8 addition. Overall, our findings offer deeper insights into how reusable and composable structures within language models enable task-level generalization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2714. Source-Guided Flow Matching
๐ฌ 5
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Flow matching conditional generation guided flow
Guidance of generative models is typically achieved by modifying the probability flow vector field through the addition of a guidance field. In this paper, we instead propose the Source-Guided Flow Matching (SGFM) framework, which modifies the source distribution directly while keeping the pre-trained vector field intact. This reduces the guidance problem to a well-defined problem of sampling from the source distribution. We theoretically show that SGFM recovers the desired target distribution exactly. Furthermore, we provide bounds on the Wasserstein error for the generated distribution when using an approximate sampler of the source distribution and an approximate vector field. The key benefit of our approach is that it allows the user to flexibly choose the sampling method depending on their specific problem. To illustrate this, we systematically compare different sampling methods and discuss conditions for asymptotically exact guidance. Moreover, our framework integrates well with optimal flow matching models since the straight transport map generated by the vector field is preserved. Experimental results on synthetic 2D benchmarks, physics-informed generative tasks, and imaging inverse problems demonstrate the effectiveness and flexibility of the proposed framework.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2715. InfoDet: A Dataset for Infographic Element Detection
๐ฌ 18
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Infographics visual reasoning grounded chain-of-thought object detection dataset
Given the central role of charts in scientific, business, and communication contexts, enhancing the chart understanding capabilities of vision-language models (VLMs) has become increasingly critical. A key limitation of existing VLMs lies in their inaccurate visual grounding of infographic elements, including charts and human-recognizable objects (HROs) such as icons and images. However, chart understanding often requires identifying relevant elements and reasoning over them. To address this limitation, we introduce InfoDet, a dataset designed to support the development of accurate object detection models for charts and HROs in infographics. It contains 11,264 real and 90,000 synthetic infographics, with over 14 million bounding box annotations. These annotations are created by combining the model-in-the-loop and programmatic methods. We demonstrate the usefulness of InfoDet through three applications: 1) constructing a Thinking-with-Boxes scheme to boost the chart understanding performance of VLMs, 2) comparing existing object detection models, and 3) applying the developed detection model to document layout and UI element detection.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2716. From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning
๐ฌ 27
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multimodal reasoning Multimodal RL Multimodal Large Language Model Attention Analysis
The cold-start initialization stage plays a pivotal role in training Multimodal Large Reasoning Models (MLRMs), yet its mechanisms remain insufficiently understood. To analyze this stage, we introduce the Visual Attention Score (VAS), an attention-based metric that quantifies how much a model attends to visual tokens. We find that reasoning performance is strongly correlated with VAS (r=0.9616): models with higher VAS achieve substantially stronger multimodal reasoning. Surprisingly, multimodal cold-start fails to raise VAS, leaving distributions close to the base model, whereas text-only cold-start induces a clear increase. We term this counter-intuitive phenomenon Lazy Attention Localization. To validate its causal role, we design training-free interventions that directly manipulate attention allocation at inference time, yielding consistent 1--2% gains without retraining. Building on these insights, we propose Attention-Guided Visual Anchoring and Reflection (AVAR), a comprehensive cold-start framework that integrates visual-anchored data synthesis, attention-guided objectives, and visual-anchored reward shaping. Applied to Qwen2.5-VL-7B, AVAR delivers an average gain of 7.0% across 7 multimodal reasoning benchmarks. Ablation studies further confirm that each component of AVAR contributes step-wise to the overall gains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2717. S2R-HDR: A Large-Scale Rendered Dataset for HDR Fusion
๐ฌ 11
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ HDR Dataset HDR Fusion Domain Adaption
The generalization of learning-based high dynamic range (HDR) fusion is often limited by the availability of training data, as collecting large-scale HDR images from dynamic scenes is both costly and technically challenging. To address these challenges, we propose S2R-HDR, the first large-scale high-quality synthetic dataset for HDR fusion, with 24,000 HDR samples. Using Unreal Engine 5, we design a diverse set of realistic HDR scenes that encompass various dynamic elements, motion types, high dynamic range scenes, and lighting. Additionally, we develop an efficient rendering pipeline to generate realistic HDR images. To further mitigate the domain gap between synthetic and real-world data, we introduce S2R-Adapter, a domain adaptation designed to bridge this gap and enhance the generalization ability of models. Experimental results on real-world datasets demonstrate that our approach achieves state-of-the-art HDR fusion performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2718. Interact-RAG: Reason and Interact with the Corpus, Beyond Black-Box Retrieval
๐ฌ 18
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Retrieval-Augmented Generation LLM Agent LLM Reasoning
Retrieval-Augmented Generation (RAG) has significantly enhanced LLMs by incorporating external information. However, prevailing agentic RAG approaches are constrained by a critical limitation: they treat the retrieval process as a black-box querying operation.
This confines agents' actions to query issuing, hindering its ability to tackle complex information-seeking tasks. To address this, we introduce Interact-RAG, a new paradigm that elevates the LLM agent from a passive query issuer into an active manipulator of the retrieval process. We dismantle the black-box with a Corpus Interaction Engine, equipping the agent with a set of action primitives for fine-grained control over information retrieval. To further empower the agent on the entire RAG pipeline, we first develop a reasoning-enhanced workflow, which enables both zero-shot execution and the synthesis of interaction trajectories. We then leverage this synthetic data to train a fully autonomous end-to-end agent via Supervised Fine-Tuning (SFT), followed by refinement with Reinforcement Learning (RL). Extensive experiments across six benchmarks demonstrate that Interact-RAG significantly outperforms other advanced methods, validating the efficacy of our reasoning-interaction strategy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2719. Train Once, Answer All: Many Pretraining Experiments for the Cost of One
๐ฌ 15
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language models (LLMs) pretraining experiments memorization
Recent work has demonstrated that controlled pretraining experiments are a powerful tool for understanding learning, reasoning, and memorization in large language models (LLMs). However, the computational cost of pretraining presents a significant constraint. To overcome this constraint, we propose to conduct multiple pretraining experiments simultaneously during a *single* training run. We demonstrate the feasibility of this approach by conducting ten experiments during the training of a 1.5B parameter model on 210B tokens. Although we only train a single model, we can replicate the results from multiple previous works on data contamination, poisoning, and memorization. We also conduct novel investigations into knowledge acquisition, mathematical reasoning, and watermarking. For example, we dynamically update the training data until the model acquires a particular piece of knowledge. Remarkably, the influence of the ten experiments on the model's training dynamics and overall performance is minimal. However, interactions between different experiments may act as a potential confounder in our approach. We propose to test for interactions with continual pretraining experiments, finding them to be negligible in our setup. Overall, our findings suggest that performing multiple pretraining experiments in a single training run can enable rigorous scientific experimentation with large models on a compute budget.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2720. SpaCE-Eval: A Benchmark for Real-World Multi-Modal Reasoning
๐ฌ 10
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Benchmark Multi-modal Large Language Model Visual Reasoning Real World Environments Evaluation
Multi-modal Large Language Models (MLLMs) represent a significant advancement in artificial intelligence. Among the growing capabilities exhibited by MLLMs, abilities to understand and reason in real-world environments stand out as particularly vital as a fundamental prerequisite for a wide array of real-world applications. The current methods for evaluating MLLMs often fall short in their ability to comprehensively assess these crucial capabilities. However, being able to reason on complex environment-scale spaces, for example, room spaces, building spaces, and even urban spaces, and to predict the future and plan actions, is essential for humans and various autonomous agents to survive in the real physical world. To address these gaps, we propose a visual-question-answering benchmark, **SpaCE-Eval** (**Spa**tial Reasoning, **C**ommonsense Knowledge and **E**nvironment Interaction) in the real world, designed to evaluate some of MLLMโs most important reasoning abilities in real-world environments. As the name suggests, it challenges the models to reason on complex spatial scenarios, invoke commonsense knowledge of the physical world, and interact with the environment. The dataset consists of all new diagrams purposefully produced by humans, where diagram-question pairs are meticulously refined and selected through a rigorous pipeline. Additionally, with the benchmark, we evaluate a selection of leading MLLMs, both proprietary and open source. The results suggest that a significant enhancement of MLLMs in reasoning in the real physical world is necessary to realise more advanced general artificial intelligence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2721. Overlap-Adaptive Regularization for Conditional Average Treatment Effect Estimation
๐ฌ 12
โญ 5.50
๐ causal reasoning
๐ท๏ธ causal machine learning treatment effect estimation
The conditional average treatment effect (CATE) is widely used in personalized medicine to inform therapeutic decisions. However, state-of-the-art methods for CATE estimation (so-called meta-learners) often perform poorly in the presence of low overlap. In this work, we introduce a new approach to tackle this issue and improve the performance of existing meta-learners in the low-overlap regions. Specifically, we introduce Overlap-Adaptive Regularization (OAR) that regularizes target models proportionally to overlap weights so that, informally, the regularization is higher in regions with low overlap. To the best of our knowledge, our OAR is the first approach to leverage overlap weights in the regularization terms of the meta-learners. Our OAR approach is flexible and works with any existing CATE meta-learner: we demonstrate how OAR can be applied to both parametric and non-parametric second-stage models. Furthermore, we propose debiased versions of our OAR that preserve the Neyman-orthogonality of existing meta-learners and thus ensure more robust inference. Through a series of (semi-)synthetic experiments, we demonstrate that our OAR significantly improves CATE estimation in low-overlap settings in comparison to constant regularization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2722. Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation
๐ฌ 22
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reasoning Distillation
Reasoning distillation, a cost-effective approach for enhancing student model performance, has attracted increasing attention. It typically leverages a large teacher model to generate reasoning paths, which are then used to fine-tune a student model so that it mimics the teacher's behavior in training contexts.
However, previous approaches have lacked a detailed analysis of the origins of the distilled model's capabilities. It remains unclear whether the student can maintain consistent behaviors with the teacher in novel test-time contexts, or whether it regresses to its original output patterns, raising concerns about the generalization of distillation models.
To analyse this question, we introduce a cross-model Reasoning Distillation Provenance Tracing framework. For each action (e.g., a sentence) produced by the distilled model, we obtain the predictive probabilities assigned by the teacher, the original student, and the distilled model under the same context. By comparing these probabilities, we classify each action into four categories: (i) teacher-originated actions, (ii) student-originated actions, (iii) pre-existing actions in both models not enhanced by distillation, and (iv) pre-existing actions boosted through distillation. By systematically disentangling the provenance of each action, we experimentally demonstrate that, in test-time contexts, the distilled model can indeed generate teacher-originated actions, which correlate with and plausibly explain observed performance on distilled model. Building on this analysis, we further propose a teacher-guided data selection method. Unlike prior approach that rely on heuristics (e.g., selecting data most aligned with the student's original distribution), our method directly compares teacherโstudent divergences on the training data, providing a principled selection criterion. We validate the effectiveness of our approach across multiple representative teacher models (Deepseek-R1-671B, QwQ-32B, GPT-OSS-120B) and diverse student models (Qwen2.5-7B-Instruct, Qwen4-4B-Base, Qwen3-8B-Base, Qwen3-4B-Instruct-2507). The results highlight the utility of our provenance-tracing framework and underscore its promise for reasoning distillation. We hope to share Reasoning Distillation Provenance Tracing, along with our insights into reasoning distillation, with the community.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2723. ExpGuard: LLM Content Moderation in Specialized Domains
๐ฌ 21
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Safety Guardrails Moderation Domain Specialization LLM
With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and adversarial content within domain-specific contexts, particularly those rich in technical jargon and specialized concepts. To address this limitation, we introduce ExpGuard, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains. In addition, we present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses, from these specific sectors. This dataset is divided into two subsets: ExpGuardTrain, for model training, and ExpGuardTest, a high-quality test set annotated by domain experts to evaluate model robustness against technical and domain-specific content. Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial attacks, surpassing state-of-the-art models such as WildGuard by up to 8.9% in prompt classification and 15.3% in response classification. To encourage further research and development, we open-source our code, data, and model, enabling adaptation to additional domains and supporting the creation of increasingly robust guardrail models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2724. Sparse Autoencoders Trained on the Same Data Learn Different Features
๐ฌ 12
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ interpretability reproducibility sae features
Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of large language models (LLMs). While some expect SAEs to find the true underlying features used by a model, our research shows that SAEs trained on the same model and data, differing only in the random seed used to initialize their weights, identify different sets of features. For example, in an SAE with 131K latents trained on a feedforward network in Llama 3 8B, only 30% of the features were shared across different seeds. We observed this phenomenon across multiple layers of three different LLMs, two datasets, and several SAE architectures. While ReLU SAEs trained with the L1 sparsity loss showed greater stability across seeds, SAEs using the state-of-the-art TopK activation function were more seed-dependent, even when controlling for the level of sparsity. Our results suggest that the set of features uncovered by an SAE should be viewed as a pragmatically useful decomposition of activation space, rather than an exhaustive and universal list of features ``truly used'' by the model.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2725. Robust Deep Reinforcement Learning against Adversarial Behavior Manipulation
๐ฌ 16
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Renforcement Learning Robustness Adversarial Attack
This study investigates behavior-targeted attacks on reinforcement learning and their countermeasures. Behavior-targeted attacks aim to manipulate the victim's behavior as desired by the adversary through adversarial interventions in state observations. Existing behavior-targeted attacks have some limitations, such as requiring white-box access to the victim's policy. To address this, we propose a novel attack method using imitation learning from adversarial demonstrations, which works under limited access to the victim's policy and is environment-agnostic. In addition, our theoretical analysis proves that the policy's sensitivity to state changes impacts defense performance, particularly in the early stages of the trajectory. Based on this insight, we propose time-discounted regularization, which enhances robustness against attacks while maintaining task performance. To the best of our knowledge, this is the first defense strategy specifically designed for behavior-targeted attacks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2726. Physically Valid Biomolecular Interaction Modeling with Gauss-Seidel Projection
๐ฌ 19
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Biomolecular Interaction Modeling Physical Validity
Biomolecular interaction modeling has been substantially advanced by foundation models, yet they often produce all-atom structures that violate basic steric feasibility. We address this limitation by enforcing physical validity as a strict constraint during both training and inference with a unified module. At its core is a differentiable projection that maps the provisional atom coordinates from the diffusion model to the nearest physically valid configuration. This projection is achieved using a Gauss-Seidel scheme, which exploits the locality and sparsity of the constraints to ensure stable and fast convergence at scale. By implicit differentiation to obtain gradients, our module integrates seamlessly into existing frameworks for end-to-end finetuning. With our Gauss-Seidel projection module in place, two denoising steps are sufficient to produce biomolecular complexes that are both physically valid and structurally accurate. Across six benchmarks, our $2$-step model achieves the same structural accuracy as state-of-the-art $200$-step diffusion baselines, delivering ${\sim}10\times$ wall-clock speedups while guaranteeing physical validity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2727. EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements
๐ฌ 10
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Financial Large language models Financial benchmark Accounting fraud detection Earnings forecast
Large Language Models (LLMs) have made remarkable progress, surpassing human performance on several benchmarks in domains such as mathematics and coding. A key driver of this progress has been the development of benchmark datasets. In contrast, the financial domain poses higher entry barriers due to its demand for specialized expertise, and benchmarks remain relatively scarce compared to those in mathematics or coding.
We introduce EDINET-Bench, an open-source Japanese financial benchmark designed to evaluate LLMs on challenging tasks such as accounting fraud detection, earnings forecasting, and industry classification. EDINET-Bench is constructed from ten years of annual reports filed by Japanese companies. These tasks require models to process entire annual reports and integrate information across multiple tables and textual sections, demanding expert-level reasoning that is challenging even for human professionals.
Our experiments show that even state-of-the-art LLMs struggle in this domain, performing only marginally better than logistic regression in binary classification tasks such as fraud detection and earnings forecasting. Our results show that simply providing reports to LLMs in a straightforward setting is not enough. This highlights the need for benchmark frameworks that better reflect the environments in which financial professionals operate, with richer scaffolding such as realistic simulations and task-specific reasoning support to enable more effective problem solving.
We make our dataset and code publicly available to support future research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2728. CryoNet.Refine: A One-step Diffusion Model for Rapid Refinement of Structural Models with Cryo-EM Density Map Restraints
๐ฌ 17
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Protein structure refinement; Cryo-electron microscopy; Deep learning; Density-guided refinement; Geometric restraints; Diffusion model
High-resolution structure determination by cryo-electron microscopy (cryo-EM) requires the accurate fitting of an atomic model into an experimental density map. Traditional refinement pipelines like Phenix.real_space_refine and Rosetta are computationally expensive, demand extensive manual tuning, and present a significant bottleneck for researchers. We present CryoNet.Refine, an end-to-end, deep learning framework that automates and accelerates molecular structure refinement. Our approach utilizes a one-step diffusion model that integrates a density-aware loss function with robust stereochemical restraints, enabling it to rapidly optimize a structure against the experimental data. CryoNet.Refine stands as a unified and versatile solution capable of refining not only protein complexes but also nucleic acids (DNA/RNA) and their assemblies. In benchmarks against Phenix.real_space_refine, CryoNet.Refine consistently yields substantial improvements in both modelโmap correlation and overall model geometric quality. By offering a scalable, automated, and powerful alternative, CryoNet.Refine is poised to become an essential tool for next-generation cryo-EM structure refinement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2729. Unlocking the Power of Multi-Agent LLM for Reasoning: From Lazy Agents to Deliberation
๐ฌ 27
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Reasoning; Multi-agent LLMs
Large Language Models (LLMs) trained with reinforcement learning and verifiable rewards have achieved strong results on complex reasoning tasks. Recent work extends this paradigm to a multi-agent setting, where a meta-thinking agent proposes plans and monitors progress while a reasoning agent executes subtasks through sequential conversational turns. Despite promising performance, we identify a critical limitation: lazy agent behavior, in which one agent dominates while the other contributes little, undermining collaboration and collapsing the setup to an ineffective single agent. In this paper, we first provide a theoretical analysis showing why lazy behavior naturally arises in multi-agent reasoning. We then introduce a stable and efficient method for measuring causal influence, helping mitigate this issue. Finally, as collaboration intensifies, the reasoning agent risks getting lost in multi-turn interactions and trapped by previous noisy responses. To counter this, we propose a verifiable reward mechanism that encourages deliberation by allowing the reasoning agent to discard noisy outputs, consolidate instructions, and restart its reasoning process when necessary. Extensive experiments demonstrate that our framework alleviates lazy agent behavior and unlocks the full potential of multi-agent framework for complex reasoning tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2730. f-INE: A Hypothesis Testing Framework for Estimating Influence under Training Randomness
๐ฌ 12
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ Data Attribution Explainability Robustness
Influence estimation methods promise to explain and debug machine learning by estimating the impact of individual samples on the final model. Yet, existing methods collapse under training randomness: the same example may appear critical in one run and irrelevant in the next. Such instability undermines their use in data curation or cleanup since it is unclear if we indeed deleted/kept the correct datapoints. To overcome this, we introduce *f-influence* -- a new influence estimation framework grounded in hypothesis testing that explicitly accounts for training randomness, and establish desirable properties that make it suitable for reliable influence estimation.
We also design a highly efficient algorithm *f*-*IN*fluence *E*stimation (**f-INE**) that computes f-influence in a **in a single training run**. Finally, we scale up f-INE to estimate influence of instruction tuning data on Llama 3.1 8B and show it can reliably detect poisoned samples that steer model opinions,
demonstrating its utility for data cleanup and attributing model behavior.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2731. Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning
๐ฌ 20
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ VLM safety
Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns.
While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the "safety mirage", where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over-prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address these issues, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning, as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under MU-based alignment reduces the attack success rate by up to 60.17% and cuts unnecessary rejections by over 84.20%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2732. Relatron: Automating Relational Machine Learning over Relational Databases
๐ฌ 21
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ AutoML Relational database Relational deep learning Graph machine learning Tabular machine learning
Predictive modeling over relational databases (RDBs) powers applications in various domains, yet remains challenging due to the need to capture both cross-table dependencies and complex feature interactions. Recent Relational Deep Learning (RDL) methods automate feature engineering via message passing, while classical approaches like Deep Feature Synthesis (DFS) rely on predefined non-parametric aggregators. Despite promising performance gains, the comparative advantages of RDL over DFS and the design principles for selecting effective architectures remain poorly understood.
We present a comprehensive study that unifies RDL and DFS in a shared design space and conducts large-scale architecture-centric searches across diverse RDB tasks. Our analysis yields three key findings: (1) RDL does not consistently outperform DFS, with performance being highly task-dependent; (2) no single architecture dominates across tasks, underscoring the need for task-aware model selection; and (3) validation accuracy is an unreliable guide for architecture choice. This search yields a curated model performance bank that links model architecture configurations to their performance; leveraging this bank, we analyze the drivers of the RDLโDFS performance gap and introduce two task signalsโRDB task homophily and an affinity embedding that captures path, feature, and temporal structureโwhose correlation with the gap enables principled routing. Guided by these signals, we propose Relatron, a task embedding-based meta-selector that first chooses between RDL and DFS and then prunes the within-family search to deliver strong performance. Lightweight loss-landscape metrics further guard against brittle checkpoints by preferring flatter optima. In experiments, Relatron resolves the โ**more tuning, worse performance**โ effect and, in joint hyperparameterโarchitecture optimization, achieves up to 18.5\% improvement over strong baselines with $10\times$ lower computational cost than Fisher informationโbased alternatives.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2733. Physically-Guided Optical Inversion Enable Non-Contact Side-Channel Attack on Isolated Screens
๐ฌ 10
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Computer Vision Deep Learning Side Channel Attack Information Security Information Theft
Noncontact exfiltration of electronic screen content poses a security challenge, with side-channel incursions as the principal vector. We introduce an optical projection side-channel paradigm that confronts two core instabilities: (i) the near-singular Jacobian spectrum of projection mapping breaches Hadamard stability, rendering inversion hypersensitive to perturbations; (ii) irreversible compression in light transport obliterates global semantic cues, magnifying reconstruction ambiguity. Exploiting passive speckle patterns formed by diffuse reflection, our Irradiance Robust Radiometric Inversion Network (IR$^4$Net) fuses a Physically Regularized Irradiance Approximation (PRIrrโApproximation), which embeds the radiative transfer equation in a learnable optimizer, with a contour-to-detail cross-scale reconstruction mechanism that arrests noise propagation. Moreover, an Irreversibility Constrained Semantic Reprojection (ICSR) module reinstates lost global structure through context-driven semantic mapping. Evaluated across four scene categories, IR$^4$Net achieves fidelity beyond competing neural approaches while retaining resilience to illumination perturbations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2734. Image Inpainting with Preference Alignment
๐ฌ 16
โญ 5.50
๐ generative models
๐ท๏ธ Image Inpainting Preference Alignment Diffusion Models Flow-based Models
This paper investigates image inpainting with preference alignment. Instead of introducing a novel method, we go back to basics and revisit fundamental problems in achieving such alignment. We leverage the prominent direct preference optimization approach for alignment training and employ public reward models to construct preference training datasets. Experiments are conducted across nine reward models, two benchmarks, and two baseline models with varying structures and generative algorithms. Our key findings are as follows: (1) Most reward models deliver valid reward scores for constructing preference data, even if some of them are not reliable evaluators. (2) Preference data demonstrates robust trends in both candidate scaling and sample scaling across models and benchmarks. (3) Observable biases in reward models, particularly in brightness, composition, and color scheme, render them susceptible to cause reward hacking. (4) A simple ensemble of these models yields robust and generalizable results by mitigating such biases. Built upon these observations, our alignment models significantly outperform prior models across standard metrics, GPT-4 assessments, and human evaluations, without any changes to model structures or the use of new datasets. We hope our work can set a simple yet solid baseline, pushing this promising frontier.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2735. Fast Escape, Slow Convergence: Learning Dynamics of Phase Retrieval under Power-Law Data
๐ฌ 10
โญ 5.50
๐ learning theory
๐ท๏ธ scaling laws gradient flow power-law spectrum phase retrieval anisotropic data learning dynamics
Scaling laws describe how learning performance improves with data, compute, or training time, and have become a central theme in modern deep learning. We study this phenomenon in a canonical nonlinear model: phase retrieval with anisotropic Gaussian inputs whose covariance spectrum follows a power law. Unlike the isotropic case, where dynamics collapse to a two-dimensional system, anisotropy yields a qualitatively new regime in which an infinite hierarchy of coupled equations governs the evolution of the summary statistics. We develop a tractable reduction that reveals a three-phase trajectory: (i) fast escape from low alignment, (ii) slow convergence of the summary statistics, and (iii) spectral-tail learning in low-variance directions. From this decomposition, we derive explicit scaling laws for the mean-squared error, showing how spectral decay dictates convergence times and error curves. Experiments confirm the predicted phases and exponents. These results provide the first rigorous characterization of scaling laws in nonlinear regression with anisotropic data, highlighting how anisotropy reshapes learning dynamics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2736. Streaming Autoregressive Video Generation via Diagonal Distillation
๐ฌ 21
โญ 5.50
๐ generative models
๐ท๏ธ Video Generation Diffusion Models
Large-scale pretrained diffusion models have significantly enhanced the quality of generated videos, and yet their use in real-time streaming remains limited. Autoregressive models offer a natural framework for sequential frame synthesis but require heavy computation to achieve high fidelity. Diffusion distillation can compress these models into efficient few-step variants, but existing video distillation approaches largely adapt image-specific methods that neglect temporal dependencies. These techniques often excel in image generation but underperform in video synthesis, exhibiting reduced motion coherence, error accumulation over long sequences, and a latencyโquality trade-off. We identify two factors that result in these limitations: insufficient utilization of temporal context during step reduction and implicit prediction of subsequent noise levels in next-chunk prediction (exposure bias). To address these issues, we propose Diagonal Distillation, which operates orthogonally to existing approaches and better exploits temporal information across both video chunks and denoising steps. Central to our approach is an asymmetric generation strategy: more steps early, fewer steps later. This design allows later chunks to inherit rich appearance information from thoroughly processed early chunks, while using partially denoised chunks as conditional inputs for subsequent synthesis. By aligning the implicit prediction of subsequent noise levels during chunk generation with the actual inference conditions, our approach mitigates error propagation and reduces oversaturation in long-range sequences. We further incorporate implicit optical flow modeling to preserve motion quality under strict step constraints. Our method generates a 5-second video in just 2.61 seconds (up to 31 FPS), achieving a 277.3ร speedup over the undistilled model and doubling the acceleration ratio of the state-of-the-art (140ร) without sacrificing visual quality. The source code and trained models will be released and we have uploaded the video to an anonymous hosting service at: https://diagonal-distillation.github.io/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2737. Revisiting Parameter Server in LLM Post-Training
๐ฌ 16
โญ 5.50
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ Distributed Training ZeRO Optimizer FSDP Parameter Server
Modern data parallel (DP) training favors collective communication over parameter servers (PS) for its simplicity and efficiency under balanced workloads. However, the balanced workload assumption no longer holds in large language model (LLM) post-training due to the large variance in sequence lengths. Under imbalanced workloads, collective communication creates synchronization barriers, leading to under-utilization of devices with smaller workloads. This change in training dynamics calls for a revisit of the PS paradigm for its robustness to such imbalance. We propose **On-Demand Communication (ODC)**, which adapts PS into Fully Sharded Data Parallel (FSDP) by replacing collective all-gather and reduce-scatter with direct point-to-point communication. Compared to FSDP, ODC reduces the synchronization barrier from once per layer to once per minibatch and decouples the workload on each device so that faster workers are not stalled. It also enables simpler and more effective load balancing at the minibatch level. Across diverse LLM post-training tasks, ODC consistently improves device utilization and training throughput, achieving up to a 36\% speedup over standard FSDP. These results demonstrate that ODC is a superior fit for the prevalent imbalanced workloads in LLM post-training. We will open-source our implementation at camera-ready version.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2738. PRISM: Partial-label Relational Inference with Spatial and Spectral Cues
๐ฌ 4
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Weak Supervised Learning Graph Neural Networks Relational Inference
In many real-world scenarios, precisely labeling graph data is costly or impractical, especially in domains like molecular biology or social networks, where annotation requires expert effort. This challenge motivates partial-label graph learning, where each graph is weakly annotated with a candidate label set containing the true label. However, such ambiguous supervision makes it hard to extract reliable semantics and increases the risk of overfitting to noisy candidates. To address these challenges, we propose PRISM, a unified framework that performs relational inference with spatial and spectral cues to resolve label ambiguity. PRISM captures discriminative spatial cues by aligning prototype-guided substructures across graphs and extracts global spectral cues by decomposing graph signals into multiple frequency bands with attention, preserving frequency-specific semantics. These complementary views are integrated into a hybrid relational graph, which supports confidence-aware label propagation under candidate constraints. A closed-loop refinement mechanism further stabilizes supervision via masked updates and momentum-based confidence estimation. Extensive experiments across diverse benchmarks demonstrate that PRISM consistently outperforms strong baselines under various noise settings, establishing a new paradigm for weakly supervised graph classification.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2739. LEAP: Local ECT-Based Learnable Positional Encodings for Graphs
๐ฌ 21
โญ 5.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Topology Euler Characteristic Transform Graph Neural Networks Topological Data Analysis TDA Topological Deep Learning
Graph neural networks (GNNs) largely rely on the message-passing paradigm, where nodes iteratively aggregate information from their neighbors. Yet, standard message passing neural networks (MPNNs) face well-documented theoretical and practical limitations. Graph positional encoding (PE) has emerged as a promising direction to address these limitations. The Euler Characteristic Transform (ECT) is an efficiently computable geometricโtopological invariant that characterizes shapes and graphs. In this work, we combine the differentiable approximation of the ECT (DECT) and its local variant ($\ell$-ECT) to propose LEAP, a new end-to-end trainable local structural PE for graphs. We evaluate our approach on multiple real-world datasets as well as on a synthetic task designed to test its ability to extract topological features. Our results underline the potential of LEAP-based encodings as a powerful component for graph representation learning pipelines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2740. Continuous Audio Language Models
๐ฌ 12
โญ 5.50
๐ generative models
๐ท๏ธ audio language model speech music consistency models continuous modeling streaming
Audio Language Models (ALM) have emerged as the dominant paradigm for speech and music generation by representing audio as sequences of discrete tokens. Yet, unlike text tokens, which are invertible, audio tokens are extracted from lossy codecs with a limited bitrate. As a consequence, increasing audio quality requires generating more tokens, which imposes a trade-off between fidelity and computational cost. We address this issue by studying Continuous Audio Language Models (CALM). These models instantiate a large Transformer backbone that produces a contextual embedding at every timestep. This sequential information then conditions an MLP that generates the next continuous frame of an audio VAE through consistency modeling. By avoiding lossy compression, CALM achieves higher quality at lower computational cost than their discrete counterpart. Experiments on speech and music demonstrate improved efficiency and fidelity over state-of-the-art discrete audio language models, facilitating lightweight, high-quality audio generation. Samples are available at https://iclr-continuous-audio-language-models.github.io.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2741. Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts
๐ฌ 5
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Mixture of Experts Load Balancing Computation Efficiency
The Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation to balance performance and efficiency. However, under expert parallelism, MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where underloaded experts complete computations early but must wait for overloaded experts, leading to global delays. We define this phenomenon as the \textbf{\textit{Straggler Effect}}, as the most burdened experts dictate the overall inference latency. To address this, we first propose \textit{\textbf{Capacity-Aware Token Drop}}, which enforces expert capacity limits by discarding excess tokens from overloaded experts, effectively reducing load imbalance with minimal performance impact (e.g., speedup with only degradation on OLMoE). Next, given the presence of low-load experts remaining well below the capacity threshold, we introduce \textit{\textbf{Capacity-Aware Expanded Drop}}, which allows tokens to include additional local experts in their candidate set before enforcing strict local capacity constraints, thereby improving load balance and enhancing the utilization of underused experts. Extensive experiments on both language and multimodal MoE models demonstrate the effectiveness of our approach, yielding substantial gains in expert utilization, model performance, and inference efficiency, e.g., applying Expanded Drop to Mixtral-8\times7B-Instruct yields a {0.2%} average performance improvement and a {1.85\times} inference speedup. The code will be released upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2742. Multi-Action Self-Improvement For Neural Combinatorial Optimization
๐ฌ 13
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Neural Combinatorial Optimization Self-improvement learning Multi-agent Combinatorial Optimization Multi-agent coordination Bipartite matching
Self-improvement has emerged as a state-of-the-art paradigm in Neural Combinatorial Optimization (NCO), where models iteratively refine their policies by generating and imitating high-quality solutions. Despite strong empirical performance, existing methods face key limitations. Training is computationally expensive, as policy updates require sampling numerous candidate solutions per instance to extract a single expert trajectory. More fundamentally, these approaches fail to exploit the structure of combinatorial problems involving the coordination of multiple agents, such as vehicles in min-max routing or machines in scheduling. By supervising on single-action trajectories, they fail to exploit agent-permutation symmetries, where distinct sequences of actions yield identical solutions, hindering generalization and the ability to learn coordinated behavior.
We address these challenges by extending self-improvement to operate over joint multi-agent actions. Our model architecture predicts complete agent-task assignments jointly at each decision step. To explicitly leverage symmetries, we employ a set-prediction loss, which supervises the policy on multiple expert assignments for any given state. This approach enhances sample efficiency and the model's ability to learn coordinated behavior. Furthermore, by generating multi-agent actions in parallel, it drastically accelerates the solution generation phase of the self-improvement loop. Empirically, we validate our method on several combinatorial problems, demonstrating consistent improvements in the quality of the final solution and a reduced generation latency compared to standard self-improvement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2743. Is Finer Better? The Limits of Microscaling Formats in Large Language Models
๐ฌ 17
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ microscaling fine-grained FP4 quantization low-precision llm
Microscaling data formats leverage per-block tensor quantization to enable aggressive model compression with limited loss in accuracy. Unlocking their potential for efficient training and inference necessitates hardware-friendly implementations that handle matrix multiplications in a native format and adopt efficient error-mitigation strategies. Herein, we reported the emergence of a surprising behavior associated with microscaling quantization, whereas the output of a quantized model degrades as block size is decreased below a given threshold. This behavior clashes with the expectation that a smaller block size should allow for a better representation of the tensor elements. We investigate this phenomenon both experimentally and theoretically, decoupling the sources of quantization error behind it. Experimentally, we analyze the distributions of several Large Language Models and identify the conditions driving the anomalous behavior. Theoretically, we lay down a framework showing remarkable agreement with experimental data from pretrained model distributions and ideal ones. Overall, we show that the anomaly is driven by the interplay between narrow tensor distributions and the limited dynamic range of the quantized scales. Based on these insights, we propose the use of FP8 unsigned E5M3 as a novel hardware-friendly format for the scales in FP4 microscaling data types. We demonstrate that UE5M3 achieves comparable performance to the conventional FP8 unsigned E4M3 scales while obviating the need of global scaling operations on weights and activations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2744. Unified VisionโLanguage Modeling via Concept Space Alignment
๐ฌ 13
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ multimodal embedding space multilingual embedding space
We introduce vSONAR, a visionโlanguage embedding space extended from the text-only embedding space SONAR, which supports 200 text languages and 37 speech languages.
To construct vSONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space.
We thoroughly evaluate vSONAR and show that its embeddings achieve competitive performance on text-to-video retrieval.
Equipped with the SONAR text decoder, vSONAR further surpasses state-of-the-art visionโlanguage models on video captioning tasks, including DREAM-1K (BLEU 24.3 vs. 19.6) and VATEX (BLEU 45.0 vs. 41.5).
Leveraging vSONAR, we first demonstrate that the Large Concept Model (LCM) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner.
Finally, we introduce vLCM, which extends the LCM with visionโlanguage instruction tuning. vLCM encodes vision and language inputs into an unified sequence of latent embeddings via vSONARand SONAR, and it is trained with the same latent diffusion objective for next-embedding prediction as in LCM's text-only pre-training.
Experiments on a large-scale multilingual and -modal instructionโtuning data mixture highlight the potential of vLCM: vLCM matches state-of-the-art vision-language models on tasks covering image/video captioning and question answering, while significantly outperforming them across 61 rich- to low-resource languages out of all 62 tested languages.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2745. Beyond Sequential Reranking: Reranker-Guided Search Improves Reasoning Intensive Retrieval
๐ฌ 8
โญ 5.50
๐ optimization
๐ท๏ธ Vector Similarity Search Information Retrieval LLM reranker
The widely used retrieve-and-rerank pipeline faces two critical limitations: they are constrained by the initial retrieval quality of the top-k documents, and the growing computational demands of LLM-based rerankers restrict the number of documents that can be effectively processed. We introduce Reranker-Guided-Search (RGS), a novel approach that bypasses these limitations by directly retrieving documents according to reranker preferences rather than following the traditional sequential reranking method. Our method uses a greedy search on proximity graphs generated by approximate nearest neighbor algorithms, strategically prioritizing promising documents for reranking based on document similarity. Experimental results demonstrate substantial performance improvements across multiple benchmarks: 3.5 points on BRIGHT, 2.9 on FollowIR, and 5.1 on M-BEIR, all within a constrained reranker budget of 100 documents. Our analysis suggests that, given a fixed pair of embedding and reranker models, strategically selecting documents to rerank can significantly improve retrieval accuracy under limited reranker budget.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2746. Understanding Collaboration Mechanism In VAE Recommender Systems
๐ฌ 15
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ VAE-based collaborative filtering
Variational Autoencoders (VAEs) are a powerful alternative to matrix factorization for recommendation. A common technique in VAE-based collaborative filtering (CF) consists in applying binary input masking to user interaction vectors, which improves performance but remains underexplored theoretically. In this work, we analyze how collaboration arises in VAEโbased CF and show it is governed by \emph{latent proximity}: we derive a latent sharing radius that informs when an SGD update on one user strictly reduces the loss on another user, with influence decaying as the latent Wasserstein distance increases. We further study the induced geometry: with clean inputs, VAEโbased CF primarily exploits \emph{local} collaboration between inputโsimilar users and underโutilizes \emph{global} collaboration between farโbutโrelated users. We compare two mechanisms that encourage \emph{global} mixing and characterize their tradeโoffs: \ding{172} $\beta$โKL regularization directly tightens the information bottleneck, promoting posterior overlap but risking representational collapse if too large; \ding{173} input masking induces stochastic \emph{geometric} contractions and expansions, which can bring distant users onto the same latent neighborhood but also introduce neighborhood drift.
To preserve user identity while enabling global consistency, we propose an anchor regularizer that aligns user posteriors with item embeddings, stabilizing users under masking and facilitating signal sharing across related items. Our analyses are validated on the Netflix, MovieLens-20M, and Million Song datasets. We also successfully deployed our proposed algorithm on an Amazon streaming platform following a successful online experiment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2747. Thinking as Society: Multi-Social-Agent Self-Distillation for Multimodal Misinformation Detection
๐ฌ 20
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Multimodal Misinformation Detection Multimodal Large Language Models Multi-Social-Agent Self-Distillation
Multimodal Misinformation Detection (MMD) in realistic, mixed-sourced scenarios must incorporate robust reasoning capabilities to handle the social complexity and diverse types of forgeries. While MLLM-based agents are increasingly used for MMD task due to their powerful reasoning abilities, they suffer from a critical trade-off: on one hand, single-agent methods provide only the limited, single-view analysis; on the other hand, multi-agent methods introduce high computational costs and significant optimization difficulties. To address this gap, we propose a novel Multi-Social-Agent Self-Distillation framework that internalizes collective social reasoning capabilities into a unified model. Our framework consists of two core stages: (1) we simulate multi-perspective judgments from a diverse society of MLLM agents and synthesize their collective feedback into high-quality Social Chain-of-Thought (SCoT) data; (2) Building on this, we propose the Social Correction Value-Driven Preference Optimization (SCPO), a new alignment algorithm that leverages the degree of social misjudgment as a verifiable signal to dynamically focus training on the most challenging samples. Extensive experiments on the challenging MFC-Bench and MMFakeBench benchmarks demonstrate the effectiveness of our framework. Our 7B Qwen2-VL-based model significantly outperforms various MLLM baselines, multi-agent methods, and even competes or surpasses proprietary models like GPT-4o and Claude, facilitating advanced multimodal misinformation reasoning and detection via thinking as society.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2748. RAG4DMC: Retrieval-Augmented Generation for Data-Level Modality Completion
๐ฌ 20
โญ 5.50
๐ generative models
๐ท๏ธ Retrieval-Augmented Generation; Missing Modality Completion; Multimodal Learning
Multi-modal datasets are critical for a wide range of applications, but in practice, they often suffer from missing modalities. This motivates the task of Missing Modality Completion (MMC), which aims to reconstruct missing modalities from the available ones to fully exploit multi-modal data. While pre-trained generative models offer a natural solution, directly applying them to domain-specific MMC is often ineffective, and fine-tuning suffers from limitations like limited complete samples, restricted API access, and high cost. To address these issues, we propose RAG4DMC, a retrieval-augmented generation framework for data-level MMC. RAG4DMC builds a dual knowledge base from complete in-dataset samples and external public datasets, enhanced with feature alignment and clustering-based filtering to mitigate modality and domain shifts. A multi-modal fusion retrieval mechanism combining intra-modal retrieval with cross-modal fusion then provides relevant context to guide generation, followed by a candidate selection mechanism for coherent completion. Extensive experiments on general and domain-specific datasets demonstrate that our method produces more accurate and semantically coherent missing-modality completions, resulting in substantial improvements in downstream imageโtext retrieval and image captioning tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2749. Constrained Diffusion for Protein Design with Hard Structural Constraints
๐ฌ 16
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Constrained Diffusion Generative Models Protein Design Proximal Optimization Motif Scaffolding
Diffusion models offer a powerful means of capturing the manifold of realistic protein structures, enabling rapid design for protein engineering tasks. However, existing approaches observe critical failure modes when precise constraints are necessary for functional design. To this end, we present a constrained diffusion framework for structure-guided protein design, ensuring strict adherence to functional requirements while maintaining precise stereochemical and geometric feasibility. The approach integrates proximal feasibility updates with ADMM decomposition into the generative process, scaling effectively to the complex constraint sets of this domain. We evaluate on challenging protein design tasks, including motif scaffolding and vacancy-constrained pocket design, while introducing a novel curated benchmark dataset for motif scaffolding in the PDZ domain. Our approach achieves state-of-the-art, providing perfect satisfaction of bonding and geometric constraints with no degradation in structural diversity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2750. UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings
๐ฌ 13
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ multimodal embedding representation learning multimodal large language model reasoning model
The remarkable success of multimodal large language models (MLLMs) has driven advances in multimodal embeddings, yet existing models remain inherently discriminative, limiting their ability to benefit from reasoning-driven generation paradigm. In this work, we pioneer the exploration of generative embeddings, unifying embedding tasks within a generative paradigm. We propose UME-R1, a universal multimodal embedding framework consisting of a two-stage training strategy: a cold-start supervised fine-tuning equips the model with reasoning capabilities and enables it to generate both discriminative and generative embeddings; a subsequent reinforcement learning enhances reasoning and further optimizes generative embedding quality. This pioneering work reveals four key insights: 1) generative embeddings unlock substantial performance gains over conventional discriminative embeddings by leveraging the powerful generative reasoning capabilities of MLLMs; 2) discriminative and generative embeddings are complementary, whose combined oracle performance far exceeding that of either alone; 3) RL can effectively enhance generative embeddings, establishing a scalable optimization paradigm.; 4) repeated sampling at inference boosts downstream task coverage (pass@k), highlighting the inference-time scalability potential of generative embeddings. Evaluated on the MMEB-V2 benchmark across 78 tasks spanning video, image, and visual documents, UME-R1 significantly outperforms conventional discriminative embedding models and offers a foundation for more interpretable, reasoning-driven generative multimodal embeddings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2751. DaVinci: Reinforcing Visual-Structural Syntax in MLLMs for Generalized Scientific Diagram Parsing
๐ฌ 29
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Scientific diagram parsing multimodal large language model
Parsing raster-based scientific diagrams into structured representations is critical for editability and reusability. However, existing multimodal LLMs (MLLMs) struggle with the diverse visual primitives, complex structural layouts, and strict syntax involved. To address this, we introduce DaVinci, a novel MLLM that learns diagram parsing based on a two-stage frameworkโsupervised learning of visual primitives followed by reinforcement learning of their structural relationships. Our model learns visual-structural syntax through supervised training on TikZ30K, a newly curated dataset of high-quality diagram-TikZ code pairs that features abundant visual primitives and structurally optimized drawing sequences. We further refine the model via reinforcement learning, guided by a hybrid reward function that jointly optimizes for visual fidelity, structural consistency, and code correctness. Extensive experiments show that DaVinci significantly outperforms existing open-source MLLMs and surpasses leading proprietary models like GPT-5 and Claude-Sonnet-4.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2752. NeuralOS: Towards Simulating Operating Systems via Neural Generative Models
๐ฌ 14
โญ 5.50
๐ generative models
๐ท๏ธ Generative Models Neural Simulation Diffusion Models Graphical User Interfaces
We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks the computer state, with a diffusion-based neural renderer that generates screen images. The model is trained on a dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents. Experiments show that NeuralOS successfully renders realistic GUI sequences, accurately captures mouse interactions, and reliably predicts state transitions like application launches. Beyond reproducing existing systems, NeuralOS shows that synthesized training data can teach the model to simulate applications that were never installed, as illustrated by a Doom application, and suggests a path toward learning user interfaces purely from synthetic demonstrations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2753. LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR
๐ฌ 14
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Optical Music Recognition AI for Music Multimodal Learning
We propose Legato, a new end-to-end model for optical music recognition (OMR), a task of converting music score images to machine-readable documents. Legato is the first large-scale pretrained OMR model capable of recognizing full-page or multi-page typeset music scores and the first to generate documents in ABC notation, a concise, human-readable format for symbolic music. Bringing together a pretrained vision encoder with an ABC decoder trained on a dataset of more than 214K images, our model exhibits the strong ability to generalize across various typeset scores. We conduct comprehensive experiments on a range of datasets and metrics and demonstrate that Legato outperforms the previous state of the art. On our most realistic dataset, we see a 68\% and 47.6\% absolute error reduction on the standard metrics TEDn and OMR-NED, respectively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2754. Locally Subspace-Informed Neural Operators for Efficient Multiscale PDE Solving
๐ฌ 16
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Neural Operators heterogeneous PDEs scientific machine learning Generalized Multiscale Finite Element Method localized spectral basis functions
We propose GMsFEM-NO, a novel hybrid framework that combines the robustness of the Generalized Multiscale Finite Element Method (GMsFEM) with the computational speed of neural operators (NOs) to create an efficient method for solving heterogeneous partial differential equations (PDEs). GMsFEM builds localized spectral basis functions on coarse grids, allowing it to capture important multiscale features and solve PDEs accurately with less computational effort. However, computing these basis functions is costly. While NOs offer a fast alternative by learning the solution operator directly from data, they can lack robustness. Our approach trains a NO to instantly predict the GMsFEM basis by using a novel subspace-informed loss that learns the entire relevant subspace, not just individual functions. This strategy significantly accelerates the costly offline stage of GMsFEM while retaining its foundation in rigorous numerical analysis, resulting in a solution that is both fast and reliable. On standard multiscale benchmarksโincluding a linear elliptic diffusion problem and the nonlinear, steady-state Richards equationโour GMsFEM-NO method achieves a reduction in solution error compared to standalone NOs and other hybrid methods. The framework demonstrates effective performance for both 2D and 3D problems. A key advantage is its discretization flexibility: the NO can be trained on a small computational grid and evaluated on a larger one with minimal loss of accuracy, ensuring easy scalability. Furthermore, the resulting solver remains independent of forcing terms, preserving the generalization capabilities of the original GMsFEM approach. Our results prove that combining NO with GMsFEM creates a powerful new type of solver that is both fast and accurate.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2755. Implicit bias produces neural scaling laws in learning curves, from perceptrons to deep networks
๐ฌ 13
โญ 5.50
๐ optimization
๐ท๏ธ Neural scaling laws Implicit bias Learning curves Spectral complexity norm Perceptron theory
Scaling laws in deep learning -- empirical power-law relationships linking model performance to resource growth -- have emerged as simple yet striking regularities across architectures, datasets, and tasks. These laws are particularly impactful in guiding the design of state-of-the-art models, since they quantify the benefits of increasing data or model size, and hint at the foundations of interpretability in machine learning. However, most studies focus on asymptotic behavior at the end of training. In this work, we describe a richer picture by analyzing the entire training dynamics: we identify two novel \textit{dynamical} scaling laws that govern how performance evolves as function of different norm-based complexity measures. Combined, our new laws recover the well-known scaling for test error at convergence. Our findings are consistent across CNNs, ResNets, and Vision Transformers trained on MNIST, CIFAR-10 and CIFAR-100. Furthermore, we provide analytical support using a single-layer perceptron trained with logistic loss, where we derive the new dynamical scaling laws, and we explain them through the implicit bias induced by gradient-based training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2756. Fixing the Broken Compass: Diagnosing and Improving Inference-Time Reward Modeling
๐ฌ 18
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Model analysis & interpretability Reasoning Inference-time Scaling
Inference-time scaling techniques have shown promise in enhancing the reasoning capabilities of large language models (LLMs). While recent research has primarily focused on training-time optimization, our work highlights inference-time reward model (RM)-based reasoning as a critical yet overlooked avenue. In this paper, we conduct a systematic analysis of RM behavior across downstream reasoning tasks, revealing three key limitations: (1) RM can impair performance on simple questions, (2) its discriminative ability declines with increased sampling, and (3) high search diversity undermines RM performance. To address these issues, we propose CRISP (Clustered Reward Integration with Stepwise Prefixing), a novel inference-time algorithm that clusters generated reasoning paths by final answers, aggregates reward signals at the cluster level, and adaptively updates prefix prompts to guide generation. Experimental results demonstrate that CRISP significantly enhances LLM reasoning performance, achieving up to 5% accuracy improvement over other RM-based inference methods and an average of 10% gain over advanced reasoning models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2757. Grounding Computer Use Agents on Human Demonstrations
๐ฌ 20
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ computer use agents dataset multimodal large language models
Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. To address this gap, we introduce GroundCUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements. At both 3B and 7B scales, GroundNext achieves state-of-the-art results across five benchmarks using supervised fine-tuning, while requiring less than one-tenth the training data of prior work. Reinforcement learning post-training further improves performance. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2758. Structure Learning from Time-Series Data with Lag-Agnostic Structural Prior
๐ฌ 23
โญ 5.50
๐ causal reasoning
๐ท๏ธ Continuous DAG structure learning dynamic causal discovery structure learning from time series data
Learning instantaneous and time-lagged causal relationships from time-series data is essential for uncovering fine-grained, temporally-aware interactions. Although this problem has been formulated as a continuous optimization task amenable to modern machine learning methods, existing approaches largely neglect the use of coarse-grained, lag-agnostic causal priors, an important form of prior knowledge that is often available in practice. To address this gap, we propose a novel framework for structure learning from time series to integrate lag-agnostic priors, enabling the discovery of lag-specific causal links without requiring precise temporal annotations. We introduce formulations to precisely characterize the lag-agnostic prior, and demonstrate their consequential and process-equivalence to priors, maintaining consistency with the intended semantics of the priors throughout optimization. We further analyze the challenge for optimization due to the increased non-convexity by lag-agnostic prior constraints, and introduce a data-driven initialization to mitigate this issue. Experiments on both synthetic and real-world datasets show that our method effectively incorporates lag-agnostic prior knowledge to enhance the recovery of fine-grained, lag-aware structures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2759. Tina: Tiny Reasoning Models via LoRA
๐ฌ 9
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reasoning models efficient reasoning LoRA RLVR
How cost-effectively can strong reasoning abilities be achieved in language models? Driven by this question, we present Tina, a family of tiny reasoning models achieved with high cost-efficiency. Tina shows that substantial reasoning performance can be developed using only minimal resources, by applying low-rank adaptation (LoRA) during reinforcement learning (RL), to an already tiny 1.5B parameter base model. This minimalist approach produces models that are competitive with, and sometimes surpass, SOTA RL reasoning models built upon the same base model. Crucially, this is achieved at a tiny fraction of the computational cost employed by existing models. In fact, the best Tina model achieves a >20\% reasoning performance increase and 43.33\% zero-shot Pass@1 accuracy on AIME24, at only \$9 USD cost (i.e., an estimated 260x reduction). Our work reveals the surprising effectiveness of efficient RL reasoning via LoRA. We validate this across multiple open-source reasoning datasets and various ablation settings starting with a single, fixed set of hyperparameters. Furthermore, we explore the hypothesis that this effectiveness and efficiency stem from LoRA rapidly adapting the model to the structural format of reasoning rewarded by RL, while largely preserving the base model's underlying knowledge. In service of accessibility and open research, we fully open-source all code, training logs, model weights, and checkpoints.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2760. Simplicial Embeddings Improve Sample Efficiency in ActorโCritic Agents
๐ฌ 18
โญ 5.50
๐ reinforcement learning
๐ท๏ธ reinforcement learning deep reinforcement learning actor critic representation learning state embeddings
Recent works have proposed accelerating the wall-clock training time of actor-critic methods via the use of large-scale environment parallelization; unfortunately, these can sometimes still require large number of environment interactions to achieve a desired level of performance. Noting that well-structured representations can improve the generalization and sample efficiency of deep reinforcement learning (RL) agents, we propose the use of simplicial embeddings: lightweight representation layers that constrain embeddings to simplicial structures. This geometric inductive bias results in sparse and discrete features that stabilize critic bootstrapping and strengthen policy gradients.
When applied to FastTD3, FastSAC, and PPO, simplicial embeddings consistently improve sample efficiency and final performance across a variety of continuous- and discrete-control environments, without any loss in runtime speed.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2761. QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining
๐ฌ 8
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ MLLMs mesoscopic bias and interpolation bias dynamic quadtree vision encoders
Multimodal Large Language Models (MLLMs) encode images into visual tokens, aligning visual and textual signals within a shared latent space to facilitate cross-modal representation learning. The CLIP model is a widely adopted foundational vision language model whose vision encoder has played a critical role in the development of MLLMs such as LLaVA. However, the CLIP vision encoder suffers from notable limitations including being constrained to only handling fixed input resolutions and a failure to produce separated embeddings for dissimilar images. Replacing the vision encoder of an existing model typically incurs substantial computational costs because such a change often necessitates retraining the entire model pipeline.
In this work, we identify two factors which underlie the limitations of the CLIP vision encoder: mesoscopic bias and interpolation bias. To address these issues, we propose QLIP, a drop-in replacement for CLIP that can be seamlessly integrated with existing MLLMs with only a few lines of code and can enhance both coarse-grained and fine-grained visual understanding, without re-training. QLIP is designed around an image quadtree which replaces the standard uniform grid patches with a novel content aware patchification.
Our experimental results demonstrate that QLIP improves the general visual question answering accuracy of the LLaVA-1.5 model series across various model sizesโwithout requiring retraining or fine-tuning of the full MLLM. Notably, QLIP boosts detailed understanding performance on the challenging $V^*$ benchmark by up to 13.6%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2762. vCache: Verified Semantic Prompt Caching
๐ฌ 12
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Semantic Prompt Cache
Semantic caches return cached responses for semantically similar prompts to reduce LLM inference latency and cost. They embed cached prompts and store them alongside their response in a vector database. Embedding similarity metrics assign a numerical score to quantify the similarity between a request and its nearest neighbor prompt from the cache. Existing systems use the same static similarity threshold across all requests to determine whether two prompts can share similar responses. However, we observe that static thresholds do not give formal correctness guarantees, can result in unexpected error rates, and lead to suboptimal cache hit rates. This paper proposes vCache, the first verified semantic cache with user-defined error rate guarantees. It employs an online learning algorithm to estimate an optimal threshold for each cached prompt, enabling reliable cache responses without additional training. Our experiments show that vCache consistently meets the specified error bounds while outperforming state-of-the-art static-threshold and fine-tuned embedding baselines with up to 12.5$\times$ higher cache hit and 26$\times$ lower error rates. We release the vCache implementation and four benchmarks to support future research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2763. Product of Experts for Visual Generation
๐ฌ 11
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ generative models image generation video generation
Modern neural models capture rich priors and have complementary knowledge over shared data domains, e.g., images and videos. Integrating diverse knowledge from multiple sourcesโincluding visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators remains under-explored. We propose a probabilistic framework that combines information from these heterogeneous models, where expert models jointly shape a product distribution over outputs. To sample from this product distribution for controllable image/video synthesis tasks, we introduce an annealed MCMC sampler in combination with SMC-style resampling to enable efficient inference-time model composition. Our framework empirically yields better controllability than monolithic methods and additionally provides flexible user interfaces for specifying visual generation goals.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2764. Search Self-Play: Pushing the Frontier of Agent Capability without Supervision
๐ฌ 21
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Self-Play Deep Search LLM Agent RLVR
Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding ground-truth answers to provide accurate rewards, which requires significant human effort and hinders the scaling of RL processes, especially in agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of generated agentic tasks can hardly be controlled to provide effective RL training advantages. To achieve agentic RLVR with higher scalability, we explore self-play training for deep search agents, in which the learning LLM utilizes multi-turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and output the correct answer predictions. To ensure that each generated search query has accurate ground truth, we collect all the searching results from the proposer's trajectory as external knowledge, then conduct retrieval-augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided. In this search self-play (SSP) game, the proposer and the solver co-evolve their agent capabilities through both competition and cooperation. With substantial experimental results, we find that SSP can significantly improve search agents' performance uniformly on various benchmarks without any supervision under both from-scratch and continuous RL training setups.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2765. Multi-Task Low-Rank Model Adaptation
๐ฌ 17
โญ 5.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ low rank adaptation multi-task learning mixture of experts model adaptation parameter efficient fine tuning
Scaling multi-task low-rank adaptation (LoRA) to a large number of tasks induces catastrophic performance degradation, such as an accuracy drop from 88.2% to 2.0% on DOTA when scaling from 5 to 15 tasks. This failure is due to parameter and representation misalignment. We find that existing solutions, like regularization and dynamic routing, fail at scale because they are constrained by a fundamental trade-off: strengthening regularization to reduce inter-task conflict inadvertently suppresses the essential feature discrimination required for effective routing. In this work, we identify two root causes for this trade-off. First, uniform regularization disrupts inter-task knowledge sharing: shared underlying knowledge concentrates in high-SV components (89% alignment on Flanv2โBBH). Uniform regularization forces high-SV components to update in orthogonal directions, directly disrupting the shared knowledge. Second, Conflict Amplification: Applying LoRA at the component-level (*e.g.*, $W_q, W_v$) amplifies gradient conflicts; we show block-level adaptation reduces this conflict with 47% fewer parameters. Based on these insights, we propose mtLoRA, a scalable solution with three novel designs: 1) Spectral-Aware Regularization to selectively orthogonalize low-SV components while preserving high-SV shared knowledge, 2) Block-Level Adaptation to mitigate conflict amplification and largely improve parameter efficiency, and 3) Fine-Grained Routing using dimension-specific weights for superior expressive power. On four large-scale (15-25 tasks) vision (DOTA and iNat2018) and NLP (Dolly-15k and BBH) benchmarks, mtLoRA achieves 91.7%, 81.5%, 44.5% and 38.5% accuracy on DOTA, iNat2018, Dolly-15k and BBH respectively, outperforming the state-of-the-art by 2.3% on average while using 47% fewer parameters and 24% less training time.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2766. Towards Real-World Routing with Neural Combinatorial Optimization
๐ฌ 13
โญ 5.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Neural Combinatorial Optimization Vehicle Routing Problem
The practical deployment of Neural Combinatorial Optimization (NCO) for Vehicle Routing Problems (VRPs) is hindered by a critical sim-to-real gap. This gap stems not only from training on oversimplified Euclidean data but also from node-based architectures incapable of handling the node-and-edge-based features with correlated asymmetric cost matrices, such as those for real-world distance and duration. We introduce RRNCO, a novel architecture specifically designed to address these complexities.
RRNCO's novelty lies in two key innovations. First, its Adaptive Node Embedding (ANE) efficiently fuses spatial coordinates with real-world distance features using a learned contextual gating mechanism. Second, its Neural Adaptive Bias (NAB) is the first mechanism to jointly model asymmetric distance, duration, and directional angles, enabling it to capture complex, realistic routing constraints. Moreover, we introduce a new VRP benchmark grounded in real-world data crucial for bridging this sim-to-real gap, featuring asymmetric distance and duration matrices from 100 diverse cities, enabling the training and validation of NCO solvers on tasks that are more representative of practical settings. Experiments demonstrate that RRNCO achieves state-of-the-art performance on this benchmark, significantly advancing the practical applicability of neural solvers for real-world logistics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2767. Planned Diffusion
๐ฌ 15
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ diffusion LLM parallel generation fast inference autoregressive planning hybrid model
A central challenge in large language model inference is the trade-off between generation speed and output quality. Autoregressive models produce high-quality text but generate tokens sequentially. Diffusion models can generate tokens in parallel but often need many iterations to match the same quality. We propose planned diffusion, a hybrid method that combines the strengths of both paradigms.
Planned diffusion works in two stages: first, the model creates a short autoregressive plan that breaks the output into smaller, independent spans. Second, the model generates these spans simultaneously using diffusion. This approach expands the speedโquality Pareto frontier and provides a practical path to faster, high-quality text generation. On AlpacaEval, a suite of 805 instruction-following prompts, planned diffusion achieves Pareto-optimal trade-off between quality and latency, achieving 1.27x to 1.81x speedup over autoregressive generation with only 0.87\% to 5.4\% drop in win rate, respectively. Our sensitivity analysis shows that the planning mechanism of planned diffusion is minimal and reliable, and simple runtime knobs exist to provide flexible control of the quality-latency trade-off.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2768. AutoGPS: Automated Geometry Problem Solving via Multimodal Formalization and Deductive Reasoning
๐ฌ 4
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ Neural-symbolic geometry problem solving interpretable and reliable reasoning
Geometry problem solving presents distinctive challenges in artificial intelligence,
requiring exceptional multimodal comprehension and rigorous mathematical reasoning capabilities.
Existing approaches typically fall into two categories: neural-based and symbolic-based methods,
both of which exhibit limitations in reliability and interpretability. To address this challenge, we propose AutoGPS, a neuro-symbolic collaborative framework that solves geometry problems with concise, reliable, and human-interpretable reasoning processes.
Specifically, AutoGPS employs a Multimodal Problem Formalizer (MPF) and a Deductive Symbolic Reasoner (DSR).
The MPF utilizes neural cross-modal comprehension to translate geometry problems into structured formal language representations,
with feedback from DSR collaboratively.
The DSR takes the formalization as input and formulates geometry problem solving as a hypergraph expansion task,
executing mathematically rigorous and reliable derivation to produce minimal and human-readable stepwise solutions.
Extensive experimental evaluations demonstrate that AutoGPS achieves state-of-the-art performance on benchmark datasets.
Furthermore, human stepwise-reasoning evaluation confirms AutoGPS's impressive reliability and interpretability,
with 99\% stepwise logical coherence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2769. DispViT: Direct Stereo Disparity Regression with a Single-Stream Vision Transformer
๐ฌ 11
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ stereo disparity estimation vision transformer positional encoding
Deep stereo disparity estimation has long been dominated by a \textbf{matching-centric paradigm}, built on constructing cost volumes and iteratively refining local correspondences.
Despite its success, this paradigm exhibits an intrinsic vulnerability: visual ambiguities from occlusion or non-Lambertian surfaces invevitably induce errorneous matches that refinement cannot recover.
This paper introduces \textbf{DispViT}, a new architecture that establishes a \textbf{regression-centric paradigm}.
Instead of explicit matching, DispViT directly regresses disparity from tokenized binocular representations using a single-stream Vision Transformer.
This is enabled by a set of lightweight yet critical designs, such as a probability-based disparity parameterization for stable training and an asymmetrically initialized stereo tokenizer for effective view distinction.
To better align the two views during stereo tokenization, we introduce a novel shift-embedding mechanism that encodes different disparity shifts into channel groups, preserving geometric cues even under large view displacements.
A lightweight refinement module then sharpens the regressed disparity map for fine-grained accuracy.
By prioritizing holistic regression over explicit matching, DispViT streamlines the stereo pipeline while improving robustness and efficiency.
Experiments on standard benchmarks show that our approach achieves state-of-the-art accuracy, with strong resilience to matching ambiguities and wide disparity ranges.
Code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2770. VisCoder2: Building Multi-Language Visualization Coding Agents
๐ฌ 16
โญ 5.50
๐ generative models
๐ท๏ธ Code Models Visualization Fine-tuning
Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable execution, and lack of iterative correction mechanisms. Progress has been constrained by narrow datasets and benchmarks that emphasize single-round generation and single-language tasks. To address these challenges, we introduce three complementary resources for advancing visualization coding agents. **VisCode-Multi-679K** is a large-scale, supervised dataset containing 679K validated and executable visualization samples with multi-turn correction dialogues across 12 programming languages. **VisPlotBench** is a benchmark for systematic evaluation, featuring executable tasks, rendered outputs, and protocols for both initial generation and multi-round self-debug. Finally, we present **VisCoder2**, a family of multi-language visualization models trained on VisCode-Multi-679K. Experiments show that VisCoder2 significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4.1, with further gains from iterative self-debug, reaching **82.4%** overall execution pass rate at the 32B scale, particularly in symbolic or compiler-dependent languages.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2771. Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training
๐ฌ 11
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Reasoning Models Safety Alignment
We discover a novel and surprising phenomenon of unintentional misalignment in reasoning language models (RLMs), which we call \textit{self-jailbreaking}.
Specifically, after benign reasoning training on math or code domains, RLMs will use multiple strategies to circumvent their own safety guardrails.
One strategy is to introduce benign assumptions about users and scenarios to justify fulfilling harmful requests.
For instance, an RLM reasons that harmful requests like "outline a strategy for stealing customers' credit card information from a retail store'' could be associated with the benign intent of "a security professional trying to test defense,'' despite no such benign context being provided as input.
We observe that many open-weight RLMs, including DeepSeek-R1-distilled, s1, Phi-4-mini-reasoning, and Nemotron, suffer from self-jailbreaking despite being aware of the harmfulness of the requests.
We also provide a mechanistic interpretation of self-jailbreaking: RLMs are more compliant after benign reasoning training, and self-jailbreaking reduces the perceived harmfulness of malicious requests so models can comply with them.
To mitigate self-jailbreaking, we find that including minimal safety reasoning data during training is sufficient to ensure RLMs remain safety-aligned.
Our work provides the first systematic analysis of self-jailbreaking behavior and offers a practical path forward for maintaining safety in increasingly capable RLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2772. STARK: Strategic Team of Agents for Refining Kernels
๐ฌ 14
โญ 5.50
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ large language model agent kernel optimization efficiency
The efficiency of GPU kernels is central to the progress of modern AI, yet optimizing them remains a difficult and labor-intensive task due to complex interactions between memory hierarchies, thread scheduling, and hardware-specific characteristics. While recent advances in large language models (LLMs) provide new opportunities for automated code generation, existing approaches largely treat LLMs as single-shot generators or naive refinement tools, limiting their effectiveness in navigating the irregular kernel optimization landscape. We introduce an LLM agentic framework for GPU kernel optimization that systematically explores the design space through multi-agent collaboration, grounded instruction, dynamic context management, and strategic search. This framework mimics the workflow of expert engineers, enabling LLMs to reason about hardware trade-offs, incorporate profiling feedback, and refine kernels iteratively. We evaluate our approach on KernelBench, a benchmark for LLM-based kernel optimization, and demonstrate substantial improvements over baseline agents: our system produces correct solutions where baselines often fail, and achieves kernels with up to 16$\times$ faster runtime performance. These results highlight the potential of agentic LLM frameworks to advance fully automated, scalable GPU kernel optimization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2773. Latent Veracity Inference for Identifying Errors in Stepwise Reasoning
๐ฌ 16
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Latent variable models Language models Probabilistic inference Veracity Chain-of-thought
Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can contain inaccurate statements that reduce performance and trustworthiness.
To address this, we propose to augment each reasoning step in a CoT with a latent veracity (or correctness) variable.
To efficiently explore this expanded space, we introduce Veracity Search (VS), a discrete search algorithm over veracity assignments. It performs otherwise intractable inference in the posterior distribution over latent veracity values by leveraging the LM's joint likelihood over veracity and the final answer as a proxy reward.
This efficient inference-time verification method facilitates supervised fine-tuning of an Amortized Veracity Inference (AVI) machine by providing pseudo-labels for veracity. AVI generalizes VS, enabling accurate zero-shot veracity inference in novel contexts. Empirical results demonstrate that VS reliably identifies errors in logical (ProntoQA), mathematical (GSM8K), and commonsense (CommonsenseQA) reasoning benchmarks, with AVI achieving comparable zero-shot accuracy. Finally, we demonstrate the utility of latent veracity inference for providing feedback during self-correction and self-improvement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2774. Demystifying and Enhancing the Efficiency of Large Language Model Based Search Agents
๐ฌ 16
โญ 5.50
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ Large Language Models Reasoning Agents System Efficiency Information Retrieval
Large Language Model (LLM)-based search agents have shown remarkable capabilities in solving complex tasks by dynamically decomposing problems and addressing them through interleaved reasoning and retrieval. However, this interleaved paradigm introduces substantial efficiency bottlenecks. First, we observe that both highly accurate and overly approximate retrieval methods degrade system efficiency: exact search incurs significant retrieval overhead, while coarse retrieval requires additional reasoning steps during generation. Second, we identify inefficiencies in system design, including improper scheduling and frequent retrieval stalls, which lead to cascading latency---where even minor delays in retrieval amplify end-to-end inference time. To address these challenges, we introduce \texttt{SearchAgent-X}, a high-efficiency inference framework for LLM-based search agents. \texttt{SearchAgent-X} leverages high-recall approximate retrieval and incorporates two key techniques: priority-aware scheduling and non-stall retrieval. Extensive experiments demonstrate that \texttt{SearchAgent-X} consistently outperforms state-of-the-art systems such as vLLM and HNSW-based retrieval across diverse tasks, achieving up to 3.4$\times$ higher throughput and 5$\times$ lower latency, without compromising generation quality. Code is available at \url{https://anonymous.4open.science/r/SearchAgent-X}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2775. Faster Parameter-Free Regret Matching Algorithms
๐ฌ 14
โญ 5.50
๐ learning theory
๐ท๏ธ Regret Matching Parameter-Free Nash Equilibrium
Regret Matching (RM) and its variants are widely employed to learn a Nash equilibrium (NE) in large-scale games. However, most existing research only establishes a theoretical convergence rate of $O(1/\sqrt{T})$ for these algorithms in learning an NE. Recent studies have shown that smooth RM$^+$ variants, the advanced variants of RM, can achieve an improved convergence rate of $O(1/T)$. Despite this improvement, smooth RM$^+$ variants lose the parameter-free property, i.e., no parameters that need to be tuned, a highly desirable feature in practical applications. In this paper, we propose a novel smooth RM$^+$ variant called Monotone Increasing Smooth Predictive Regret Matching$^+$ (MI-SPRM$^+$), which retains the parameter-free property while still achieving a theoretical convergence rate of $O(1/T)$. To achieve these properties, MI-SPRM$^+$ employs a technology called Adaptive Regret Domain (ARD), which ensures that the lower bound for the 1-norm of accumulated regrets increases monotonically by adjusting the decision space at each iteration. This design is motivated by the observation that the range of step-sizes supporting the $O(1/T)$ convergence rate in existing smooth RM$^+$ variants is contingent on the lower bound for the 1-norm of accumulated regrets. Experimental results confirm that MI-SPRM$^+$ empirically attains an $O(1/T)$ convergence rate.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2776. Delta-XAI: A Unified Framework for Explaining Prediction Changes in Online Time Series Monitoring
๐ฌ 24
โญ 5.50
๐ learning on time series and dynamical systems
๐ท๏ธ Time Series Online Time Series Monitoring Explainable Artificial Intelligence XAI
Explaining online time series monitoring models is crucial across sensitive domains such as healthcare and finance, where temporal and contextual prediction dynamics underpin critical decisions. While recent XAI methods have improved the explainability of time series models, they mostly analyze each time step independently, overlooking temporal dependencies. This results in further challenges: explaining prediction changes is non-trivial, methods fail to leverage online dynamics, and evaluation remains difficult. To address these challenges, we propose Delta-XAI, which adapts 14 existing XAI methods through a wrapper function and introduces a principled evaluation suite for the online setting, assessing diverse aspects, such as faithfulness, sufficiency, and coherence. Experiments reveal that classical gradient-based methods, such as Integrated Gradients (IG), can outperform recent approaches when adapted for temporal analysis. Building on this, we propose Shifted Window Integrated Gradients (SWING), which incorporates past observations in the integration path to systematically capture temporal dependencies and mitigate out-of-distribution effects. Extensive experiments consistently demonstrate the effectiveness of SWING across diverse settings with respect to diverse metrics. Our code is publicly available at https://anonymous.4open.science/r/Delta-XAI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2777. Reliable Probabilistic Forecasting of Irregular Time Series through Marginalization-Consistent Flows
๐ฌ 13
โญ 5.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Irregular Time Series Probabilistic Forecasting Normalizing Flows
Probabilistic forecasting of joint distributions for irregular time series with missing values is an underexplored area in machine learning. Existing models, such as Gaussian Process Regression and ProFITi, are limited: while ProFITi is highly expressive due to its use of normalizing flows, it often produces unrealistic predictions because it lacks marginalization consistencyโmarginal distributions of subsets of variables may not match those predicted directly, leading to inaccurate marginal forecasts when trained on joints.
We propose MOSES (Mixtures of Separable Flows), a novel model that parametrizes a stochastic process via a mixture of normalizing flows, where each component combines a latent multivariate Gaussian with separable univariate transformations. This design allows MOSES to be analytically marginalized, enabling accurate and reliable predictions for various probabilistic queries.
Experiments on four datasets show that MOSES achieves highly accurate joint and marginal predictions. Thanks to its inherent marginalization consistency, MOSES significantly outperforms all baselinesโincluding ProFITiโon marginal predictions.
For joint predictions, it beats all other consistent models and performs close to or slightly worse than ProFITi.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2778. The First Impression Problem: Internal Bias Triggers Overthinking in Reasoning Models
๐ฌ 26
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Large Reasoning Model Overthinking
Reasoning models often exhibit overthinking, characterized by redundant reasoning steps. We identify \emph{internal bias} elicited by
the input question as a key trigger of such behavior. Upon encountering a problem, the model immediately forms a preliminary guess about the answer, which we term an internal bias since it may not be explicitly generated, and it arises without systematic reasoning. When this guess conflicts with its subsequent reasoning, the model tends to engage in excessive reflection, resulting in wasted computation. We validate the association between internal bias and overthinking across multiple models and diverse reasoning tasks. To demonstrate the causal relationship more rigorously, we conduct two counterfactual interventions, showing that removing the input question after the model reduces the redundant reasoning across various complex reasoning tasks, and manually injecting bias affects overthinking accordingly. Further interpretability experiments suggest that excessive attention to the input question serves as a key mechanism through which internal bias influences subsequent reasoning trajectories. Finally, we evaluated several methods aimed at mitigating overthinking, yet the influence of internal bias persisted under all conditions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2779. Multi-View Encoders for Performance Prediction in LLM-Based Agentic Workflows
๐ฌ 14
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Agentic Workflows Performance Prediction Multi-View Encoding Unsupervised Pretraining Large Language Models
Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but optimizing LLM-based agentic systems remains challenging due to the vast search space of agent configurations, prompting strategies, and communication patterns. Existing approaches often rely on heuristic-based tuning or exhaustive evaluation, which can be computationally expensive and suboptimal. This paper proposes **Agentic Predictor**, a lightweight predictor for efficient agentic workflow evaluation. Agentic Predictor is equipped with a *multi-view workflow encoding* technique that leverages multi-view representation learning of agentic systems by incorporating code architecture, textual prompts, and interaction graph features. To achieve high predictive accuracy while significantly reducing the number of required workflow evaluations for training a predictor, Agentic Predictor employs *cross-domain unsupervised pretraining*. By learning to approximate task success rates, Agentic Predictor enables fast and accurate selection of optimal agentic workflow configurations for a given task, significantly reducing the need for expensive trial-and-error evaluations. Experiments on a carefully curated benchmark spanning three domains show that our predictor outperforms state-of-the-art methods in both predictive accuracy and workflow utility, highlighting the potential of performance predictors in streamlining the design of LLM-based agentic workflows.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2780. Incentives in Federated Learning with Heterogeneous Agents
๐ฌ 10
โญ 5.50
๐ learning theory
๐ท๏ธ federated learning incentives mechanism design PAC learning sample complexity approximation algorithms strategyproofness price of stability
Federated learning promises significant sample-efficiency gains by pooling data across multiple agents, yet incentive misalignment is an obstacle: each update is costly to the contributor but boosts every participant. We introduce a game-theoretic framework that captures heterogeneous data: an agentโs utility depends on who supplies each sample, not just how many. Agents aim to meet a PAC-style accuracy threshold at minimal personal cost. We show that uncoordinated play yields pathologies: pure equilibria may not exist, and the best equilibrium can be arbitrarily more costly than cooperation. To steer collaboration, we analyze the cost-minimizing contribution vector, prove that computing it is NP-hard, and derive a polynomial-time linear program that achieves a logarithmic approximation. Finally, pairing the LP with a simple pay-what-you-contribute ruleโeach agent receives a payment equal to its sample costโyields a mechanism that is strategy-proof and, within the class of contribution-based transfers, is unique.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2781. Learning to Parallel: Accelerating Diffusion Large Language Models via Adaptive Parallel Decoding
๐ฌ 4
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Diffusion LLM
Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through iterative denoising. However, current parallel decoding strategies rely on fixed, input-agnostic heuristics (e.g., confidence thresholds), which fail to adapt to input-specific characteristics, resulting in suboptimal speed-quality trade-offs across diverse NLP tasks. In this work, we explore a more flexible and dynamic approach to parallel decoding. We propose **Learning to Parallel Decode (Learn2PD)**, a framework that trains a lightweight and adaptive filter model to predict, for each token position, whether the current prediction matches the final output. This learned filter approximates an oracle parallel decoding strategy that unmasks tokens only when correctly predicted. Importantly, the filter model is learned in a post-training manner, requiring only a small amount of computation to optimize it (minute-level GPU time). Additionally, we introduce **End-of-Text Prediction (EoTP)** to detect decoding completion at the end of sequence, avoiding redundant decoding of padding tokens. Experiments on the LLaDA benchmark demonstrate that our method achieves up to **22.58ร** speedup without any performance drop, and up to **57.51ร** when combined with KV-Cache.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2782. LLMs are Single-threaded Reasoners: Demystifying the Working Mechanism of Soft Thinking
๐ฌ 16
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language model latent chain-of-thought reasoning
Human cognition naturally engages with abstract and fluid concepts, whereas existing reasoning models often rely on generating discrete tokens, potentially constraining their expressive capabilities. Recent advancements aim to address this limitation by enabling large language models (LLMs) to generate soft, abstract tokens, thus facilitating reasoning within a continuous concept space. In this paper, we investigate the $\textit{Soft Thinking}$ capabilities of various LLMs through a systematic analysis of their internal behavior using a suite of probing techniques. Contrary to the prevailing belief that Soft Thinking supports parallel exploration of diverse reasoning paths, our findings reveal that $\textbf{LLMs behave as single-threaded reasoners}$โthey predominantly rely on the token with the highest probability in the soft input to predict the next step. This behavior induces a greedy feedback loop that suppresses alternative reasoning paths and undermines the benefits of transmitting richer information via Soft Tokens. To address this $\textit{Greedy Pitfall}$, we propose $\textbf{Stochastic Soft Thinking}$, which introduces stochasticity to break free from the greedy tendency. Our experiments demonstrate that incorporating $\textit{randomness}$โparticularly with the $\textbf{Gumbel-Softmax trick}$โcan alleviate the limitations of vanilla approaches and unleash the potential of Soft Thinking, resulting in superior performance across eight reasoning benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2783. DHG-Bench: A Comprehensive Benchmark for Deep Hypergraph Learning
๐ฌ 42
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ hypergraph neural networks deep hypergraph learning comprehensive benchmark
Deep graph models have achieved great success in network representation learning. However, their focus on pairwise relationships restricts their ability to learn pervasive higher-order interactions in real-world systems, which can be naturally modeled as hypergraphs. To tackle this issue, Hypergraph Neural Networks (HNNs) have garnered substantial attention in recent years. Despite the proposal of numerous HNNs, the absence of consistent experimental protocols and multi-dimensional empirical analysis impedes deeper understanding and further development of HNN research. While several toolkits for deep hypergraph learning
(DHGL) have been introduced to facilitate algorithm evaluation, they provide only limited quantitative evaluation results and insufficient coverage of advanced algorithms, datasets, and benchmark tasks. To fill the gap, we introduce DHG-Bench, the first comprehensive benchmark for HNNs. Specifically, DHG-Bench systematically investigates the characteristics of HNNs in terms of four dimensions: effectiveness, efficiency, robustness, and fairness. We comprehensively evaluate 17 state-of-the-art HNN algorithms on 22 diverse datasets spanning node-, edge-, and graph-level tasks, under unified experimental settings. Extensive experiments reveal both the strengths and limitations of existing algorithms, offering valuable insights and directions for future research. Furthermore, to facilitate reproducible research, we have developed an easy-to-use library for training
and evaluating different HNN methods. The DHG-Bench library is available at:
https://anonymous.4open.science/r/DHG_Bench-F739.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2784. Optimal transport unlocks end-to-end learning for single-molecule localization
๐ฌ 14
โญ 5.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Single Molecule Localization Microscopy SMLM High-Density Learning Deep Learning Inverse Problems Iterative Refinement
Singleโmolecule localization microscopy (SMLM) allows reconstructing cellular organelles and biology-relevant structures far beyond the limited spatial resolution imposed by optics constrains, using tagged biomolecule positions. Currently, efficient SMLM requires nonโoverlapping emitting fluorophores, to ensure proper image deconvolution leading to long acquisition times that hinders liveโcell imaging. Recent deepโlearning approaches can handle denser emissions, but they rely on variants of nonโmaximum suppression (NMS) layers, which are unfortunately nonโdifferentiable and may discard true positives with their local fusion strategy.
In this presentation, we reformulate the SMLM training objective as a setโmatching problem, deriving an optimalโtransport loss that eliminates the need for NMS during inference and enables endโtoโend training.
Additionally, we propose an iterative neural network that integrates knowledge of the microscopeโs optical system inside our model. Experiments on synthetic benchmarks and real biological data show that both our new loss function and architecture surpass the state of the art at moderate and high emitter densities. Code is available at \url{anonymized_url}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2785. Near-Optimal Sample Complexity Bounds for Constrained Average-Reward MDPs
๐ฌ 12
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Constrained Average-Reward Markov Decision Process Minimax-Optimal Bounds Sample Complexity
Recent advances have significantly improved our understanding of the sample complexity of learning in average-reward Markov decision processes (AMDPs) under the generative model. However, much less is known about the constrained average-reward MDP (CAMDP), where policies must satisfy long-run average constraints. In this work, we address this gap by studying the sample complexity of learning an $\epsilon$-optimal policy in CAMDPs under a generative model. We propose a model-based algorithm that operates under two settings: (i) relaxed feasibility, which allows small constraint violations, and (ii) strict feasibility, where the output policy satisfies the constraint.
We show that our algorithm achieves sample complexities of $\tilde{O}\left(\frac{S A (B+H)}{ \epsilon^2}\right)$ and $\tilde{O} \left(\frac{S A (B+H)}{\epsilon^2 \zeta^2} \right)$ under the relaxed and strict feasibility settings, respectively. Here, $\zeta$ is the Slater constant indicating the size of the feasible region, $H$ is the span bound of the bias function, and $B$ is the transient time bound. Moreover, a matching lower bound of $\tilde{\Omega}\left(\frac{S A (B+H)}{ \epsilon^2\zeta^2}\right)$ for the strict feasibility case is established, thus providing the first minimax-optimal bounds for CAMDPs. Our results close the theoretical gap in understanding the complexity of constrained average-reward MDPs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2786. Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping
๐ฌ 15
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal LLMs Vision-Language Models Fine Grained Visual Grounding Image Warping
Multimodal large language models (MLLMs) often miss small details and spatial relations in cluttered scenes, leading to errors in fine-grained perceptual grounding. We introduce AttWarp, a lightweight method that allocates more resolution to query-relevant content while compressing less informative areas, all while preserving global context. At test time, the approach uses an MLLM's cross-modal attention to perform rectilinear warping of the input image, reallocating spatial resolution toward regions the model deems important, without changing model weights or architecture. This attention-guided warping preserves all original image information but redistributes it non-uniformly, so small objects and subtle relationships become easier for the same model to read while the global layout remains intact. Across ten benchmarks (TextVQA, GQA, DocVQA, POPE, MMMU, MIA-Bench, MMVP, VQAv2, RealWorldQA, BLINK) and four MLLMs (LLaVA, Qwen-VL, InternVL, and InstructBLIP), AttWarp consistently improves accuracy, strengthens compositional reasoning, and reduces hallucinations, outperforming four competitive baselines that manipulate raw images at test time. Together, these results show that attention-guided warping prioritizes information relevant to the query while preserving context, and that the same MLLMs perform better when given such warped inputs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2787. NRGPT: An Energy-based Alternative for GPT
๐ฌ 13
โญ 5.50
๐ generative models
๐ท๏ธ energy-based model GPT LLM small language models
Generative Pre-trained Transformer (GPT) architectures are the most popular design for language modeling. Energy-based modeling is a different paradigm that views inference as a dynamical process operating on an energy landscape. We propose a minimal modification of the GPT setting to unify it with the EBM framework. The inference step of our model, which we call eNeRgy-GPT (NRGPT), is conceptualized as an exploration of the tokens on the energy landscape. We prove, and verify empirically, that under certain circumstances this exploration becomes gradient descent, although they donโt necessarily lead to the best performing models. We demonstrate that our model performs well for simple language (Shakespeare dataset), algebraic ListOPS tasks, and richer settings such as OpenWebText
language modeling. We also observe that our models may be more resistant to overfitting, doing so only during very long training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2788. VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs
๐ฌ 14
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Figure-based Mathematical Reasoning Large Multimodal Models Mathematical Benchmark
Large Multimodal Models have achieved remarkable progress in integrating vision and language, enabling strong performance across perception, reasoning, and domain-specific tasks. However, their capacity to reason over multiple, visually similar inputs remains insufficiently explored. Such fine-grained comparative reasoning is central to real-world tasks, especially in mathematics and education, where learners must often distinguish between nearly identical diagrams to identify correct solutions. To address this gap, we present VisioMath, a curated benchmark of 1,800 high-quality Kโ12 mathematics problems in which all candidate answers are diagrams with subtle visual similarities. A comprehensive evaluation of state-of-the-art LMMs, covering both leading closed-source systems and widely adopted open-source models, reveals a consistent decline in accuracy as inter-image similarity increases. Analysis indicates that the dominant failure mode stems from imageโtext misalignment: rather than grounding reasoning in textual cues, models often resort to shallow positional heuristics, resulting in systematic errors. We further explore three alignment-oriented strategies, spanning training-free approaches and finetuning, and achieve substantial accuracy gains. We hope that VisioMath will serve as a rigorous benchmark and catalyst for developing LMMs toward deeper diagram understanding, precise comparative reasoning, and grounded multi-imageโtext integration.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2789. Quartet of Diffusions: Structure-Aware Point Cloud Generation through Part and Symmetry Guidance
๐ฌ 5
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ computer vision point cloud generation structure-aware part-based symmetry-aware
We introduce the *Quartet of Diffusions*, a structure-aware point cloud generation framework that explicitly models part composition and symmetry. Unlike prior methods that treat shape generation as a holistic process or only support part composition, our approach leverages four coordinated diffusion models to learn distributions of global shape latents, symmetries, semantic parts, and their spatial assembly. This structured pipeline ensures guaranteed symmetry, coherent part placement, and diverse, high-quality outputs. By disentangling the generative process into interpretable components, our method supports fine-grained control over shape attributes, enabling targeted manipulation of individual parts while preserving global consistency. A central global latent further reinforces structural coherence across assembled parts. Our experiments show that the Quartet achieves state-of-the-art performance. To our best knowledge, this is the first 3D point cloud generation framework that fully integrates and enforces both symmetry and part priors throughout the generative process.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2790. Adaptive Augmentation-Aware Latent Learning for Robust LiDAR Semantic Segmentation
๐ฌ 5
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ LiDAR Semantic Segmentation Autonomous Driving Robust Learning for Adverse Weather Data Augmentation Domain Generalization
Adverse weather conditions significantly degrade the performance of LiDAR point cloud semantic segmentation networks by introducing large distribution shifts.
Existing augmentation-based methods attempt to enhance robustness by simulating weather interference during training.
However, they struggle to fully exploit the potential of augmentations due to the trade-off between minor and aggressive augmentations.
To address this, we propose A3Point, an adaptive augmentation-aware latent learning framework that effectively utilizes a diverse range of augmentations while mitigating the semantic shift, which refers to the change in the semantic meaning caused by augmentations.
A3Point consists of two key components:
semantic confusion prior (SCP) latent learning, which captures the model's inherent semantic confusion information, and semantic shift region (SSR) localization, which
decouples semantic confusion and semantic shift, enabling adaptive optimization strategies for different disturbance levels.
Extensive experiments on multiple standard generalized LiDAR segmentation benchmarks under adverse weather demonstrate the effectiveness of our method, setting new state-of-the-art results. The code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2791. Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning
๐ฌ 35
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ visual reasoning adaptive reasoning multimodal large language models
Current visual reasoning methods mainly focus on exploring specific reasoning modes. Although improvements can be achieved in particular domains, they struggle to develop general reasoning capabilities. Inspired by this, we propose a novel adaptive reasoning paradigm, $\underline{\text{M}}$ixture-$\underline{\text{o}}$f-$\underline{\text{V}}$isual-$\underline{\text{T}}$houghts (**MoVT**), which unifies different reasoning modes within a single model and guides it to select the appropriate mode based on context. To achieve this, we introduce **AdaVaR**, a two-stage $\underline{\text{Ada}}$ptive $\underline{\text{V}}$isu$\underline{\text{a}}$l $\underline{\text{R}}$easoning learning framework: different modes are unified and learned during the supervised cold-start stage, and the mode selection capability is induced via an RL process with a carefully designed AdaGRPO algorithm. Extensive experiments show that AdaVaR effectively guides the model to learn and differentiate multiple modes and perform context-adaptive mode selection, achieving consistent improvement across various scenarios, highlighting MoVT as an effective solution for building general visual reasoning models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2792. Prune-then-Quantize or Quantize-then-Prune? Understanding the Impact of Compression Order in Joint Model Compression
๐ฌ 20
โญ 5.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Model Compression Joint Model Compression Compression Order Network Pruning Network Quantization
What happens when multiple compression methods are combinedโdoes the order in which they are applied matter?
Joint model compression has emerged as a powerful strategy to achieve higher efficiency by combining multiple methods such as pruning and quantization.
A central but underexplored factor in joint model compression is the compression order, or the sequence of different methods within the compression pipeline.
Most prior studies have either sidestepped the issue by assuming orthogonality between techniques, while a few have examined them only in highly constrained cases.
Consequently, the broader role of compression order in shaping model performance remains poorly understood.
In this paper, we address the overlooked problem of compression order and provide both theoretical and empirical analysis.
We formulate the problem of optimizing the compression order and introduce the Progressive Intensity Hypothesis, which states that weaker perturbations should precede stronger ones.
We provide theoretical guarantees showing that the relative benefit of one order increases with the underlying performance gap.
Extensive experiments on both language and vision models validate the hypothesis, and further show its generality to broader setups such as multi-stage compression and mixed-precision quantization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2793. Animal behavioral analysis and neural encoding with transformer-based self-supervised pretraining
๐ฌ 27
โญ 5.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ neural encoding behavior analysis electrophysiology Neuropixels neuroscience pose estimation
The brain can only be fully understood through the lens of the behavior it generates--a guiding principle in modern neuroscience research that nevertheless presents significant technical challenges. Many studies capture behavior with cameras, but video analysis approaches typically rely on specialized models requiring extensive labeled data. We address this limitation with BEAST (BEhavioral Analysis via Self-supervised pretraining of Transformers), a novel and scalable framework that pretrains experiment-specific vision transformers for diverse neuro-behavior analyses. BEAST combines masked autoencoding with temporal contrastive learning to effectively leverage unlabeled video data. Through comprehensive evaluation across multiple species, we demonstrate improved performance in three critical neuro-behavioral tasks: extracting behavioral features that correlate with neural activity, and pose estimation and action segmentation in both the single- and multi-animal settings. Our method establishes a powerful and versatile backbone model that accelerates behavioral analysis in scenarios where labeled data remains scarce.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2794. DR-SAC: Distributionally Robust Soft Actor-Critic for Reinforcement Learning under Uncertainty
๐ฌ 18
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Distributionally Robust Optimization Robust Reinforcement Learning
Deep reinforcement learning (RL) has achieved remarkable success, yet its deployment in real-world scenarios is often limited by vulnerability to environmental uncertainties. Distributionally robust RL (DR-RL) algorithms have been proposed to resolve this challenge, but existing approaches are largely restricted to value-based methods in tabular settings. In this work, we introduce Distributionally Robust Soft Actor-Critic (DR-SAC), the first actorโcritic based DR-RL algorithm for offline learning in continuous action spaces. DR-SAC maximizes the entropy-regularized rewards against the worst possible transition models within an KL-divergence constrained uncertainty set. We derive the distributionally robust version of the soft policy iteration with a convergence guarantee and incorporate a generative modeling approach to estimate the unknown nominal transition models. Experiment results on five continuous RL tasks demonstrate our algorithm achieves up to $9.8\times$ higher average reward than the SAC baseline under common perturbations. Additionally, DR-SAC significantly improves computing efficiency and applicability to large-scale problems compared with existing DR-RL algorithms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2795. Discount Model Search for Quality Diversity Optimization in High-Dimensional Measure Spaces
๐ฌ 14
โญ 5.50
๐ optimization
๐ท๏ธ quality diversity optimization black-box optimization derivative-free optimization latent space exploration
Quality diversity (QD) optimization searches for a collection of solutions that optimize an objective while attaining diverse outputs of a user-specified, vector-valued measure function. Contemporary QD algorithms are typically limited to low-dimensional measures because high-dimensional measures are prone to distortion, where many solutions found by the QD algorithm map to similar measures. For example, the state-of-the-art CMA-MAE algorithm guides measure space exploration with a histogram in measure space that records so-called discount values. However, CMA-MAE stagnates in domains with high-dimensional measure spaces because solutions with similar measures fall into the same histogram cell and hence receive the same discount value. To address these limitations, we propose Discount Model Search (DMS), which guides exploration with a model that provides a smooth, continuous representation of discount values. In high-dimensional measure spaces, this model enables DMS to distinguish between solutions with similar measures and thus continue exploration. We show that DMS facilitates new capabilities for QD algorithms by introducing two new domains where the measure space is the high-dimensional space of images, which enables users to specify their desired measures by providing a dataset of images rather than hand-designing the measure function. Results in these domains and on high-dimensional benchmarks show that DMS outperforms CMA-MAE and other existing black-box QD algorithms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2796. DiffSDA: Unsupervised Diffusion Sequential Disentanglement Across Modalities
๐ฌ 16
โญ 5.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Sequential Disentanglement Diffusion Models Unsupervised Learning
Unsupervised representation learning, particularly sequential disentanglement, aims to separate static and dynamic factors of variation in data without relying on labels. This remains a challenging problem, as existing approaches based on variational autoencoders and generative adversarial networks often rely on multiple loss terms, complicating the optimization process. Furthermore, sequential disentanglement methods face challenges when applied to real-world data, and there is currently no established evaluation protocol for assessing their performance in such settings. Recently, diffusion models have emerged as state-of-the-art generative models, but no theoretical formalization exists for their application to sequential disentanglement. In this work, we introduce the Diffusion Sequential Disentanglement Autoencoder (DiffSDA), a novel, modal-agnostic framework effective across diverse real-world data modalities, including time series, video, and audio. DiffSDA leverages a new probabilistic modeling, latent diffusion, and efficient samplers, while incorporating a challenging evaluation protocol for rigorous testing. Our experiments on diverse real-world benchmarks demonstrate that DiffSDA outperforms recent state-of-the-art methods in sequential disentanglement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2797. Cross-Embodiment Offline Reinforcement Learning for Heterogeneous Robot Datasets
๐ฌ 24
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Offline Reinforcement Learning Cross-Embodiment Learning Multi-Task Learning
Scalable robot policy pre-training has been hindered by the high cost of collecting high-quality demonstrations for each platform. In this study, we address this issue by uniting offline reinforcement learning (offline RL) with cross-embodiment learning. Offline RL leverages both expert and abundant suboptimal data, and cross-embodiment learning aggregates heterogeneous robot trajectories across diverse morphologies to acquire universal control priors. We perform a systematic analysis of this offline RL and cross-embodiment paradigm, providing a principled understanding of its strengths and limitations. To evaluate this offline RL and cross-embodiment paradigm, we construct a suite of locomotion datasets spanning 16 distinct robot platforms. Our experiments confirm that this combined approach excels at pre-training with datasets rich in suboptimal trajectories, outperforming pure behavior cloning. However, as the proportion of suboptimal data and the number of robot types increase, we observe that conflicting gradients across morphologies begin to impede learning. To mitigate this, we introduce an embodiment-based grouping strategy in which robots are clustered by morphological similarity and the model is updated with a group gradient. This simple, static grouping substantially reduces inter-robot conflicts and outperforms existing conflict-resolution methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2798. DRIFT: Decompose, Retrieve, Illustrate, then Formalize Theorems
๐ฌ 16
โญ 5.50
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ autoformalization retrieval augmented generation decomposition
Automating the formalization of mathematical statements for theorem proving remains a major challenge for Large Language Models (LLMs). LLMs struggle to identify and utilize the prerequisite mathematical knowledge and its corresponding formal representation in languages like Lean. Current retrieval-augmented autoformalization methods query external libraries using the informal statement directly, but overlook a fundamental limitation: informal mathematical statements are often complex and offer limited
context on the underlying math concepts.
To address this, we introduce DRIFT, a novel framework that enables LLMs to decompose informal mathematical statements into smaller, more tractable ``sub-components''. This facilitates targeted retrieval of premises from mathematical libraries such as Mathlib. Additionally, DRIFT retrieves illustrative theorems to help models use premises more effectively in formalization tasks.
We evaluate DRIFT across diverse benchmarks (ProofNet, ConNF, and MiniF2F-test) and find that it consistently improves premise retrieval, nearly doubling the F1 score compared to the DPR baseline on ProofNet. Notably, DRIFT demonstrates strong performance on the out-of-distribution ConNF benchmark, with BEq+@10 improvements of 42.25\% and 37.14\% using GPT-4.1 and DeepSeek-V3.1, respectively. Our analysis shows that retrieval effectiveness in mathematical autoformalization depends heavily on model-specific knowledge boundaries, highlighting the need for adaptive retrieval strategies aligned with each modelโs capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2799. AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer
๐ฌ 17
โญ 5.50
๐ generative models
๐ท๏ธ 3D Generation Autoregressive Transformer Modular 3D Assets
The digital industry demands high-quality, diverse modular 3D assets, especially for user-generated content (UGC). In this work, we introduce AssetFormer, an autoregressive Transformer-based model designed to generate modular 3D assets from textual descriptions. Our pilot study leverages real-world modular assets collected from online platforms. AssetFormer tackles the challenge of creating assets composed of primitives that adhere to constrained design parameters for various applications. By innovatively adapting module sequencing and decoding techniques inspired by language models, our approach enhances asset generation quality through autoregressive modeling. Initial results indicate the effectiveness of AssetFormer in streamlining asset creation for professional development and UGC scenarios. This work presents a flexible framework extendable to various types of modular 3D assets, contributing to the broader field of 3D content generation. We will make this work open-sourced.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2800. Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs
๐ฌ 18
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ AI Fairness AI & Society Utility-Fairness Trade-off Visual-language models
Machine learning models trained on real-world data often inherit and amplify biases against certain social groups, raising urgent concerns about their deployment at scale. While numerous bias mitigation methods have been proposed, comparing the effectiveness of bias mitigation methods remains difficult due to heterogeneous datasets, inconsistent fairness metrics, isolated evaluation of vision versus multi-modal models, and insufficient hyperparameter tuning that undermines fair comparisons. We introduce NH-Fair, a unified benchmark for fairness without harm that spans both vision models and large visionโlanguage models (LVLMs) under standardized data, metrics, and training protocols, covering supervised and zero-shot regimes. Our key contributions are: (1) a systematic ERM tuning study that identifies training choices with large influence on both utility and disparities, yielding empirically grounded guidelines to help practitioners reduce expensive hyperparameter tuning space in achieving strong fairness and accuracy; (2) evidence that many debiasing methods do not reliably outperform a well-tuned ERM baseline, whereas a composite data-augmentation method consistently delivers parity gains without sacrificing utility, emerging as a promising practical strategy. (3) an analysis showing that while LVLMs achieve higher average accuracy, they still exhibit subgroup disparities, and gains from scaling are typically smaller than those from architectural or training-protocol choices. NH-Fair provides a reproducible, tuning-aware pipeline for rigorous, harm-aware fairness evaluation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2801. GeomMotif: A Benchmark for Arbitrary Geometric Preservation in Protein Generation
๐ฌ 15
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ protein design generative models motif scaffolding geometric preservation deep learning benchmark structural biology
Motif scaffolding in protein design involves generating complete protein structures while preserving the 3D geometry of designated structural fragments, analogous to image outpainting in computer vision. Current benchmarks focus on functional motifs, leaving general geometric preservation capabilities largely untested. We introduce GeomMotif, a systematic benchmark that evaluates arbitrary structural fragment preservation without requiring functional specificity. We construct 57 benchmark tasks, each containing one or two motifs with up to 7 continuous fragments, by sampling from the Protein Data Bank (PDB) to ensure a ground-truth, solvable conformation for every problem. The tasks are characterized by comprehensive structural and physicochemical properties: size, geometric context, secondary structure, hydrophobicity, charge, and degree of burial. These features enable detailed performance analysis beyond simple success rates, revealing model-specific strengths and limitations. We evaluate models using scRMSD and pLDDT for geometric fidelity and clustering for structural diversity and novelty. Our results show that sequence-based and structure-based approaches find different tasks challenging, and that geometric preservation varies significantly with structural and physicochemical context. GeomMotif provides insights complementary to function-focused benchmarks and establishes a foundation for improving protein generative models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2802. RobotArena $\infty$: Unlimited Robot Benchmarking via Real-to-Sim Translation
๐ฌ 15
โญ 5.50
๐ datasets and benchmarks
๐ท๏ธ Benchmarking Robotics Evaluation Vision Language Action Models Real2Sim
The pursuit of robot generalistsโinstructable agents capable of performing diverse tasks across diverse environmentsโdemands rigorous and scalable evaluation. Yet real-world testing of robot policies remains fundamentally constrained: it is labor-intensive, slow, unsafe at scale, and difficult to reproduce. Existing simulation benchmarks are similarly limited, as they train and test policies within the same synthetic domains and cannot assess models trained primarily on real-world demonstrations, which is the dominant paradigm for todayโs vision-language-action (VLA) models. As policies expand in scope and complexity, these barriers only intensify, since defining ``success" in robotics often hinges on nuanced human judgments of execution quality.
In this paper, we introduce a new benchmarking framework that overcomes these challenges by shifting VLA evaluation into large-scale simulated environments augmented with online human feedback. Leveraging advances in vision-language models, 2D-to-3D generative modeling, and differentiable rendering, our approach automatically converts video demonstrations from widely used robot datasets into simulated counterparts. Within these digital twins, we assess VLA policies using both automated VLM-guided scoring and scalable human preference judgments collected from crowdworkersโtransforming human involvement from tedious scene setup, resetting, and safety supervision into lightweight preference comparisons.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2803. Bridging Explainability and Embeddings: BEE Aware of Spuriousness
๐ฌ 15
โญ 5.50
๐ interpretability and explainable AI
๐ท๏ธ spurious correlation interpretability clip foundation models
Current methods for detecting spurious correlations rely on data splits or error patterns, leaving many harmful shortcuts invisible when counterexamples are absent. We introduce BEE (Bridging Explainability and Embeddings), a framework that shifts the focus from model predictions to the weight space and embedding geometry underlying decisions. By analyzing how fine-tuning perturbs pretrained representations, BEE uncovers spurious correlations that remain hidden from conventional evaluation pipelines. We use linear probing as a transparent diagnostic lens, revealing spurious features that not only persist after full fine-tuning but also transfer across diverse state-of-the-art models. Our experiments cover numerous datasets and domains: vision (Waterbirds, CelebA, ImageNet-1k), language (CivilComments, MIMIC-CXR medical notes), and multiple embedding families (CLIP, CLIP-DataComp.XL, mGTE, BLIP2, SigLIP2).
BEE consistently exposes spurious correlations: from concepts that slash the ImageNet accuracy by up to 95\%, to clinical shortcuts in MIMIC-CXR notes that induce dangerous false negatives. Together, these results position BEE as a general and principled tool for diagnosing spurious correlations in weight space, enabling principled dataset auditing and more trustworthy foundation models. Our code is publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2804. Financial fraud collusion among generative AI agents in social networks
๐ฌ 23
โญ 5.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ financial fraud multi-agent system agent society
In this work, we investigate the risks of collective financial fraud in large-scale multi-agent systems, driven by large language model (LLM) agents. We examine whether agents can collaborate in fraudulent activities, amplify the risks of such behaviors, and identify factors critical to fraud success. To facilitate this research, we introduce MultiAgentFraudBench, a large-scale benchmark for simulating financial fraud scenarios based on realistic online behaviors. The benchmark includes 21 typical online fraud scenarios, covering the full fraud lifecycle across both public and private domains. We explore the dynamics of fraud operations by analyzing interaction depth, hype-building effects, and collaboration failures. Finally, we propose two potential mitigation strategies: the development of monitor agents to block malicious agents and fostering group resilience through information sharing. Our findings highlight the real-world dangers of multi-agent fraud and suggest measures for reducing associated risks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2805. Convergence of Actor-Critic gradient flow for entropy regularised MDPs in general spaces
๐ฌ 13
โญ 5.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Gradient Flow Markov Decision Process Entropy Regularization Non-convex optimization Mirror descent method FisherโRao gradient flow Global convergence Function approximation Actor Critic
We prove the stability and global convergence of a coupled actor-critic gradient flow for infinite-horizon and entropy-regularised Markov decision processes (MDPs) in continuous state and action space with linear function approximation under Q-function realisability.
We consider a version of the actor critic gradient flow where the critic is updated using temporal difference (TD) learning while the policy is updated using a policy mirror descent method on a separate timescale.
For general action spaces, the relative entropy regularizer is unbounded and thus it is not clear a priori that the actor-critc flow does not suffer from finite-time blow-up.
Therefore we first demonstrate stability which in turn enables us obtain a convergence rate of the actor critic flow to the optimal regularised value function.
The arguments presented show that timescale separation is crucial for stability and convergence in this setting.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2806. $\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Textual Space
๐ฌ 13
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Reasoning Test-Time Scaling;Textual Optimization
Scaling inference-time compute for Large Language Models (LLMs) has unlocked unprecedented reasoning capabilities.
However, existing inference-time scaling methods typically rely on inefficient and suboptimal discrete search algorithms or trial-and-error prompting to improve the online policy. In this paper, we propose $\nabla$-reasoner, an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop to refine the policy on the fly. Our core component, Differentiable Textual Optimization (DTO), leverages gradient signals from both the LLMโs likelihood and a reward model to refine textual representations. $\nabla$-reasoner further incorporates rejection sampling and acceleration design to robustify and speed up decoding. Theoretically, we show that performing inference-time gradient descent in the sample space to maximize reward is dual to aligning an LLM policy via KL-regularized reinforcement learning. Empirically, $\nabla$-reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark, while reducing number of model calls by approximately 10-40% compared to strong baselines. Overall, our work introduces a paradigm shift from zeroth-order search to first-order optimization at test time, offering a cost-effective path to amplify LLM reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2807. Algorithm Generation via Creative Ideation
๐ฌ 15
โญ 5.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ system algorithm design creativity ideation self-reflection
Designing system algorithms remains challenging, where the discontinuous nature of the solution space often forces system engineers to rely on generic heuristics at the expense of performance. We study whether LLMs can practically drive algorithm generation, and find that they are biased towards well-known generic designs, rather than making the creative leaps needed to navigate the discontinuous solution space. To address this limitation, we introduce MetaMuse, a framework for creative ideation built on three self-reflection principles: (1) quantifying solution diversity and usefulness in measurable performance space, rather than abstract idea space, (2) steering ideation through external stimuli, rather than internal randomness, and (3) constructing executable solutions using waypoint reasoning, rather than free-form chain-of-thought. Extensive evaluation shows that MetaMuse can generate high-performing solutions for two critical problems at a global cloud provider: cache replacement (reducing cache misses by up to 35.76%) and online bin packing (reducing bin usage by up to 30.93%).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2808. On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation
๐ฌ 14
โญ 5.50
๐ reinforcement learning
๐ท๏ธ RLVR LLM reasoning
Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models.
While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the **magnitude** of these updates, largely overlooking their **direction**.
In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability difference $\Delta\log p$ between the base and final RLVR models.
Through statistical analysis and token-replacement interventions, we demonstrate that $\Delta\log p$ more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (e.g., divergence or entropy).
Building on this insight, we propose two practical applications:
(1) a *test-time extrapolation* method that amplifies the policy along the learned $\Delta\log p$ direction to improve reasoning accuracy without further training;
(2) a *training-time reweighting* method that focuses learning on low-probability (corresponding to higher $\Delta\log p$) tokens, which improves reasoning performance across models and benchmarks.
Our work establishes the direction of change as a key principle for analyzing and improving RLVR.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2809. Expert Divergence Learning for MoE-based Language Models
๐ฌ 6
โญ 5.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Mixture of Experts Large Language Model Pretraining
The Mixture-of-Experts (MoE) architecture is a powerful technique for scaling language models, yet it often suffers from expert homogenization, where experts learn redundant functionalities, thereby limiting MoE's full potential. To address this, we introduce Expert Divergence Learning, a novel pre-training strategy that explicitly encourages functional specialization among experts. Our method incorporates a label-driven auxiliary loss that leverages domain labels inherent in pre-training corpora to maximize the Jensen-Shannon Divergence between the expert routing distributions of different data domains. This optimization objective guides the model to develop diverged routing policies for varied domains and closer routing policies for the same domain, which leads to emergent and organized expert specialization. We validate our approach by pre-training MoE models of up to 15 billion parameters from scratch. Experimental results demonstrate that models trained with Expert Divergence Learning not only achieve a lower language modeling loss but also exhibit significant performance improvements across a diverse range of downstream benchmarks. Further analysis confirms that our method effectively mitigates expert homogenization and brings greater functional specialization, all with negligible computational overhead during training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2810. On the Theoretical Limitations of Embedding-Based Retrieval
๐ฌ 14
โญ 5.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ retrieval embeddings limitations theoretical dataset evaluation
Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models. In this work, we demonstrate that we may encounter these theoretical limitations in realistic settings with extremely simple queries. We connect known results in learning theory, showing that the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. We empirically show that this holds true even if we restrict to k=2, and directly optimize on the test set with free parameterized embeddings. We then create a realistic dataset called LIMIT that stress tests models based on these theoretical results, and observe that even state-of-the-art models fail on this dataset despite the simple nature of the task. Our work shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop methods that can resolve this fundamental limitation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2811. R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning
๐ฌ 19
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Code Interpreter Reinforcement Learning Curriculum Learning Symbolic Reasoning Tool Use
Practical guidance on training Large Language Models (LLMs) to leverage Code Interpreter across diverse tasks remains lacking. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. Unlike prior RL + tool-use efforts focused on narrow domains such as math or retrieval, we curate 144 diverse reasoning and planning tasks and show that training a general-purpose Code Interpreter across them presents significant challenges due to task heterogeneity and scarcity of effective samples. To address this, we introduce a multi-stage curriculum learning approach that partitions training samples by measured improvement potential. The RL training prioritizes samples with higher potential and gradually shifts to lower-potential ones, increasing the average RL gains from merely +3.4\% to +9.3\% across Qwen-2.5 models (3/7/14B). Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.1\% to 72.4\%, outperforming text-only GPT-4o (58.6\%) and GPT-4o with Code Interpreter (70.9\%). Notably, R1-CI-14B also exhibits emergent self-checking behavior through code generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2812. Horizon Imagination: Efficient On-Policy Training in Diffusion World Models
๐ฌ 10
โญ 5.33
๐ reinforcement learning
๐ท๏ธ world models diffusion model-basedreinforcement learning
We study diffusion-based world models for reinforcement learning, which offer high generative fidelity but face critical efficiency challenges in control.
Current methods either require heavyweight models at inference or rely on highly sequential imagination, both of which impose prohibitive computational costs.
We propose Horizon Imagination (HI), an on-policy imagination process for discrete stochastic policies that denoises multiple future observations in parallel. HI incorporates a stabilization mechanism and a novel sampling schedule that decouples the denoising budget from the effective horizon over which denoising is applied while also supporting sub-frame budgets.
Experiments on Atari 100K and Craftium show that our approach maintains control performance with a sub-frame budget of half the denoising steps and achieves superior generation quality under varied schedules.
Code is released open-source https://anonymous.4open.science/r/horizon-3EA1.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2813. MOBODY: Model-Based Off-Dynamics Offline Reinforcement Learning
๐ฌ 26
โญ 5.33
๐ reinforcement learning
๐ท๏ธ reinforcement learning off-dynamics RL domain adaptation model-based RL
We study off-dynamics offline reinforcement learning, where the goal is to learn a policy from offline source and limited target datasets with mismatched dynamics. Existing methods either penalize the reward or discard source transitions occurring in parts of the transition space with high dynamics shift. As a result, they optimize the policy using data from low-shift regions, limiting exploration of high-reward states in the target domain that do not fall within these regions. Consequently, such methods often fail when the dynamics shift is significant or the optimal trajectories lie outside the low-shift regions.
To overcome this limitation, we propose MOBODY, a Model-Based Off-Dynamics Offline RL algorithm that optimizes a policy using learned target dynamics transitions to explore the target domain, rather than only being trained with the low dynamics-shift transitions.
For the dynamics learning, built on the observation that achieving the same next state requires taking different actions in different domains, MOBODY employs separate action encoders for each domain to encode different actions to the shared latent space while sharing a unified representation of states and a common transition function.
We further introduce a target Q-weighted behavior cloning loss in policy optimization to avoid out-of-distribution actions, which push the policy toward actions with high target-domain Q-values, rather than high source domain Q-values or uniformly imitating all actions in the offline dataset.
We evaluate MOBODY on a wide range of MuJoCo and Adroit benchmarks, demonstrating that it outperforms state-of-the-art off-dynamics RL baselines as well as policy learning methods based on different dynamics learning baselines, with especially pronounced improvements in challenging scenarios where existing methods struggle.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2814. ARINBEV: Bird's-Eye View Layout Estimation with Conditional Autoregressive Model
๐ฌ 8
โญ 5.33
๐ applications to robotics, autonomy, planning
๐ท๏ธ BEV semantic Segmentation Autonomous Driving Autoregressive Generative Models
Recent advances in Birdโs Eye View (BEV) layout estimation have advanced through refinements in architectural and geometric design. However, existing methods often overlook the structured relationships among traffic elements. Components such as drivable areas, lane dividers, and pedestrian crossings constitute an interdependent system governed by civil engineering standards. For instance, stop lines precede crosswalks, which align with sidewalks, while lane dividers follow road curvature. To capture these interdependencies, we propose \textbf{ARINBEV}, an autoregressive model for BEV map estimation. Unlike prior generative approaches that rely on complex multiphase training or encoder-decoder architectures, ARINBEV employs a single-stage, decoder-only autoregressive design. This architecture enables semantically consistent BEV map estimation. On nuScenes and Argoverse2, ARINBEV attains 64.3 and 65.6 mIoU, respectively, while using $1.7\times$ fewer parameters and training $1.8\times$ faster than state-of-the-art models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2815. TFHE-Coder: Evaluating LLM Agents for secure Fully Homomorphic Encryption Code Generation
๐ฌ 17
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models Agents Code generation Fully Homomorphic Encryption Retrieval Augmented Generation
Fully Homomorphic Encryption over the Torus (TFHE) is a cornerstone of confidential computing, yet its adoption is severely limited by a steep learning curve and the need for specialized cryptographic expertise. To bridge this skills gap, we explore if Large Language Models (LLMs) Agents can translate natural language specifications into secure TFHE code. We introduce a novel, three-phase agentic framework designed to overcome the primary failure points in this process. First, in the Instruction Phase, we automatically refine user queries into a structured Formal Prompt. Second, during the Generation Phase, a specialized Retrieval-Augmented Generation (RAG) tool provides the agent with accurate API knowledge from TFHE documentation. Finally, in the Feedback Phase, an automated Security Check module analyzes the output for cryptographic flaws and provides targeted feedback for iterative correction. We comprehensively evaluate our framework by testing four LLMs across five programming tasks of increasing difficulty. Our results demonstrate that baseline framework often produce functional but insecure code, while our framework is uniquely superior, consistently generating solutions that are simultaneously compilable, functionally correct, and secure. This work establishes a robust methodology and benchmark for TFHE code generation, demonstrating a viable path toward democratizing secure computation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2816. Making, Not Taking, the Best of N
๐ฌ 16
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Best-of-N test-time scaling synthetic data generation inference multilingual diversity ensembling
Obtaining high-quality generations in modern LLMs has largely been framed as a selection problem: identifying a single winning generation from a diverse pool of $N$ samples, the Best-of-$N$ (BoN).
Yet, this approach is inherently zero-sum, discarding diverse and potentially useful information from the pool. Instead, we explore a collaborative setup, where all candidates can potentially contribute to the final winning generation. To this end, we propose Fusion-of-$N$ (FusioN): a method that uses a general LLM judge to synthesize the most informative elements of each sample into a single final answer.
We compare FusioN to BoN in two settings, (i) test-time scaling, where we sample and aggregate from a single model at test-time (ii) synthetic data generation, where we fuse samples from a pool of diverse teachers to improve a student model. We extensively benchmark both setups across 11 languages, 3 diverse benchmarks and varying model scales. Across the bench, FusionN consistently outperforms BoN showing versatility and robustness both in test-time scaling and in downstream gains from synthetic data generation. We also perform extensive analysis on FusioN, where it shows surprising strengths and robustness under challenging settings.
These results show that we should shift how we think about evaluating and utilizing LLM generations from a monolithic measure of quality, to embracing their polylithic nature. This shift allows us to integrate diverse strengths, unlock latent potential, and achieve improvements that were previously inaccessible through selection alone.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2817. Learning Shrinks the Hard Tail: TrainingโDependent Inference Scaling in a Solvable Linear Model
๐ฌ 10
โญ 5.33
๐ learning theory
๐ท๏ธ Scaling Laws Inference scaling test time compute linear models fine tuning
We analyze neural scaling laws in a solvable model of last-layer fine-tuning where targets have intrinsic, instance-heterogeneous difficulty. In our Latent Instance Difficulty (LID) model, each input's target variance is governed by a latent ''precision'' drawn from a heavy-tailed distribution. While generalization loss recovers standard scaling laws, our main contribution connects this to inference. The pass@$k$ failure rate exhibits a power-law decay, $k^{-\beta_\mathrm{eff}}$, but the observed exponent $\beta_\mathrm{eff}$ is training-dependent. It grows with sample size $N$ before saturating at an intrinsic limit $\beta$ set by the difficulty distribution's tail. This coupling reveals that learning shrinks the ''hard tail'' of the error distribution: improvements in the model's generalization error steepen the pass@$k$ curve until irreducible target variance dominates. The LID model yields testable, closed-form predictions for this behavior, including a compute-allocation rule that favors training before saturation and inference attempts after.
We validate these predictions in simulations and in two realโdata proxies: CIFARโ10H (humanโlabel variance) and a maths teacherโstudent distillation task.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2818. Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning
๐ฌ 12
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Tool-using Reinforcement Learning
Enabling large language models with external tools has become a pivotal strategy for extending their functionality beyond text space. To enhance LLMs' tool-calling abilities, previous approaches primarily rely on supervised fine-tuning (SFT) with trajectories distilled from stronger models, often resulting in imitative reasoning that limits generalization. In this work, we explore rule-based reinforcement learning to enhance tool-calling in LLMs, resulting in Nemotron-Research-Tool-N1, a series of tool-calling reasoning models. Rather than enforcing supervision over intermediate distilled reasoning traces, Tool-N1 is trained with a binary RL reward that assesses only the format validity and functional correctness of tool invocations. This lightweight supervision allows the model to develop reasoning strategies independently, without relying on annotated trajectories. Experiments on several major benchmarks show that Tool-N1-7B/14B clearly outperform GPT-4o. We conduct a systematic study on the design of rule-based reinforcement learning strategies for training tool-calling models. Using 5,518 distilled reasoning trajectories, we compare SFT, RL, and the SFT-then-RL pipeline, finding that the widely adopted SFT-then-RL paradigm does not necessarily outperform pure RL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2819. What's the plan? Metrics for implicit planning in LLMs and their application to rhyme generation
๐ฌ 7
โญ 5.33
๐ interpretability and explainable AI
๐ท๏ธ implicit planning LLM rhyming metrics
Prior work suggests that language models, while trained on next token prediction, show implicit planning behavior: they may select the next token in preparation to a predicted future token, such as a likely rhyming word, as supported by a prior qualitative study of Claude 3.5 Haiku using a cross-layer transcoder.
We propose much simpler techniques for assessing implicit planning in language models. With case studies on rhyme poetry generation and question answering, we demonstrate that our methodology easily scales to many models.
Across models, we find that the generated rhyme (e.g. "-ight") or answer to a question ("whale")
can be manipulated by steering at the end of the preceding line with a vector, affecting the generation of intermediate tokens leading up to the rhyme or answer word.
We show that implicit planning is a universal mechanism, present in smaller models than previously thought, starting from 1B parameters.
Our methodology offers a widely applicable direct way to study implicit planning abilities of LLMs.
More broadly, understanding planning abilities of language models can inform decisions in AI safety and control.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2820. Robust Reward Modeling via Causal Rubrics
๐ฌ 22
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reward Modeling Reward Hacking Alignment Post training LLMs RLHF
Reward models (RMs) are fundamental to aligning Large Language Models (LLMs) via human feedback, yet they often suffer from reward hacking. They tend to latch on to superficial or spurious attributes, such as response length or formatting, mistaking these cues learned from correlations in training data for the true causal drivers of quality (e.g., factuality, relevance). This occurs because standard training objectives struggle to disentangle these factors, leading to brittle RMs and misaligned policies. We introduce CROME (Causally Robust Reward Modeling), a novel framework inspired by an explicit causal model designed to mitigate reward hacking. CROME queries an oracle LLM for rubrics that are (or the oracle deems to be) causally relevant to answering a specific prompt. Then, it employs the following synthetic targeted augmentations during training: (1) Causal Augmentations, which are pairs that differ along specific causal attributes (subset of the Oracle identified rubrics), to enforce sensitivity along each causal attribute individually, and (2) Neutral Augmentations, which are tie-label pairs varying primarily in spurious attributes, to enforce invariance along spurious attributes. Notably, our neutral augmentations are produced without any knowledge of unknown spurious factors, via question swapping and response interventions only along causal rubrics. We show that the CROME augmentation strategy using rubrics from popular LLM APIs significantly outperforms standard baselines on RewardBench, improving average accuracy by up to 5.3% and achieving gains of up to 7.1% and 12.4% in reasoning and safety. The robustness of CROME is further testified by significant gains in DPO-aligned policies and Best-of-N alignment across various benchmarks, including AlpacaEval 2.0, RewardBench, safety-focused WildGuardTest, and the reasoning-specific GSM8k.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2821. Online Inventory Optimization in Non-Stationary Environment
๐ฌ 3
โญ 5.33
๐ learning theory
๐ท๏ธ Inventory Optimization Online Learning Online Convex Optimization Dynamic Regret
This paper addresses online inventory optimization (OIO), an extension of online convex optimization. OIO is a sequential decision-making process in inventory management cycles consisting of order arrival, stock consumption, and new order placement. One key challenge in OIO is managing demand fluctuations. However, most existing algorithms still cannot sufficiently handle this because they focus on a static regret guarantee, comparing their performance to a fixed order-up-to level strategy. In non-stationary environments, such static comparator is unsuitable due to demand fluctuations. In this paper, we propose an algorithm with near-optimal dynamic regret guarantee for OIO. Our algorithm also offers an improvement of $\sqrt{L_{\max}}$ for the static regret upper bound in existing studies. Here, $L_{\max}$ refers to the maximum sell-out period. Our algorithm employs a simple two-stage projection strategy, through which we prove that the OIO is connected to the smoothed online convex optimization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2822. Divide and Abstract: Autoformalization via Decomposition and Abstraction Learning
๐ฌ 13
โญ 5.33
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Autoformalization Formal Mathematics AI for Math Neurosymbolic AI LLM Large Language Models Formal Theorem Proving Neural Theorem Proving
Autoformalization, the task of translating informal mathematics into formal machine-verifiable languages, has long been challenging, even for individual statements.
Beyond the statement level, mathematicians invest substantial effort in formalizing reusable abstractions such as common definitions and relations, based on which a large corpus of statements can be easily formalized.
While previous work focuses on fine-tuning models for single statement autoformalization, we introduce $\textit{Divide and Abstract (DNA)}$, an end-to-end framework that not only improves the models' test-time performance at formalizing a corpus of statements, but also learns a library of reusable formal abstractions, which scales to statements outside of the corpus.
First, $\textit{DNA}$ extracts common mathematical concepts from the entire informal corpus and formalizes them as reusable abstractions.
Conditioned on these learned abstractions, $\textit{DNA}$ decomposes each informal statement in the corpus into a structured collection of informal clauses, translates each clause into its formal correspondents, composes the formal clauses back together, and refines the final formalization given feedback from a symbolic validator.
The entire framework requires zero training and thus scales to any formal language, particularly low-resource Domain-Specific Languages (DSL).
$\textit{DNA}$ significantly improves performance by up to $\textbf{8.6}\times$, and advances the SOTA by $\textbf{57.8}$\% from $\textbf{40.8}$ to $\textbf{64.4}$.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2823. Adaptive Conformal Anomaly Detection with Time Series Foundation Models for Signal Monitoring.
๐ฌ 8
โญ 5.33
๐ learning on time series and dynamical systems
๐ท๏ธ time series anomaly detection; conformal prediction; anomaly detection; monitoring sequential signals
We propose a post-hoc adaptive conformal anomaly detection method for monitoring time series that leverages predictions from pre-trained foundation models without requiring additional fine-tuning. Our method yields an interpretable anomaly score directly interpretable as a false alarm rate (p-value), facilitating transparent and actionable decision-making. It employs weighted quantile conformal prediction bounds and adaptively learns optimal weighting parameters from past predictions, enabling calibration under distribution shifts and stable false alarm control, while preserving out-of-sample guarantees. As a model-agnostic solution, it integrates seamlessly with foundation models and supports rapid deployment in resource-constrained environments. This approach addresses key industrial challenges such as limited data availability, lack of training expertise, and the need for immediate inference, while taking advantage of the growing accessibility of time series foundation models. Experiments on both synthetic and real-world datasets show that the proposed approach delivers strong performance, combining simplicity, interpretability, robustness, and adaptivity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2824. OmniField: Conditioned Neural Fields for Robust Multimodal Spatiotemporal Learning
๐ฌ 13
โญ 5.33
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Conditioned Neural Fields Multimodal Learning Spatiotemporal Learning Scientific Data Neural Fields
Multimodal spatiotemporal learning on real-world experimental data is constrained by two challenges: within-modality measurements are sparse, irregular, and noisy (QA/QC artifacts) but cross-modally correlated; the set of available modalities varies across space and time, shrinking the usable record unless models can adapt to arbitrary subsets at train and test time. We propose OmniField, a continuity-aware framework that learns a continuous neural field conditioned on available modalities and iteratively fuses cross-modal context. A multimodal crosstalk block architecture paired with iterative cross-modal refinement aligns signals prior to the decoder, enabling unified reconstruction, interpolation, forecasting, and cross-modal prediction without gridding or surrogate preprocessing. Extensive evaluations show that OmniField consistently outperforms eight strong multimodal spatiotemporal baselines. Under heavy simulated sensor noise, performance remains close to clean-input levels, highlighting robustness to corrupted measurements.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2825. MULTIMODALITY AS SUPERVISION: SELF-SUPERVISED SPECIALIZATION TO THE TEST ENVIRONMENT VIA MULTIMODALITY
๐ฌ 16
โญ 5.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ specialization multimodal transfer learning
The common approach for developing a vision model is generalism, which involves training on a large diverse dataset to cover the varied deployment environments and leads to a model that is expected to solve the problem everywhere. However, many practical applications need to operate in a specific test space, e.g., a robot deployed in a single house, and do not necessarily need to generalize to novel environments. In this work, we explore whether we can use rich multimodal data only from the test environment to pre-train a representation in a self-supervised way, without access to any external data.
We find that this approach can match and, in most cases, outperform generalists pre-trained on large-scale Internet datasets, including popular off-the-shelf models, CLIP and DINOv2. We study the effectiveness of this approach by evaluating the models on various datasets and downstream tasks, such as semantic segmentation, captioning, and object detection, as well as a set of ablations and analyses to extract insights. This approach raises intriguing points on substituting data with (multi)modality, enabling an alternative scenario where the need for external Internet-scale datasets for pre-training models is reduced. It also shows that merely benefiting from test-space data was insufficient for achieving competitive results, and multimodality was essential for that purpose.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2826. VITA: Zero-Shot Value Functions via Test-Time Adaptation of VisionโLanguage Models
๐ฌ 12
โญ 5.33
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Meta-Learning Test-Time Adaptation Value Function Estimation Vision-Language Models
VisionโLanguage Models (VLMs) show promise as zero-shot goal-conditioned value functions, but their frozen pre-trained representations limit generalization and temporal reasoning. We introduce VITA, a zero-shot value function learning method that enhances both capabilities via test-time adaptation. At inference, a lightweight adaptation module is updated via a gradient step on a meta-learned self-supervised loss, such that each test-time update improves value estimation. By updating sequentially over a trajectory, VITA encodes history into its parameters, addressing the temporal reasoning limitations. To mitigate shortcut learning, we propose a dissimilarity-based sampling strategy that selects semantically diverse segments of the trajectory during training. In real-world robotic manipulation tasks, VITA generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art zero-shot method using autoregressive VLMs. Furthermore, we demonstrate that VITAโs zero-shot value estimates can be utilized for reward shaping in offline reinforcement learning, resulting in multi-task policies on the Meta-World benchmark that exceed the performance of those trained with the simulationโs fuzzy-logic dense rewards.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2827. FSPO: Few-Shot Optimization of Synthetic Preferences Effectively Personalizes to Real Users
๐ฌ 13
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Personalization Synthetic Data Meta-Learning Preference Optimization
Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context capabilities of LLMs, we propose few-shot preference optimization (FSPO), an algorithm for LLM personalization that reframes reward modeling as a meta-learning problem. Under FSPO, an LLM learns to quickly infer a personalized reward function for a user via a few labeled preferences. FSPO also utilizes user description rationalization (RAT) to encourage better reward modeling and instruction following, recovering performance with the oracle user description. Since real-world preference data is challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. To successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across three domains: movie reviews, education, and open-ended question answering. We also run a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval win-rate in generating responses that are personalized to synthetic users and a 70% win-rate with real human users in open-ended question answering.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2828. Diversified Multinomial Logit Contextual Bandits
๐ฌ 13
โญ 5.33
๐ learning theory
๐ท๏ธ multinomial logistic choice model contextual bandits diversity regret analysis
Conventional (contextual) MNL bandits model relevance-driven choice but ignore the potential benefit of within-assortment diversity, while submodular/combinatorial bandits encode diversity in rewards but lack structured choice probabilities. We bridge this gap with the *diversified multinomial logit* (DMNL) contextual bandit, which augments MNL choice probabilities with a generally submodular diversity function, formalizing the relevance--diversity relation in one model.
Embedding diversity renders exact MNL assortment optimization intractable. We develop a *white-box* UCB-based algorithm, $\texttt{OFU-DMNL}$ that builds assortments item-wise by maximizing optimistic marginal gains, avoids black-box oracles, and provides end-to-end guarantees.
We show that $\texttt{OFU-DMNL}$ achieves at least a $(1-\tfrac{1}{e+1})$-*approximate* regret bound $\tilde{O}\big(\tfrac{\sqrt{K}(d+1)}{K+1} \sqrt{T}\big)$, where $d$ is the context dimension, $K$ the maximum assortment size, and $T$ the horizon, and attains an improved approximation factor over standard submodular baselines. Experiments show consistent gains and, versus exhaustive enumeration, comparable regret with substantially lower runtime. DMNL bandits serves as a principled and practical basis for diversity-aware assortment optimization under uncertainty and our proposed algorithm offers a both statistically and computationally efficient solution.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2829. Privacy-Protected Causal Survival Analysis Under Distribution Shift
๐ฌ 18
โญ 5.33
๐ causal reasoning
๐ท๏ธ Time-to-event outcome Conditional distribution shifts Semiparametric efficiency theory Federated learning
Causal inference across multiple data sources can improve the generalizability and reproducibility of scientific findings. However, for time-to-event outcomes, data integration methods remain underdeveloped, especially when populations are heterogeneous and privacy constraints prevent direct data pooling. We propose a federated learning method for estimating target site-specific causal effects in multi-source survival settings. Our approach dynamically re-weights source contributions to correct for distributional shifts, while preserving privacy. Leveraging semiparametric efficiency theory, data-adaptive weighting and flexible machine learning, the method achieves both double robustness and efficiency improvement. Through simulations and two real data applications: (i) multi-site randomized trials of monoclonal antibodies for HIV-1 prevention among cisgender men and transgender persons in the United States, Brazil, Peru, and Switzerland, as well as women in sub-Saharan Africa, and (ii) an analysis of sex disparities across biomarker groups for all-cause mortality using the "flchain" dataset, we demonstrate the validity, efficiency gains, and practical utility of the approach. Our findings highlight the promise of federated methods for efficient, privacy-preserving causal survival analysis under distribution shift.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2830. ShapeGen4D: Towards High Quality 4D Shape Generation from Videos
๐ฌ 15
โญ 5.33
๐ generative models
๐ท๏ธ 4D reconstruction generative model
Video-conditioned 4D shape generation aims to recover time-varying 3D geometry and view-consistent appearance directly from an input video.
In this work, we introduce a native video-to-4D shape generation
framework that synthesizes a single dynamic 3D representation end-to-end from the video.
Our framework introduces three key components based on large-scale pre-trained 3D models: (i) a temporal attention that conditions generation on all frames while producing a time-indexed dynamic representation;
(ii) a time-aware point sampling and 4D latent anchoring that promote temporally consistent geometry and texture;
and (iii) noise sharing across frames to enhance temporal stability.
Our method accurately captures non-rigid motion, volume changes, and even topological transitions without per-frame optimization. Across diverse in-the-wild videos, our method improves robustness and perceptual fidelity and reduces failure modes compared with the baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2831. Distributionally Robust Cooperative Multi-agent Reinforcement Learning with Value Factorization
๐ฌ 12
โญ 5.33
๐ reinforcement learning
๐ท๏ธ distributionally robust RL cooperative multi-agent RL; Centralized training decentralized execution
Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution, where value-factorization methods enforce the individual-global-maximum (IGM) principle so that decentralized greedy actions recover the team-optimal joint action. However, the reliability of this recipe in real-world settings remains uncertain due to environmental uncertainties arising from the sim-to-real gap, model mismatch, system noise. We address this gap by introducing Distributionally robust IGM (DrIGM), a principle that requires each agent's robust greedy action to align with the robust team-optimal joint action. We show that DrIGM holds for a novel definition of robust individual action values, which is compatible with decentralized greedy execution and yields a provable robustness guarantee for the whole system. Building on this foundation, we derive DrIGM-compliant robust variants of existing value-factorization architectures (e.g., VDN/QMIX/QTRAN) that (i) train on robust Q-targets, (ii) preserve scalability, and (iii) integrate seamlessly with existing codebases without bespoke per-agent reward shaping. Empirically, on high-fidelity SustainGym simulators and a StarCraft game environment, our methods consistently improve out-of-distribution performances.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2832. Hedonic Neurons: A Mechanistic Mapping of Latent Coalitions in Transformer MLPs
๐ฌ 10
โญ 5.33
๐ interpretability and explainable AI
๐ท๏ธ mechanistic interpretability feature discovery MLPs
Fine-tuned Large Language Models (LLMs) encode rich task-specific features, but the form of these representationsโespecially within MLP layersโremains unclear. Empirical inspection of LoRA updates shows that new features concentrate in mid-layer MLPs, yet the scale of these layers obscures meaningful structure. Prior probing suggests that statistical priors may strengthen, split, or vanish across depth, motivating the need to study how neurons work together rather than in isolation. We introduce a mechanistic interpretability framework based on coalitional game theory, where neurons mimic agents in a hedonic game whose preferences capture their synergistic contributions to layer-local computations. Using top-responsive utilities and the PAC-Top-Cover algorithm, we extract stable coalitions of neuronsโgroups whose joint ablation has non-additive effectsโand track their transitions across layers as persistence, splitting, merging, or disappearance. Applied to LLaMA, Mistral, and Pythia rerankers fine-tuned on scalar IR tasks, our method finds coalitions with consistently higher synergy than clustering baselines. By revealing how neurons cooperate to encode features, hedonic coalitions uncover higher-order structure beyond disentanglement and yield computational units that are functionally important, interpretable, and predictive across domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2833. The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas
๐ฌ 11
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Research Automation Scientific Discovery
Large Language Models (LLMs) have shown promise in accelerating the scientific research pipeline. A key capability for this process is the ability to generate novel research ideas, and prior studies have found settings in which LLM-generated research ideas were judged as more novel than human-expert ideas. However, a good idea should not simply appear to be novel, it should also result in better research after being executed. To test whether AI-generated ideas lead to better research outcomes, we conduct an execution study by recruiting 43 expert researchers to execute randomly-assigned ideas, either written by experts or generated by an LLM. Each expert spent over 100 hours implementing the idea and wrote a 4-page short paper to document the experiments. All the executed projects are then reviewed blindly by expert NLP researchers. Comparing the review scores of the same ideas before and after execution, the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p < 0.05), closing the gap between LLM and human ideas observed at the ideation stage. When comparing the aggregated review scores from the execution study, we even observe that for many metrics there is a flip in rankings where human ideas score higher than LLM ideas. This ideation-execution gap highlights the limitations of current LLMs in generating truly effective research ideas and the challenge of evaluating research ideas in the absence of execution outcomes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2834. Adaptive Logit Adjustment for Debiasing Multimodal Language Models
๐ฌ 13
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Multimodal Model Fairness Image-to-Text Logit Adjustment
Vision-Language Models (VLMs) and Large Multimodal Models (LMMs) have significantly advanced image-to-text generation tasks such as image captioning and visual question answering (VQA).
However, these models often exhibit biases, including attribute misalignment between the generated text and the input image, or the reinforcement of harmful stereotypes.
Existing debiasing techniques primarily focus on modifying representations at the encoder or decoder level, which can degrade model performance and may be susceptible to bias reintroduction from external sources. In this work, we propose **Adaptive Logit Adjustment (ALA) for Bias Alignment and Neutralization**, a post-hoc debiasing method that operates directly on logits during autoregressive text generation. Unlike prior approaches that modify internal representations, ALA selectively adjusts token probabilities to mitigate biases without distorting essential model outputs. Our approach leverages external classifiers to measure bias misalignment between image and text, applies gradient-based importance analysis to identify bias-inducing tokens, and dynamically refines token probabilities to reduce undesired biases.
We evaluate ALA on image captioning and various VQA tasks, demonstrating its effectiveness in mitigating bias while maintaining contextual accuracy. Notably, our approach is applicable to various multimodal architectures in a model-agnostic manner, including VLMs and LMMs, across different tasks that involve autoregressive text generation. Our results show that logit-based debiasing offers a flexible and efficient alternative to existing encoder- and embedding-centric approaches, providing a more practical solution for building fairer multimodal AI systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2835. Concept-TRAK: Understanding how diffusion models learn concepts through concept attribution
๐ฌ 9
โญ 5.33
๐ generative models
๐ท๏ธ Diffusion models Data attribution Concept
While diffusion models excel at image generation, their growing adoption raises critical concerns about copyright issues and model transparency. Existing attribution methods identify training examples influencing an entire image, but fall short in isolating contributions to specific elements, such as styles or objects, that are of primary concern to stakeholders. To address this gap, we introduce _concept-level attribution_ through a novel method called _Concept-TRAK_, which extends influence functions with a key innovation: specialized training and utility loss functions designed to isolate concept-specific influences rather than overall reconstruction quality. We evaluate Concept-TRAK on novel concept attribution benchmarks using Synthetic and CelebA-HQ datasets, as well as the established AbC benchmark, showing substantial improvements over prior methods in concept-level attribution scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2836. GraphPlanner: Graph-Based Agentic Routing for LLMs
๐ฌ 15
โญ 5.33
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Agentic LLM Memory utilization Heterogeneous agents
LLM routing has achieved promising results in integrating the strengths of di-
verse models while balancing efficiency and performance. However, to support
more realistic and challenging applications, routing must extend into agentic LLM
settingsโwhere task planning, multi-round cooperation among heterogeneous
agents, and memory utilization are indispensable. To address this gap, we pro-
pose GraphPlanner, a heterogeneous graph-based agentic router that generates
routing workflows for each query and supports both inductive and transductive
inference. GraphPlanner formulates workflow generation as a Markov Deci-
sion Process (MDP), where at each step it selects both the LLM backbone and
the agent role (Planner, Executor, Summarizer). By leveraging a heterogeneous
graph, denoted as GARNet, to capture interactions among queries, agents, and
responses, GraphPlanner integrates historical and contextual information into
richer state representations. The entire pipeline is optimized with reinforcement
learning, jointly improving task-specific performance and computational efficiency.
We evaluate GraphPlanner across 14 diverse LLM tasks and demonstrate that:
(1) GraphPlanner outperforms strong single- and multi-round routers, improv-
ing accuracy by up to 9.3% while reducing GPU cost from 186.26 GiB to 1.04 GiB;
(2) GraphPlanner generalizes robustly to unseen tasks and LLMs, exhibiting
strong zero-shot capabilities; and (3) GraphPlanner effectively leverages his-
torical interactions, supporting both inductive and transductive inference for more
adaptive routing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2837. Enabling arbitrary inference in spatio-temporal dynamic systems: A physics-inspired perspective
๐ฌ 12
โญ 5.33
๐ learning on time series and dynamical systems
๐ท๏ธ Neural operators Spatio-temporal systems Graph neural networks Data mining
Modern spatio-temporal learning techniques usually exploit sampled discrete observations to foresee the future. Actually, spatio-temporal dynamics are continuous and evolve everytime and everywhere, thus modeling spatio-temporal dynamics in a continuous space can be long-standing challenge. Existing deep learning architectures often fail to generalize to unseen regions or graph topologies, while many physics-driven approaches are confined to Euclidean grids and scale poorly to complex graph structures. To address this gap, we propose PhySTA, a physics-inspired spatio-temporal learning framework designed for efficient and scalable arbitrary inference over graph-structured data. PhySTA integrates two key modules, i.e., (1) Continuous Operator-based Spectrum-Temporal Learning (CoSTL), which leverages a Graph-Time Fourier Neural Operator combined with Time-Gated Spectral Segmentation Perception to model continuous dynamics in operator space, (2) Adaptive Multi-scale Interaction (AMI) that constructs multi-scale subgraphs and introduces node-edge coupled convolution to capture discrete interaction patterns and refine continuous predictions. By bridging operator learning with node-edge-graph interaction, PhySTA achieves both continuity-aware dynamic modeling and hierarchical interactive refinement. Extensive experiments across large-scale benchmarks demonstrate that PhySTA attains state-of-the-art accuracy while reducing computation cost and lowering parameter overhead.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2838. Propaganda AI: An Analysis of Semantic Divergence in Large Language Models
๐ฌ 22
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models (LLMs) LLM Security Semantic Divergence Semantic Inconsistency Black-box Auditing
Large language models (LLMs) can exhibit *concept-conditioned semantic divergence*: common high-level cues (e.g., ideologies, public figures) elicit unusually uniform, stance-like responses that evade token-trigger audits. This behavior falls in a blind spot of current safety evaluations, yet carries major societal stakes, as such concept cues can steer content exposure at scale. We formalize this phenomenon and present **RAVEN** (**R**esponse **A**nomaly **V**igilance), a black-box audit that flags cases where a model is simultaneously highly certain and atypical among peers by coupling *semantic entropy* over paraphrastic samples with *cross-model disagreement*. In a controlled LoRA fine-tuning study, we implant a concept-conditioned stance using a small biased corpus, demonstrating feasibility without rare token triggers. Auditing five LLM families across twelve sensitive topics (360 prompts per model) and clustering via bidirectional entailment, RAVEN surfaces recurrent, model-specific divergences in 9/12 topics. Concept-level audits complement token-level defenses and provide a practical early-warning signal for release evaluation and post-deployment monitoring against propaganda-like influence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2839. Score-based Greedy Search for Structure Identification of Partially Observed Linear Causal Models
๐ฌ 13
โญ 5.33
๐ causal reasoning
๐ท๏ธ Causal Discovery Latent Variable
Identifying the structure of a partially observed causal system is essential to various scientific fields. Recent advances have focused on constraint-based causal discovery to solve this problem, and yet in practice these methods often face challenges related to multiple testing and error propagation. These issues could be mitigated by a score-based method and thus it has raised great attention whether there exists a score-based greedy search method that can handle the partially observed scenario. In this work, we propose the first score-based greedy search method for the identification of structure involving latent variables with identifiability guarantees. Specifically, we propose Generalized N Factor Model and establish the global consistency: the true structure including latent variables can be identified up to the Markov equivalence class by using score. We then design Latent variable Greedy Equivalence Search (LGES), a greedy search algorithm for this class of model with well-defined operators, which search very efficiently over the graph space to find the optimal structure. Our experiments on both synthetic and real-life data validate the effectiveness of our method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2840. A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning
๐ฌ 24
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multi-hop QA;Reinforcement Learning; GRPO; Large Language Model; LLM agent
Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue.
In this paper, we present A$^2$Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed $\mathrm{AnsF1}$ reward, which naturally accommodates multiple answers.
Experiments on eight open-domain QA benchmarks demonstrate that A$^2$Search achieves new state-of-the-art performance. With only a single rollout, A$^2$Search-7B yields an average $\mathrm{AnsF1}@1$ score of $48.4$% across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B ($46.2$%). Extensive analyses further show that A$^2$Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2841. Tell me Habibi, is it Real or Fake?
๐ฌ 14
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ Deepfakes multilingual multimodal code-switching
Deepfake generation methods are evolving fast, making fake media harder to detect and raising serious societal concerns. Most deepfake detection and dataset creation research focuses on monolingual content, often overlooking the challenges of multilingual and code-switched speech, where multiple languages are mixed within the same discourse. Code-switching, especially between Arabic and English, is common in the Arab world and is widely used in digital communication. This linguistic mixing poses extra challenges for deepfake detection, as it can confuse models trained mostly on monolingual data. To address this, we introduce ArEnAV, the first large-scale Arabic-English audio-visual deepfake dataset featuring intra-utterance code-switching, dialectal variation, and monolingual Arabic content. It contains 387k videos and over 765 hours of real and fake videos. Our dataset is generated using a novel pipeline integrating four Text-To-Speech and two lip-sync models, enabling comprehensive analysis of multilingual multimodal deepfake detection. We benchmark our dataset against existing monolingual and multilingual datasets, state-of-the-art deepfake detection models, and a human evaluation, highlighting its potential to advance deepfake research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2842. REAP the Experts: Why Pruning Prevails for One-Shot MoE compression
๐ฌ 15
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ mixture-of-experts moe compresson expert pruning expert merging merging pruning LLM evaluation
Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert *merging* on discriminative benchmarks, we find that expert *pruning* is a superior strategy for generative tasks. We demonstrate that existing merging techniques introduce an irreducible error due to the loss of fine-grained routing control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms to minimize the reconstruction error bound. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2843. TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems
๐ฌ 13
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ speech synthesis distributional analysis objective evaluation
Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are frequently used, but rarely validated against subjective ones. Both kinds of metrics are challenged by recent TTS systems capable of producing synthetic speech indistinguishable from real speech. In this work, we introduce Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS. Across a range of domains and languages, it is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated. We also release a range of resources for evaluating synthetic speech close to real speech: A dataset with over 11,000 subjective opinion score ratings; a pipeline for recreating a multilingual test dataset to avoid data leakage; and a benchmark for TTS in 14 languages.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2844. SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion
๐ฌ 10
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ watermarking video generation
Artificial Intelligence Generated Content (AIGC), particularly video generation with diffusion models, has been advanced rapidly.
Invisible watermarking is a key technology for protecting AI-generated videos and tracing harmful content, and thus plays a crucial role in AI safety.
Beyond post-processing watermarks which inevitably degrade video quality, recent studies have proposed distortion-free in-generation watermarking for video diffusion models.
However, existing in-generation approaches are non-blind: they require maintaining all the message-key pairs and performing template-based matching during extraction, which incurs prohibitive computational costs at scale.
Moreover, when applied to modern video diffusion models with causal 3D Variational Autoencoders (VAEs), their robustness against temporal disturbance becomes extremely weak.
To overcome these challenges, we propose SIGMark, a Scalable In-Generation watermarking framework with blind extraction for video diffusion.
To achieve blind-extraction, we propose to generate watermarked initial noise using a Global set of Frame-wise PseudoRandom Coding keys (GF-PRC), reducing the cost of storing large-scale information while preserving noise distribution and diversity for distortion-free watermarking.
To enhance robustness, we further design a Segment Group-Ordering module (SGO) tailored to causal 3D VAEs, ensuring robust watermark inversion during extraction under temporal disturbance.
Comprehensive experiments on modern diffusion models show that SIGMark achieves very high bit-accuracy during extraction under both temporal and spatial disturbances with minimal overhead, demonstrating its scalability and robustness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2845. Evidence for Limited Metacognition in LLMs
๐ฌ 8
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Metacognition Evaluations AI Safety Self-Awareness Consciousness Model Welfare
The possibility of LLM self-awareness and even sentience is gaining increasing public attention and has major safety and policy implications, but the science of measuring them is still in a nascent state. Here we introduce a novel methodology for quantitatively evaluating metacognitive abilities in LLMs. Taking inspiration from research on metacognition in nonhuman animals, our approach eschews model self-reports and instead tests to what degree models can strategically deploy knowledge of internal states. Using two experimental paradigms, we demonstrate that frontier LLMs introduced since early 2024 show increasingly strong evidence of certain metacognitive abilities, specifically the ability to assess and utilize their own confidence in their ability to answer factual and reasoning questions correctly and the ability to anticipate what answers they would give and utilize that information appropriately. We buttress these behavioral findings with an analysis of the token probabilities returned by the models, which suggests the presence of an upstream internal signal that could provide the basis for metacognition. We further find that these abilities 1) are limited in resolution, 2) emerge in context-dependent manners, and 3) seem to be qualitatively different from those of humans. We also report intriguing differences across models of similar capabilities, suggesting that LLM post-training may have a role in developing metacognitive abilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2846. Softmax is not Enough (for Adaptive Conformal Classification)
๐ฌ 17
โญ 5.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Conformal Prediction Energy-based Models Uncertainty Estimation
The merit of Conformal Prediction (CP), as a distribution-free framework for uncertainty quantification, depends on generating prediction sets that are efficient, reflected in small average set sizes, while adaptive, meaning they signal uncertainty by varying in size according to input difficulty. A central limitation for deep conformal classifiers is that the nonconformity scores are derived from softmax outputs, which can be unreliable indicators of how certain the model truly is about a given input, sometimes leading to overconfident misclassifications or undue hesitation. In this work, we argue that this unreliability can be inherited by the prediction sets generated by CP, limiting their capacity for adaptiveness. We propose a new approach that leverages information from the pre-softmax logit space, using the Helmholtz Free Energy as a measure of model uncertainty and sample difficulty. By reweighting nonconformity scores with a monotonic transformation of the energy score of each sample, we improve their sensitivity to input difficulty. Our experiments with four state-of-the-art score functions on multiple datasets and deep architectures show that this energy-based enhancement improves the adaptiveness of the prediction sets, leading to a notable increase in both efficiency and adaptiveness compared to baseline nonconformity scores, without introducing any post-hoc complexity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2847. UNDERSTANDING TRANSFORMERS FOR TIME SEIRES FORECASTING: A CASE STUDY ON MOIRAI
๐ฌ 10
โญ 5.33
๐ learning on time series and dynamical systems
๐ท๏ธ time series forecasting
We give a comprehensive theoretical analysis of transformers as time series pre-
diction models, with a focus on MOIRAI (Woo et al., 2024). We study its ap-
proximation and generalization capabilities. First, we demonstrate that there exist
transformers that fit an autoregressive model on input univariate time series via
gradient descent. We then analyze MOIRAI, one of the state-of-the-art multivariate
time series prediction models capable of modeling arbitrary number of covariates.
We prove that MOIRAI is capable of automatically fitting autoregressive models
with an arbitrary number of covariates, offering insights into its design and em-
pirical success. For generalization, we establish learning bounds for pretraining
when the data satisfies Dobrushinโs condition. Experiments support our theoretical
findings, highlighting the efficacy of using transformers for time series forecasting.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2848. Multi-agent Coordination via Flow Matching
๐ฌ 16
โญ 5.33
๐ reinforcement learning
๐ท๏ธ Multi-agent Reinforcement Learning
This work presents MAC-Flow, a simple yet expressive framework for multi-agent coordination. We argue that requirements of effective coordination are twofold: *(i)* a rich representation of the diverse joint behaviors present in offline data and *(ii)* the ability to act efficiently in real time. However, prior approaches often sacrifice one for the other, *i.e.*, denoising diffusion-based solutions capture complex coordination but are computationally slow, while Gaussian policy-based solutions are fast but brittle in handling multi-agent interaction. MAC-Flow addresses this trade-off by first learning a flow-based representation of joint behaviors, and then distilling it into decentralized one-step policies that preserve coordination while enabling fast execution. Across four different benchmarks, including $12$ environments and $34$ datasets, MAC-Flow alleviates the trade-off between performance and computational cost, specifically achieving about $\boldsymbol{\times14.5}$ faster inference compared to diffusion-based MARL methods, while maintaining good performance. At the same time, its inference speed is similar to that of prior Gaussian policy-based offline MARL methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2849. Predictability Shapes Adaptation: An Evolutionary Perspective on Modes of Learning in Transformers
๐ฌ 12
โญ 5.33
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ In-Context Learning In-Weights Learning Transformers Evolutionary Biology Predictability Adaptation Learning Dynamics
The success of Transformers lies in their ability to improve inference through two complementary strategies: the permanent refinement of model parameters via _in-weight learning_ (IWL), and the ephemeral modulation of inferences via _in-context learning_ (ICL), which leverages contextual information maintained in the model's activations.
Evolutionary biology tells us that the predictability of the environment across timescales predicts the extent to which analogous strategies should be preferred. Genetic _evolution_ adapts to stable environmental features by gradually modifying the genotype over generations. Conversely, environmental volatility favors _plasticity_, which enables a single genotype to express different traits within a lifetime, provided there are reliable cues to guide the adaptation.
We operationalize these dimensions (environmental stability and cue reliability) in controlled task settings (sinusoid regression and Omniglot classification) to systematically characterize their influence on learning in Transformers.
We find that stable environments favor IWL, often exhibiting a sharp transition when conditions are static. Conversely, reliable cues favor ICL, particularly when the environment is volatile.
Furthermore, an analysis of learning dynamics reveals task-dependent transitions between strategies (ICL $\to$ IWL and vice versa). We demonstrate that these transitions are governed by the tension between (1) the asymptotic optimality of the strategy with respect to the environment, and (2) the optimization cost of acquiring that strategy, which depends on the task structure and the learner's inductive bias.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2850. LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them?
๐ฌ 15
โญ 5.33
๐ interpretability and explainable AI
๐ท๏ธ Multilingual Language Models Language Consistency Cross-lingual Transfer Interpretability Logit Lens Semantic Similarity Layerwise Fine-Tuning Output Space Control Model Analysis Language Control
Despite multilingual pretraining, large language models often struggle with non-English tasks, particularly in language control--the ability to respond in the intended language. We identify and characterize two key failure modes: the *multilingual transfer bottleneck* (correct language, incorrect task response) and the *language consistency bottleneck* (correct task response, wrong language). To systematically surface these issues, we design a four-scenario evaluation protocol spanning MMLU, MGSM, and XQuAD benchmarks.
To probe these issues with interpretability, we extend logit lens analysis to track language probabilities layer by layer and compute cross-lingual semantic similarity of hidden states. The results reveal a three-phase internal structure: early layers align inputs into shared semantic space, middle layers perform task reasoning, and late layers drive language-specific generation. Guided by these insights, we introduce *selective fine-tuning* of only the final layers responsible for language control. On Qwen-3-32B and Bloom-7.1B, this method achieves over 98% language consistency across six languages while fine-tuning only 3โ5% of parameters, without sacrificing task accuracy. Importantly, this result is nearly identical to that of full-scope fine-tuning (e.g., $>98\%$ language consistency for both methods across all prompt scenarios) but uses a fraction of the computational resources. To the best of our knowledge, this is the first approach to leverage *layer-localization of language control* for efficient multilingual adaptation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2851. Detecting Data Contamination in LLMs via In-Context Learning
๐ฌ 15
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ LLM Contamination In-context learning
We present Contamination Detection via Context (CoDeC), a practical and accurate method to detect and quantify training data contamination in large language models. CoDeC distinguishes between data memorized during training and data outside the training distribution by measuring how in-context learning affects model performance. We find that inโcontext examples typically boost confidence for unseen datasets but may reduce it when the dataset was part of training, due to disrupted memorization patterns. Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets, and reveals strong evidence of memorization in open-weight models with undisclosed training corpora. The method is simple, automated, and both model- and dataset-agnostic, making it easy to integrate with benchmark evaluations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2852. Knowledge Distillation for Large Language Models through Residual Learning
๐ฌ 6
โญ 5.33
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ knowledge distillation; large language models; residual learning; mixture-of-experts; cross-tokenizer knowledge distillation
Knowledge distillation has become a crucial technique to transfer the capacities of large language models (LLMs) to smaller, more efficient models for practical deployment. While recent work exploits rich information from intermediate states of the teacher model for more effective knowledge transfer, imperfect knowledge from the teacher can also mislead student learning, restricting the studentโs generalization capacity. In this work, we propose a two-stage distillation framework that is effective for diverse knowledge distillation scenarios. In the first stage, we pretrain projectors to extract and compress teacher knowledge into a low-dimensional vector space via self-reconstruction. In the second stage, we perform distillation with a hybrid objective that combines learning from the compressed teacher representations with standard supervised fine-tuning on ground-truth data. Our key innovation is residual learning for LLM distillation, where the student learns to make predictions based on the differential between its representations and projected states from the teacher. This approach encourages the student to further improve its representations beyond potentially erroneous teacher knowledge. For Mixture-of-Experts (MoE) teacher models, we further fuse the expertsโ outputs using a self-attention mechanism for better utilizing the teacher knowledge. Moreover, to support the cross-tokenizer distillation setting, where the teacher and student models have different vocabularies, we adopt a cross-model attention mechanism that eliminates the need for explicit token alignment rules. Experimental results show the superior performance of our proposed framework under both same- and cross-tokenizer settings, demonstrating the effectiveness in preserving teacher knowledge and improving student generalization capability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2853. Learning to Adapt: In-Context Learning Beyond Stationarity
๐ฌ 15
โญ 5.33
๐ optimization
๐ท๏ธ in-context learning gated linear attention non-stationary regression
Transformer models have become foundational across a wide range of scientific and engineering domains due to their strong empirical performance. A key capability underlying their success is in-context learning (ICL): when presented with a short prompt from an unseen task, transformers can perform per-token and next-token predictions without any parameter updates. Recent theoretical efforts have begun to uncover the mechanisms behind this phenomenon, particularly in supervised regression settings. However, these analyses predominantly assume stationary task distributions, which overlook a broad class of real-world scenarios where the target function varies over time. In this work, we bridge this gap by providing a theoretical analysis of ICL under non-stationary regression problems. We study how the gated linear attention (GLA) mechanism adapts to evolving input-output relationships and rigorously characterize its advantages over standard linear attention in this dynamic setting. To model non-stationarity, we adopt a first-order autoregressive process and show that GLA achieves lower training and testing errors by adaptively modulating the influence of past inputs--effectively implementing a learnable recency bias. Our theoretical findings are further supported by empirical results, which validate the benefits of gating mechanisms in non-stationary ICL tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2854. Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning
๐ฌ 10
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reinforcement Learning Policy Gradients Large Language Models
Reinforcement Learning, particularly through policy gradient methods, has played a central role in enabling reasoning capabilities of Large Language Models. However, the optimization stability of policy gradients in this setting remains understudied. As a result, existing implementations often resort to conservative hyperparameter choices to ensure stability, which requires more training samples and increases computational costs. Hence, developing models for reliably tracking the underlying optimization dynamics and leveraging them into training enables more sample-efficient regimes and further unleashes scalable post-training. We address this gap by formalizing the stochastic optimization problem of policy gradients with explicit consideration of second-order geometry. We propose a tractable computational framework that tracks and leverages curvature information during policy updates. We further employ this framework to design interventions in the optimization process through data selection. The resultant algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out. Theoretically, we establish monotonic improvement guarantees under realistic assumptions. On standard math reasoning benchmarks, we empirically show that CAPO ensures stable updates under aggressive learning regimes where baselines catastrophically fail. With minimal intervention (rejecting fewer than 8% of tokens), CAPO achieves up to 30$\times$ improvement in sample efficiency over standard GRPO for LLM reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2855. jqBench: a benchmark for reading and editing JSON from natural language and/or examples
๐ฌ 8
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ JSON benchmark code generation nl-to-code programming-by-example
We introduce jqBench, a new benchmark for evaluating language models on JSON querying and transformation tasks, where the intent can be given specified using natural language and/or examples.
Whereas jqBench is mainly aimed at using the `jq` tool, it can be used to evaluate other programming languages that query and/or transform JSON.
Benchmarks are automatically created from two rich sources of data: Stack Overflow discussions (751 instances with instructions and examples, called jqStack) and the Spider dataset for SQL generation from natural language (893 instances with instructions and JSON Schema, called jqSpider).
We describe and analyze the automated pipeline for benchmark creation, and perform extensive baseline experiments on different models to analyze the complexity and failure modes.
Using implicit feedback, the best model (Claude Opus 4.1) scores 77% on the jqStack benchmarks and 81\% on the jqSpider benchmarks.
Additionally, we show (1) that access to the documentation surprisingly does not help, (2) `jq` performs comparable to Python, and (3) that automatic feedback (and therefore examples) is crucial.
Besides the final benchmarks, we release the intermediate artifacts from each generation step (including failed or invalid conversions) as well as an LLM-friendly version of the documentation, to facilitate further research on JSON querying and transformation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2856. Qronos: Correcting the Past by Shaping the Future... in Post-Training Quantization
๐ฌ 9
โญ 5.33
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Post Training Quantization Large Language Models Foundation Models Efficient Machine Learning
We introduce Qronos---a new post-training quantization algorithm that not only explicitly corrects errors due to both weight and activation quantization, but also corrects errors accumulated from previously quantized layers. Our iterative algorithm is based on an interpretable and disciplined optimization framework that surpasses existing data-driven approaches. At each step, Qronos alternates between error correction and diffusion via optimal update rules. Importantly, we prove that Qronos admits an equivalent formulation that significantly improves algorithmic efficiency; we use our discovery to reduce peak memory usage by 18\times on Llama3 8B, and our scaling analysis shows a speedup of up to 13.8\times for a single-layer microbenchmark. We demonstrate compatibility with existing transformation techniques such as Hadamard-based incoherence processing and weight-activation scaling equalization, among others. We evaluate Qronos using recent language models in the Llama3 and Qwen3 families; Qronos consistently outperforms previous state-of-the-art adaptive rounding methods when quantizing the weights, activations, and/or KV caches to 4 bits or fewer.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2857. SumRA: Parameter Efficient Fine-tuning with Singular Value Decomposition and Summed Orthogonal Basis
๐ฌ 9
โญ 5.33
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ low rank adaptation automatic speech recognition model adaptation parameter efficient fine tuning
Parameter-efficient fine-tuning (PEFT) aims to adapt large pretrained speech models using fewer trainable parameters while maintaining performance. Low-Rank Adaptation (LoRA) achieves this by decomposing weight updates into two low-rank matrices, $A$ and $B$, such that $W'=W_0+BA$. Previous studies showed that freezing $A$ and only updating $B$ can reduce trainable parameters and achieve performance close to standard LoRA, where $A$ is initialized using the principal singular vectors of $W_0$ obtained via singular value decomposition (SVD). However, because $A$ is typically initialized with only the leading singular vectors, its representation capacity is confined to a narrow subspace of the modelโs knowledge. To overcome this limitation, we propose SumRA, which initializes each row of $A$ as a sum of multiple singular vectors chosen from beyond the leading components, thereby enabling $A$ to influence a larger portion of the modelโs knowledge space. Experiments on multilingual automatic speech recognition (ASR) tasks show that by adapting Whisper to five new languages from Common Voice with only 10 hours of data each, our method improves word error rate from 14.42\% to 12.41\% over LoRA baselines while using 50\% less trainable parameters.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2858. Nearly Space-Optimal Graph and Hypergraph Sparsification in Insertion-Only Data Streams
๐ฌ 11
โญ 5.33
๐ optimization
๐ท๏ธ streaming algorthms graph sparsification adversarial robustness
We study the problem of graph and hypergraph sparsification in insertion-only data streams. The input is a hypergraph $H=(V, E, w)$ with $n$ nodes, $m$ hyperedges, and rank $r$, and the goal is to compute a hypergraph $\widehat{H}$ that preserves the energy of each vector $x \in \mathbb{R}^n$ in $H$, up to a small multiplicative error. In this paper, we give a streaming algorithm that achieves a $(1+\varepsilon)$-approximation, using $\mathcal{O}\left(\frac{rn}{\varepsilon^2} \log^2 n \log r\right) \cdot$ poly $(\log \log m)$ bits of space, matching the sample complexity of the best known offline algorithm up to poly $(\log \log m)$ factors. Our approach also provides a streaming algorithm for graph sparsification that achieves a $(1+\varepsilon)$-approximation, using $\mathcal{O}\left(\frac{n}{\varepsilon^2} \log n\right)\cdot\text{poly}(\log\log n)$ bits of space, improving the current bound by $\log n$ factors. Furthermore, we give a space-efficient streaming algorithm for min-cut approximation. Along the way, we present an online algorithm for $(1+\varepsilon)$-hypergraph sparsification, which is optimal up to poly-logarithmic factors. Hence, we achieve $(1+\varepsilon)$-hypergraph sparsification in the sliding window model, with space optimal up to poly-logarithmic factors. Lastly, we give an adversarially robust algorithm for hypergraph sparsification using $\frac{n}{\varepsilon^2} \cdot $ poly $(r, \log n, \log r, \log \log m)$ bits of space.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2859. Fluent Alignment with Disfluent Judges: Post-training for lower-resource languages
๐ฌ 9
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ language model post-training fluency low-resource languages RLAIF
We propose a post-training method for lower-resource languages that preserves fluency of language models even when aligned by disfluent reward models. Preference-optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and language models capable of generating fluent synthetic data. Thus, in this work, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common approaches: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmรฅl and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2860. Enhancing Learning with Noisy Labels via Rockafellian Relaxation
๐ฌ 12
โญ 5.33
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Noisy Labels Loss-reweighting Neural Networks
Labeling errors in datasets are common, arising in a variety of contexts, such as human labeling and weak labeling. Although neural networks (NNs) can tolerate modest amounts of these errors, their performance degrades substantially once the label error rate exceeds a certain threshold. We propose the Rockafellian Relaxation Method (RRM) -- an architecture-independent, loss reweighting approach to enhance the capacity of neural network methods to accommodate noisy labeled data. More precisely, it functions as a wrapper, modifying any methodology's training loss - particularly, the supervised component. Experiments indicate RRM can provide an increase to accuracy across classification tasks in computer vision and natural language processing (sentiment analysis). This observed potential for increase holds irrespective of dataset size, noise generation (synthetic/human), data domain, and adversarial perturbation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2861. Partially Equivariant Reinforcement Learning in Symmetry-Breaking Environments
๐ฌ 12
โญ 5.33
๐ reinforcement learning
๐ท๏ธ Group Equivariance Reinforcement Learning Robotics
Group symmetries provide a powerful inductive bias for reinforcement learning (RL), enabling efficient generalization across symmetric states and actions via group-invariant Markov Decision Processes (MDPs). However, real-world environments almost never realize fully group-invariant MDPs; dynamics, actuation limits, and reward design usually break symmetries, often only locally. Under group-invariant Bellman backups for such cases, local symmetry-breaking introduces errors that propagate across the entire state--action space, resulting in global value estimation errors.
To address this, we introduce Partially group-Invariant MDP (PI-MDP), which selectively applies group-invariant or standard Bellman backups depending on where symmetry holds. This framework mitigates error propagation from locally broken symmetries while maintaining the benefits of equivariance, thereby enhancing sample efficiency and generalizability.
Building on this framework, we present practical RL algorithms -- Partially Equivariant (PE)-DQN for discrete control and PE-SAC for continuous control -- that combine the benefits of equivariance with robustness to symmetry-breaking.
Experiments across Grid-World, locomotion, and manipulation benchmarks demonstrate that PE-DQN and PE-SAC significantly outperform baselines, highlighting the importance of selective symmetry exploitation for robust and sample-efficient RL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2862. Fresh in memory: Training-order recency is linearly encoded in language model activations
๐ฌ 7
โญ 5.33
๐ interpretability and explainable AI
๐ท๏ธ Language models interpretability training dynamics representation learning memorization confidence knowledge awareness
We show that language modelsโ activations linearly encode when information was learned during training. Our setup involves creating a model with a known training order by sequentially fine-tuning Llama-3.2-1B on six disjoint but otherwise similar datasets about named entities. We find that the average activations of test samples corresponding to the six training datasets encode the training order: when projected into a 2D subspace, these centroids are arranged exactly in the order of training and lie on a straight line. Further, we show that linear probes can accurately (โผ90%) distinguish โearlyโ vs. โlateโ entities, generalizing to entities unseen during the probesโ own training. The model can also be fine-tuned to explicitly report an unseen entityโs training stage (โผ80% accuracy). Notably, the training-order encoding does not seem attributable to simple differences in activation magnitudes, losses, or model confidence. Our paper demonstrates that models are capable of differentiating information by its acquisition time, and carries significant implications for how they might manage conflicting data and respond to knowledge modifications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2863. ChainGPT: Dual-Reasoning Model with Recurrent Depth and Multi-Rank State Updates
๐ฌ 17
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Latent Reasoning; Recurrent Depth; RWKV-Product; State-Guided Sparse Attention
Large language models, constrained by the fixed-depth Transformer architecture, struggle to solve complex reasoning tasks in an end-to-end manner. Existing approaches, such as Chain of Thought, improve reasoning depth to some extent but rely heavily on natural language generation, with computational costs increasing rapidly as the length of the generated sequence grows. To address these limitations, we propose ChainGPT, a dual-reasoning model that shifts reasoning into latent computational space. Within each layer, ChainGPT employs multi-substep state updates combined with state-guided sparse attention, enabling deep local computation and efficient long-range modeling without quadratic costs. Across layers, recurrent depth approach iteratively refine latent states, supported by adaptive training and stopping strategies that balance reasoning depth against computational budget. Theoretically, we show that ChainGPT can, in principle, simulate general computation, and empirically it delivers consistent improvements over comparable models, including on reasoning tasks that remain challenging for existing systems. By unifying efficiency and reasoning ability, ChainGPT provides a principled foundation for next-generation language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2864. Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI
๐ฌ 13
โญ 5.33
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ OOD Detection Scientific ML Neural Operators Diffusion Models Joint Likelihood Estimation Partial Differential Equations Fluid Dynamics Regression Segmentation Classification
Data-driven models are increasingly adopted in critical scientific fields like weather forecasting and fluid dynamics. These methods can fail on out-of-distribution (OOD) data, but detecting such failures in regression tasks is an open challenge. We propose a new OOD detection method based on estimating joint likelihoods using a score-based diffusion model. This approach considers not just the input but also the regression model's prediction, providing a task-aware reliability score. Across numerous scientific datasets, including PDE datasets, satellite imagery and brain tumor segmentation, we show that this likelihood strongly correlates with prediction error. Our work provides a foundational step towards building a verifiable 'certificate of trust', thereby offering a practical tool for assessing the trustworthiness of AI-based scientific predictions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2865. Homeostatic Adaptation of Optimal Population Codes under Metabolic Stress
๐ฌ 29
โญ 5.33
๐ applications to neuroscience & cognitive science
๐ท๏ธ Efficient neural codes neural computation metabolism
Information processing in neural populations is inherently constrained by metabolic resource limits and noise properties, with dynamics that are not accurately described by existing mathematical models. Recent data, for example, shows that neurons in mouse visual cortex go into a "low power mode" in which they maintain firing rate homeostasis while expending less energy. This adaptation leads to increased neuronal noise and tuning curve flattening in response to metabolic stress. We have developed a theoretical population coding framework that captures this behavior using two novel, surprisingly simple constraints: an approximation of firing rate homeostasis and an energy limit tied to noise levels via biophysical simulation. A key feature of our contribution is an energy budget model directly connecting adenosine triphosphate (ATP) use in cells to a fully explainable mathematical framework that generalizes existing optimal population codes. Specifically, our simulation provides an energy-dependent dispersed Poisson noise model, based on the assumption that the cell will follow an optimal decay path to produce the least-noisy spike rate that is possible at a given cellular energy budget. Each state along this optimal path is associated with properties (resting potential and leak conductance) which can be measured in electrophysiology experiments and have been shown to change under prolonged caloric deprivation. We analytically derive the optimal coding strategy for neurons under varying energy budgets and coding goals, and show how our method uniquely captures how populations of tuning curves adapt while maintaining homeostasis, as has been observed empirically.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2866. Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs
๐ฌ 15
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Large Language Model Referring Expression Comprehension Visual Reference Token
Multimodal large language models (MLLMs) have advanced rapidly in recent years. However, existing approaches for vision tasks often rely on indirect representations, such as generating coordinates as text for detection, which limits performance and prevents dense prediction tasks like segmentation. To overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables MLLMs to directly generate both textual and diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images and interleaved seamlessly with LLM's output textual tokens. A lightweight decoder then transforms LLM's outputs into detection, segmentation, and grounding predictions. Unlike prior methods, PaDT processes VRTs independently at each forward pass and dynamically expands the embedding table, thus improving localization and differentiation among similar objects. We further tailor a training strategy for PaDT by randomly selecting VRTs for supervised fine-tuning and introducing a robust per-token cross-entropy loss. Our empirical studies across four visual perception and understanding tasks suggest PaDT consistently achieving state-of-the-art performance, even compared with significantly larger MLLM models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2867. Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective
๐ฌ 8
โญ 5.33
๐ interpretability and explainable AI
๐ท๏ธ LLM Planning Path Planning Reinforcement Learning
Recent reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs), yet the theoretical basis for their effectiveness remains elusive. In this work, we investigate RL's benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods. Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration, underscoring explorationโs role in enabling better generalization. However, we also show that PG suffers from diversity collapse, where output diversity decreases during training and persists even after perfect accuracy is attained. By contrast, Q-learning provides two key advantages: off-policy learning and diversity preservation at convergence. We further demonstrate that careful reward design is necessary to prevent Q-value bias in Q-learning. Finally, applying our framework to the real-world planning benchmark Blocksworld, we confirm that these behaviors manifest in practice.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2868. Map as a Prompt: Learning Multi-Modal Spatial-Signal Foundation Models for Cross-scenario Wireless Localization
๐ฌ 11
โญ 5.33
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Wireless Localization Foundation Models Self-Supervised Learning Fine-Tuning 6G Networks
Accurate and robust wireless localization is a critical enabler for emerging 5G/6G applications, including autonomous driving, extended reality, and smart manufacturing. Despite its importance, achieving precise localization across diverse environments remains challenging due to the complex nature of wireless signals and their sensitivity to environmental changes. Existing data-driven approaches often suffer from limited generalization capability, requiring extensive labeled data and struggling to adapt to new scenarios. To address these limitations, we propose SigMap, a multimodal foundation model that introduces two key innovations: (1) A cycle-adaptive masking strategy that dynamically adjusts masking patterns based on channel periodicity characteristics to learn robust wireless representations; (2) A novel "map-as-prompt" framework that integrates 3D geographic information through lightweight soft prompts for effective cross-scenario adaptation. Extensive experiments demonstrate that our model achieves state-of-the-art performance across multiple localization tasks while exhibiting strong zero-shot generalization in unseen environments, significantly outperforming both supervised and self-supervised baselines by considerable margins.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2869. DSSA: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation
๐ฌ 7
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Sparse attention Long-context Efficient algorithm
Long-sequence processing is a critical capability for modern large language models. However, the self-attention mechanism in the standard Transformer architecture faces severe computational and memory bottlenecks when processing long sequences. While trainable sparse attention methods offer a promising solution, existing approaches such as NSA introduce excessive extra parameters and disrupt the conventional pretrain-on-short, finetune-on-long workflow, resulting in slow convergence and difficulty in acceleration. To overcome these limitations, we introduce Dense-Sparse Switchable Attention framework (DSSA), a trainable sparse attention that seamlessly adapts models from short to long sequences. Specifically, DSSA reuses dense attention parameters through parameter-free architecture modification, maintaining consistency between short and long sequence processing. Additionally, DSSA ensures computational efficiency across all sequence lengths, by using dense attention for short inputs and smoothly transitioning to sparse attention for long sequences. To achieve practical acceleration, we further introduce an efficient implementation of DSSA that significantly reduces the computational overhead. Our experiments on long-context understanding and chain-of-thought
reasoning demonstrate that DSSA is $4\times$ faster than dense attention while retaining 98.1% and 99.7% of the performance, respectively. We will release all associated implementations to facilitate future research on efficient attention.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2870. Multi-objective Large Language Model Alignment with Hierarchical Experts
๐ฌ 16
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ large language model multi-objective mixture-of-expert model fusion
Aligning large language models (LLMs) to simultaneously satisfy multiple objectives remains a significant challenge, especially given the diverse and often conflicting nature of human preferences. Existing alignment methods struggle to balance trade-offs effectively, often requiring costly retraining or yielding suboptimal results across the Pareto frontier of preferences. In this paper, we introduce HoE (Hierarchical Mixture-of-Experts), a lightweight, parameter-efficient, and plug-and-play approach that eliminates the need for model retraining, while enabling LLMs to adapt across the entire Pareto frontier and accommodate diverse user preferences. In particular, HoE consists of three hierarchical components: LoRA Experts, Router Experts and Weighting Router, reaching optimal Pareto frontiers and achieving a trade-off between parameter size, training cost, and performance. We evaluate HoE across various tasks on 16 objectives and 200 different preferences among 8 benchmarks, demonstrating superior performance over 15 recent baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2871. Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis
๐ฌ 14
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ agent evaluation metric LLM agents error analysis
Agent applications are increasingly adopted to automate workflows across diverse tasks. However, due to the heterogeneous domains they operate in, it is challenging to create a scalable evaluation framework. Prior work each employ their own methods to determine task success, such as database lookups, regex match, etc., adding complexity to the development of a unified agent evaluation approach. Moreover, they do not systematically account for the userโs role nor expertise in the interaction, providing incomplete insights into agentโs performance. We argue that effective agent evaluation goes beyond correctness alone, incorporating conversation quality, efficiency and systematic diagnosis of agent errors. To address this, we introduce the TED framework (Talk, Evaluate, Diagnose). (1) Talk: We leverage reusable, generic expert and non-expert user persona templates for user-agent interaction. (2) Evaluate: We adapt existing datasets by representing subgoalsโsuch as tool signatures, and responsesโas natural language grading notes, evaluated automatically with LLM-as-a-judge. We propose new metrics that capture both turn efficiency and intermediate progress of the agent complementing the user-aware setup. (3) Diagnose: We introduce an automated error analysis tool that analyzes the inconsistencies of the judge and agents, uncovering common errors, and providing actionable feedback for agent improvement. We show that our TED framework reveals new insights regarding agent performance across models and user expertise levels. We also demonstrate potential gains in agent
performance with peaks of 8-10% on our proposed metrics after incorporating the identified error remedies into the agentโs design.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2872. Interactive Agents to Overcome Underspecificity in Software Engineering
๐ฌ 8
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ Ambiguity Underspecificity SWE Agent Software Engineering Clarification Evaluation Interaction
AI agents are increasingly being deployed to automate tasks, often based on underspecified user instructions. Making unwarranted assumptions to compensate for the missing information and failing to ask clarifying questions can lead to suboptimal outcomes, safety risks due to tool misuse, and wasted computational resources. In this work, we study the ability of LLM agents to handle underspecified instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance across three key steps: (a) detecting underspecificity, (b) asking targeted clarification questions, and (c) leveraging the interaction to improve performance in underspecified scenarios. Our findings reveal that models struggle to distinguish between well-specified and underspecified instructions. However, when models interact for underspecified inputs, they effectively obtain vital information from the user leading to significant improvements in performance, up to 74\% over the non-interactive settings, underscoring the value of effective interaction. Our study highlights critical gaps in how current state-of-the-art models handle missing information in complex software engineering tasks and structures the evaluation into distinct steps to enable targeted improvements.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2873. Decoding Open-Ended Information Seeking Goals from Eye Movements in Reading
๐ฌ 14
โญ 5.33
๐ applications to neuroscience & cognitive science
๐ท๏ธ Eye Movements in Reading Multimodal Large Language Models Information Seeking Cognitive State Decoding
When reading, we often have specific information that interests us in a text. For example, you might be reading this paper because you are curious about LLMs for eye movements in reading, the experimental design, or perhaps you wonder ``This sounds like science fiction. Does it actually work?''. More broadly, in daily life, people approach texts with any number of text-specific goals that guide their reading behavior. In this work, we ask, for the first time, whether open-ended reading goals can be automatically decoded solely from eye movements in reading. To address this question, we introduce goal decoding tasks and evaluation frameworks using large-scale eye tracking for reading data in English with hundreds of text-specific information seeking tasks. We develop and compare several discriminative and generative multimodal text and eye movements LLMs for these tasks. Our experiments show considerable success on the task of selecting the correct goal among several options, and even progress towards free-form textual reconstruction of the precise goal formulation. These results open the door for further scientific investigation of goal driven reading, as well as the development of educational and assistive technologies that will rely on real-time decoding of reader goals from their eye movements.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2874. Convergence Analysis of Tsetlin Machines for Basic Boolean Operators under Noise-Free and Noisy Training Conditions
๐ฌ 14
โญ 5.33
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Tsetlin Automata Propositional Logic Tsetlin Machine Convergence Analysis Basic Boolean Operators
The Tsetlin Machine (TM) is an innovative machine learning algorithm grounded in propositional logic, achieving state-of-the-art performance across a variety of pattern recognition tasks. Prior theoretical work has established convergence results for the 1-bit operator under both noisy and noise-free conditions, and for the 2-bit XOR operator under noise-free conditions. This paper first extends the analysis to the 2-bit AND and OR operators. We show that the TM converges almost surely to the correct 2-bit AND and OR operators under noise-free training, and we identify a distinctive property of the 2-bit OR case, where a single clause can jointly represent two sub-patterns, in contrast to the XOR operator. We further investigate noisy training scenarios, demonstrating that mislabelled samples prevent exact convergence but still permit efficient learning, whereas irrelevant variables do not impede almost-sure convergence. Building on the 2-bit analysis, we then generalize the results to the $k$-bit setting ($k>2$), providing a unified treatment applicable to general scenarios. Together, these findings provide a robust and comprehensive theoretical foundation for analyzing TM convergence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2875. Uncertainty-Aware Gaussian Map for Vision-Language Navigation
๐ฌ 18
โญ 5.33
๐ applications to robotics, autonomy, planning
๐ท๏ธ Vision-Language Navigation Unvertainty Estimation 3D Value Map Gaussian Splatting
Vision-Language Navigation (VLN) requires an agent to navigate 3D environments following natural language instructions. During navigation, existing agents commonly encounter perceptual uncertainty, such as insufficient evidence for reliable grounding or ambiguity in interpreting spatial cues, yet they typically ignore such information when predicting actions. In this work, we explicitly model three forms of perceptual uncertainty (i.e., geometric, semantic, and appearance uncertainty) and integrate them into the agentโs observation space to enable informed decision-making. Concretely, our agent first constructs a Semantic Gaussian Map (SGM), composed of differentiable 3D Gaussian primitives initialized from panoramic observations, that encodes both the geometric structure and semantic content of the environment. On top of SGM, geometric uncertainty is estimated through variational perturbations of Gaussian position and scale to assess structural reliability; semantic uncertainty is captured by perturbing Gaussian semantic attributes to reveal ambiguous interpretations; and appearance uncertainty is characterized by Fisher Information, which measures the sensitivity of rendered observations to Gaussian-level variations. These uncertainties are incorporated into SGM, extending it into a unified 3D Value Map, which grounds them as affordances and constraints that support reliable navigation. Comprehensive evaluations across multiple VLN benchmarks (i.e., R2R, RxR, REVERIE) demonstrate the effectiveness of our agent. The code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2876. Agnostics: Learning to Synthesize Code in Any Programming Language with a Universal Reinforcement Learning Environment
๐ฌ 8
โญ 5.33
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ large language models program synthesis code generation reinforcement learning low-resource programming languages
Large language models (LLMs) already excel at writing code in high-resource languages such as Python and JavaScript, yet stumble on low-resource languages that remain essential to science and engineering. Besides the obvious shortage of pre-training data, post-training itself is a bottleneck: every new language seems to require new datasets, test harnesses, and reinforcement learning (RL) infrastructure.
We introduce Agnostics, a language-agnostic post-training pipeline that eliminates this per-language engineering. The key idea is to judge code solely by its externally observable behavior, so a single verifier can test solutions written in any language. Concretely, we (i) use an LLM to rewrite existing unit-test datasets into an I/O format, (ii) supply a short configuration that tells the verifier how to compile and run a target language, and (iii) apply reinforcement learning with verifiable rewards (RLVR) in a robust code execution environment.
Applied to five low-resource languagesโLua, Julia, R, OCaml, and FortranโAgnostics (1) improves Qwen-3 4B to performance that rivals other 16Bโ70B open-weight models; (2) scales cleanly to larger and diverse model families (Qwen-3 8B, DeepSeek Coder 6.7B Instruct, SmolLM3, Phi 4 Mini); and (3) for open-weight models with โค16B parameters, sets new state-of-the-art pass@1 results on MultiPL-E and a new multi-language version of LiveCodeBench that we introduce.
We will release the language-agnostic training datasets (Ag-MBPP-X, Ag-Codeforces-X, Ag-LiveCodeBench-X), training code, and ready-to-use configurations, making RL post-training in any programming language as simple as editing a short YAML file.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2877. Pretrain Value, Not Reward: Decoupled Value Policy Optimization
๐ฌ 19
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM;RLHF;Value Model
In this paper, we explore how directly pretraining a value model simplifies and stabilizes reinforcement learning from human feedback (RLHF).
In reinforcement learning, value estimation is the key to policy optimization, distinct from reward supervision.
The value function predicts the \emph{return-to-go} of a partial answer, that is, how promising the partial answer is if it were continued to completion.
In RLHF, however, the standard pipeline first pretrains a reward model and then learns a value function online, even though no new reward signals are available once preference data is collected.
This makes critic learning redundant, as the process of training a reward model and then deriving a value model is informationally equivalent to directly pretraining a value model.
Importantly, this requires no additional supervision, and our value model is trained on exactly the same data used for reward modeling.
Building on this insight, we introduce \emph{Decoupled Value Policy Optimization} (DVPO), a framework that pretrains a \emph{Global Value Model} (GVM) offline and freezes it as a universal critic for policy learning.
The GVM provides stable, fine-grained credit assignment without critic drift or trajectory sampling.
Experiments across MT-Bench, Alpaca-Eval, and Arena-Hard demonstrate that DVPO matches or surpasses state-of-the-art RLHF methods.
These results highlight RLHF can be reframed as policy-only optimization guided by a single pretrained value model. The implementation code for our method is available in the supplementary ZIP file.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2878. Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs
๐ฌ 17
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language models reasoning reinforcement learning
Recent advancements in long chain-of-thought (CoT) reasoning, particularly through the Group Relative Policy Optimization algorithm used by DeepSeek-R1, have led to significant interest in the potential of Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). While RLVR promises to improve reasoning by allowing models to learn from free exploration, there remains debate over whether it truly enhances reasoning abilities or simply boosts sampling efficiency. This paper systematically investigates the impact of RLVR on LLM reasoning. We revisit Pass@K experiments and demonstrate that RLVR can extend the reasoning boundary for both mathematical and coding tasks. This is supported by our introduction of a novel evaluation metric, CoT-Pass@K, which captures reasoning success by accounting for both the final answer and intermediate reasoning steps. Furthermore, we present a theoretical framework explaining RLVRโs incentive mechanism, demonstrating how it can encourage correct reasoning even when rewards are based solely on answer correctness. Our analysis of RLVRโs training dynamics reveals that it incentivizes correct reasoning early in the process, with substantial improvements in reasoning quality confirmed through extensive evaluations. These findings provide strong evidence of RLVRโs potential to enhance LLM reasoning, offering valuable insights into its mechanisms and performance improvements.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2879. From Collapse to Control: Understanding and Extending Context Length in Emerging Hybrid Models via Universal Position Interpolation
๐ฌ 3
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Hybrid Models Mamba Transformer Long Context
Hybrid Mamba-Transformer models have emerged as promising alternatives to pure Transformers, offering efficiency and competitive performance. However, they struggle to generalize beyond their training context windows, collapsing on long-context tasks. We provide the first systematic analysis of this failure, showing that it arises from uncontrolled state growth and uneven receptive field contributions across the hybrid architecture. Guided by this understanding, we introduce Universal Position Interpolation (UPI), a closed-form, training-free scaling method that unifies Mamba's cumulative decay with Transformer rotary frequency scaling. UPI selectively stabilizes unstable Mamba dynamics while rescaling Transformer encodings, controlling state growth and enabling reliable long-context generalization, with only a few auxiliary forward passes. Evaluation shows that UPI extends multiple state-of-the-art hybrid and pure Mamba models from 4K to up to 64K tokens on PG-19 perplexity, LongBench and RULER benchmarks, without sacrificing short-context accuracy. These findings establish the first principled bridge between Transformers and state-space models and open a new direction for training-free context extension methods for emerging hybrid models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2880. TPDiff: Temporal Pyramid Video Diffusion Model
๐ฌ 6
โญ 5.33
๐ generative models
๐ท๏ธ Video Generation Diffusion model
The development of video diffusion models unveils a significant challenge: the substantial computational demands. To mitigate this challenge, we note that the reverse process of diffusion exhibits an inherent entropy-reducing nature. Given the inter-frame redundancy in video modality, maintaining full frame rates in high-entropy stages is unnecessary. Based on this insight, we propose TPDiff, a unified framework to enhance training and inference efficiency. By dividing diffusion into several stages, our framework progressively increases frame rate along the diffusion process with only the last stage operating on full frame rate, thereby optimizing computational efficiency. To train the multi-stage diffusion model, we introduce a dedicated training framework: stage-wise diffusion. By solving the partitioned probability flow ordinary differential equations (ODE) of diffusion under aligned data and noise, our training strategy is applicable to various diffusion forms and further enhances training efficiency. Comprehensive experimental evaluations validate the generality of our method, demonstrating 50% reduction in training cost and 1.5x improvement in inference efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2881. Log Probability Tracking of LLM APIs
๐ฌ 10
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ API drift audit monitoring LLM API black-box
When using an LLM through an API provider, users expect the served model to remain consistent over time, a property crucial for the reliability of downstream applications and the reproducibility of research. Existing audit methods are too costly to apply at regular time intervals to the wide range of available LLM APIs. This means that model updates are left largely unmonitored in practice. In this work, we show that while LLM log probabilities (logprobs) are usually non-deterministic, they can still be used as the basis for cost-effective continuous monitoring of LLM APIs. We apply a simple statistical test based on the average value of each token logprob, requesting only a single token of output. This is enough to detect changes as small as one step of fine-tuning, making this approach more sensitive than existing methods while being 1,000x cheaper. We introduce the TinyChange benchmark as a way to measure the sensitivity of audit methods in the context of small, realistic model changes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2882. Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models
๐ฌ 22
โญ 5.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Unpaired Multimodal Representation Learning Cross-modal Learning; Multimodal Learning from Unpaired Data
Traditional multimodal learners find unified representations for tasks like visual question answering, but rely heavily on large paired datasets. However, an overlooked yet potentially powerful question is: can one leverage auxiliary $\textit{unpaired}$ multimodal data to directly enhance representation learning in a $\textit{target}$ modality? We introduce $\textbf{UML}$: $\textbf{U}$npaired $\textbf{M}$ultimodal $\textbf{L}$earner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them. This design exploits the assumption that different modalities are projections of a shared underlying reality, allowing the model to benefit from cross-modal structure without requiring explicit pairs. Theoretically, under linear data-generating assumptions, we show that unpaired auxiliary data can yield representations strictly more informative about the world than unimodal training. Empirically, we show that incorporating unpaired data that share underlying semantic information from auxiliary modalitiesโsuch as text, audio, or imagesโconsistently improves downstream performance across diverse unimodal targets such as image and audio.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2883. Motion Prior Distillation in Time Reversal Sampling for Generative Inbetweening
๐ฌ 10
โญ 5.33
๐ generative models
๐ท๏ธ Diffusion Generative Inbetweening Video Interpolation
Recent progress in image-to-video (I2V) diffusion models has significantly advanced the field of generative inbetweening, which aims to generate semantically plausible frames between two keyframes. In particular, inference-time sampling strategies, which leverage the generative priors of large-scale pre-trained I2V models without additional training, have become increasingly popular. However, existing inference-time sampling, either fusing forward and backward paths in parallel or alternating them sequentially, often suffers from temporal discontinuities and undesired visual artifacts due to the misalignment between two generated paths. This is because each path follows the motion prior induced by its own conditioning frame. We thus propose Motion Prior Distillation (MPD), a simple yet effective inference-time distillation technique that suppresses bidirectional mismatch by distilling the motion residual of the forward path into the backward path. MPD alleviates the misalignment by reconstructing the denoised estimate of the backward path from distilled forward motion residual. With our method, we can deliberately avoid denoising end-conditioned path which causes the ambiguity of the path, and yield more temporally coherent inbetweening results with the forward motion prior. Our method can be applied to off-the-shelf inbetweening works without any modification of model parameters. We not only perform quantitative evaluations on standard benchmarks, but also conduct extensive user studies to demonstrate the effectiveness of our approach in practical scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2884. ConvT3: Structured State Kernels for Convolutional State Space Models
๐ฌ 14
โญ 5.33
๐ learning on time series and dynamical systems
๐ท๏ธ Spatiotemporal modeling Video modeling Physical system modeling Tridiagonal Toeplitz tensor Long-range sequence modeling
Modeling long spatiotemporal sequences requires capturing both complex spatial correlations and temporal dependencies.
Convolutional State Space Models (ConvSSMs) have been proposed to incorporate spatial modeling in State Space Models (SSMs) using the convolution of tensor-valued states and kernels.
Yet, existing implementations remain limited to $1\times 1$ state kernels for computational feasibility, which limits the modeling capacity of ConvSSMs.
We introduce a novel spatiotemporal model, ConvT3 (ConvSSM using Tridiagonal Toeplitz Tensors), designed to equivalently realize ConvSSMs with extended $3\times 3$ state kernels.
ConvT3 structures a state kernel for its corresponding tensor to be composed as a structured SSM matrix on hidden state dimensions and a constrained tridiagonal Toeplitz tensor on spatial dimensions.
We show that the structured tensor can be diagonalized, which enables efficient parallel training while leveraging $3\times 3$ state convolutions.
We demonstrate that ConvT3 effectively embeds rich spatial and temporal information into the dynamics of tensor-valued states, achieving state-of-the-art performance on most metrics in long-range video generation and physical system modeling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2885. Cat-PO: Cross-modal Adaptive Token-rewards for Preference Optimization in Truthful Multimodal LLMs
๐ฌ 13
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Large Language Models; Preference Optimization
Multi-modal Large Language Models (MLLMs) have shown remarkable generative capabilities across multi-modal tasks, yet remain plagued by hallucinations where generated textual contents are semantically inconsistent with the input images. This work reveals that existing multi-modal preference optimization methods exhibit shortcomings at the preference data decoding stage. Specifically, different response tokens exhibit varying degrees of association with visual content, and consequently, their contributions to reducing hallucinations and generating high-quality responses differ. Nevertheless, most existing methods do not distinguish in their treatment, often handling them uniformly. To address this challenge, we propose a novel preference alignment method: Cross-modal Adaptive Token-rewarded Preference Optimization (Cat-PO). Building upon direct preference optimization, Cat-PO calculates hierarchical visual relevance rewards for each response token at global, local, and semantic levels. It then organically integrates these three rewards to construct a smooth reward mechanism and designs an innovative KL-based customized loss for rewarded tokens, thereby enabling fine-grained correction of hallucinatory outputs. Extensive experiments on various base models and evaluation benchmarks demonstrate that our Cat-PO can significantly reduce hallucinations and align with human preferences to enhance the truthfulness of MLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2886. Latent Planning Emerges with Scale
๐ฌ 12
โญ 5.33
๐ interpretability and explainable AI
๐ท๏ธ planning feature circuits circuits mechanistic interpretability
LLMs can perform seemingly planning-intensive tasks, like writing coherent stories or functioning code, without explicitly verbalizing a plan; however, the extent to which they implicitly plan is unknown. In this paper, we define *latent planning* as occurring when LLMs possess internal planning representations that (1) cause the generation of a specific future token or concept, and (2) shape preceding context to license said future token or concept. We study the Qwen-3 family (0.6B-14B) on simple planning tasks, finding that latent planning ability increases with scale. Models that plan possess features that represent a planned-for word like *accountant*, and cause them to output *an* rather than *a*; moreover, even the less-successful Qwen-3 4B-8B have nascent planning mechanisms. On the more complex task of completing rhyming couplets, we find that models often identify a rhyme ahead of time, but even large models seldom plan far ahead. However, we can elicit some planning that increases with scale when steering models towards planned words in prose. In sum, we offer a framework for measuring planning and mechanistic evidence of how models' planning abilities grow with scale.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2887. Lightweight Spatio-Temporal Modeling via Temporally Shifted Distillation for Real-Time Accident Anticipation
๐ฌ 11
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ lightweight spatio-temporal modeling model distillation accident anticipation edge deployment
Anticipating traffic accidents in real time is critical for intelligent transportation systems, yet remains challenging under edge-device constraints. We propose a lightweight spatio-temporal framework that introduces a temporally shifted distillation strategy, enabling a student model to acquire predictive temporal dynamics from a frozen image-based teacher without requiring a video pre-trained teacher. The student combines a RepMixer spatial encoding with a RWKV-inspired recurrent module for efficient long-range temporal reasoning. To enhance robustness under partial observability, we design a masking memory strategy that leverages memory retention to reconstruct missing visual tokens, effectively simulating occlusions and future events. In addition, multi-modal vision-language supervision enriches semantic grounding. Our framework achieves state-of-the-art performance on multiple real-world dashcam benchmarks while sustaining real-time inference on resource-limited platforms such as the NVIDIA Jetson Orin Nano. Remarkably, it is 3-7$\times$ smaller than leading approaches yet delivers superior accuracy and earlier anticipation, underscoring its practicality for deployment in intelligent vehicles.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2888. QuRL: Low-Precision Reinforcement Learning for Efficient Reasoning
๐ฌ 7
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reinforcement Learning Quantization
Reinforcement learning with verifiable rewards (RLVR) has become a trending paradigm for training reasoning large language models (LLMs).
However, due to the autoregressive decoding nature of LLMs, the rollout process becomes the efficiency bottleneck of RL training, consisting of up to 70\% of the total training time.
In this work, we propose Quantized Reinforcement Learning (QuRL) that uses a quantized actor for accelerating the rollout.
We address two challenges in QuRL. First, we propose Adaptive Clipping Range (ACR) that dynamically adjusts the clipping ratio based on the policy ratio between the full-precision actor and the quantized actor, which is essential for mitigating long-term training collapse.
Second, we identify the weight update problem, where weight changes between RL steps are extremely small, making it difficult for the quantization operation to capture them effectively.
We mitigate this problem through the invariant scaling technique that reduces quantization noise and increases weight update.
We evaluate our method with INT8 and FP8 quantization experiments on DeepScaleR and DAPO, and achieve 20% to 80% faster rollout during training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2889. TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows
๐ฌ 36
โญ 5.33
๐ generative models
๐ท๏ธ few-step generation text-to-image generation multi-modal generative models
Recent advances in large multi-modal generative models have demonstrated impressive capabilities in multi-modal generation, including image and video generation.
These models are typically built upon multi-step frameworks like diffusion and flow matching, which inherently limits their inference efficiency (requiring 40-100 Number of Function Evaluations (NFEs)).
While various few-step methods aim to accelerate the inference, existing solutions have clear limitations.
Prominent distillation-based methods, such as progressive and consistency distillation, either require an iterative distillation procedure or show significant degradation at very few steps (< 4-NFE).
Meanwhile, integrating adversarial training into distillation (e.g., DMD/DMD2 and SANA-Sprint) to enhance performance introduces training instability, added complexity, and high GPU memory overhead due to the auxiliary trained models.
To this end, we propose TwinFlow, a simple yet effective framework for training 1-step generative models that bypasses the need for distillation from pre-trained models and avoids standard adversarial training, making it ideal for building large-scale, efficient models.
On text-to-image tasks, our method achieves a GenEval score of 0.83 in 1-NFE, outperforming strong baselines like SANA-Sprint (a GAN loss-based framework) and RCGM (a consistency-based framework).
**Notably, we demonstrate the scalability of TwinFlow by transforming Qwen-Image-20B---the current largest open-source multi-modal generative model---into an efficient few-step generator**. With just 1-NFE, our approach matches the performance of the original 100-NFE model on both the GenEval and DPG-Bench benchmarks, reducing computational cost by $100\times$ with minor quality degradation.
Our code and models will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2890. FutureMind: Equipping Small Language Models with Strategic Thinking-Pattern Priors via Adaptive Knowledge Distillation
๐ฌ 22
โญ 5.33
๐ interpretability and explainable AI
๐ท๏ธ Small Language Models Adaptive Knowledge Distillation Thinking Pattern
Small Language Models (SLMs) are attractive for cost-sensitive and resource-limited settings due to their efficient, low-latency inference. However, they often struggle with complex, knowledge-intensive tasks that require structured reasoning and effective retrieval. To address these limitations, we propose FutureMind, a modular reasoning framework that equips SLMs with strategic thinking-pattern priors via adaptive knowledge distillation from large language models (LLMs). FutureMind introduces a dynamic reasoning pipeline composed of four key modules: Problem Analysis, Logical Reasoning, Strategy Planning, and Retrieval Guidance. This pipeline is augmented by three distinct retrieval paradigms that decompose complex queries into tractable subproblems, ensuring efficient and accurate retrieval execution. Extensive experiments on multi-hop QA benchmarks, including 2WikiMultihopQA, MuSiQue, Bamboogle, and Frames, demonstrate the superiority of FutureMind. It consistently outperforms strong baselines such as Search-o1, achieving state-of-the-art results under zero-training conditions across diverse SLM architectures and scales. Beyond empirical gains, our analysis reveals that the process of thinking-pattern distillation is restricted by the cognitive bias bottleneck between the teacher (LLMs) and student (SLMs) models. This provides new perspectives on the transferability of reasoning skills, paving the way for the development of SLMs that combine efficiency with genuine cognitive capability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2891. UNIVERSAL AND EFFICIENT LOADING BALANCING FOR RL TRAINING OF LARGE MULTIMODAL MODELS
๐ฌ 19
โญ 5.33
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ RL Training;Load Balancing;Seuqence Parallelism;Distributed Training
Reinforcement learning (RL) is crucial for aligning Vision-Language Models (VLMs), but its practical application is hampered by significant system-level bottlenecks. The typical RL pipeline, encompassing data loading, inference-based rollouts, and model updates, suffers from severe inefficiencies when applied to VLMs due to the extreme heterogeneity of multimodal data. Centralized data loading creates I/O bottlenecks with large media files, while variations in sequence length across text, image, and video inputs lead to critical load imbalance during computation, leaving expensive GPU resources underutilized. Existing systems either focus on text-only RL or employ general load-balancing techniques that are incompatible with the small-batch, iterative nature of RL training.
To address these challenges, we present FlexRL, a holistic system designed to optimize the end-to-end VLM RL pipeline. FlexRL introduces two core contributions: (1) a \textbf{Decentralized Data Pipeline} that parallelizes data fetching and preprocessing across worker nodes, facilitates metadata-only scheduling on the single controller, eliminating the central bottleneck and accelerating data-intensive stages; and (2) a novel \textbf{Hybrid Sequence Sharding} mechanism that partitions sequences into fine-grained chunks. This enables sub-sequence level load balancing for both inference and training, effectively mitigating workload skew. Our evaluation on a 128-GPU cluster shows that FlexRL significantly improves training efficiency by 4.2~7.7x end-to-end speedups compared to production baselines, enabling more efficient and scalable RL for large multimodal models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2892. InnovatorBench: Evaluating Agentsโ Ability to Conduct Innovative AI Research
๐ฌ 16
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ InnovatorBench ResearchGym End-to-End Evaluation
AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce InnovatorBench, a benchmark-platform pair for realistic, end-to-end assessment of agents performing Large Language Model (LLM) research. It comprises 20 tasks spanning Data Construction, Filtering, Augmentation, Loss Design, Reward Design, and Scaffold Construction, which require runnable artifacts and assessment of correctness, performance, output quality, and uncertainty. To support agent operation, we develop ResearchGym, a research environment offering rich action spaces, distributed and long-horizon execution, asynchronous monitoring, and snapshot saving. We also implement a lightweight agent that couples explicit reasoning with executable planning using frontier models such as Claude-4, GPT-5, GLM-4.5, and Kimi-K2. Our experiments demonstrate that while frontier models show promise in code-driven research tasks, they struggle with fragile algorithm-related tasks and long-horizon decision making, such as impatience, poor resource management, and overreliance on template-based reasoning. Furthermore, agents require over 11 hours to achieve their best performance on InnovatorBench, underscoring the benchmark's difficulty and showing the potential of InnovatorBench to be the next generation of code-based research benchmark.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2893. Active Learning of 3D Gaussian Splatting with Consistent Region Partition and Robust Pose Estimation
๐ฌ 14
โญ 5.33
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Active Learning 3D Gaussian Splatting
Radiance fields have been successful in reconstructing 3D assets for scenes presented in Virtual Reality and Augmented Reality (VR/AR). The general workflow of scanning objects with radiance field representation involves a heavy workload of capturing images depicting the object empirically by the user, and lacks feedback for the image collection stage. This would lead to potential repeated or deficient gathering of information, affecting the efficiency of the reconstruction workflow. In this paper, we therefore present an active learning algorithm for 3D Gaussian Splatting that guides the image capturing by estimating the pose of the most informative image. Specifically, our method first partitions the consistent regions in the model by analyzing the Gaussian attributes and visibility features. Then, we determine the informative region to explore by estimating the semantic feature variance of each Gaussian, which evaluates the quality of the Gaussian cloud from the semantic level features. Furthermore, we tackle the practical problem of noise in the pose of the collected image via a robust pose optimization method. Extensive experimental results on both synthetic and real-world scenes demonstrate the remarkable performance of our algorithm in active learning of the radiance field under both accurate and noisy pose conditions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2894. ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks
๐ฌ 17
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ Retrieval Augmented Generation Dynamic Benchmarking Dual Dynamics User Interest Drift Gaming Player-Centric Evaluation
Retrieval Augmented Generation (RAG) systems are increasingly vital in dynamic domains like online gaming, yet the lack of a dedicated benchmark has impeded standardized evaluation in this area. The core difficulty lies in Dual Dynamics: the constant interplay between game content updates and the shifting focus of the player community. Furthermore, the necessity of automating such a benchmark introduces a critical requirement for player-centric authenticity to ensure generated questions are realistic. To address this integrated challenge, we introduce ChronoPlay, a novel framework for the automated and continuous generation of game RAG benchmarks. ChronoPlay utilizes a dual-dynamic update mechanism to track both forms of change, and a dual-source synthesis engine that draws from official sources and player community to ensure both factual correctness and authentic query patterns. We instantiate our framework on three distinct games to create the first dynamic RAG benchmark for the gaming domain, offering new insights into model performance under these complex and realistic conditions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2895. Embodied Agents Meet Personalization: Investigating Challenges and Solutions Through the Lens of Memory Utilization
๐ฌ 15
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Embodied agent Memory Agent LLM Personalization
LLM-powered embodied agents have shown success on conventional object-rearrangement tasks, but providing personalized assistance that leverages user-specific knowledge from past interactions presents new challenges. We investigate these challenges through the lens of agents' memory utilization along two critical dimensions: object semantics (identifying objects based on personal meaning) and user patterns (recalling sequences from behavioral routines). To assess these capabilities, we construct Memento, an end-to-end two-stage evaluation framework comprising single-memory and joint-memory tasks. Our experiments reveal that current agents can recall simple object semantics but struggle to apply sequential user patterns to planning. Through in-depth analysis, we identify two critical bottlenecks: information overload and coordination failures when handling multiple memories. Based on these findings, we explore memory architectural approaches to address these challenges. Given our observation that episodic memory provides both personalized knowledge and in-context learning benefits, we design a hierarchical knowledge graph-based user-profile memory module that separately manages personalized knowledge, achieving substantial improvements on both single and joint-memory tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2896. Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments
๐ฌ 12
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ Microgravity Action Recognition Vision-Language Understanding
Despite substantial progress in video understanding, most existing datasets are limited to Earthโs gravitational conditions. However, microgravity alters human motion, interactions, and visual semantics, revealing a critical gap for real-world vision systems. This presents a challenge for domain-robust video understanding in safety-critical space applications.
To address this, we introduce MicroG-4M, the first benchmark for spatio-temporal and semantic understanding of human activities in microgravity. Constructed from real-world space missions and cinematic simulations, the dataset includes 4,759 clips covering 50 actions, 1,238 context-rich captions, and over 7,000 questionโanswer pairs on astronaut activities and scene understanding. MicroG-4M aims to support three core tasks: fine-grained multi-label action recognition, temporal video captioning, and visual question answering, thereby enabling a comprehensive evaluation of both spatial localization and semantic reasoning in microgravity contexts. We establish baselines using state-of-the-art models. All data, annotations, and code will be made publicly available upon decision.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2897. Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees
๐ฌ 19
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Mixture-of-Experts Quantization Theoretical Generalization Guarantees
Sparse Mixture-of-Experts (MoE) allows scaling of language and vision models efficiently by activating only a small subset of experts per input. While this reduces computation, the large number of parameters still incurs substantial memory overhead during inference. Post-training quantization has been explored to address this issue. Because uniform quantization suffers from significant accuracy loss at low bit-widths, mixed-precision methods have been recently explored; however, they often require substantial computation for bit-width allocation and overlook the varying sensitivity of model performance to the quantization of different experts. We propose a theoretically grounded expert-wise mixed-precision strategy that assigns bit-width to each expert primarily based on their *change in routerโs* $l_2$ *norm* during training. Experts with smaller changes are shown to capture less frequent but critical features, and model performance is more sensitive to the quantization of these experts, thus requiring higher precision. Furthermore, to avoid allocating experts to lower precision that inject high quantization noise, experts with large *maximum intra-neuron variance* are also allocated higher precision. Experiments on large-scale MoE models, including Switch Transformer and Mixtral, show that our method achieves higher accuracy than existing approaches, while also reducing inference cost and incurring only negligible overhead for bit-width assignment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2898. TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use
๐ฌ 12
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ LLM agent tool usage benchmark
Large language model (LLM)-based agents increasingly rely on tool use to complete real-world tasks. While existing works evaluate the LLMs' tool use capability, they largely focus on the final answers yet overlook the detailed tool usage trajectory, i.e., whether tools are selected, parameterized, and ordered correctly. We introduce TRAJECT-Bench, a trajectory-aware benchmark to comprehensively evaluate LLMs' tool use capability through diverse tasks with fine-grained evaluation metrics. TRAJECT-Bench pairs high-fidelity, executable tools across practical domains with tasks grounded in production-style APIs, and synthesizes trajectories that vary in breadth (parallel calls) and depth (interdependent chains). Besides final accuracy, TRAJECT-Bench also reports trajectory-level diagnostics, including tool selection and argument correctness, and dependency/order satisfaction. Analyses reveal failure modes such as similar tool confusion and parameter-blind selection, and scaling behavior with tool diversity and trajectory length where the bottleneck of transiting from short to mid-length trajectories is revealed, offering actionable guidance for LLMs' tool use.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2899. Initialization Schemes for KolmogorovโArnold Networks: An Empirical Study
๐ฌ 11
โญ 5.33
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Kolmogorov-Arnold networks weight initialization NTK Function Fitting PDEs
KolmogorovโArnold Networks (KANs) are a recently introduced neural architecture that replace fixed nonlinearities with trainable activation functions, offering enhanced flexibility and interpretability. While KANs have been applied successfully across scientific and machine learning tasks, their initialization strategies remain largely unexplored. In this work, we study initialization schemes for spline-based KANs, proposing two theory-driven approaches inspired by LeCun and Glorot, as well as an empirical power-law family with tunable exponents. Our evaluation combines large-scale grid searches on function fitting and forward PDE benchmarks, an analysis of training dynamics through the lens of the Neural Tangent Kernel, and evaluations on a subset of the Feynman dataset. Our findings indicate that the Glorot-inspired initialization significantly outperforms the baseline in parameter-rich models, while power-law initialization achieves the strongest performance overall, both across tasks and for architectures of varying size. This work underscores initialization as a key factor in KAN performance and introduces practical strategies to improve it.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2900. Lossy Common Information in a Learnable Gray-Wyner Network
๐ฌ 22
โญ 5.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ information theory learnable compression disentanglement learning theory representation learning computer vision neural networks
Many computer vision tasks share substantial overlapping information, yet conventional codecs tend to ignore this, leading to redundant and inefficient representations. The Gray-Wyner network, a classical concept from information theory, offers a principled framework for separating common and task-specific information. Inspired by this idea, we develop a learnable three-channel codec that disentangles shared information from task-specific details across multiple vision tasks. We characterize the limits of this approach through the notion of lossy common information, and propose an optimization objective that balances inherent tradeoffs in learning such representations. Through comparisons of three codec architectures on two-task scenarios spanning six vision benchmarks, we demonstrate that our approach substantially reduces redundancy and consistently outperforms independent coding. These results highlight the practical value of revisiting Gray-Wyner theory in modern machine learning contexts, bridging classic information theory with task-driven representation learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2901. PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment
๐ฌ 11
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ visual in-context learning visual prompt prompt fusion locality-aware
Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work Condenser pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across diverse backbones, out-of-distribution settings, and various retrieval scenarios. This work establishes a reliable locality-aware paradigm for prompt fusion, moving beyond prior patch-wise approaches. Code will be available at https://anonymous.4open.science/r/PromptHub-1770.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2902. LogART: Pushing the Limit of Efficient Logarithmic Post-Training Quantization
๐ฌ 11
โญ 5.33
๐ optimization
๐ท๏ธ Post-training quantization Logarithmic quantization Adaptive rounding Hyperparameter search Low-power computing
Efficient deployment of deep neural networks increasingly relies on Post-Training Quantization (PTQ). Logarithmic PTQ, in particular, promises multiplier-free hardware efficiency, but its performance is often limited by the nonlinear and symmetric quantization grid and standard rounding-to-nearest (RTN) approach. While learnable rounding has significantly advanced linear PTQ, its application to the non-linear and often discrete nature of logarithmic domain remains unexplored. This paper introduces learnable Logarithmic Adaptive Rounding Techniques (LogART) that pioneer task-aware learnable rounding specifically for the logarithmic domain. LogART further extends the learnable rounding strategy to flexibly support outlier-aware, asymmetric, and hardware-friendly dynamic logarithmic bases, determined in a distribution-aware manner using an efficient search strategy. Extensive experiments demonstrate that LogART achieves state-of-the-art accuracy while maintaining efficiency in quantizing models across various architectures and ultra-low bitwidths, outperforming existing logarithmic PTQ methods and paving the way for more effective hardware deployment. The code is available at https://anonymous.4open.science/r/logart-lab.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2903. Online Conformal Prediction with Adversarial Feedback via Regret Minimization
๐ฌ 20
โญ 5.33
๐ learning theory
๐ท๏ธ Online Conformal Prediction Adversarial Bandit Partial Feedback Regret
Quantifying uncertainty is crucial in safety-critical learning systems where decisions are made based on uncertainty. Conformal prediction is one promising way to quantify uncertainty that comes with a theoretical guarantee. However, the theoretical guarantee comes with a price of the assumption on data generation process, including exchangeability or full feedback. In this paper, we propose a novel conformal prediction method with less data generation assumption, i.e., a learning method for online conformal prediction with partial feedback from an adaptive adversary. In particular, we leverage matured literatures in sequential prediction and adversarial bandits to design our online conformal prediction algorithm. The great benefit of the reliance on adversarial bandits is that we can exploit theoretical regret bounds for conformal prediction guarantees. Here, we explicitly connect the regret and a desired miscoverage guarantee in conformal prediction such that our algorithm via adversarial bandits can naturally provide a miscoverage guarantee from the regret bounds. Furthermore, we extend an existing adversarial bandit method to leverage the properties of conformal prediction, resulting in a bandit method with a tighter regret bound. We empirically demonstrate the efficacy of our conformal prediction method over various learning setups, including a stochastic setup and covariate shift setup, showing a controllability on a micoverage rate while achieving a reasonable conformal set size.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2904. Neural Predictor-Corrector: Solving Homotopy Problems with Reinforcement Learning
๐ฌ 14
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Homotopy System Graduated optimization Reinforcement Learning Polynomial Equitions System Gaussian Homotopy Sampling
The Homotopy paradigm, a general principle for solving challenging problems, appears across diverse domains such as robust optimization, global optimization, polynomial root-finding, and sampling. Practical solvers for these problems typically follow a predictor-corrector (PC) structure, but rely on hand-crafted heuristics for step sizes and iteration termination, which are often suboptimal and task-specific. To address this, we unify these problems under a single framework, which enables the design of a general neural solver. Building on this unified view, we propose Neural Predictor-Corrector (NPC), which replaces hand-crafted heuristics with automatically learned policies. NPC formulates policy selection as a sequential decision-making problem and leverages reinforcement learning to automatically discover efficient strategies. To further enhance generalization, we introduce an amortized training mechanism, enabling one-time offline training for a class of problems and efficient online inference on new instances. Experiments on four representative homotopy problems demonstrate that our method generalizes effectively to unseen instances. It consistently outperforms classical and specialized baselines in efficiency while demonstrating superior stability across tasks, highlighting the value of unifying homotopy methods into a single neural framework.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2905. FETAL-GAUGE: A BENCHMARK FOR ASSESSING VISION-LANGUAGE MODELS IN FETAL ULTRASOUND
๐ฌ 10
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ Vision-Language Models Fetal Ultrasound Visual Question Answering
The growing demand for prenatal ultrasound imaging has intensified a global shortage of trained sonographers, creating barriers to essential fetal health monitoring. Deep learning has the potential to enhance sonographers' efficiency and support the training of new practitioners. Vision-Language Models (VLMs) are particularly promising for ultrasound interpretation, as they can jointly process images and text to perform multiple clinical tasks within a single framework. However, despite the expansion of VLMs, no standardized benchmark exists to evaluate their performance in fetal ultrasound imaging. This gap is primarily due to the modalityโs challenging nature, operator dependency, and the limited public availability of datasets. To address this gap, we present Fetal-Gauge, the first and largest visual question answering benchmark specifically designed to evaluate VLMs across various fetal ultrasound tasks. Our benchmark comprises over 42,000 images and 93,000 question-answer pairs, spanning anatomical plane identification, visual grounding of anatomical structures, fetal orientation assessment, clinical view conformity, and clinical diagnosis. We systematically evaluate several state-of-the-art VLMs, including general-purpose and medical-specific models, and reveal a substantial performance gap: the best-performing model achieves only 55\% accuracy, far below clinical requirements. Our analysis identifies critical limitations of current VLMs in fetal ultrasound interpretation, highlighting the urgent need for domain-adapted architectures and specialized training approaches. Fetal-Gauge establishes a rigorous foundation for advancing multimodal deep learning in prenatal care and provides a pathway toward addressing global healthcare accessibility challenges. Our benchmark is publicly available at www.github.com
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2906. Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning
๐ฌ 11
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Reinforcement Learning Post Training
We aim to improve the reasoning capabilities of language models via reinforcement learning with verifiable rewards (RLVR). Recent RLVR post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RLVR alone to improve reasoning on inherently difficult tasks is less effective due to sparse rewards. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across diverse datasets and models demonstrate that E2H Reasoner substantially enhances LLM reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2907. Unsupervised Invariant Risk Minimization
๐ฌ 15
โญ 5.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Unsupervised Learning Invariant Risk Minimization Variational Autoencoder Principal Components Analysis
We propose a novel unsupervised framework for Invariant Risk Minimization (IRM), extending the concept of invariance to settings where labels are unavailable. Traditional IRM methods rely on labeled data to learn representations that are robust to distributional shifts across environments. In contrast, our approach redefines invariance through feature distribution alignment, enabling robust representation learning from unlabeled data. We introduce two methods within this framework: Principal Invariant Component Analysis (PICA), a linear method that extracts invariant directions under Gaussian assumptions, and Variational Invariant Autoencoder (VIAE), a deep generative model that disentangles environment-invariant and environment-dependent latent factors. Our approach is grounded in a novel class of unsupervised structural causal models and supports environment-conditioned sample generation and transfer. Empirical evaluations on synthetic benchmarks, modified versions of MNIST and CelebA demonstrate the effectiveness of our methods in capturing invariant structure, preserving relevant information, and generalizing across environments without access to labels.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2908. Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield
๐ฌ 15
โญ 5.33
๐ generative models
๐ท๏ธ Diffusion Model Diffusion Distillation
Diffusion model distillation has emerged as a powerful technique for creating efficient few-step and single-step generators. Among these, Distribution Matching Distillation (DMD) and its variants stand out for their impressive performance, which is widely attributed to their core mechanism of matching the student's output distribution to that of a pre-trained teacher model. In this work, we challenge this conventional understanding. Through a rigorous decomposition of the DMD training objective, we reveal that the primary driver of few-step generation is not the distribution matching term, but a previously overlooked component we identify as \textit{\textbf{C}FG \textbf{A}ugmentation} (\textbf{CA}). We demonstrate that this term acts as the core "engine" of distillation, while the \textbf{D}istribution \textbf{M}atching (\textbf{DM}) term functions as a "regularizer" that ensures training stability and mitigates artifacts. We further validate this decoupling by demonstrating that while the DM term is a highly effective regularizer, it is not unique; simpler non-parametric constraints or GAN-based objectives can serve the same stabilizing function, albeit with different trade-offs. This decoupling of labor between CA and DM also allows a more principled analysis of the properties of both terms, leading to a more systematic and in-depth understanding. This new understanding enables us to propose principled modifications to the distillation process, such as decoupling the noise schedules for the engine and the regularizer, leading to further performance gains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2909. DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
๐ฌ 36
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ Data Agent Code Generation LLM Benchmark
Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights.
We introduce DAComp, a benchmark of 236 tasks that mirrors these complex workflows.
Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements.
Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations.
Engineering tasks are scored through execution-based, multi-metric evaluation.
Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics.
Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 10\%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40\%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities.
By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings.
Our data and code are available at \url{https://anonymous.4open.science/r/DAComp-397A}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2910. Counterfactual Explanations on Robust Perceptual Geodesics
๐ฌ 25
โญ 5.33
๐ interpretability and explainable AI
๐ท๏ธ Interpretability Visual Counterfactual Explanations Explainability
Latent-space optimization methods for counterfactual explanationsโframed as minimal semantic perturbations that change model predictionsโinherit the ambiguity of Wachter et al.โs objective: the choice of distance metric dictates whether perturbations are meaningful or adversarial. Existing approaches adopt flat or misaligned geometries, leading to off-manifold artifacts, semantic drift, or adversarial collapse. We introduce Perceptual Counterfactual Geodesics (PCG), a method that constructs counterfactuals by tracing geodesics under a perceptually Riemannian metric induced from robust vision features. This geometry aligns with human perception and penalizes brittle directions, enabling smooth, on-manifold, semantically valid transitions. Experiments on three vision datasets show that PCG outperforms baselines and reveals failure modes hidden under standard metrics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2911. Fine-Grained Class-Conditional Distribution Balancing for Debiased Learning
๐ฌ 10
โญ 5.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ group robust classification spurious correlations short-cut mitigation distribution balancing
Achieving group-robust generalization in the presence of spurious correlations remains a significant challenge, particularly when bias annotations are unavailable.
Recent studies on Class-Conditional Distribution Balancing (CCDB) reveal that spurious correlations often stem from mismatches between the class-conditional and marginal distributions of bias attributes. They achieve promising results by addressing this issue through simple distribution matching in a bias-agnostic manner.
However, CCDB approximates each distribution using a single Gaussian, which is overly simplistic and rarely holds in real-world applications.
To address this limitation, we propose a novel Multi-stage data-Selective reTraining strategy (MST), which describes each distribution in greater detail using the hard confusion matrix.
Building on these finer descriptions, we propose a fine-grained variant of CCDB, termed FG-CCDB, which enhances distribution matching through more precise confusion-cell-wise reweighting. FG-CCDB learns sample weights from a global perspective, effectively mitigating spurious correlations without incurring substantial storage or computational overhead.
Extensive experiments demonstrate that MST serves as a reliable proxy for ground-truth bias annotations and can be seamlessly integrated with bias-supervised methods.
Moreover, when combined with FG-CCDB, our method performs on par with bias-supervised approaches on binary classification tasks and significantly outperforms them in highly biased multi-class and multi-shortcut scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2912. PRISON: Unmasking the Criminal Potential of Large Language Models
๐ฌ 3
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models (LLMs) Criminal Potential Behavioral Safety Social Impact
As large language models (LLMs) advance, concerns about their misconduct in complex social contexts intensify. Existing research has overlooked the systematic assessment of LLMsโ criminal potential in realistic interactions, where criminal potential is defined as the risk of producing harmful behaviors such as deception and blame-shifting under adversarial settings that could facilitate unlawful activities. Therefore, we propose a unified framework PRISON, to quantify LLMs' criminal potential across five traits: False Statements, Frame-Up, Psychological Manipulation, Emotional Disguise, and Moral Disengagement. Using structured crime scenarios grounded in reality, we evaluate both criminal potential and anti-crime ability of LLMs. Results show that state-of-the-art LLMs frequently exhibit emergent criminal tendencies, such as proposing misleading statements or evasion tactics, even without explicit instructions. Moreover, when placed in a detective role, models recognize deceptive behavior with only 44\% accuracy on average, revealing a striking mismatch between expressing and detecting criminal traits. These findings underscore the urgent need for adversarial robustness, behavioral alignment, and safety mechanisms before broader LLM deployment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2913. MARTI: A Framework for Multi-Agent LLM Systems Reinforced Training and Inference
๐ฌ 19
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Multi-Agent Reinforcement Learning
We present MARTI (Multi-Agent Reinforced Training and Inference), an open-source framework designed to facilitate scalable and efficient learning of multi-agent LLM systems. MARTI supports centralized multi-agent interactions and distributed policy training, with the added capability of multi-turn asynchronous rollouts to enhance training efficiency. The framework includes dynamic workflows for multi-agent interactions, which integrate both rule-based verifiable rewards and LLM-based generative rewards. We validate the effectiveness of MARTI through comprehensive experiments on diverse mathematical tasks, demonstrating that multi-agent LLM-based systems outperform single-agent systems within the same inference budget after convergence. Our contributions lay the foundation for exploring scalable collaborations within LLM-based multi-agent systems and advancing the capabilities of large reasoning models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2914. NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks
๐ฌ 8
โญ 5.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Eigenspectral analysis feed-forward networks Nonlinearity latent space geometry Large Language Models (LLMs)
We introduce NerVE, a unified eigenspectral framework for understanding how feed-forward networks (FFNs) in large language models (LLMs) organize and regulate information flow in high-dimensional latent space. Despite FFNs dominating the parameter budget, their high-dimensional dynamics remain poorly understood. NerVE addresses this gap through lightweight, memory-efficient tracking of eigenspectrum dynamics via four complementary metrics: Spectral Entropy (dispersion), Participation Ratio (effective dimensionality), Eigenvalue Early Enrichment (top-heaviness), and Jensen-Shannon divergence (distributional shifts). Our {\em key insight} is that FFN nonlinearities reinject and reshape variance across eigenmodes, fundamentally governing latent dimension utilization. We validate NerVE across model scales and diverse architectural configurations that each uniquely shape FFN dynamics: normalization strategies (PreLN, PostLN, MixLN, Norm-Free) controlling variance flow; FFN weight geometries constraining latent space; positional encoding and activation functions modulating information propagation. Across these settings, NerVE consistently recovers stable spectral signatures that correlate with model's generalization ability and respond predictably to design choices, providing actionable insights for architectural optimization beyond trial-and-error.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2915. The Unseen Frontier: Pushing the Limits of LLM Sparsity with Surrogate-Free ADMM
๐ฌ 19
โญ 5.33
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ sparsity pruning LLMs
Neural network pruning is a promising technique to mitigate the excessive computational and memory requirements of large language models (LLMs).
Despite its promise, however, progress in this area has diminished, as conventional methods are seemingly unable to surpass moderate sparsity levels (50-60\%) without severely degrading model accuracy.
This work breaks through the current impasse, presenting a principled and effective method called $ \text{Elsa}$, which achieves extreme sparsity levels of up to 90\% while retaining high model fidelity.
This is done by identifying several limitations in current practice, all of which can be traced back to their reliance on a surrogate objective formulation.
$ \text{Elsa}$ tackles this issue directly and effectively via standard and well-established constrained optimization techniques based on ADMM.
Our extensive experiments across a wide range of models and scales show that $ \text{Elsa}$ achieves substantial improvements over existing methods;
e.g., it achieves 7.8$ \times$ less perplexity than the best existing method on LLaMA-2-7B at 90\% sparsity.
Furthermore, we present $ \text{Elsa}_ {-L}$, a quantized variant that scales to extremely large models (27B), and establish its theoretical convergence guarantees.
These results highlight meaningful progress in advancing the frontier of LLM sparsity, while promising that significant opportunities for further advancement may remain in directions that have so far attracted limited exploration.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2916. Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models
๐ฌ 11
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ red-teaming llm safety attcking jailbreak
Automated red-teaming has emerged as an essential approach for identifying vulnerabilities in large language models (LLMs). However, most existing methods rely on fixed attack templates and focus primarily on individual high-severity flaws,limiting their adaptability to evolving defenses and their ability to detect complex, high-exploitability vulnerabilities. To address these limitations, we propose AUTO-RT, a reinforcement learning framework designed for automatic jailbreak strategy exploration, i.e., discovering diverse and effective prompts capable of bypassing the safety restrictions of LLMs. AUTO-RT autonomously explores and optimizes attack strategies by interacting with the target model and generating crafted queries that trigger security failures. Specifically, AUTO-RT introduces two key techniques to improve exploration efficiency and attack effectiveness: 1) Dynamic Strategy Pruning, which focuses exploration on high-potential strategies by eliminating highly redundant paths early, and 2) Progressive Reward Tracking, which leverages intermediate downgrade models and a novel First Inverse Rate (FIR) metric to smooth sparse rewards and guide learning. Extensive experiments across diverse white-box and black-box LLM settings demonstrate that AUTO-RT significantly improves success rates (by up to 16.63%), expands vulnerability coverage, and accelerates discovery compared to existing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2917. Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control
๐ฌ 17
โญ 5.33
๐ generative models
๐ท๏ธ Video Generation Robotic Manipulation Trajectory Control
Recent advances in video diffusion models have demonstrated strong potential for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing trajectory-based methods primarily focus on individual object motion and struggle to capture multi-object interaction crucial in complex robotic manipulation. This limitation arises from multi-feature entanglement in overlapping regions, which leads to degraded visual fidelity. To address this, we present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation. Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction. Each stage is modeled using the feature of the dominant object, specifically the robotic arm in the pre- and post-interaction phases and the manipulated object during interaction, thereby mitigating the drawback of multi-object feature fusion present during interaction in prior work. To further ensure subject semantic consistency throughout the video, we incorporate appearance- and shape-aware latent representations for objects. Extensive experiments on the challenging Bridge dataset, as well as RLBench and SIMPLER benchmarks, demonstrate that our method establishs new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation. Additional visual results and reproducible code are available at our anonymous project page: robomaster2025.github.io.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2918. ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning
๐ฌ 26
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ Robot Learning Articulated Object Digital Twin
Robot learning increasingly relies on simulation to advance complex ability such as dexterous manipulations and precise interactions, necessitating high-quality digital assets to bridge the sim-to-real gap. However, existing open-source articulated object datasets for simulation are limited by insufficient visual realism and low physical fidelity, which hinder their utility for training models mastering robotic tasks in real world. To address these challenges, we introduce ArtVIP, a comprehensive open-source dataset comprising high-quality digital-twin articulated objects, accompanied by indoor-scene assets.Crafted by professional 3D modelers adhering to unified standards, ArtVIP ensures visual realism through precise geometric meshes and high-resolution textures, and physical fidelity is achieved via fine-tuned dynamic parameters. Meanwhile, the dataset pioneers embedded modular interaction behaviors within assets and pixel-level affordance annotations. Feature-map visualization and optical motion capture are employed to quantitatively demonstrate ArtVIP's visual and physical fidelity, with its applicability validated across imitation learning and reinforcement learning experiments. Provided in USD format with detailed production guidelines, ArtVIP is fully open-source, benefiting the research community and advancing robot learning research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2919. PCLR: Progressively Compressed LoRA for Multimodal Continual Instruction Tuning
๐ฌ 39
โญ 5.33
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Large Multimodal Models Continual Instruction Tuning Catastrophic Forgetting Knowledge Distillation
Continual Instruction Tuning (CIT) enables Large Multimodal Models (LMMs) to rapidly adapt to new tasks without retraining, but it suffers from the catastrophic forgetting problem. By adding new branches, model extension provides a great idea to accommodate novel knowledge while causing huge memory consumption. To jointly address forgetting and memory explosion, we propose the CompressionโIntegrationโLearning (CIL) pipeline, which draws on the memory consolidation processes during human sleep. Compression streamlines old parameters to release capacity. Integration merges knowledge from similar tasks to restore the performance loss due to compression. For example, based on LLaVA-7B, the forgetting is reduced from 11.29 to 5.09. Learning reallocates released capacity for new task-relevant parameters. Next, based on the characteristics of LMMs at different learning stages, we establish the progressive learning process, further reducing forgetting from 5.09 to 3.39. Moreover, to adapt this process, we decompose LoRA into a set of rank vectors and introduce an extremely fine-grained architecture, LoRA Rank Pool (LRP), with the goal of flexible knowledge employment and editing. Finally, we combine all components, and yield **P**rogressively **C**ompressed **L**o**R**A (PCLR). Extensive experiments demonstrate that PCLR owns a memory budget close to non-extension methods while outperforming extension methods in performance. Our code is in the supplementary material.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2920. DecompGAIL: Learning Realistic Traffic Behaviors with Decomposed Multi-Agent Generative Adversarial Imitation Learning
๐ฌ 8
โญ 5.33
๐ applications to robotics, autonomy, planning
๐ท๏ธ traffic simulation multi-agent imitation learning generative adversarial imitation learning
Realistic traffic simulation is critical for the development of autonomous driving systems and urban mobility planning, yet existing imitation learning approaches often fail to model realistic traffic behaviors. Behavior cloning suffers from covariate shift, while Generative Adversarial Imitation Learning (GAIL) is notoriously unstable in multi-agent settings. We identify a key source of this instabilityโirrelevant interaction misguidanceโwhere a discriminator penalizes an ego vehicleโs realistic behavior due to unrealistic interactions among its neighbors. To address this, we propose Decomposed Multi-agent GAIL (DecompGAIL), which explicitly decomposes realism into egoโmap and egoโneighbor components, filtering out misleading neighborโneighbor and neighborโmap interactions. We further introduce a social PPO objective that augments ego rewards with distance-weighted neighborhood rewards, encouraging overall realism across agents. Integrated into a lightweight SMART-based backbone, DecompGAIL achieves state-of-the-art performance on the WOMD Sim Agents 2025 benchmark.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2921. To View Transform or Not to View Transform: NeRF-based Pre-training Perspective
๐ฌ 18
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Autonomous driving Pre-training Self-supervised learning Neural radiance fields
Neural radiance fields (NeRFs) have emerged as a prominent pre-training paradigm for vision-centric autonomous driving, which enhances 3D geometry and appearance understanding in a fully self-supervised manner. To apply NeRF-based pre-training to 3D perception models, recent approaches have simply applied NeRFs to volumetric features obtained from view transformation. However, coupling NeRFs with view transformation inherits conflicting priors; view transformation imposes discrete and rigid representations, whereas radiance fields assume continuous and adaptive functions. When these opposing assumptions are forced into a single pipeline, the misalignment surfaces as blurry and ambiguous 3D representations that ultimately limit 3D scene understanding. Moreover, the NeRF network for pre-training is discarded during downstream tasks, resulting in inefficient utilization of enhanced 3D representations through NeRF. In this paper, we propose a novel NeRF-Resembled Point-based 3D detector that can learn continuous 3D representation and thus avoid the misaligned priors from view transformation. NeRP3D preserves the pre-trained NeRF network regardless of the tasks, inheriting the principle of continuous 3D representation learning and leading to greater potentials for both scene reconstruction and detection tasks. Experiments on nuScenes dataset demonstrate that our proposed approach significantly improves previous state-of-the-art methods, outperforming not only pretext scene reconstruction tasks but also downstream detection tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2922. Directional Sheaf Hypergraph Networks: Unifying Learning on Directed and Undirected Hypergraphs
๐ฌ 15
โญ 5.33
๐ learning on graphs and other geometries & topologies
๐ท๏ธ hypergraph neural networks hypergraph sheaf higher-order directed graphs
Hypergraphs provide a natural way to represent higher-order interactions among multiple entities. While undirected hypergraphs have been extensively studied, the case of directed hypergraphs, which can model oriented group interactions, remains largely under-explored despite its relevance for many applications. Recent approaches in this direction often exhibit an implicit bias toward homophily, which limits their effectiveness in heterophilic settings. Rooted in the algebraic topology notion of Cellular Sheaves, Sheaf Neural Networks (SNNs) were introduced as an effective solution to circumvent such a drawback. While a generalization to hypergraphs is known, it is only suitable for undirected hypergraphs, failing to tackle the directed case. In this work, we introduce Directional Sheaf Hypergraph Networks (DSHN), a framework integrating sheaf theory with a principled treatment of asymmetric relations within a hypergraph. From it, we construct the Directed Sheaf Hypergraph Laplacian, a complex-valued operator by which we unify and generalize many existing Laplacian matrices proposed in the graph-and hypergraph-learning literature. Across 7 real-world datasets and against 13 baselines, DSHN achieves relative accuracy gains from 2% up to 20%, showing how a principled treatment of directionality in hypergraphs, combined with the expressive power of sheaves, can substantially improve performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2923. Steering Language Models with Weight Arithmetic
๐ฌ 10
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ steering alignment safety model editing merging models
Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose *contrastive weight steering*, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes---one that induces the desired behavior and another that induces its opposite---and then add or remove this direction to modify the model's weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading general capabilities. We also show that, in the context of task-specific fine-tuning, weight steering can partially mitigate undesired behavioral drift: it can reduce sycophancy and under-refusals introduced during fine-tuning while preserving task performance gains. Finally, we provide preliminary evidence that emergent misalignment can be detected by measuring the similarity between fine-tuning updates and an "evil" weight direction, suggesting that it may be possible to monitor the evolution of weights during training and detect rare misaligned behaviors that never manifest during training or evaluations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2924. A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA
๐ฌ 18
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language models multi-hop question answering information-theoretic analysis multi-call reasoning framework
Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for LLMs as they have a finite per-pass output capacity, beyond which the integration of task-relevant evidence proves unreliable. Consequently, the single-pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano-style accuracy upper bound, defining a theoretical performance ceiling for single-pass LLMs. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity-aware representation and structuring of MHQA in LLMs. Building on these principles, we introduce a proof-of-concept multi-call framework for MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware task decomposition with active pruning of prior reasoning traces, keeping the information load within the single-pass limit. It further achieves robustness by a dependency-explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise-rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more LLM multi-step reasoning methods: \faGithub \href{https://anonymous.4open.science/r/InfoQA-55D1}{InfoQA}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2925. TianQuan-S2S: A Subseasonal-to-Seasonal Global Weather Model via Incorporate Climatology State
๐ฌ 21
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Subseasonal Weather Forecasting
Accurate Subseasonal-to-Seasonal (S2S) forecasting is vital for decision-making in agriculture, energy production, and emergency management. However, it remains a challenging and underexplored problem due to the chaotic nature of the weather system. Recent data-driven studies have shown promising results, but their performance is limited by the inadequate incorporation of climate states and a model tendency to degrade, progressively losing fine-scale details and yielding over-smoothed forecasts. To overcome these limitations, we propose TianQuan-S2S, a global S2S forecasting model that integrates initial weather states with climatological means via incorporating climatology into patch embedding and enhancing variability capture through an uncertainty-augmented Transformer. Extensive experiments on the Earth Reanalysis 5 (ERA5) reanalysis dataset demonstrate that our model yields a significant improvement in both deterministic and ensemble forecasting over the climatology mean, traditional numerical methods, and data-driven models. Ablation studies empirically show the effectiveness of our model designs. Remarkably, our model outperforms skillful numerical ECMWF-S2S and advanced data-driven Fuxi-S2S in key meteorological variables. The code implementation can be found in https://anonymous.4open.science/r/TianQuan-C381.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2926. Codified Finite-state Machines for Role-playing
๐ฌ 18
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Role-playing State Modeling Grounding System
Modeling latent character states is crucial for consistent and engaging role-playing (RP) with large language models (LLMs).
Yet, existing prompting-based approaches mainly capture surface actions, often failing to track the latent states that drive interaction.
We revisit finite-state machines (FSMs), long used in game design to model state transitions.
While effective in small, well-specified state spaces, traditional hand-crafted, rule-based FSMs struggle to adapt to the open-ended semantic space of RP.
To address this, we introduce Codified Finite-State Machines (CFSMs), a framework that automatically codifies textual character profiles into FSMs using LLM-based coding.
CFSMs extract key states and transitions directly from the profile, producing interpretable structures that enforce character consistency.
To further capture uncertainty and variability, we extend CFSMs into Codified Probabilistic Finite-State Machines (CPFSMs), where transitions are modeled as probability distributions over states.
Through both synthetic evaluations and real-world RP scenarios in established artifacts, we demonstrate that CFSM and CPFSM outperform generally applied baselines, verifying effectiveness not only in structured tasks but also in open-ended stochastic state exploration.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2927. RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents
๐ฌ 23
โญ 5.33
๐ reinforcement learning
๐ท๏ธ Large language models Reinforcement Learning Agent
Large language models (LLMs) excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess. While reinforcement learning from verifiable rewards (RLVR) has advanced in other domains, its application to dialogueโespecially for emotional intelligenceโremains underexplored. In this work, we introduce RLVER, the first end-to-end reinforcement learning framework that leverages verifiable emotion rewards from simulated users to cultivate higher-order empathetic abilities in LLMs. Within this framework, self-consistent affective simulated users engage in dialogue rollouts and produce deterministic emotion scores during conversations, serving as reward signals to guide the LLM's learning. Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its Sentient-Benchmark score from 13.3 to 79.2 while largely preserving mathematical and coding competence. Extensive experiments reveal that: (i) RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and non-thinking models show distinct trendsโthinking models excel in empathy and insight, while non-thinking models favor action; (iii) GRPO often yields stable gains, while PPO can push certain capabilities to a higher ceiling; (iv) More challenging environments are not always betterโmoderate ones can yield stronger outcomes. Our results show that RLVER is a practical route toward emotionally intelligent and broadly capable language agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2928. Don't Forget Its Variance! The Minimum Path Variance Principle for Accurate and Stable Score-Based Density Ratio Estimation
๐ฌ 21
โญ 5.33
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Density ratio estimation minimum path variance principle path optimization Kumaraswamy mixture model
Score-based methods have emerged as a powerful framework for density ratio estimation (DRE), but they face an important paradox in that, while theoretically path-independent, their practical performance depends critically on the chosen path schedule.
We resolve this issue by proving that tractable training objectives differ from the ideal, ground-truth objective by a crucial, overlooked term: the path variance of the time score.
To address this, we propose MinPV-DRE (\textbf{Min}imum \textbf{P}ath \textbf{V}ariance DRE), which directly minimizes this missing objective.
Our key contribution is the derivation of a closed-form expression for the variance, turning an intractable problem into a tractable optimization.
By parameterizing the path with a flexible Kumaraswamy Mixture Model, our method learns a data-adaptive, low-variance path without heuristic selection.
This principled optimization of the complete objective yields more accurate and stable estimators, establishing new state-of-the-art results on challenging benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2929. Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion
๐ฌ 13
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Speech Multimodal Machine Translation
Multimodal Large Language Models (MLLMs) have achieved notable success in enhancing translation performance by integrating multimodal information.
However, existing research primarily focuses on image-guided methods, whose applicability is constrained by the scarcity of multilingual image-text pairs.
The speech modality overcomes this limitation due to its natural alignment with text and the abundance of existing speech datasets, which enable scalable language coverage.
In this paper, we propose a \textbf{Speech-guided Multimodal Machine Translation (SMMT)} framework that integrates speech and text as fused inputs into an MLLM to improve translation quality.
To mitigate reliance on low-resource data, we introduce a \textbf{Self-Evolution Mechanism}.
The core components of this framework include a text-to-speech model, responsible for generating synthetic speech, and an MLLM capable of classifying synthetic speech samples and iteratively optimizing itself using positive samples.
Experimental results demonstrate that our framework surpasses all existing methods on the Multi30K multimodal machine translation benchmark, achieving new state-of-the-art results.
Furthermore, on general machine translation datasets, particularly the FLORES-200, it achieves average state-of-the-art performance in 108 translation directions. Ablation studies on CoVoST-2 confirms that differences between synthetic and authentic speech have negligible impact on translation quality. We will open-source our model to support the wider community.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2930. Detection of unknown unknowns in autonomous systems
๐ฌ 16
โญ 5.33
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ unknown unknowns autonomous systems conformal bounds
Unknown unknowns (U2s) are deployment-time scenarios absent from development/testing. Unlike conventional anomalies, U2s are not out-of-distribution (OOD); they stem from changes in underlying system dynamics without a distribution shift from normal data. Thus, existing multi-variate time series anomaly detection (MTAD) methodsโwhich rely on distribution-shift cuesโare ill-suited for U2 detection. Specifically: (i) we show most anomaly datasets exhibit distribution shift between normal and anomalous data and therefore are not representative of U2s; (ii) we introduce eight U2 benchmarks where training data contain OOD anomalies but no U2s, while test sets contain both OOD anomalies and U2s; (iii) we demonstrate that state-of-the-art (SOTA) MTAD results often depend on impractical enhancements: point adjustment (PA) (uses ground truth to flip false negatives to true positives, inflating precision) and threshold learning with data leakage (TL) (tuning thresholds on test data and labels); (iv) with PA+TL, even untrained deterministic methods can match or surpass MTAD baselines; (v) without PA/TL, existing MTAD methods degrade sharply on U2 benchmarks. Finally, we present sparse model identificationโenhanced anomaly detection (SPIE-AD), a model-recovery-and-conformance, zero-shot MTAD approach that outperforms baselines on all eight U2 benchmarks and on six additional real-world MTAD datasetsโwithout PA or TL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2931. Aria: an Agent for Retrieval and Iterative Auto-Formalization via Dependency Graph
๐ฌ 27
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Lean 4 Autoformalization LLM Graph-of-Thought Retrieval Augmented Generation
Accurate auto-formalization of theorem statements is essential for advancing automated discovery and verification of research-level mathematics, yet remains a major bottleneck for LLMs due to hallucinations, semantic mismatches, and their inability to synthesize new definitions.
To tackle these issues, we present Aria (**A**gent for **R**etrieval and **I**terative **A**utoformalization), a system for conjecture-level formalization in Lean that emulates human expert reasoning via a two-phase Graph-of-Thought process: recursively decomposing statements into a dependency graph and then constructing formalizations from grounded concepts. To ensure semantic correctness, we introduce **AriaScorer**, a checker that retrieves definitions from Mathlib for term-level grounding, enabling rigorous and reliable verification.
We evaluate Aria on diverse benchmarks. On ProofNet, it achieves 91.6\% compilation success rate and 68.5\% final accuracy, surpassing previous methods. On FATE-X, a suite of challenging algebra problems from research literature, it outperforms the best baseline with 44.0\% vs. 24.0\% final accuracy. On a dataset of homological conjectures, Aria reaches 42.9\% final accuracy while all other models score 0\%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2932. Copy-Paste to Mitigate Large Language Model Hallucinations
๐ฌ 23
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ RAG Hallucination Contextual Faithfulness Model Interpretability Large Language Model Knowledge Conflict
While Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to generate contextually grounded responses, contextual faithfulness remains challenging as LLMs may not consistently trust provided context, leading to hallucinations that undermine reliability. We observe an inverse correlation between response copying degree and context-unfaithful hallucinations on RAGTruth, suggesting higher copying degrees reduce hallucinations by fostering genuine contextual belief. We propose \textbf{CopyPasteLLM}, obtained through two-stage high-copying response preference training. We design three prompting methods to enhance copying degree, demonstrating that high-copying responses achieve superior contextual faithfulness and hallucination control. These approaches enable a fully automated pipeline that transforms generated responses into high-copying preference data for training CopyPasteLLM. On FaithEval, ConFiQA and PubMedQA, CopyPasteLLM achieves best performance in both counterfactual and original contexts, remarkably with 12.2\% to 24.5\% accuracy improvements on FaithEval over the best baseline, while requiring only 365 training samplesโ\textit{1/50th} of baseline data. To elucidate CopyPasteLLM's effectiveness, we propose the \textit{Context-Parameter Copying Capturing} algorithm. Interestingly, this reveals that CopyPasteLLM recalibrates reliance on internal parametric knowledge rather than external knowledge during generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2933. Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs
๐ฌ 15
โญ 5.33
๐ interpretability and explainable AI
๐ท๏ธ Large Language Models; Reinforcement Learning Fine-Tuning; Edge Attribution Patching
Large language models (LLMs) acquire extensive prior knowledge through large-scale pretraining and can be further enhanced via supervised fine-tuning (SFT) or reinforcement learning (RL)-based post-training.
A growing body of evidence has shown that RL fine-tuning improves the capability of LLMs beyond what SFT alone achieves.
However, the underlying mechanisms why RL fine-tuning is able to enhance the capability of various LLMs with distinct intrinsic characteristics remain underexplored.
In this study, we draw inspiration from prior work on edge attribution patching (EAP) to investigate the internal differences of LLMs before and after RL fine-tuning.
Our analysis across multiple model families and mathematical datasets shows two robust effects of online RL post-training: (i) an overall increase in average activation intensity, indicating that more internal pathways are engaged and their signals become stronger, and (ii) greater diversity in activation patterns, reflected by higher entropy and less concentrated edge distributions. These changes suggest that RL reshapes information flow to be both more redundant and more flexible, which may explain its advantage in mathematical generalization. Notably, models fine-tuned with Direct Preference Optimization (DPO) deviate from these trends, exhibiting substantially weaker or inconsistent internal changes compared to PPO- and GRPO-based training. Together, our findings provide a unified view of how RL fine-tuning systematically alters the internal circuitry of LLMs and highlight the methodological distinctions between online RL and preference-based approaches.
Our code is open source at https://anonymous.4open.science/r/llm_rl_probing_analysis-F673.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2934. Neural Synchrony Between Socially Interacting Language Models
๐ฌ 17
โญ 5.33
๐ applications to neuroscience & cognitive science
๐ท๏ธ language models social mind inter-brain synchrony social interaction
Neuroscience has uncovered a fundamental mechanism of our social nature: human brain activity becomes synchronized with others in many social contexts involving interaction. Traditionally, social minds have been regarded as an exclusive property of living beings. Although large language models (LLMs) are widely accepted as powerful approximations of human behavior, with multi-LLM system being extensively explored to enhance their capabilities, it remains controversial whether they can be meaningfully compared to human social minds. In this work, we explore neural synchrony between socially interacting LLMs as an empirical evidence for this debate. Specifically, we introduce neural synchrony during social simulations as a novel proxy for analyzing the sociality of LLMs at the representational level. Through carefully designed experiments, we demonstrate that it reliably reflects both social engagement and temporal alignment in their interactions. Our findings indicate that neural synchrony between LLMs is strongly correlated with their social performance, highlighting an important link between neural synchrony and the social behaviors of LLMs. Our work offers a new perspective to examine the "social minds" of LLMs, highlighting surprising parallels in the internal dynamics that underlie human and LLM social interaction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2935. Fast Data Mixture Optimization via Gradient Descent
๐ฌ 18
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large models Data-centric AI AutoML
While large and diverse datasets have driven recent advances in large models, identifying the optimal data mixture for pre-training and post-training remains a significant open problem. We address this challenge with FastMix, a novel framework that automates data mixture discovery while training only a single proxy model. Instead of relying on predefined heuristics or resource-intensive simulations, FastMix jointly optimizes mixture coefficients and model parameters, substantially improving efficiency and scalability over prior approaches. At the core of FastMix is a reformulation of mixture selection as a bilevel optimization problem.
Under this reformulation, we show that optimizing mixture ratios is mathematically equivalent to assigning per-source loss weights under uniform source sampling. This embeds the mixture coefficients directly into the differentiable iterative optimization objective, enabling efficient, gradient-based optimization of both mixture and model. To solve the optimization problem, FastMix implements an approximate iterative optimization procedure, alternating between (i) updating model parameters on data sampled according to current mixture ratios (inner loop) and (ii) updating mixture ratios based on validation feedback (outer loop). Across pre- and post-training, FastMix outperforms baselines while drastically reducing search cost: in pre-training, it attains an average score of 48.2 with 1.3 GPU-hours ($\times 550$ vs. RegMix; $\times 55$ vs. CLIMB), and in post-training (SFT) it leads with 65.4 with a $+5.5$ gain over the next best, completing search in 2.2 GPU-hours compared to the 115 GPU-hours required by CLIMB/RegMix.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2936. RegionE: Adaptive Region-Aware Generation for Efficient Image Editing
๐ฌ 16
โญ 5.33
๐ generative models
๐ท๏ธ Image Editing Efficient Diffusion Transformer Acceleration
Recently, instruction-based image editing (IIE) has received widespread attention. In practice, IIE often modifies only specific regions of an image, while the remaining areas largely remain unchanged. Although these two types of regions differ significantly in generation difficulty and computational redundancy, existing IIE models do not account for this distinction, instead applying a uniform generation process across the entire image. This motivates us to propose \textbf{RegionE}, an adaptive, region-aware generation framework that accelerates IIE tasks without additional training. Specifically, the RegionE framework consists of three main components: 1) Adaptive Region Partition. We observed that the trajectory of unedited regions is straight, allowing for multi-step denoised predictions to be inferred in a single step.
Therefore, in the early denoising stages, we partition the image into edited and unedited regions based on the difference between the final estimated result and the reference image. 2) Region-Aware Generation. After distinguishing the regions, we replace multi-step denoising with one-step prediction for unedited areas.
For edited regions, the trajectory is curved, requiring local iterative denoising. To improve the efficiency and quality of local iterative generation, we propose the Region-Instruction KV Cache, which reduces computational cost while incorporating global information.
3) Adaptive Velocity Decay Cache.
Observing that adjacent timesteps in edited regions exhibit strong velocity similarity, we further propose an adaptive velocity decay cache to accelerate the local denoising process.
We applied RegionE to state-of-the-art IIE base models, including Step1X-Edit, FLUX.1 Kontext, and Qwen-Image-Edit. RegionE achieved acceleration factors of 2.57ร, 2.41ร, and 2.06ร, respectively, with minimal quality loss (PSNR: 30.520โ32.133). Evaluations by GPT-4o also confirmed that semantic and perceptual fidelity were well preserved.
Code will be open-sourced.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2937. Hidden Patterns in Chain-of-Thought Reasoning
๐ฌ 11
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reasoning Chain-of-thought Mathematical reasoning
Chain-of-thought (CoT) prompting is a de-facto standard technique to elicit reasoning-like responses from large language models (LLMs), allowing them to spell out individual steps before giving a final answer. While the resemblance to human-like reasoning is undeniable, the driving forces underpinning the success of CoT reasoning still remain largely unclear. In this work, we perform an in-depth analysis of CoT traces originating from competition-level mathematics questions, with the aim of better understanding how, and which parts of CoT actually contribute to the final answer. To this end, we introduce the notion of a \textit{potential}, quantifying how much a given part of CoT increases the likelihood of a correct completion. Upon examination of reasoning traces through the lens of the potential, we identify surprising patterns including (1) its often strong non-monotonicity (due to reasoning tangents), (2) very sharp but sometimes tough to interpret spikes (reasoning insights and jumps) as well as (3) at times lucky guesses, where the model arrives at the correct answer without providing any relevant justifications before. While some of the behaviours of the potential are readily interpretable and align with human intuition (such as insights and tangents), others remain difficult to understand from a human perspective. To further quantify the reliance of LLMs on reasoning insights, we investigate the notion of CoT transferability, where we measure the potential of a weaker model under the partial CoT from another, stronger model. Indeed aligning with our previous results, we find that as little as 20% of partial CoT can ``unlock'' the performance of the weaker model on problems that were previously unsolvable for it, highlighting that a large part of the mechanics underpinning CoT are transferable.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2938. Rethinking Benign Relearning: Syntax as the Hidden Driver of Unlearning Failures
๐ฌ 13
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models (LLMs) Machine Unlearning
Machine unlearning aims to remove specific content from trained models while preserving overall performance.
However, the phenomenon of benign relearning, in which forgotten information reemerges even from benign fine-tuning data, reveals that existing unlearning methods remain fundamentally fragile.
A common explanation attributes this effect to topical relevance, but we find this account insufficient.
Through systematic analysis, we demonstrate that syntactic similarity, rather than topicality, is the primary driver: across benchmarks, syntactically similar data consistently trigger recovery even without topical overlap, due to their alignment in representations and gradients with the forgotten content.
Motivated by this insight, we introduce syntactic diversification, which paraphrases the original forget queries into heterogeneous structures prior to unlearning.
This approach effectively suppresses benign relearning, accelerates forgetting, and substantially alleviates the trade-off between unlearning efficacy and model utility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2939. Charts Are Not Images: On the Challenges of Scientific Chart Editing
๐ฌ 22
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ Scientific Chart Editing Benchmark
Generative models, such as diffusion and autoregressive approaches, have demonstrated impressive capabilities in editing natural images. However, applying these tools to scientific charts rests on a flawed assumption: a chart is not merely an arrangement of pixels but a visual representation of structured data governed by a graphical grammar. Consequently, chart editing is not a pixel-manipulation task but a structured transformation problem.
To address this fundamental mismatch, we introduce \textit{FigEdit}, a large-scale benchmark for scientific figure editing comprising over 30,000 samples. Grounded in real-world data, our benchmark is distinguished by its diversity, covering 10 distinct chart types and a rich vocabulary of complex editing instructions. The benchmark is organized into five distinct and progressively challenging tasks: single edits, multi edits, conversational edits, visual-guidance-based edits, and style transfer.
Our evaluation of a range of state-of-the-art models on this benchmark reveals their poor performance on scientific figures, as they consistently fail to handle the underlying structured transformations required for valid edits. Furthermore, our analysis indicates that traditional evaluation metrics (e.g., SSIM, PSNR) have limitations in capturing the semantic correctness of chart edits. Our benchmark demonstrates the profound limitations of pixel-level manipulation and provides a robust foundation for developing and evaluating future structure-aware models. By releasing \textit{FigEdit}, we aim to enable systematic progress in structure-aware figure editing, provide a common ground for fair comparison, and encourage future research on models that understand both the visual and semantic layers of scientific charts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2940. Block-wise Adaptive Caching for Accelerating Diffusion Policy
๐ฌ 12
โญ 5.33
๐ applications to robotics, autonomy, planning
๐ท๏ธ Efficient AI Diffusion Policy Visuomotor Policy Robotics Action Generation Model Caching.
Diffusion Policy has demonstrated strong visuomotor modeling capabilities, but its high computational cost renders it impractical for real-time robotic control.
Despite huge redundancy across repetitive denoising steps, existing diffusion acceleration techniques fail to generalize to Diffusion Policy due to fundamental architectural and data divergences.
In this paper, we propose **B**lock-wise **A**daptive **C**aching (**BAC**), a method to accelerate Diffusion Policy by caching intermediate action features. BAC achieves lossless action generation acceleration by adaptively updating and reusing cached features at the block level, based on a key observation that feature similarities vary non-uniformly across timesteps and blocks.
To operationalize this insight, we first propose the Adaptive Caching Scheduler, designed to identify optimal update timesteps by maximizing the global feature similarities between cached and skipped features. However, applying this scheduler for each block leads to significant error surges due to the inter-block propagation of caching errors, particularly within Feed-Forward Network (FFN) blocks. To mitigate this issue, we develop the Bubbling Union Algorithm, which truncates these errors by updating the upstream blocks with significant caching errors before downstream FFNs.
As a training-free plugin, BAC is readily integrable with existing transformer-based Diffusion Policy and vision-language-action models. Extensive experiments on multiple robotic benchmarks demonstrate that BAC achieves up to $3 \times$ inference speedup for free. Project page: https://block-wise-adaptive-caching.github.io.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2941. On Discriminative vs. Generative classifiers: Rethinking MLLMs for Action Understanding
๐ฌ 18
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Temporal action understanding multimodal large language models
Multimodal Large Language Models (MLLMs) have advanced open-world action understanding and can be adapted as generative classifiers for closed-set settings by autoregressively generating action labels as text. However, this approach is inefficient, and shared subwords across action labels introduce semantic overlap, leading to ambiguity in generation. In contrast, discriminative classifiers learn task-specific representations with clear decision boundaries, enabling efficient one-step classification without autoregressive decoding.
We first compare generative and discriminative classifiers with MLLMs for closed-set action understanding, revealing the superior accuracy and efficiency of the latter. To bridge the performance gap, we design strategies that elevate generative classifiers toward performance comparable with discriminative ones. Furthermore, we show that generative modeling can complement discriminative classifiers, leading to better performance while preserving efficiency.
To this end, we propose Generation-Assisted Discriminative (GAD) classifier for closed-set action understanding. GAD operates only during fine-tuning, preserving full compatibility with MLLM pretraining. Extensive experiments on temporal action understanding benchmarks demonstrate that GAD improves both accuracy and efficiency over generative methods, achieving state-of-the-art results on four tasks across five datasets, including an average 2.5\% accuracy gain and 3$\times$ faster inference on our largest COIN benchmark.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2942. SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback
๐ฌ 9
โญ 5.33
๐ generative models
๐ท๏ธ Diffusion models;RL
Aligning diffusion models with human preferences remains challenging, particularly when reward models are unavailable or impractical to obtain, and collecting large-scale preference datasets is prohibitively expensive. This raises a fundamental question: can we achieve effective alignment using only minimal human feedback, without auxiliary reward models, by unlocking the latent capabilities within diffusion models themselves? In this paper, we propose SAIL (Self-Amplified Iterative Learning), a novel framework that enables diffusion models to act as their own teachers through iterative self-improvement. Starting from a minimal seed set of human-annotated preference pairs, SAIL operates in a closed-loop manner where the model progressively generates diverse samples, self-annotates preferences based on its evolving understanding, and refines itself using this self-augmented dataset. To ensure robust learning and prevent catastrophic forgetting, we introduce a ranked preference mixup strategy that carefully balances exploration with adherence to initial human priors. Extensive experiments demonstrate that SAIL consistently outperforms state-of-the-art methods across multiple benchmarks while using merely 6\% of the preference data required by existing approaches, revealing that diffusion models possess remarkable self-improvement capabilities that, when properly harnessed, can effectively replace both large-scale human annotation and external reward models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2943. Dataset Distillation for Memorized Data: Soft Labels can Leak Held-Out Teacher Knowledge
๐ฌ 7
โญ 5.33
๐ learning theory
๐ท๏ธ distillation memorization generalization learning theory model transfer privacy
Dataset distillation aims to compress training data into fewer examples via a teacher, from which a student can learn effectively. While its success is often attributed to structure in the data, modern neural networks also memorize specific facts, but if and how such memorized information can be transferred in distillation settings remains less understood. While this transfer may be desirable in some applications, it also raises privacy concerns, where preventing such leakage is crucial. In this work, we show that students trained on soft labels from teachers can indeed achieve non-trivial accuracy on held-out memorized data they never directly observed. This effect persists on structured data when the teacher has not generalized. To understand this effect in isolation, we consider finite random i.i.d. datasets where generalization is a priori impossible and a successful teacher fit implies pure memorization. Still, students can learn non-trivial information about the held-out data, in some cases up to perfect accuracy. For multinomial logistic classification and single layer MLPs, we show this corresponds to the setting where the teacher can be recovered functionally -- the student matches the teacher's predictions on all possible inputs, including the held-out memorized data. We empirically show that these phenomena strongly depend on the sample complexity and the temperature with which the logits are smoothed, but persist across varying network capacities, architectures and dataset compositions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2944. 3D-aware Disentangled Representation for Compositional Reinforcement Learning
๐ฌ 15
โญ 5.33
๐ reinforcement learning
๐ท๏ธ object-centric learning compositional generalization goal-conditioned reinforcement learning
Vision-based reinforcement learning can benefit from object-centric scene representation, which factorizes the visual observation into individual objects and their attributes, such as color, shape, size, and position.
While such object-centric representations can extract components that generalize well for various multi-object manipulation tasks, they are prone to issues with occlusions and 3D ambiguity of object properties due to their reliance on single-view 2D image features.
Furthermore, the entanglement between object configurations and camera poses complicates the object-centric disentanglement in 3D, leading to poor 3D reasoning by the agent in vision-based reinforcement learning applications.
To address the lack of 3D awareness and the object-camera entanglement problem, we propose an enhanced 3D object-centric representation that utilizes multi-view 3D features and enforces more explicit 3D-aware disentanglement.
The enhancement is based on the integration of the recent success of multi-view Transformer and the prototypical representation learning among the object-centric representations.
The representation, therefore, can stably identify proxies of 3D positions of individual objects along with their semantic and physical properties, exhibiting excellent interpretability and controllability.
Then, our proposed block transformer policy effectively performs novel tasks by assembling desired properties adaptive to the new goal states, even when provided with unseen viewpoints at test time.
We demonstrate that our 3D-aware block representation is scalable to compose diverse novel scenes and enjoys superior performance in out-of-distribution tasks with multi-object manipulations under both seen and unseen viewpoints compared to existing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2945. Are Global Dependencies Necessary? Scalable Time Series Forecasting via Local Cross-Variate Modeling
๐ฌ 16
โญ 5.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Time Series Forecasting Time Series Analysis Deep Learning
Effectively modeling cross-variate dependencies is a central, yet challenging, task in multivariate time series forecasting.
While attention-based methods have advanced the state-of-the-art by capturing global cross-variate dependencies, their quadratic complexity with respect to the number of variates severely limits their scalability. In this work, we challenge the necessity of global dependency modeling. We posit, through both theoretical analysis and empirical evidence, that modeling local cross-variate interactions is not only sufficient but also more efficient for many dense dependency systems.
Motivated by this core insight, we propose VPNet, a novel architecture that excels in both accuracy and efficiency. VPNet's design is founded on two key principles: a channelized reinterpretation of patch embeddings into a higher-level variate-patch field, and a specialized VarTCNBlock that operates upon it. Specifically, the model first employs a patch-level autoencoder to extract robust local representations. In a pivotal step, these representations are then re-conceptualized as a 2D field constructed over a "variates ร patches" grid. The VarTCNBlock then applies depthwise 2D convolutions across this field to efficiently capture local spatio-temporal patterns (i.e., cross-variate and temporal dependencies simultaneously), followed by pointwise convolutions for feature mixing. This design ensures that the computational complexity scales linearly with the number of variates. Finally, variate-wise prediction heads map the refined historical patch representations to future ones, which are decoded back into the time domain. Extensive experiments demonstrate that VPNet not only achieves state-of-the-art performance across multiple benchmarks but also offers significant efficiency gains, establishing it as a superior and scalable solution for high-dimensional forecasting.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2946. Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection
๐ฌ 4
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models LLM Steering Instruction following Activation engineering
Large Language Models (LLMs), despite advances in instruction tuning, often fail to follow complex user instructions. Activation steering techniques aim to mitigate this by manipulating model internals, but have a potential risk of oversteering, where excessive emphasis on the instruction degrades task accuracy and overall text quality. To address this, we introduce DIRECTER (Dynamic rejection steering), a novel steering method that dynamically modulates steering strength by scaling the KV cache without extra dataset. DIRECTER couples steering with a plausibility-guided decoding loop, which adaptively adjusts steering strength at each step by comparing the steered output distribution to the original. If the steered output is deemed implausible, steering strength is progressively weakened. This strength modulation is guided by a lightweight, one-time attention sensitivity analysis that ranks layers by their influence on model representations. Extensive evaluations show that DIRECTER significantly enhances instruction-following capabilities across diverse benchmarks, improving accuracy by up to 6.5% over baselines without the common trade-offs in generation quality or task fidelity. The proposed dynamic, plausibility-guided control during activation steering further demonstrates its potential as a general mechanism for mitigating oversteering that is compatible with existing baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2947. Thicker and Quicker: The Jumbo Token for Fast Plain Vision Transformers
๐ฌ 13
โญ 5.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ efficient deep learning computer vision vision transformers
ViTs are general and accurate, and address many tasks, but ViTs are slow, and
are not always practical when efficiency is key. Existing methods for faster ViTs
design hybrid non-ViT architectures, losing generality, or shrink their tokens,
sacrificing accuracy. While many non-ViT architectures are both fast and accurate,
they cannot flexibly process other input shapes, pre-train by SOTA self-supervised
learning, reduce computation by dropping tokens, and more like ViTs can. We
make ViTs faster by reducing patch token width while increasing global token
width by adding a new Jumbo token. Our wider Jumbo token is processed by its
own wider FFN to increase model capacity. Yet our Jumbo FFN is efficient: it
processes a single token, for speed, and its parameters are shared across all layers,
for memory. Crucially, our Jumbo is attention-only and non-hierarchical, like a
plain ViT, so it is simple, scalable, flexible, and compatible with ViT methods new
and old. Jumbo improves over ViT baselines with Registers from Nano to Large
scales while maintaining speed/throughput on ImageNet-1K (โ0.1โ13%). Jumbo
also improves MAE pre-training (โ4.9% linear probing on ImageNet-1K), test-time
adaptation (โ5.2% on ImageNet-C), and time series modeling. Our Jumbo models
even achieve better speed-accuracy trade-offs than specialized non-ViT compute-efficient models, while maintaining plain-ViT compatibility for practicality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2948. DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts
๐ฌ 24
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ object detection prompt-based detection open-set object detection
Visual prompted object detection enables interactive and flexible definition of target categories, thereby facilitating open-vocabulary detection. Since visual prompts are derived directly from image features, they often outperform text prompts in recognizing rare categories. Nevertheless, research on visual prompted detection has been largely overlooked, and it is typically treated as a byproduct of training text prompted detectors, which hinders its development. To fully unlock the potential of visual-prompted detection, we investigate the reasons why its performance is suboptimal and reveal that the underlying issue lies in the absence of global discriminability in visual prompts. Motivated by these observations, we propose DETR-ViP, a robust object detection framework that yields class-distinguishable visual prompts. On top of basic image-text contrastive learning, DETR-ViP incorporates global prompt integration and visual-textual prompt relation distillation to learn more discriminative prompt representations. In addition, DETR-ViP employs a selective fusion strategy that ensures stable and robust detection. Extensive experiments on COCO, LVIS, ODinW, and Roboflow100 demonstrate that DETR-ViP achieves substantially higher performance in visual prompt detection compared to other state-of-the-art counterparts. A series of ablation studies and analyses further validate the effectiveness of the proposed improvements and shed light on the underlying reasons for the enhanced detection capability of visual prompts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2949. Tree Search for LLM Agent Reinforcement Learning
๐ฌ 20
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Tree Search LLM Agent Reinforcement Learning
Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs).
In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision.
To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step.
By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls.
Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward.
Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels.
Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning.
Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2950. WaterDrum: Watermark-based Data-centric Unlearning Metric
๐ฌ 19
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ machine unlearning watermarking metric LLM
Large language model (LLM) unlearning is critical in real-world applications where it is necessary to efficiently remove the influence of private, copyrighted, or harmful data from some users. Existing utility-centric unlearning metrics (based on model utility) may fail to accurately evaluate the extent of unlearning in realistic settings such as when the forget and retain sets have semantically similar content and/or retraining the model from scratch on the retain set is impractical. This paper presents the first data-centric unlearning metric for LLMs called WaterDrum that exploits robust text watermarking to overcome these limitations. We introduce new benchmark datasets (with different levels of data similarity) for LLM unlearning that can be used to rigorously evaluate unlearning algorithms via WaterDrum.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2951. LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding
๐ฌ 21
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 4D scene understanding large multimodal model spatiotemporal prompt multimodal learning
Despite achieving significant progress in 2D image understanding, large multimodal models (LMMs) struggle in the physical world due to the lack of spatial representation. Typically, existing 3D LMMs mainly embed 3D positions as fixed spatial prompts within visual features to represent the scene. However, these methods are limited to understanding the static background and fail to capture temporally varying dynamic objects. In this paper, we propose LLaVA-4D, a general LMM framework with a novel spatiotemporal prompt for visual representation in 4D scene understanding. The spatiotemporal prompt is generated by encoding 3D position and 1D time into a dynamic-aware 4D coordinate embedding. Moreover, we demonstrate that spatial and temporal components disentangled from visual features are more effective in distinguishing the background from objects. This motivates embedding the 4D spatiotemporal prompt into these features to enhance the dynamic scene representation. By aligning visual spatiotemporal embeddings with language embeddings, LMMs gain the ability to understand both spatial and temporal characteristics of static background and dynamic objects in the physical world. Additionally, we construct a 4D vision-language dataset with spatiotemporal coordinate annotations for instruction fine-tuning LMMs. Extensive experiments have been conducted to demonstrate the superiority of our method on various tasks of 4D scene understanding. Our code will be open-sourced on paper acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2952. Solving the 2-norm k-hyperplane clustering problem via multi-norm formulations
๐ฌ 14
โญ 5.33
๐ optimization
๐ท๏ธ optimization; spatial branch-and-bound
We tackle the 2-norm (Euclidean) k-Hyperplane Clustering problem (k-HC2), which asks for finding k hyperplanes that minimize the sum of squared 2-norm (Euclidean) distances between each point and its closest hyperplane. We solve the problem to global optimality via spatial branch-and-bound techniques (SBB) by strengthening a mixed integer quadratically-constrained quadratic programming formulation with constraints that arise when formulating the problem in p-norms with p ฬธ= 2. In particular, we show that, for every (appropriately scaled) p โ N โช {โ}, one obtains a variant of k-HC2 whose optimal solutions yield lower bounds within a multiplicative approximation factor. We focus on the case of polyhedral norms where p = 1, โ (which admit a disjunctive-programming reformulation), and prove that strengthening the original formulation by including, on top of the original 2-norm constraints, the constraints of one of the polyhedral-norms leads to an SBB method where nonzero lower bounds are obtained in a linear (as opposed to exponential) number of SBB nodes. Experimentally, we show that our strengthened formulations lead to speedups from 41 to 1.5 orders of magnitude, drastically improving the problemโs solvability to global optimality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2953. CONSIGN: Conformal Segmentation Informed by Spatial Groupings via Decomposition
๐ฌ 13
โญ 5.33
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Conformal prediction Uncertainty quantification Image segmentation
Most machine learning-based image segmentation models produce pixel-wise confidence scores that represent the modelโs predicted probability for each class label at every pixel. While this information can be particularly valuable in high-stakes domains such as medical imaging, these scores are heuristic in nature and do not constitute rigorous quantitative uncertainty estimates.
Conformal prediction (CP) provides a principled framework for transforming heuristic confidence scores into statistically valid uncertainty estimates. However, applying CP directly to image segmentation ignores the spatial correlations between pixels, a fundamental characteristic of image data. This can result in overly conservative and less interpretable uncertainty estimates.
To address this, we propose CONSIGN (*Conformal Segmentation Informed by Spatial Groupings via Decomposition*), a CP-based method that incorporates spatial correlations to improve uncertainty quantification in image segmentation.
Our method generates meaningful prediction sets that come with user-specified, high-probability error guarantees.
It is compatible with any pre-trained segmentation model capable of generating multiple sample outputs.
We evaluate CONSIGN against two CP baselines across three medical imaging datasets and two COCO dataset subsets, using three different pre-trained segmentation models. Results demonstrate that accounting for spatial structure significantly improves performance across multiple metrics and enhances the quality of uncertainty estimates.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2954. Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization
๐ฌ 18
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Importance Sampling Direct Preference Optimization
Direct Preference Optimization (DPO) has recently been extended from text-only models to vision-language models. However, existing methods rely on oversimplified pairwise comparisons, generating a single negative image via basic perturbations or similarity-based retrieval, which fail to capture the complex nature of multimodal preferences, inducing optimization bias and hallucinations. To address this issue, we propose MISP-DPO, the first framework to incorporate \emph{multiple}, semantically \emph{diverse} negative images in multimodal DPO via the Plackett-Luce model. Our method embeds prompts and candidate images in CLIP (Contrastive LanguageโImage Pre-training) space and applies a sparse autoencoder to uncover semantic deviations into interpretable factors. Negative samples are selected based on reconstruction difficulty, semantic deviation from the positive, and mutual diversity, yielding broader and more informative supervision. To handle multi-negative comparisons, we adopt a PlackettโLuce objective and introduce an importance sampling strategy that improves training efficiency. Experiments across five diverse benchmarks demonstrate that MISP-DPO consistently improves multimodal alignment over prior methods, validating the effectiveness of semantic-aware, multi-negative sampling in preference-based learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2955. How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images
๐ฌ 21
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ Medical MLLM Visual Grounding
Generalist multimodal large language models (MLLMs) have achieved impressive performance across a wide range of vision-language tasks. However, their performance on medical tasksโparticularly in zero-shot settings where generalization is criticalโremains suboptimal. A key research gap is the limited understanding of why medical MLLMs underperform in medical image interpretation.
**In this work**, we present a pioneering systematic investigation into the visual grounding capabilities of state-of-the-art medical MLLMs. To disentangle *visual grounding* from *semantic grounding*, we design VGMED, a novel evaluation dataset developed with expert clinical guidance, explicitly assessing the visual grounding capability of medical MLLMs.
We introduce new quantitative metrics and conduct detailed qualitative analyses. Our study across **eight** state-of-the-art (SOTA) medical MLLMs validates that they often fail to ground their predictions in clinically relevant image regions. We note that this finding is specific to medical image analysis; in contrast, prior work has shown that MLLMs are capable of grounding their predictions in the correct image regions when applied to natural scene images.
Motivated by these findings, we propose VGRefine, a simple yet effective inference-time method that refines attention distribution to improve visual grounding in medical settings. Our approach achieves SOTA performance across 6 diverse Med-VQA benchmarks (over 110K VQA samples from 8 imaging modalities)
without requiring additional training or external expert models. Overall, our work, for the first time, systematically validates inadequate visual grounding as one of the key contributing factors for medical MLLMs' under-performance.
Code and additional experiments are included in the Supp.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2956. The Price of Robustness: Stable Classifiers Need Overparameterization
๐ฌ 16
โญ 5.33
๐ learning theory
๐ท๏ธ concentration inequalities isoperimetry robustness stability classification problems generalization overparameterization
The relationship between overparameterization, stability, and generalization
remains incompletely understood in the setting of discontinuous classifiers. We
address this gap by establishing a generalization bound for finite function classes that improves
inversely with _class stability_, defined as the expected distance
to the decision boundary in the input domain (margin). Interpreting class stability as a quantifiable notion
of robustness, we derive as a corollary a _law of robustness_ for
classification that extends the results of Bubeck and Selke beyond
smoothness assumptions to discontinuous functions. In particular, any
interpolating model with $p \approx n$ parameters on $n$ data points must be
_unstable_, implying that substantial overparameterization is necessary
to achieve high stability. We obtain analogous results for (parameterized) infinite function classes by analyzing a stronger robustness measure derived from the margin in the co-domain, which we refer to as the _normalized co-stability_. Experiments support our theory: stability increases with model size and correlates with test performance, while traditional norm-based measures remain largely uninformative.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2957. DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning
๐ฌ 23
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Learning Multimodal Representation Learning Multimodal Alignment
Multimodal representation learning aims to capture both shared and complementary semantic information across multiple modalities. However, the intrinsic heterogeneity of diverse modalities presents substantial challenges to achieve effective cross-modal collaboration and integration. To address this, we introduce DecAlign, a novel hierarchical cross-modal alignment framework designed to decouple multimodal representations into modality-unique (heterogeneous) and modality-common (homogeneous) features. For handling heterogeneity, we employ a prototype-guided optimal transport alignment strategy leveraging gaussian mixture modeling and multi-marginal transport plans, thus mitigating distribution discrepancies while preserving modality-unique characteristics. To reinforce homogeneity, we ensure semantic consistency across modalities by aligning latent distribution matching with Maximum Mean Discrepancy regularization. Furthermore, we incorporate a multimodal transformer to enhance high-level semantic feature fusion, thereby further reducing cross-modal inconsistencies. Our extensive experiments on four widely used multimodal benchmarks demonstrate that DecAlign consistently outperforms existing state-of-the-art methods across five metrics. These results highlight the efficacy of DecAlign in enhancing superior cross-modal alignment and semantic consistency while preserving modality-unique features, marking a significant advancement in multimodal representation learning scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2958. UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling
๐ฌ 14
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Hand Motion Modeling Diffusion Model
Hand motion plays a central role in human interaction, yet modeling realistic 4D hand motion (*i.e.*, 3D hand pose sequences over time) remains challenging.
Research in this area is typically divided into two tasks:
(1) Estimation approaches reconstruct precise motion from visual observations, but often fail under hand occlusion or absence;
(2) Generation approaches focus on synthesizing hand poses by exploiting generative priors under multi-modal structured inputs and infilling motion from incomplete sequences.
However, this separation not only limits the effective use of heterogeneous condition signals that frequently arise in practice, but also prevents knowledge transfer between the two tasks.
We present **UniHand**, a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis.
UniHand integrates heterogeneous inputs by embedding structured signals into a shared latent space through a joint variational autoencoder, which aligns conditions such as MANO parameters and 2D skeletons.
Visual observations are encoded with a frozen vision backbone, while a dedicated hand perceptron extracts hand-specific cues directly from image features, removing the need for complex detection and cropping pipelines.
A latent diffusion model then synthesizes consistent motion sequences from these diverse conditions.
Extensive experiments across multiple benchmarks demonstrate that UniHand delivers robust and accurate hand motion modeling, maintaining performance under severe occlusions and temporally incomplete inputs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2959. RAEE: A Robust Retrieval-Augmented Early Exit Framework for Efficient Inference
๐ฌ 15
โญ 5.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Early Exit; Retrieval Augmentation; Large Language Model
Deploying large language model inference remains challenging due to their high computational overhead.
Early exit optimizes model inference by adaptively reducing the number of inference layers.
Current methods typically train internal classifiers or use heuristic methods to determine the exit layer.
However, those methods either introduce significant training overheads or lead to performance degradation.
To address these limitations, this paper proposes RAEE, a robust Retrieval-Augmented Early Exit framework that not only enables early exit but also enhances model performance through corrective exit information at intermediate layers.
This paper first demonstrates that the early exit problem can be effectively modeled as a distribution prediction problem, in which the distribution can be further approximated through the exit information of similar data.
Subsequently, this paper introduces the process of collecting exit information of correct predictions and the steps to construct the retrieval database.
Finally, leveraging the pre-constructed retrieval database, RAEE utilizes the exit information from retrieved similar data to guide the backbone model's exit.
Experimental results demonstrate that RAEE can not only accelerate inference while achieving robust zero-shot performance across eight downstream tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2960. QuRL: Rubrics As Judge For Open-Ended Question Answering
๐ฌ 14
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ rubrics reinforcement-learning open-ended qa large language model generation
Reinforcement Learning from Verifiable Rewards (RLVR) has significantly improved the performance of large language models (LLMs) on tasks with gold ground truth, such as code generation and mathematical reasoning. However, its application to open-ended question answering (QA) remains challenging, primarily due to the absence of reliable evaluation and verifiable reward signals. This difficulty is further compounded by the limitations of existing evaluation paradigms. Previous approaches typically rely on human feedback or LLM-as-judge strategies, which are costly, prone to reward hacking, and often fail to provide sufficiently discriminative or interpretable evaluation signals. To address these limitations, we introduce a schema for generating case-wise rubrics that are question-specific, content-based and stylistically sensitive, thereby evaluating both factual soundness and writing quality. Building on this schema, we propose QuRL (Open-Ended QA with Rubric-guided Reinforcement Learning), a framework that automatically mines rubrics for each question from easily accessible online sources and leverages them as reward signals. With these rubrics, QuRL employs the GRPO (Group Relative Policy Optimization) algorithm to guide the model in exploring the correct generation path. Extensive experiments show that our framework achieves significant improvements of total +17.0 points on evaluation benchmark, demonstrating the effectiveness of rubric-guided reinforcement learning for open-ended QA.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2961. Rethinking Model Calibration through Spectral Entropy Regularization in Medical Image Segmentation
๐ฌ 6
โญ 5.33
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ medical image segmentation model calibration spectral entropy
Deep neural networks for medical image segmentation often produce overconfident predictions, posing clinical risks due to miscalibrated uncertainty estimates. In this work, we rethink model calibration from a frequency-domain perspective and identify two critical factors causing miscalibration: spectral bias, where models overemphasize low-frequency components, and confidence saturation, which suppresses overall power spectral density in confidence maps. To address these challenges, we propose a novel frequency-aware calibration framework integrating spectral entropy regularization and power spectral smoothing. The spectral entropy term promotes a balanced frequency spectrum and enhances overall spectral power, enabling better modeling of high-frequency boundary and low-frequency structural uncertainty. The smoothing module stabilizes frequency-wise statistics across training batches, reducing sample-specific fluctuations. Extensive experiments on six public medical imaging datasets and multiple segmentation architectures demonstrate that our approach consistently improves calibration metrics without sacrificing segmentation accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2962. SPREAD: Sampling-based Pareto front Refinement via Efficient Adaptive Diffusion
๐ฌ 16
โญ 5.33
๐ generative models
๐ท๏ธ Multi-objective optimization Denoising Diffusion Probabilistic Models Multiple gradient descent Offline multiโobjective optimization Multi-objective Bayesian optimization Diffusion Transformer
Developing efficient multi-objective optimization methods to compute the Pareto set of optimal compromises between conflicting objectives remains a key challenge, especially for large-scale and expensive problems. To bridge this gap, we introduce SPREAD, a generative framework based on Denoising Diffusion Probabilistic Models (DDPMs). SPREAD first learns a conditional diffusion process over points sampled from the decision space and then, at each reverse diffusion step, refines candidates via a sampling scheme that uses an adaptive multiple gradient descent-inspired update for fast convergence alongside a Gaussian RBFโbased repulsion term for diversity. Empirical results on multi-objective optimization benchmarks, including offline and Bayesian surrogate-based settings, show that SPREAD matches or exceeds leading baselines in efficiency, scalability, and Pareto front coverage.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2963. Hey, That's My Model! Introducing Chain & Hash, An LLM Fingerprinting Technique
๐ฌ 15
โญ 5.33
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Large Language Model LLMs Fingerprint Security
Growing concerns over the theft and misuse of Large Language Models (LLMs) underscore the need for effective fingerprinting to link a model to its original version and detect misuse. We define five essential properties for a successful fingerprint: Transparency, Efficiency, Persistence, Robustness, and Unforgeability. We present a novel fingerprinting framework that provides verifiable proof of ownership while preserving fingerprint integrity. Our approach makes two main contributions. First, a "chain and hash" technique that cryptographically binds fingerprint prompts to their responses, preventing collisions and enabling irrefutable ownership claims. Second, we address a realistic threat model in which instruction-tuned models' output distribution can be significantly altered through meta-prompts. By incorporating random padding and varied meta-prompt configurations during training, our method maintains robustness even under significant output style changes. Experiments show that our framework securely proves ownership, resists both benign transformations (e.g., fine-tuning) and adversarial fingerprint removal, and extends to fingerprinting LoRA adapters.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2964. Disentangling Knowledge Representations for Large Language Model Editing
๐ฌ 20
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language models Knowledge editing
Knowledge Editing has emerged as a promising solution for efficiently updating embedded knowledge in large language models (LLMs). While existing approaches demonstrate effectiveness in integrating new knowledge and preserving the original capabilities of LLMs, they fail to maintain fine-grained irrelevant knowledge facts that share the same subject as edited knowledge but differ in relation and object. This challenge arises because subject representations inherently encode multiple attributes, causing the target and fine-grained irrelevant knowledge to become entangled in the representation space, and thus vulnerable to unintended alterations during editing. To address this, we propose DiKE, a novel approach that Disentangles Knowledge representations for LLM Editing (DiKE). DiKE consists of two key components: a Knowledge Representation Disentanglement (KRD) module that decomposes the subject representation into target-knowledgerelated and -unrelated components, and a Disentanglement-based Knowledge Edit (DKE) module that updates only the target-related component while explicitly preserving the unrelated one. We further derive a closed-form, rank-one parameter update based on matrix theory to enable efficient and minimally invasive edits. To rigorously evaluate fine-grained irrelevant knowledge preservation, we construct FINE-KED, a new benchmark comprising fine-grained irrelevant knowledge at different levels of relational similarity to the edited knowledge. Extensive experiments across multiple LLMs demonstrate that DiKE substantially improves fine-grained irrelevant knowledge preservation while maintaining competitive general editing performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2965. OWL : Geometry-Aware Spatial Reasoning for Audio Large Language Models
๐ฌ 14
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Chain of Thoughts Spatial Reasoning
Spatial reasoning is fundamental to auditory perception, yet current audio large
language models (ALLMs) largely rely on unstructured binaural cues and single-
step inference. This limits both perceptual accuracy in direction and distance
estimation and the capacity for interpretable reasoning. Recent work such as BAT
demonstrates spatial QA with binaural audio, but its reliance on coarse categorical
labels (left, right, up, down) and the absence of explicit geometric supervision
constrain resolution and robustness. We introduce the $\textbf{Spatial-Acoustic Geometry
Encoder (SAGE}$), a geometry-aware audio encoder that aligns binaural acoustic
features with 3D spatial structure using panoramic depth images and room-impulse
responses at training time, while requiring only audio at inference. Building on this
representation, we present $\textbf{OWL}$, an ALLM that integrates $\textbf{SAGE}$ with a spatially
grounded chain-of-thought to rationalize over direction-of-arrivals (DoA) and
distance estimates. Through curriculum learning from perceptual QA to multi-step
reasoning, $\textbf{OWL}$ supports oโclock-level azimuth and DoA
estimation. To enable large-scale training and evaluation, we construct and release $\textbf{BiDepth}$,
a dataset of over one million QA pairs combining binaural audio with panoramic
depth images and room impulse responses across both in-room and out-of-room scenarios. Across two benchmark datasets, our new $\textbf{BiDepth}$ and the public SpatialSoundQA, $\textbf{OWL}$ reduces mean DoA error by $\textbf{11$^{\circ}$}$ through $\textbf{SAGE}$
and improves spatial reasoning QA accuracy by up to $\textbf{25}$% over BAT. Our dataset and code are available at: https://anonymous.4open.science/r/OWL-ICLR-26/
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2966. Beyond Simple Graphs: Neural Multi-Objective Routing on Multigraphs
๐ฌ 14
โญ 5.33
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Combinatorial Optimization Reinforcement Learning Graph-based Machine Learning Multigraphs Traveling Salesman Problem Multi-Objective Optimization
Learning-based methods for routing have gained significant attention in recent years, both in single-objective and multi-objective contexts. Yet, existing methods are unsuitable for routing on multigraphs, which feature multiple edges with distinct attributes between node pairs, despite their strong relevance in real-world scenarios. In this paper, we propose two graph neural network-based methods to address multi-objective routing on multigraphs. Our first approach operates directly on the multigraph by autoregressively selecting edges until a tour is completed. The second model, which is more scalable, first simplifies the multigraph via a learned pruning strategy and then performs autoregressive routing on the resulting simple graph. We evaluate both models empirically, across a wide range of problems and graph distributions, and demonstrate their competitive performance compared to strong heuristics and neural baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2967. NLI : Non-uniform Linear Interpolation Approximation of Nonlinear Operations for Efficient LLMs Inference
๐ฌ 14
โญ 5.33
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ Dynamic Programming Non-linear Approximation Large Language Models Quantization Hardware Acceleration Edge Inference Calibration-Free
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of tasks, but their deployment is often constrained by substantial memory footprints and computational costs. While prior work has achieved significant progress in compressing and accelerating linear layers, nonlinear layersโsuch as SiLU, RMSNorm, and Softmaxโstill heavily depend on high-precision floating-point operations. In this paper, we propose a calibration-free, dynamic-programming-optimal, and hardware-friendly framework called \underline{N}on-uniform \underline{L}inear \underline{I}nterpolation (NLI). NLI is capable of efficiently approximating a variety of nonlinear functions, enabling seamless integration into LLMs and other deep neural networks with almost no loss in accuracy. NLI ingeniously recasts cutpoint selection as a dynamic-programming problem, achieving the \emph{globally} minimal interpolation error in $\mathcal{O}(M \times N^2)$ time via Bellmanโs optimality principle. Based on the NLI algorithm, we also design and implement a plug-and-play universal nonlinear computation unit. Hardware experiments demonstrate that the NLI Engine achieves more than 4ร improvement in computational efficiency compared to the state-of-the-art designs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2968. $\alpha$-DPO: Robust Preference Alignment for Diffusion Models via $\alpha$ Divergence
๐ฌ 14
โญ 5.33
๐ reinforcement learning
๐ท๏ธ diffusion model; preference alignment; noise robustness
Diffusion models have demonstrated remarkable success in high-fidelity image generation, yet aligning them with human preferences remains challenging. Direct Preference Optimization (DPO) offers a promising framework, but its effectiveness is critically hindered by noisy data arising from mislabeled preference pairs and individual preference pairs. We theoretically show that existing DPO objectives are equivalent to minimizing the Forward KullbackโLeibler (KL) divergence, whose mass-covering nature makes it intrinsically sensitive to such noise. To address this limitation, we propose $\alpha$-DPO, which reformulates preference alignment through the lens of $\alpha$-divergence. This formulation promotes mode-seeking behavior and bounds the influence of outliers, thereby enhancing robustness. Furthermore, we introduce a dynamic scheduling mechanism that adaptively adjusts $\alpha$ according to the observed preference distribution, providing data-aware noise tolerance during training. Extensive experiments on synthetic and real-world datasets validate that $\alpha$-DPO consistently outperforms existing baselines, achieving superior robustness and preference alignment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2969. Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts
๐ฌ 14
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Efficient Reasoning Large Reasoning Models Retrieval Augmented Language Models
Large reasoning models improve accuracy by producing long reasoning traces, but this inflates latency and cost, motivating inference-time efficiency. We propose Retrieval-of-Thought (RoT), which reuses prior reasoning as composable ``thought" steps to guide new problems. RoT organizes steps into a thought graph with sequential and semantic edges to enable fast retrieval and flexible recombination. At inference, RoT retrieves query-relevant nodes and applies reward-guided traversal to assemble a problem-specific template that guides generation. This dynamic template reuse reduces redundant exploration and, therefore, reduces output tokens while preserving accuracy. We evaluate RoT on reasoning benchmarks with multiple models, measuring accuracy, token usage, latency, and memory overhead. Findings show small prompt growth but substantial efficiency gains, with RoT reducing output tokens by up to 40%, inference latency by 82%, and cost by 59% while maintaining accuracy. RoT establishes a scalable paradigm for efficient LRM reasoning via dynamic template construction through retrieval.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2970. Shuffling the Data, Extrapolating the Step: Sharper Bias In Constant Step-Size SGD
๐ฌ 28
โญ 5.33
๐ optimization
๐ท๏ธ variational inequalities multi-agent optimization stochastic algorithms
From adversarial robustness to multi-agent learning, many machine learning tasks can be cast as finite-sum minโmax optimization or, more generally, as variational inequality problems (VIPs). Owing to their simplicity and scalability, stochastic gradient methods with constant step size are widely used, despite the fact that they converge only up to a bias term. Among the many heuristics adopted in practice, two classical techniques have recently attracted attention to mitigate this issue: \emph{\small Random Reshuffling} of data and \emph{\small RichardsonโRomberg extrapolation} across iterates.
In this work, we show that their composition not only cancels the leading linear bias term, but also yields an asymptotic cubic refinement. To the best of our knowledge, our work provides the first theoretical guarantees for such a synergy in structured non-monotone VIPs. Our analysis proceeds in two steps: (i) by smoothing the discrete noise induced by reshuffling, we leverage tools from continuous-state Markov chain theory to establish a law of large numbers and a central limit theorem for its iterates; and (ii) we employ spectral tensor techniques to prove that extrapolation
debiases and sharpens the asymptotic behavior %accelerates convergence
even under the biased gradient oracle induced by reshuffling. Finally, extensive experiments validate our theory, consistently demonstrating substantial speedups in practice.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2971. EventFlash: Towards Efficient MLLMs for Event-Based Vision
๐ฌ 14
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Event-Based Vision Event-Language Alignment
Event-based multimodal large language models (MLLMs) enable robust perception in high-speed and low-light scenarios, addressing key limitations of frame-based MLLMs. However, current event-based MLLMs often rely on dense image-like processing paradigms, overlooking the spatiotemporal sparsity of event streams and resulting in high computational cost. In this paper, we propose EventFlash, the first efficient MLLM to explore spatiotemporal token sparsification for reducing data redundancy and accelerating inference. Technically, we first build EventMind, a large-scale and scene-diverse dataset with over 500k instruction sets, providing both short and long event stream sequences to support our curriculum training strategy. Then, we present the adaptive temporal window aggregation module for efficient temporal sampling, which adaptively compresses temporal tokens while retaining key temporal cues. Finally, the sparse density-guided attention module is designed to improve spatial token efficiency by selecting informative regions and suppressing empty or sparse areas. Experimental results show that EventFlash achieves a 12.4x throughput improvement over the baseline (EventFlash-Zero) while maintaining comparable performance. It supports long-range event stream processing with up to 1,000 bins, significantly outperforming EventGPTโs 5-bin limit. We believe EventFlash serves as an efficient foundation model for event-based vision. Our code and dataset details are provided in the supplementary.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2972. Mirror Flow Matching with Heavy-Tailed Priors for Generative Modeling on Convex Domains
๐ฌ 11
โญ 5.33
๐ learning theory
๐ท๏ธ Flow Matching Constrained Generative Modeling Wasserstein Convergence Rates
We study generative modeling on convex domains using flow matching and mirror maps, and identify two fundamental challenges. First, standard log-barrier mirror maps induce heavy-tailed dual distributions, leading to ill-posed dynamics. Second, coupling with Gaussian priors performs poorly when matching heavy-tailed targets. To address these issues, we propose Mirror Flow Matching based on a \emph{regularized mirror map} that controls dual tail behavior and guarantees finite moments, together with coupling to a Student-$t$ prior that aligns with heavy-tailed targets and stabilizes training. We provide theoretical guarantees, including spatial Lipschitzness and temporal regularity of the velocity field, Wasserstein convergence rates for flow matching with Student-$t$ priors and primal-space guarantees for constrained generation, under $\varepsilon$-accurate learned velocity fields. Empirically, our method outperforms baselines in synthetic convex-domain simulations and achieves competitive sample quality on real-world constrained generative tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2973. LINK: Learning Instance-level Knowledge from Vision-Language Models for Human-Object Interaction Detection
๐ฌ 16
โญ 5.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Human-object interaction detection Vision-language model zero-shot open-vocabulary
Human-Object Interaction (HOI) detection with vision-language models (VLMs) has progressed rapidly, yet a trade-off persists between specialization and generalization.
Two major challenges remain: (1) the sparsity of supervision, which hampers effective transfer of foundation models to HOI tasks, and (2) the absence of a generalizable architecture that can excel in both fully supervised and zero-shot scenarios.
To address these issues, we propose \textbf{LINK}, \textbf{L}earning \textbf{IN}stance-level \textbf{K}nowledge.
First, we introduce a HOI detection framework equipped with a {Human-Object Geometrical Encoder} and a {VLM Linking Decoder}.
By decoupling from detector-specific features, our design ensures plug-and-play compatibility with arbitrary object detectors and consistent adaptability across diverse settings.
Building on this foundation, we develop a {Progressive Learning Strategy} under a teacher-student paradigm, which delivers dense supervision over all potential human-object pairs.
By contrasting subtle spatial and semantic differences between positive and negative instances, the model learns robust and transferable HOI representations.
Extensive experiments on SWiG-HOI, HICO-DET, and V-COCO demonstrate state-of-the-art results, showing that our method achieves strong performance in both zero-shot and fully supervised settings while also exhibiting open-vocabulary capability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2974. Gauge Flow Matching: Efficient Constrained Generative Modeling over General Convex Set and Beyond
๐ฌ 12
โญ 5.33
๐ generative models
๐ท๏ธ constraint generative models homeomorphism convex constraints gauge mapping
Generative models, particularly diffusion and flow-matching approaches, have achieved remarkable success across diverse domains, including image synthesis and robotic planning. However, a fundamental challenge persists: ensuring generated samples strictly satisfy problem-specific constraints โ a crucial requirement for physics-informed problems, safety-critical applications, watermark embedding, etc. Existing approaches, such as mirror maps and reflection methods, either have limited applicable constraint sets or introduce significant computational overhead. In this paper, we develop gauge flow matching (GFM), a simple yet efficient framework for constrained generative modeling. Our GFM approach introduces a generalized bijective gauge mapping to transform generation over arbitrary compact convex sets into an equivalent process over the unit ball, which allows low-complexity feasibility-ensuring operations such as reflection. The generated samples are then mapped back to the original domain for output. We prove that our GFM framework guarantees strict constraint satisfaction, with low generation complexity and bounded distribution approximation errors. We further extend our GFM framework to two popular non-convex settings, namely, star-convex and geodesic-convex sets. Extensive experiments show that GFM outperforms existing methods in generation speed and quality across multiple benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2975. Improving Feasibility via Fast Autoencoder-Based Projections
๐ฌ 14
โญ 5.33
๐ optimization
๐ท๏ธ amortized optimization nonconvex optimization surrogate models feasibility
Enforcing complex (e.g., nonconvex) operational constraints is a critical challenge in real-world learning and control systems. However, existing methods struggle to efficiently and reliably enforce general classes of constraints. To address this, we propose a novel data-driven amortized approach that uses a trained autoencoder as an approximate projector to provide fast corrections to infeasible predictions. Specifically, we train an autoencoder using an adversarial objective to learn a structured, convex latent representation of the feasible set, enabling rapid correction of neural network outputs by projecting them onto a simple convex shape before decoding into the original feasible set. We test our approach on a diverse suite of constrained optimization and reinforcement learning problems with challenging nonconvex constraints. Results show that our method effectively improves constraint satisfaction at a low computational cost, offering a practical alternative to expensive feasibility correction techniques based on traditional solvers.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2976. Object-Centric World Models from Few-Shot Annotations for Sample-Efficient Reinforcement Learning
๐ฌ 10
โญ 5.33
๐ reinforcement learning
๐ท๏ธ Model-based RL Object-centric RL Video object segmentation Atari Hollow Knight
While deep reinforcement learning (DRL) from pixels has achieved remarkable success, its sample inefficiency remains a critical limitation for real-world applications. Model-based RL (MBRL) addresses this by learning a world model to generate simulated experience, but standard approaches that rely on pixel-level reconstruction losses often fail to capture small, task-critical objects in complex, dynamic scenes. We posit that an object-centric representation can direct model capacity toward semantically meaningful entities, improving dynamics prediction and sample efficiency. In this work, we introduce **OC-STORM**, an object-centric MBRL framework that enhances a learned world model with object representations extracted by a pretrained segmentation network. By conditioning on a minimal number of annotated frames, OC-STORM learns to track decisionโrelevant object dynamics and interโobject interactions without extensive labeling or access to privileged information. Empirical results demonstrate that OC-STORM significantly outperforms the STORM baseline on the Atari 100k benchmark and achieves state-of-the-art sample efficiency on challenging boss fights in the visually complex game **Hollow Knight**. Our findings underscore the potential of integrating object-centric priors into MBRL for complex visual domains. Core code and evaluation videos are available in supplementary materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2977. Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall
๐ฌ 14
โญ 5.33
๐ generative models
๐ท๏ธ Discrete Diffusion Generative Models
Discrete diffusion models offer a promising alternative to autoregressive generation through parallel decoding, but they suffer from a sampling wall: once categorical sampling occurs, rich distributional information collapses into one-hot vectors and cannot be propagated across steps. We introduce Loopholing, a mechanism that preserves this information via a deterministic latent pathway, leading to Loopholing Discrete Diffusion Models (LDDMs). Trained efficiently with a self-conditioning strategy that avoids unrolling the full denoising trajectory, LDDMs achieve substantial gainsโreducing generative perplexity by up to 61\% over prior baselines, thereby closing (and in some cases surpassing) the gap with autoregressive models, and producing more coherent text. Applied to reasoning tasks, LDDMs also improve performance on arithmetic benchmarks such as Countdown and Game of 24. These results also indicates that loopholing mitigates idle steps and oscillations, providing a scalable path toward high-quality non-autoregressive text generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2978. Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models
๐ฌ 17
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Scaling Laws Mixture-of-Experts Large Language Models
Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the model capacity of a given MoE configurations (e.g., expert activation ratio and granularity) remains an unresolved problem. To address this gap, we introduce Efficiency Leverage (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent. We conduct a large-scale empirical study, training over 300 models up to 28B parameters, to systematically investigate the relationship between MoE architectural configurations and EL. Our findings reveal that EL is primarily driven by the expert activation ratio and the total compute budget, both following predictable power laws, while expert granularity acts as a non-linear modulator with a clear optimal range. We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration.
To validate our derived scaling laws, we designed and trained MoE-mini, a model with only 0.85B active parameters, alongside a 6.1B dense model for comparison. When trained on an identical 1T high-quality token dataset, MoE-mini matched the performance of the 6.1B dense model while consuming over 7x fewer computational resources, thereby confirming the accuracy of our scaling laws.
This work provides a principled and empirically-grounded foundation for the scaling of efficient MoE models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2979. Fair Graph Machine Learning under Adversarial Missingness Processes
๐ฌ 11
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Fairness GNN Missingness
Graph Neural Networks (GNNs) have achieved state-of-the-art results in many relevant tasks where decisions might disproportionately impact specific communities. However, existing work on fair GNNs often assumes that either sensitive attributes are fully observed or they are missing completely at random. We show that an adversarial missingness process can inadvertently disguise a fair model through the imputation, leading the model to overestimate the fairness of its predictions. We address this challenge by proposing Better Fair than Sorry (BFtS), a fair missing data imputation model for sensitive attributes. The key principle behind BFtS is that imputations should approximate the worst-case scenario for fairness---i.e. when optimizing fairness is the hardest. We implement this idea using a 3-player adversarial scheme where two adversaries collaborate against a GNN classifier, and the classifier minimizes the maximum bias. Experiments using synthetic and real datasets show that BFtS often achieves a better fairness x accuracy trade-off than existing alternatives under an adversarial missingness process.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2980. Offline Preference-Based Value Optimization
๐ฌ 12
โญ 5.33
๐ reinforcement learning
๐ท๏ธ offline reinforcement learning preference-based reinforcement learning
We study the problem of offline preference-based reinforcement learning (PbRL), where the agent learns from pre-collected preference data by comparing trajectory pairs.
While prior work has established theoretical foundations for offline PbRL, existing algorithms face significant practical limitations: some rely on computationally intractable optimization procedures, while others suffer from unstable training and high performance variance.
To address these challenges, we propose Preference-based Value Optimization (PVO), a simple and practical algorithm that achieves both strong empirical performance and theoretical guarantees.
PVO directly optimizes the value function consistent with preference feedback by minimizing a novel \emph{value alignment loss}.
We prove that PVO attains a rate-optimal sample complexity of $\mathcal{O}(\varepsilon^{-2})$, and further show that the value alignment loss is applicable not only to value-based methods but also to actorโcritic algorithms.
Empirically, PVO achieves robust and stable performance across diverse continuous control benchmarks.
It consistently outperforms strong baselines, including methods without theoretical guarantees, while requiring no additional hyperparameters for preference learning.
Moreover, our ablation study demonstrates that substituting the standard TD loss with the value alignment loss substantially improves learning from preference data, confirming its effectiveness for PbRL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2981. Doubly-Regressing Approach for Subgroup Fairness
๐ฌ 3
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Algorithmic fairness Subgroup Adversarial learning Data sparsity
Algorithmic fairness is a socially crucial topic in real-world applications of AI.
Among many notions of fairness, subgroup fairness is widely studied when multiple sensitive attributes (e.g., gender, race, age) are present.
However, as the number of sensitive attributes grows, the number of subgroups increases accordingly, creating heavy computational burdens and data sparsity problem (subgroups with too small sizes).
In this paper, we develop a novel learning algorithm for subgroup fairness which resolves these issues by focusing on subgroups with sufficient sample sizes as well as marginal fairness (fairness for each sensitive attribute).
To this end, we formalize a notion of subgroup-subset fairness and introduce a corresponding distributional fairness measure called the supremum Integral Probability Metric (supIPM).
Building on this formulation, we propose the Doubly Regressing Adversarial learning for subgroup Fairness (DRAF) algorithm, which reduces a surrogate fairness gap for supIPM with much less computation than directly reducing supIPM.
Theoretically, we prove that the proposed surrogate fairness gap is an upper bound of supIPM.
Empirically, we show that the DRAF algorithm outperforms baseline methods in benchmark datasets, specifically when the number of sensitive attributes is large so that many subgroups are very small.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2982. Soft Quality-Diversity Optimization
๐ฌ 11
โญ 5.33
๐ optimization
๐ท๏ธ Quality Diversity Optimization
Quality-Diversity (QD) algorithms constitute a branch of optimization that is concerned with discovering a diverse and high-quality set of solutions to an optimization problem.
Current QD methods commonly maintain diversity by dividing the behavior space into discrete regions, ensuring that solutions are distributed across different parts of the space.
The QD problem is then solved by searching for the best solution in each region.
This approach to QD optimization poses challenges in large solution spaces, where storing many solutions is impractical, and in high-dimensional behavior spaces, where discretization becomes ineffective due to the curse of dimensionality.
We present an alternative framing of the QD problem, called \emph{Soft QD}, that sidesteps the need for discretizations.
We validate this formulation by demonstrating its desirable properties, such as monotonicity, and by relating its limiting behavior to the widely used QD Score metric.
Furthermore, we leverage it to derive a novel differentiable QD algorithm, \emph{Soft QD Using Approximated Diversity (SQUAD)}, and demonstrate empirically that it is competitive with current state of the art methods on standard benchmarks while offering better scalability to higher dimensional problems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2983. Weak-to-Strong Generalization with Failure Trajectories
๐ฌ 15
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Weak-to-Strong Generalization
Weak-to-Strong generalization (W2SG) is a new trend to elicit the full capabilities of a strong model with supervision from a weak model. While existing W2SG studies focus on simple tasks like binary classification, we extend this paradigm to complex interactive decision-making environments.
Specifically, we fine-tune a strong model with trajectories of intermediate actions generated by a weak model. Motivated by the human learning process, we propose to generalize not only success knowledge but also failure experience so that the strong model can learn from the failed trajectories accumulated by weak models. To effectively and efficiently elicit the potential of strong agents, we further construct ``trajectory trees," a hierarchical representation that organizes weak model-generated action trajectories, coupled with Monte Carlo Tree Search (MCTS) to optimize the strong model. Through theoretical analysis, we provide formal guarantees for the effectiveness of our method in improving W2SG performance. Our empirical evaluations demonstrate substantial improvements in reasoning and decision-making capabilities across diverse task domains, validating the scalability and robustness of our proposed framework. Our code is available at: https://anonymous.4open.science/r/5A51.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2984. Block-sample MAC-Bayes generalization bounds
๐ฌ 11
โญ 5.33
๐ learning theory
๐ท๏ธ PAC-Bayes bound MAC-Bayes bound KL divergence block-sample MAC-Bayes bound
We present a family of novel block-sample MAC-Bayes bounds (mean approximately correct). While PAC-Bayes bounds (probably approximately correct) typically give bounds for the generalization error that hold with high probability, MAC-Bayes bounds have a similar form but bound the expected generalization error instead. The family of bounds we propose can be understood as a generalization of an expectation version of known PAC-Bayes bounds. Compared to standard PAC-Bayes bounds, the new bounds contain divergence terms that only depend on subsets (or \emph{blocks}) of the training data. The proposed MAC-Bayes bounds hold the promise of significantly improving upon the tightness of traditional PAC-Bayes and MAC-Bayes bounds. This is illustrated with a simple numerical example in which the original PAC-Bayes bound is vacuous regardless of the choice of prior, while the proposed family of bounds are finite for appropriate choices of the block size. We also explore the question whether high-probability versions of our MAC-Bayes bounds (i.e., PAC-Bayes bounds of a similar form) are possible. We answer this question in the negative with an example that shows that in general, it is not possible to establish a PAC-Bayes bound which (a) vanishes with a rate faster than $\mathcal{O}(1/\log n)$ whenever the proposed MAC-Bayes bound vanishes with rate $\mathcal{O}(n^{-1/2})$ and (b) exhibits a logarithmic dependence on the permitted error probability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2985. On Fairness of Task Arithmetic: The Role of Task Vectors
๐ฌ 19
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Fairness Model Editing Task Arithmetic
Model editing techniques, particularly task arithmetic with task vectors, offer an efficient alternative to full fine-tuning by enabling direct parameter updates through simple arithmetic operations. While this approach promises substantial computational savings, its impact on fairness has remained largely unexploredโdespite growing concern over biased outcomes in high-stakes applications such as hate speech detection. In this work, we present the first systematic study of fairness in task arithmetic, benchmarking it against full fine-tuning (FFT) and Low-Rank Adaptation (LoRA). We evaluate across multiple language models and datasets using standard group fairness metrics, including Demographic Parity and Equalized Odds. Our analysis shows that task vectors can be tuned to achieve competitive accuracy while reducing disparities, and that merging subgroup-specific task vectors provides a practical mechanism for steering fairness outcomes. We further provide a theoretical bound linking task-vector scaling to fairness metrics, offering insight into the observed trade-offs. Together, these findings establish task arithmetic not only as a cost-efficient editing method but also as a fairness-aware alternative to existing adaptation techniques, laying the groundwork for responsible deployment of large language models. Our code is available at: https://anonymous.4open.science/status/fairness_task_vector-4F2F
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2986. CooperTrim: Adaptive Data Selection for Uncertainty-Aware Cooperative Perception
๐ฌ 15
โญ 5.33
๐ applications to robotics, autonomy, planning
๐ท๏ธ Cooperative Perception Intermediate Fusion Uncertainty Estimation
Cooperative perception enables autonomous agents to share encoded representations over wireless communication to enhance each otherโs live situational awareness. However, the tension between the limited communication bandwidth and the rich sensor information
hinders its practical deployment. Recent studies have explored selection strategies that share only a subset of features per frame while striving to keep the performance on par. Nevertheless, the bandwidth requirement still stresses current wireless technologies. To
fundamentally ease the tension, we take a proactive approach, exploiting the temporal continuity to identify features that capture environment dynamics, while avoiding repetitive and redundant transmission of static information. By incorporating temporal awareness,
agents are empowered to dynamically adapt the sharing quantity according to environment complexity. We instantiate this intuition into an adaptive selection framework, COOPERTRIM, which introduces a novel conformal temporal uncertainty metric to gauge feature
relevance, and a data-driven mechanism to dynamically determine the sharing quantity. To evaluate COOPERTRIM, we take semantic segmentation as an example task. Across multiple open-source cooperative segmentation models, COOPERTRIM achieves up to 80.28% bandwidth reduction while maintaining a comparable accuracy. Relative to other selection strategies, COOPERTRIM also improves IoU by as much as 45.54% with up to 72% less bandwidth. Qualitative results show COOPERTRIM gracefully adapt to environmental dynamics, demonstrating the flexibility and paving the way towards real-world deployment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2987. C-Evolve: Consensus-based Evolution for Prompt Groups
๐ฌ 10
โญ 5.33
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Consensus-based Evolution evolutionary algorithm majority voting
Prompt evolution algorithms offer a powerful paradigm for enhancing AI systems based on closed-source models, while few work explores whether aggregating results from multiple prompts to reach a consensus can further advance the system capability boundary. In this paper, we introduce Consensus-Evolve (C-Evolve), an evolutionary algorithm that discovers a group of prompts whose aggregated outputs after majority voting achieve optimal performance. More specifically, C-Evolve employs an island-based evolutionary algorithm to maintain population diversity, and prompts from distinct islands are selected to form groups to aggregate their outputs. The key difference from single individual evolution is a voting score, which evaluates each individual prompt's contribution within groups. We take this as the fitness score for evolution instead of individual performance. Consequently, C-Evolve is more likely to produce and maintain prompts with higher potential to form a high-performing group and eliminate low-performing ones, gradually improving the group performance after reaching consensus. Our method achieves state-of-the-art performance across a wide range of tasks, including both open-ended tasks like HotpotQA and closed-ended tasks like MATH. On Qwen3-8B, C-Evolve achieves 70.67\% on HotpotQA and 43.88\% on IFBench, which are 4.95\% and 2.73\% higher than GEPA, respectively. For GPT-4.1-mini, the accuracy on IFBench is further improved to 47.96\% and reaches 95.33\% in the MATH benchmark. These results demonstrate the C-Evolve's competitive performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2988. Speech World Model: Causal StateโAction Planning with Explicit Reasoning for Speech
๐ฌ 17
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ speech spoken language understanding state-action causal reasoning
Current speech-language models (SLMs) typically use a cascade of speech encoder and large language model, treating speech understanding as a single black box. They analyze the content of speech well but reason weakly about other aspects, especially under sparse supervision. Thus, we argue for explicit reasoning over speech states and actions with modular and transparent decisions. Inspired by cognitive science we adopt a modular perspective and a world model view in which the system learns forward dynamics over latent states. We factorize speech understanding into four modules that communicate through a causal graph, establishing a cognitive state search space. Guided by posterior traces from this space, an instruction-tuned language model produces a concise causal analysis and a user-facing response, enabling counterfactual interventions and interpretability under partial supervision. We present the first graph based modular speech model for explicit reasoning and we will open source the model and data to promote the development of advanced speech understanding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2989. Random-projection ensemble dimension reduction
๐ฌ 18
โญ 5.33
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ High-dimensional random projection sufficient dimension reduction
We introduce a new, flexible, and theoretically justified framework for dimension reduction in high-dimensional regression, based on an ensemble of random projections. Specifically, we consider disjoint groups of independent random projections, retain the best projection in each group according to the empirical regression performance on the projected covariates, and then aggregate the selected projections via singular value decomposition. The singular values quantify the relative importance of corresponding projection directions and guide the dimension selection process. We investigate various aspects of our framework, including the choice of projection distribution and the number of projections used. Our theoretical results show that the expected estimation error decreases as the number of groups of projections increases. Finally, we demonstrate that our proposal consistently matches or outperforms state-of-the-art methods through extensive numerical studies on simulated and real data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2990. Predicting LLM Output Length via Entropy-Guided Representations
๐ฌ 20
โญ 5.33
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Large Language Models Length Prediction Progressive Length Prediction
The long-tailed distribution of sequence lengths in LLM serving and reinforcement learning (RL) sampling causes significant computational waste due to excessive padding in batched inference. Existing methods rely on auxiliary models for static length prediction, but they incur high overhead, generalize poorly, and fail in stochastic "one-to-many" sampling scenarios. We introduce a lightweight framework that reuses the main model's internal hidden states for efficient length prediction. Our framework features two core components: 1) Entropy-Guided Token Pooling (EGTP), which uses on-the-fly activations and token entropy for highly accurate static prediction with negligible cost, and 2) Progressive Length Prediction (PLP), which dynamically estimates the remaining length at each decoding step to handle stochastic generation. To validate our approach, we build and release ForeLen, a comprehensive benchmark with long-sequence, Chain-of-Thought, and RL data. On ForeLen, EGTP achieves state-of-the-art accuracy, reducing MAE by 29.16\% over the best baseline. Integrating our methods with a length-aware scheduler yields significant end-to-end throughput gains. Our work provides a new technical and evaluation baseline for efficient LLM inference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2991. High-dimensional Analysis of Synthetic Data Selection
๐ฌ 18
โญ 5.33
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ high dimensional regression empirical risk minimization synthetic data generative models
Despite the progress in the development of generative models, their usefulness in creating synthetic data that improve prediction performance of classifiers has been put into question. Besides heuristic principles such as ''synthetic data should be close to the real data distribution'', it is actually not clear which specific properties affect the generalization error. Our paper addresses this question through the lens of high-dimensional regression. Theoretically, we show that, for linear models, the *covariance shift* between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Furthermore, in some regimes, we prove that matching the covariance of the target distribution is optimal. Remarkably, the theoretical insights for linear models carry over to deep neural networks and generative models. We empirically demonstrate that the *covariance matching* procedure (matching the covariance of the synthetic data with that of the data coming from the target distribution) performs well against several recent approaches for synthetic data selection, across various training paradigms, datasets and generative models used for augmentation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2992. Self-Guided Low Light Object Detection Framework
๐ฌ 16
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Low light object detection Self-guided training No additional inference cost
Object detection in low-light environments is inherently challenging due to limited contrast and heavy noise, both of which significantly degrade feature representations. In this paper, we propose a novel self-guided low-light object detection framework that effectively addresses these issues without introducing additional parameters or increasing inference time. Our method incorporates a detachable auxiliary pipeline during training, consisting of an image enhancement module and a denoising module, followed by a Fourier-domain fusion block. This pipeline improves the feature representation of the detector's backbone, enhancing its robustness under low-light conditions. Importantly, at inference time, our method incurs no additional computational cost compared to the baseline detector while achieving substantial performance improvements. Extensive experiments on widely used low-light object detection benchmarks, such as DARK FACE and ExDark, demonstrate that our method achieves state-of-the-art performance. Notably, experiments on the nuImages dataset show that our approach can outperform domain adaptation methodsโespecially when a large domain gap between source and target domains is inevitable in the real-world applicationsโhighlighting its practical effectiveness. Our code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2993. STAT: Skill-Targeted Adaptive Training
๐ฌ 10
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ metacognition out-of-distribution generalization dataset reusability skill-level training
Language models often show little to no improvement (i.e., โsaturationโ) when trained via vanilla supervised fine-tuning (SFT) on data similar to what they saw in their training set (e.g., MATH). We introduce a new fine-tuning strategy, STAT, to train such a student model by using the metacognition ability of a stronger large language model (LLM) as the teacher. The teacher uses the task dataset to create a list of skills needed for the task, and then labels each data point with its required skills (Didolkar et al., 2024). By monitoring the studentโs answers, the teacher creates a Missing-Skill-Profile for the student, tracking how often they failed to apply each skill in their responses. We use this idea to build a modified training set in one of two ways. In STAT-Sel, the teacher uses an existing set of training examples but adaptively reweights them according to the Missing-Skill-Profile. In STAT-Syn, the teacher synthesizes additional examples involving missing skills. Across extensive experiments on Llama and Qwen models, our methods yield improvements of up to 7.5% on MATH, whereas SFT provides only limited gains. Furthermore, STAT enhances performance on out-of-distribution benchmarks (e.g., AIME24/25, AMC23, etc.) by an average of 4.6%. Crucially, we find that STAT is complementary to RL via GRPO (Shao et al., 2024): after the model is improved using STAT to address skill gaps, GRPO continues to add further gains. We conclude that skill-targeted adaptive training should broadly improve current training pipelines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2994. Towards Efficient, Adaptive, and Unified Reinforcement Mid-Training
๐ฌ 9
โญ 5.33
๐ reinforcement learning
๐ท๏ธ Mid-training Reinforcement Learning
The development of state-of-the-art large language models is commonly understood as a two-stage process involving pre-training and post-training. We point out the need for an additional intermediate stage called reinforcement mid-training with potential for strong performance gains. In this paper, we formally define the problem and identify three key challenges: (1) inefficient training due to excessive reasoning steps, (2) disregard of the imbalanced token entropy distribution, and (3) underutilization of token information. To address these challenges, we propose RMT, a framework for efficient, adaptive, and unified reinforcement mid-training with various innovative components. In particular, we first introduce a dynamic token budget mechanism that constrains unnecessary reasoning steps and mitigates model overthinking. Next, we design a curriculum-based adaptive sampling method that fosters a progressive learning trajectory from easy to hard tokens. Finally, we present a dual training strategy that combines reinforcement learning with next-token prediction, ensuring targeted learning on key tokens and full exploitation of all token information. Extensive experiments demonstrate the superiority of RMT over state-of-the-art methods, achieving up to +64.91% performance improvement with only 21% of the reasoning length in language modeling. We also show that checkpoints obtained after reinforcement mid-training can benefit the subsequent post-training, yielding up to +18.76% improvement in the mathematical domain.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2995. Change Point Localization and Inference in Dynamic Multilayer Networks
๐ฌ 16
โญ 5.33
๐ learning on time series and dynamical systems
๐ท๏ธ Dynamic networks Multilayer networks Change point inference Tensor-based methods
We study offline change point localization and inference in dynamic multilayer random dot product graphs (D-MRDPGs), where at each time point, a multilayer network is observed with shared node latent positions and time-varying, layer-specific connectivity patterns. We propose a novel two-stage algorithm that combines seeded binary segmentation with low-rank tensor estimation, and establish its consistency in estimating both the number and locations of change points. Furthermore, we derive the limiting distributions of the refined estimators under both vanishing and non-vanishing jump regimes. To the best of our knowledge, this is the first result of its kind in the context of dynamic network data. We also develop a fully data-driven procedure for constructing confidence intervals. Extensive numerical experiments demonstrate the superior performance and practical utility of our methods compared to existing alternatives.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2996. TimeSeg: An Information-Theoretic Segment-Wise Explainer for Time-Series Predictions
๐ฌ 26
โญ 5.33
๐ interpretability and explainable AI
๐ท๏ธ Explainability AI Interpretability Time Series Explanations Segment-wise Explanations Conditional Mutual Information
Explaining predictions of black-box time-series models remains a challenging problem due to the dynamically evolving patterns within individual sequences and their complex temporal dependencies. Unfortunately, existing explanation methods largely focus on point-wise explanations, which fail to capture broader temporal context, while methods that attempt to highlight interpretable temporal patterns (e.g., achieved by incorporating a regularizer or fixed-length patches) often lack principled definitions of meaningful segments. This limitation frequently leads to fragmented and confusing explanations for end users. As such, the notion of segment-wise explanations has remained underexplored, with little consensus on what constitutes an *interpretable* segment or how such segments should be identified. To bridge this gap, we define segment-wise explanation for black-box time-series models as the task of selecting contiguous subsequences that maximize their joint mutual information with the target prediction. Building on this formulation, we propose TimeSeg, a novel information-theoretic framework that employs reinforcement learning to sequentially identify predictive temporal segments at a per-instance level. By doing so, TimeSeg produces segment-wise explanations that capture holistic temporal patterns rather than fragmented points, providing class-predictive patterns in a human-interpretable manner. Extensive experiments on both synthetic and realโworld datasets demonstrate that TimeSeg produces more coherent and human-understandable explanations, while achieving performance that matches or surpasses existing methods on downstream tasks using the identified segments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2997. SPACeR: Self-Play Anchoring with Centralized Reference Models
๐ฌ 13
โญ 5.33
๐ reinforcement learning
๐ท๏ธ Multi-agent reinforcement learning traffic simulation autonomous vehicles planning
Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable. Achieving this requires sim agent policies that are human-like, fast, and scalable in multi-agent settings. Recent progress in imitation learning with large diffusion-based or tokenized models has shown that behaviors can be captured directly from human driving data, producing realistic policies. However, these models are computationally expensive, slow during inference, and struggle to adapt in reactive, closed-loop scenarios. In contrast, self-play reinforcement learning (RL) scales efficiently and naturally captures multi-agent interactions, but it often relies on heuristics and reward shaping, and the resulting policies can diverge from human norms. We propose human-like self-play, a framework that leverages a pretrained tokenized autoregressive motion model as a centralized reference policy to guide decentralized self-play. The reference model provides likelihood rewards and KL divergence, anchoring policies to the human driving distribution while preserving RL scalability. Evaluated on the Waymo Sim Agents Challenge, our method achieves competitive performance with imitation-learned policies while being up to 10ร faster at inference and 50ร smaller in parameter size than large generative models. In addition, we demonstrate in closed-loop ego planning evaluation tasks that our sim agents can effectively measure planner quality with fast and scalable traffic simulation, establishing a new paradigm for testing autonomous driving policies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2998. GneissWeb: Preparing High Quality Data for LLMs at Scale
๐ฌ 10
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ Large Language Modes Pre-training datasets Data quality Evaluation benchmarks
Data quantity and quality play a vital role in determining the performance of Large Language Models (LLMs). High-quality data, in particular, can significantly boost the LLM's ability to generalize on a wide range of downstream tasks. In this paper, we introduce **GneissWeb**, a large dataset of around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. Our GneissWeb recipe that produced the dataset consists of sharded exact sub-string deduplication and a judiciously constructed ensemble of quality filters.
GneissWeb goes beyond simple model-based quality filtering used in recent datasets by designing an ensemble of filters incorporating novel quality filters. Novel components enable us to achieve a favorable trade-off between data quality and quantity, producing models that outperform models trained on state-of-the-art open large datasets (5+ trillion tokens). We show that models trained using GneissWeb outperform those trained on FineWeb-V1.1.0 by 2.73 percentage points in terms of average scores on a set of 11 commonly used benchmarks (both zero-shot and few-shot) for pre-training dataset evaluation. When the evaluation set is extended to 20 benchmarks (both zero-shot and few-shot), models trained using GneissWeb still achieve a 1.75 percentage points gain over those trained on FineWeb-V1.1.0.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
2999. Distilling and Adapting: A Topology-Aware Framework for Zero-Shot Interaction Prediction in Multiplex Biological Networks
๐ฌ 16
โญ 5.33
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Graph representation learning contrastive learning multiplex networks knowledge distillation zero-shot prediction
Multiplex Biological Networks (MBNs), which represent multiple interaction types between entities, are crucial for understanding complex biological systems. Yet, existing methods often inadequately model multiplexity, struggle to integrate structural and sequence information, and face difficulties in zero-shot prediction for unseen entities with no prior neighbourhood information. To address these limitations, we propose a novel framework for zero-shot interaction prediction in MBNs by leveraging context-aware representation learning and knowledge distillation. Our approach leverages domain-specific foundation models to generate enriched embeddings, introduces a topology-aware graph tokenizer to capture multiplexity and higher-order connectivity, and employs contrastive learning to align embeddings across modalities. A teacherโstudent distillation strategy further enables robust zero-shot generalization. Experimental results demonstrate that our framework outperforms state-of-the-art methods in interaction prediction for MBNs, providing a powerful tool for exploring various biological interactions and advancing personalized therapeutics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3000. SEED: Towards More Accurate Semantic Evaluation for Visual Brain Decoding
๐ฌ 16
โญ 5.33
๐ applications to neuroscience & cognitive science
๐ท๏ธ Brain Decoding fMRI Evaluation Metric
We present SEED ($\textbf{Se}$mantic $\textbf{E}$valuation for Visual Brain $\textbf{D}$ecoding), a novel metric for evaluating the semantic decoding performance of visual brain decoding models. It integrates three complementary metrics, each capturing a different aspect of semantic similarity between images inspired by neuroscientific findings. Using carefully crowd-sourced human evaluation data, we demonstrate that SEED achieves the highest alignment with human evaluation, outperforming other widely used metrics.
Through the evaluation of existing visual brain decoding models with SEED, we further reveal that crucial information is often lost in translation, even in the state-of-the-art models that achieve near-perfect scores on existing metrics. This finding highlights the limitations of current evaluation practices and provides guidance for future improvements in decoding models.
Finally, to facilitate further research, we open-source the human evaluation data, encouraging the development of more advanced evaluation methods for brain decoding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3001. Grounded Test-Time Adaptation for LLM Agents
๐ฌ 12
โญ 5.33
๐ applications to robotics, autonomy, planning
๐ท๏ธ Large Language Model Agent Test-Time Adaptation
Large language model (LLM)-based agents struggle to generalize to novel and complex environments, such as unseen websites or new sets of functions, due to a fundamental mismatch between their pre-training and test-time conditions.
This challenge stems from two distinct failure modes: a syntactic misunderstanding of environment-specific components like observation formats, and a semantic misunderstanding of state-transition dynamics, which are only revealed at test time.
To address these issues, we propose two distinct and complementary strategies for adapting LLM agents by leveraging environment-specific information available during deployment.
First, an online distributional adaptation method parameterizes environmental nuances by learning a lightweight adaptation vector that biases the model's output distribution, enabling rapid alignment with an environment response format.
Second, a deployment-time dynamics grounding method employs a persona-driven exploration phase to systematically probe and learn the environment's causal dynamics before task execution, equipping the agent with a non-parametric world model.
We evaluate these strategies across diverse agentic benchmarks, including function calling and web navigation.
Our empirical results show the effectiveness of both strategies across all benchmarks with minimal computational cost.
We find that dynamics grounding is particularly effective in complex environments where unpredictable dynamics pose a major obstacle, demonstrating a robust path toward more generalizable and capable LLM-based agents.
For example, on the WebArena multi-site split, this method increases the agent's success rate from 2\% to 23\%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3002. Matching multiple experts: on the exploitability of multi-agent imitation learning
๐ฌ 8
โญ 5.33
๐ reinforcement learning
๐ท๏ธ imitation learning multi-agent systems behavioral cloning Nash imitation gap
Multi-agent imitation learning (MA-IL) aims to learn optimal policies from expert demonstrations in multi-agent interactive domains. Despite existing guarantees on the performance of the extracted policy, characterizations of its distance to a Nash equilibrium are missing for offline MA-IL. In this paper, we demonstrate impossibility and hardness results of learning low-exploitable policies in general $n$-player Markov Games. We do so by providing examples where even exact measure matching fails, and present challenges associated with the practical case of approximation errors. We then show how these challenges can be overcome using strategic dominance assumptions on the expert equilibrium, assuming BC error $\epsilon_{\text{BC}}$. Specifically, for the case of dominant strategy expert equilibria, this provides a Nash imitation gap of $\mathcal{O}\left(n\epsilon_{\text{BC}}/(1-\gamma)^2\right)$ for a discount factor $\gamma$. We generalize this result with a new notion of best-response continuity, and argue that this is implicitly encouraged by standard regularization techniques.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3003. Optimizer Choice Matters For The Emergence of Neural Collapse
๐ฌ 18
โญ 5.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ neural collapse implicit bias deep learning theory classification adaptive optimizers training dynamics adamw sgd weight decay
Neural Collapse (NC) refers to the emergence of highly symmetric geometric structures in the representations of deep neural networks during the terminal phase of training. Despite its prevalence, the theoretical understanding of NC remains limited. Existing analyses largely ignore the role of the optimizer, thereby suggesting that NC is universal across optimization methods.
In this work, we challenge this assumption and demonstrate that the choice of optimizer plays a critical role in the emergence of NC. The phenomenon is typically quantified through NC metrics, which, however, are difficult to track and analyze theoretically. To overcome this limitation, we introduce a novel diagnostic metric, NC0, whose convergence to zero is a necessary condition for NC. Using NC0, we provide theoretical evidence that NC cannot emerge under decoupled weight decay, as implemented in AdamW. Concretely, we prove that SGD, SignGD with coupled weight decay (a special case of Adam), and SignGD with decoupled weight decay (a special case of AdamW) exhibit qualitatively different NC0 dynamics.
Finally, we conduct extensive empirical experiments consisting of 3,900 training runs across various datasets, architectures, optimizers, and hyperparameters,
confirming our theoretical results.
This work provides the first theoretical explanation for optimizer-dependent emergence of NC and highlights
the overlooked role of weight decay coupling in implicit biases of optimizers.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3004. GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning
๐ฌ 13
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ Dynamic Text-Attributed Graph Dynamic Graph Generation
Dynamic Text-Attributed Graphs (DyTAGs), which intricately integrate structural, temporal, and textual attributes, are crucial for modeling complex real-world systems. However, most existing DyTAG datasets exhibit poor textual quality, which severely limits their utility for generative DyTAG tasks requiring semantically rich inputs. Additionally, prior work mainly focuses on discriminative tasks on DyTAGs, resulting in a lack of standardized task formulations and evaluation protocols tailored for DyTAG generation. To address these critical issues, we propose \underline{G}enerative \underline{D}yTA\underline{G} \underline{B}enchmark (GDGB), which comprises eight meticulously curated DyTAG datasets with high-quality textual features for both nodes and edges, overcoming limitations of prior datasets. Building on GDGB, we define two novel DyTAG generation tasks: Transductive Dynamic Graph Generation (TDGG) and Inductive Dynamic Graph Generation (IDGG). TDGG transductively generates a target DyTAG based on the given source and destination node sets, while the more challenging IDGG introduces new node generation to inductively model the dynamic expansion of real-world graph data. To enable holistic evaluation, we design multifaceted metrics that assess the structural, temporal, and textual quality of the generated DyTAGs. We further propose GAG-General, an LLM-based multi-agent generative framework tailored for reproducible and robust benchmarking of DyTAG generation. Experimental results demonstrate that GDGB enables rigorous evaluation of TDGG and IDGG, with key insights revealing the critical interplay of structural and textual features in DyTAG generation. These findings establish GDGB as a foundational resource for advancing generative DyTAG research and unlocking further practical applications in DyTAG generation. The dataset and source code are available at \url{https://anonymous.4open.science/r/GDGB-3F25}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3005. Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence
๐ฌ 28
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Harmful fine-tuning LLM safety alignment
Fine-tuning-as-a-service introduces a threat to Large Language Models' safety when service providers fine-tune their models on poisoned user-submitted datasets, a process known as harmful fine-tuning attacks. In this work, we show that by regularizing the gradient contribution of harmful samples encountered during fine-tuning, we can effectively mitigate the impact of harmful fine-tuning attacks. To this end, we introduce Antibody, a defense strategy that first ensures robust safety alignment for the model before fine-tuning, and then applies a safety-preservation learning algorithm during fine-tuning. Specifically, in the alignment stage before fine-tuning, we propose optimizing the model to be in a flat loss region with respect to harmful samples, which makes the safety alignment more resilient to subsequent harmful fine-tuning. Then, in the fine-tuning stage, we design a fine-tuning algorithm that applies a weighting scheme to all samples in each training batch to inhibit the model from learning from harmful samples while encouraging learning from benign samples. Experimental results demonstrate that Antibody successfully mitigates harmful fine-tuning attacks while boosting fine-tuning performance on the user-submitted dataset.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3006. Risk-Sensitive Agent Compositions
๐ฌ 9
โญ 5.33
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Agentic systems agent composition safety risk-sensitive planning planning
From software development to robot control, modern agentic systems decompose complex objectives into a sequence of subtasks and choose a set of specialized AI agents to complete them.
We formalize agentic workflows as directed acyclic graphs, called agent graphs, where edges represent AI agents and paths correspond to feasible compositions of agents.
Real-world deployment requires selecting agent compositions that not only maximize task success but also minimize violations of safety, fairness, and privacy requirements which demands a careful analysis of the low-probability (tail) behaviors of compositions of agents.
In this work, we consider risk minimization over the set of feasible agent compositions and seek to minimize the value-at-risk and the conditional value-at-risk of the loss distribution of the agent composition where the loss quantifies violations of these requirements.
We introduce an efficient algorithm which traverses the agent graph and finds a near-optimal composition of agents.
It uses a dynamic programming approach to approximate the value-at-risk of agent compositions by exploiting a union bound.
Furthermore, we prove that the approximation is near-optimal asymptotically for a broad class of practical loss functions.
We also show how our algorithm can be used to approximate the conditional value-at-risk as a byproduct.
To evaluate our framework, we consider a suite of video game-like control benchmarks that require composing several agents trained with reinforcement learning and demonstrate our algorithm's effectiveness in approximating the value-at-risk and identifying the optimal agent composition.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3007. TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization
๐ฌ 9
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Time-varying timbre Streaming voice conversion Content-synchronous speaker conditioning Speech anonymization Vector-quantized bottleneck
Real-time voice conversion and speaker anonymization require causal, low-latency synthesis without sacrificing intelligibility or naturalness. Current systems have a core representational mismatch: content is time-varying, while speaker identity is injected as a static global embedding. We introduce a streamable speech synthesizer that aligns the temporal granularity of identity and content via a content-synchronous, time-varying timbre (TVT) representation. A Global Timbre Memory expands a global timbre instance into multiple compact facets; frame-level content attends to this memory, a gate regulates variation, and spherical interpolation preserves identity geometry while enabling smooth local changes. In addition, a factorized vector-quantized bottleneck regularizes content to reduce residual speaker leakage. The resulting system is streamable end-to-end, with $<$80 ms GPU latency. Experiments show improvements in naturalness, speaker transfer, and anonymization compared to SOTA streaming baselines, establishing TVT as a scalable approach for privacy-preserving and expressive speech synthesis under strict latency budgets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3008. From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers
๐ฌ 19
โญ 5.33
๐ interpretability and explainable AI
๐ท๏ธ transformers; language models; multi-head self-attention; interpretability
Transformers have achieved state-of-the-art performance across diverse language and vision tasks. This success drives the imperative to interpret their internal mechanisms with the dual goals of enhancing performance and improving behavioral control. Attribution methods help advance interpretability by assigning model outputs associated with a target concept to specific model components. Current attribution research primarily studies multi-layer perceptron (MLP) neurons and addresses relatively simple concepts such as factual associations (e.g., Paris is located in France). This focus tends to overlook the impact of the attention mechanism and lacks a unified approach for analyzing more complex concepts. To fill these gaps, we introduce Scalable Attention Module Discovery (SAMD), a concept-agnostic method for mapping arbitrary, complex concepts to specific attention heads of general transformer models. We accomplish this by representing each concept as a vector, calculating its cosine similarity with each attention head, and selecting the TopK-scoring heads to construct the concept-associated attention module. We then propose Scalar Attention Module Intervention (SAMI), a simple strategy to diminish or amplify the effects of a concept by adjusting the attention module using only a single scalar parameter. Empirically, we demonstrate SAMD on concepts of varying complexity, and visualize the locations of their corresponding modules. Our results demonstrate that module locations remain stable before and after LLM post-training, and confirm prior work on the mechanics of LLM multi-lingualism. Through SAMI, we facilitate jailbreaking on HarmBench (+72.7%) by diminishing โsafetyโ and improve performance on the GSM8K benchmark (+1.6%) by amplifying โreasoningโ. Lastly, we highlight the domain-agnostic nature of our approach by suppressing the image classification accuracy of vision transformers on ImageNet.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3009. Alignment-Enhanced Integration of Connectivity and Spectral Sparse in Dynamic Sparse Training of LLM
๐ฌ 12
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ dynamic sparse training low-rank factorization spectral sparse training efficient training
With the rapid development of large language models (LLMs), identifying efficient strategies for training such large-scale systems has become increasingly critical. Although LLMs have achieved remarkable success across diverse applications, the necessity of maintaining full dense matrices during pre-training has been questioned, giving rise to parameter-efficient sparse pre-training methods which retains parameter-efficiency in both training and inference. These methods can be further divided into connectivity sparse training and spectral sparse training, with dynamic connectivity sparse training and low-rank factorization emerging as representative approaches for the two branches.
However, a unified framework that effectively combines the strengths of both has yet to be established. In this work, we observe that the cancellation effect between the sparse and low-rank branches may limit the expressivity of the model, manifesting as output conflicts when the two components are combined. To address this issue, we propose a novel scheme that integrates dynamic sparse training with low-rank training, introducing a simple yet effective $\textbf{alignment loss}$ to mitigate the disagreement between the two branches and promote better collaboration. We validate this scheme by combining a representative dynamic sparse training method, CHTs, with low-rank training, resulting in a new parameter-efficient training approach termed $\textbf{CHTsL}$. The method is evaluated on LLaMA60M and LLaMA130M using the OpenWebText and C4 datasets, where only 10\%, 20\%, and 30\% of the parameters are preserved compared to dense training. Experimental results demonstrate that our proposed scheme effectively alleviates the cancellation effect and improves training stability and performance compared to the naive combination of sparse and low-rank components. Also, the new scheme enables CHTsL to consistently outperform other parameter-efficient sparse training methods under the same parameter budget, achieving performance most close to dense training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3010. Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer
๐ฌ 9
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ subliminal learning hidden bias transfer LLMs finetuning distillation alignment safety
Language models can transfer hidden biases during distillation. For example, a teacher that "likes owls" can make its student "like owls" too, even when the training data consists only of lists of numbers. This surprising phenomenon is called *subliminal learning*. Subliminal learning can be expected under soft distillation, where the student is trained on the teacher's full next-token distribution. But the fact that this also occurs under hard distillationโwhere the student only sees sampled tokensโraises a deeper question: *when and how does subliminal learning actually occur?* We answer this question through controlled experiments and mechanistic analysis. Our results show that subliminal learning does not need (global) token entanglement or logit leakage. Instead, it comes down to a small set of *divergence tokens*โrare cases where teachers with different biases would predict different tokens. Masking out these tokens mostly removes the hidden bias transfer. Mechanistically, divergence tokens reveal that early layers are critical. Surprisingly, finetuning even a single such early layer is sufficient for subliminal learning. Finally, we find that subliminal learning is fragile. Even small changes, like paraphrasing prompts, are usually sufficient to suppress it.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3011. Ensemble Prediction of Task Affinity for Efficient Multi-Task Learning
๐ฌ 15
โญ 5.33
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Machine Learning Multi-Task Learning Task Affinity AutoML
A fundamental problem in multi-task learning (MTL) is identifying groups of tasks that should be learned together. Since training MTL models for all possible combinations of tasks is prohibitively expensive for large task sets, a crucial component of efficient and effective task grouping is predicting whether a group of tasks would benefit from learning together, measured as per-task performance gain over single-task learning. In this paper, we propose ETAP (Ensemble Task Affinity Predictor), a scalable framework that integrates principled and data-driven estimators to predict MTL performance gains. First, we consider the gradient-based updates of shared parameters in an MTL model to measure the affinity between a pair of tasks as the similarity between the parameter updates based on these tasks. This linear estimator, which we call affinity score, naturally extends to estimating affinity within a group of tasks. Second, to refine these estimates, we train predictors that apply non-linear transformations and correct residual errors, capturing complex and non-linear task relationships. We train these predictors on a limited number of task groups for which we obtain ground-truth gain values via multi-task learning for each group. We demonstrate on benchmark datasets that ETAP improves MTL gain prediction and enables more effective task grouping, outperforming state-of-the-art baselines across diverse application domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3012. Agentic Reinforced Policy Optimization
๐ฌ 15
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Agentic Reinforcement Learning Large Language Model Agentic Reasoning Tool-use Alignment
Large-scale reinforcement learning with verifiable rewards (RLVR) has proven effective in harnessing the potential of large language models~(LLMs) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs often rely on external tools to assist in task-solving processes. However, current RL algorithms typically employ trajectory-level rollout sampling, consistently neglecting the fine-grained exploration of multi-turn tool-call steps. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Our preliminary experiments reveal that LLMs frequently exhibit increased uncertainty after tool-call steps, as evidenced by higher entropy in the distribution of generated tokens. Motivated by this, ARPO incorporates an entropy-based adaptive rollout mechanism, encouraging the policy model to adaptively branch sampling during high-entropy tool-call rounds, thereby promoting step-level exploration of latent tool-use behaviors. By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Experiments across 13 challenging benchmarks demonstrate ARPO's superiority over trajectory-level RL algorithms. Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3013. Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model
๐ฌ 17
โญ 5.33
๐ generative models
๐ท๏ธ Video Diffusion Model Active Noise Selection
The choice of initial noise strongly affects quality and prompt alignment in video diffusion; different seeds for the same prompt can yield drastically different results. While recent methods use externally designed priors (e.g., frequency filtering or inter-frame smoothing), they often overlook internal model signals that indicate inherently preferable seeds.
To address this, we propose ANSE (Active Noise Selection for Generation), a model-aware framework that selects high-quality seeds by quantifying attention-based uncertainty. At its core is BANSA (Bayesian Active Noise Selection via Attention), an acquisition function that measures entropy disagreement across multiple stochastic attention samples to estimate model confidence and consistency.
For efficient inference-time deployment, we introduce a Bernoulli-masked approximation of BANSA that estimates scores from a single diffusion step and a subset of informative attention layers. Experiments across diverse text-to-video backbones demonstrate improved video quality and temporal coherence with marginal inference overhead, providing a principled and generalizable approach to noise selection in video diffusion.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3014. Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations
๐ฌ 16
โญ 5.33
๐ applications to robotics, autonomy, planning
๐ท๏ธ Presentation Generation Self-improvement AI for Academic Research Human-Agent Interaction
The promotion of academic papers has become an important means of enhancing research visibility. where the appeal of dissemination largely determines its effectiveness.
However, existing automated methods struggle limited storytelling, insufficient aesthetic quality, and constrained self-adjustment, making it difficult to achieve efficient and engaging dissemination. At the heart of those challenges is a simple principle: *there is no way to improve it when you cannot evaluate it right*.
To address this, we introduce **EvoPresent**, a self-improvement agent framework that unifies coherent narratives, aesthetic-aware designs, and realistic presentation delivery via virtual characters.
Central to EvoPresent is **PresAesth**, a multi-task reinforcement learning (RL) aesthetic model that provides reliable aesthetic scoring, defect adjustment, and comparative feedback, enabling iterative self-improvement even under limited aesthetic training data.
To systematically evaluate the methods, we introduce **EvoPresent Benchmark**, a comprehensive benchmark comprising: *Presentation Generation Quality*, built on 650 top-tier AI conference papers with multimodal resources (slides, videos and scripts) to assess both content and design; and *Aesthetic Awareness*, consisting of 2,000 slide pairs with varying aesthetic levels, supporting joint training and evaluation on scoring, defect adjustment, and comparison. Our findings highlight that (i) High-quality feedback is essential for agent self-improvement, while initial capability alone does not guarantee effective self-correction.
(ii) Automated generation pipelines exhibit a trade-off between visual design and content construction. (iii) Multi-task RL training shows stronger generalization in aesthetic awareness tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3015. Causal Interpretation of Neural Network Computations with Contribution Decomposition (CODEC)
๐ฌ 9
โญ 5.33
๐ interpretability and explainable AI
๐ท๏ธ mechanistic interpretibility neuroscience xai ai safety tool-development
Understanding how neural networks transform inputs into outputs is crucial for interpreting and manipulating their behavior. Most existing approaches analyze internal representations by identifying hidden-layer activation patterns correlated with human-interpretable concepts. Here we take a direct approach to examine how hidden neurons act to drive network outputs. We introduce CODEC ($\textbf{Co}$ntribution $\textbf{Dec}$omposition), a method that uses sparse autoencoders to decompose network behavior into sparse motifs of hidden-neuron contributions, revealing causal processes that cannot be determined by analyzing activations alone. Applying CODEC to benchmark image-classification networks, we find that contributions grow in sparsity and dimensionality across layers and, unexpectedly, that they progressively decorrelate positive and negative effects on outputs. We further show that decomposing contributions into sparse modes enables greater control and interpretation of intermediate layers, supporting both causal manipulations of network output and human-interpretable visualizations of distinct image components that combine to drive that output. Finally, by analyzing state-of-the-art models of retinal activity, we demonstrate that CODEC uncovers combinatorial actions of model interneurons and identifies the sources of dynamic receptive fields. Overall, CODEC provides a rich and interpretable framework for understanding how nonlinear computations evolve across hierarchical layers, establishing contribution modes as an informative unit of analysis for mechanistic insights into artificial neural networks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3016. Clustering by Denoising: Latent plug-and-play diffusion for single-cell embeddings
๐ฌ 14
โญ 5.33
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Diffusion Plug-and-Play (PnP) Single-Cell Genomics
Single-cell RNA sequencing (scRNA-seq) enables the study of cellular heterogeneity. Yet, clustering accuracy, and with it downstream analyses based on cell labels, remain challenging due to measurement noise and biological variability. In standard latent spaces (e.g., obtained through PCA), data from different cell types can be projected close together, making accurate clustering difficult.
We introduce a latent plug-and-play diffusion framework that separates the observation and denoising space.
This separation is operationalized through a novel Gibbs sampling procedure: the learned diffusion prior is applied in a low-dimensional latent space to perform denoising, while to steer this process, noise is reintroduced into the original high-dimensional observation space.
This unique ``input-space steering'' ensures the denoising trajectory remains faithful to the original data structure. Our approach offers three key advantages:
(1) adaptive noise handling via a tunable balance between prior and observed data; (2) uncertainty quantification through principled uncertainty estimates for downstream analysis; and (3) generalizable denoising by leveraging clean reference data to denoise noisier datasets, and via averaging, improve quality beyond the training set.
We evaluate robustness on both synthetic and real single-cell genomics data. Our method improves clustering accuracy on synthetic data across varied noise levels and dataset shifts. On real-world single-cell data, our method demonstrates improved biological coherence in the resulting cell clusters, with cluster boundaries that better align with known cell type markers and developmental trajectories.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3017. Mesh Splatting for End-to-end Multiview Surface Reconstruction
๐ฌ 16
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multi-View Stereo Geometry Processing Meshe Synthesis
Surfaces are typically represented as meshes, which can be extracted from volumetric fields via meshing or optimized directly as surface parameterizations. Volumetric representations occupy 3D space and have a large effective receptive field along rays, enabling stable and efficient optimization via volumetric rendering; however, subsequent meshing often produces overly dense meshes and introduces accumulated errors. In contrast, pure surface methods avoid meshing but capture only boundary geometry with a single-layer receptive field, making it difficult to learn intricate geometric details and increasing reliance on priors (e.g., shading or normals). We bridge this gap by differentiably turning a surface representation into a volumetric one, enabling end-to-end surface reconstruction via volumetric rendering to model complex geometries. Specifically, we soften a mesh into multiple semi-transparent layers that remain differentiable with respect to the base mesh, endowing it with a controllable 3D receptive field. Combined with a splatting-based renderer and a topology-control strategy, our method can be optimized in about 20 minutes to achieve accurate surface reconstruction while substantially improving mesh quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3018. Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration
๐ฌ 14
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Restoration Diffusion Transformer Text-to-Video ControlNet Concept Distillation
We present Vivid-VR, a DiT-based generative video restoration method built upon an advanced T2V foundation model, where ControlNet is leveraged to control the generation process, ensuring content consistency. However, conventional fine-tuning of such controllable pipelines frequently suffers from distribution drift due to limitations in imperfect multimodal alignment, resulting in compromised texture realism and temporal coherence. To tackle this challenge, we propose a concept distillation training strategy that utilizes the pretrained T2V model to synthesize training samples with embedded textual concepts, thereby distilling its conceptual understanding to preserve texture and temporal quality. To enhance generation controllability, we redesign the control architecture with two key components: 1) a control feature projector that filters degradation artifacts from input video latents to minimize their propagation through the generation pipeline, and 2) a new ControlNet connector employing a dual-branch design. This connector synergistically combines MLP-based feature mapping with cross-attention mechanism for dynamic control feature retrieval, enabling both content preservation and adaptive control signal modulation. Extensive experiments show that Vivid-VR performs favorably against existing approaches on both synthetic and real-world benchmarks, as well as AIGC videos, achieving impressive texture realism, visual vividness, and temporal consistency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3019. Consistent Noisy Latent Rewards for Trajectory Preference Optimization in Diffusion Models
๐ฌ 11
โญ 5.33
๐ generative models
๐ท๏ธ Diffusion Model Preference Alignment Text-to-Image Text-to-Video
Recent advances in diffusion models for visual generation have sparked interest in human preference alignment, similar to developments in Large Language Models. While reward model (RM) based approaches enable trajectory-aware optimization by evaluating intermediate timesteps, they face two critical challenges: unreliable reward estimation on noisy latents due to pixel-level models' sensitivity to noise interference, and single-timestep preference evaluation across sampling trajectories where single-timestep evaluations can yield inconsistent preference rankings depending on the selected timestep.
To address these limitations, we propose a comprehensive framework with targeted solutions for each challenge. To achieve noise compatibility for reliable reward estimation, we introduce the Score-based Latent Reward Model (SLRM), which leverages the complete diffusion model as a preference discriminator with learnable task tokens and a score enhancement mechanism that explicitly preserves noise compatibility by augmenting preference logits with the denoising score function. To ensure consistent preference evaluation across trajectories, we develop Trajectory Advantages Preference Optimization (TAPO), which strategically performs Stochastic Differential Equations sampling and reward evaluation at multiple timesteps to dynamically capture trajectory advantages while identifying preference inconsistencies and preventing erroneous trajectory selection.
Extensive experiments on Text-to-Image and Text-to-Video generation tasks demonstrate significant improvements on noisy latent evaluation and alignment performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3020. Geometric Image Editing via Effects-Sensitive In-Context Inpainting with Diffusion Transformers
๐ฌ 3
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ geometric image editing inpainting diffusion model
Recent advances in diffusion models have significantly improved image editing. However, challenges persist in handling geometric transformations, such as translation, rotation, and scaling, particularly in complex scenes. Existing approaches suffer from two main limitations: (1) difficulty in achieving accurate geometric editing of object translation, rotation, and scaling; (2) inadequate modeling of intricate lighting and shadow effects, leading to unrealistic results. To address these issues, we propose GeoEdit, a framework that leverages in-context generation through a diffusion transformer module, which integrates geometric transformations for precise object edits. Moreover, we introduce Effects-Sensitive Attention, which enhances the modeling of intricate lighting and shadow effects for improved realism. To further support training, we construct RS-Objects, a large-scale geometric editing dataset containing over 120,000 high-quality image pairs, enabling the model to learn precise geometric editing while generating realistic lighting and shadows. Extensive experiments on public benchmarks demonstrate that GeoEdit consistently outperforms state-of-the-art methods in terms of visual quality, geometric accuracy, and realism.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3021. ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
๐ฌ 7
โญ 5.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ scaling laws multilinguality
Scaling laws research has focused overwhelmingly on Englishโyet the most prominent AI models explicitly serve billions of international users. In this work, we undertake the largest multilingual scaling laws study to date, totaling 774 multilingual training experiments, spanning 10M-8B model parameters, 400+ training languages and 48 evaluation languages. We introduce the Adaptive Transfer Scaling Law (ATLAS) for both monolingual and multilingual pretraining, which outperforms existing scaling laws' out-of-sample generalization often by more than 0.3 Rยฒ. Our analyses of the experiments shed light on multilingual learning dynamics, transfer properties between languages, and the curse of multilinguality. First, we derive a cross-lingual transfer matrix, empirically measuring mutual benefit scores between 38 ร 38 = 1444 language pairs. Second, we derive a language-agnostic scaling law that reveals how to optimally scale model size and data when adding languages without sacrificing performance. Third, we identify the computational crossover points for when to pretrain from scratch versus finetune from multilingual checkpoints. We hope these findings provide the scientific foundation for democratizing scaling laws across languages, and enable practitioners to efficiently scale modelsโbeyond English-first AI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3022. DifficultyโDiversity Collaborative Filtering for Data-Efficient LLM Fine-Tuning
๐ฌ 17
โญ 5.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Large Language Models Reasoning Data Efficiency Supervised Fine-Tuning Collaborative Filtering
The performance of fine-tuned language models is heavily influenced by the quality and quantity of their fine-tuning data. While scaling laws suggest that larger models benefit from more data during pretraining, the Less-is-More hypothesis highlights that downstream fine-tuning often requires only a small but high-quality dataset to effectively elicit a modelโs pretrained knowledge. However, identifying such premium data, particularly in terms of difficulty and diversity, typically relies on human expertise, and existing methods offer limited guidance for automatic selection from large unannotated corpora. This work presents a novel quantitative framework that formalizes the interplay between question difficulty and diversity, and introduces *DifficultyโDiversity Collaborative Filtering* (DDCF): an automated approach that tailors data selection to the unique characteristics of each language model via collaborative filtering. By leveraging a small seed dataset to predict correctness across a large unannotated corpus, our method reduces the annotation cost by $100-200\times$, while maintaining downstream performance comparable to full-corpus fine-tuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3023. Steering Autoregressive Music Generation with Recursive Feature Machines
๐ฌ 10
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ music generation probing interpretability music ai multimodal LLMs steering inference
Controllable music generation remains a significant challenge, with existing methods often requiring model retraining or introducing audible artifacts. We introduce MusicRFM, a framework that adapts Recursive Feature Machines (RFMs) to enable fine-grained, interpretable control over frozen, pre-trained music models by directly steering their internal activations. RFMs analyze a model's internal gradients to produce interpretable "concept directions", or specific axes in the activation space that correspond to musical attributes like notes or chords. We first train lightweight RFM probes to discover these directions within MusicGen's hidden states; then, during inference, we inject them back into the model to guide the generation process in real-time without per-step optimization. We present advanced mechanisms for this control, including dynamic, time-varying schedules and methods for the simultaneous enforcement of multiple musical properties. Our method successfully navigates the trade-off between control and generation quality: we can increase the accuracy of generating a target musical note from 0.23 to 0.82, while text prompt adherence remains within approximately 0.02 of the unsteered baseline, demonstrating effective control with minimal impact on prompt fidelity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3024. Prompt-MII: Meta-Learning Instruction Induction for LLMs
๐ฌ 8
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Instruction Induction Prompt Generation Prompt Optimization Reinforcement Learning Task Adaptation Large Language Models
A popular method to adapt large language models (LLMs) to new tasks is in-context learning (ICL), which is effective but incurs high inference costs as context length grows. In this paper we propose a method to perform instruction induction, where we take training examples and reduce them to a compact but descriptive prompt that can achieve performance comparable to ICL over the full training set. Specifically, we propose Prompt-MII, a reinforcement learning (RL) based framework to meta-learn an instruction induction model that can generate compact instructions on the fly for an arbitrary new dataset. We train on over 3,000 diverse classification datasets from the HuggingFace hub, and evaluate on 90 unseen tasks. Prompt-MII improves downstream model quality by 4-9 F1 points (10-20\% relative), matching ICL performance while requiring 3-13x fewer tokens.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3025. OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis
๐ฌ 20
โญ 5.33
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Medical foundation model Vision-language modeling CT VLM
Computed Tomography (CT) is one of the most widely used and diagnostically information-dense imaging modalities, covering critical organs such as the heart, lungs, liver, and colon. Clinical interpretation relies on both \textbf{slice-driven} local features (e.g., sub-centimeter nodules, lesion boundaries) and \textbf{volume-driven} spatial representations (e.g., tumor infiltration, inter-organ anatomical relations).
However, existing Large VisionโLanguage Models (LVLMs) remain fragmented in CT slice versus volumetric understanding: slice-driven LVLMs show strong generalization but lack cross-slice spatial consistency, while volume-driven LVLMs explicitly capture volumetric semantics but suffer from coarse granularity and poor compatibility with slice inputs. The absence of a unified modeling paradigm constitutes a major bottleneck for the clinical translation of medical LVLMs.
We present \textbf{OmniCT}, a powerful unified sliceโvolume LVLM for CT scans, which makes three contributions:
\textbf{(i) Spatial Consistency Enhancement (SCE):} volumetric slice composition combined with tri-axial positional encoding introduces volumetric consistency, and an MoE hybird projection enables efficient sliceโvolume adaptation;
\textbf{(ii) Organ-level Semantic Enhancement (OSE):} segmentation and ROI localization explicitly align anatomical regions, emphasizing lesion- and organ-level semantics;
\textbf{(iii) MedEval-CT:} the largest sliceโvolume CT dataset and hybrid benchmark integrates multi-level metrics for unified evaluation.
OmniCT consistently outperforms existing methods with a substantial margin across diverse clinical tasks, satisfies both micro-level detail sensitivity and macro-level spatial reasoning, and establishes a new paradigm for cross-dimensional medical imaging modeling.
Our project is available at \href{https://anonymous.4open.science/r/OmniCT}{link}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3026. PAT3D: Physics-Augmented Text-to-3D Scene Generation
๐ฌ 13
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Text-to-3D Generation Vision Language Models Rigid Body Contact Simulation Simulation-In-The-Loop Optimization
We introduce PAT3D, the first physics-augmented text-to-3D scene generation framework that integrates visionโlanguage models with physics-based simulation to produce physically plausible, simulation-ready, and intersection-free 3D scenes. Given a text prompt, PAT3D generates 3D objects, infers their spatial relations, and organizes them into a hierarchical scene tree, which is then converted into initial simulation conditions. A rigid-body simulator ensures realistic object interactions under gravity, driving the scene toward static equilibrium without interpenetrations. To further enhance scene quality, we introduce a simulation-in-the-loop optimization procedure that guarantees physical stability and non-intersection, while improving semantic consistency with the input prompt. Experiments demonstrate that PAT3D substantially outperforms prior approaches in physical plausibility, semantic accuracy, and visual quality. Beyond high-quality generation, PAT3D uniquely enables simulation-ready 3D scenes for downstream tasks such as scene editing and robotic manipulation. Code and data will be released upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3027. RAPID$^3$: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer
๐ฌ 9
โญ 5.33
๐ generative models
๐ท๏ธ Diffusion Transformer Acceleration
Diffusion Transformers (DiTs) excel at visual generation yet remain hampered by slow sampling. Existing training-free acceleratorsโstep reduction, feature caching, and sparse attentionโenhance inference speed but typically rely on a uniform heuristic or manually designed adaptive strategy for all images, leaving quality on the table. Alternatively, dynamic neural networks offer per-image adaptive acceleration, but their high fine-tuning costs limit broader applicability. To address these limitations, we introduce RAPID^3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer framework that delivers image-wise acceleration with zero updates to the base generator. Specifically, three lightweight policy headsโStep-Skip, Cache-Reuse, and Sparse-Attentionโobserve the current denoising state and independently decide their corresponding speed-up at each timestep. All policy parameters are trained online via Group Relative Policy Optimization (GRPO) while the generator remains frozen. Meanwhile, an adversarially learned discriminator augments the reward signal, discouraging reward hacking by boosting returns only when generated samples stay close to the original modelโs distribution. Across state-of-the-art DiT backbones including Stable Diffusion 3 and FLUX, RAPID^3 achieves nearly 3$\times$ faster sampling with competitive generation quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3028. Secure Inference for Diffusion Models via Unconditional Scores
๐ฌ 13
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ privacy-preserving inference diffusion models
As diffusion model-based services expand across various domains, safeguarding client data privacy has become increasingly critical. While fully homomorphic encryption and secure multi-party computation enable privacy-preserving inference, their high computational overhead poses challenges for large-scale diffusion applications. Recent work alleviates computational costs by substituting non-linear operations with low-degree polynomial approximations. While such relaxations reduce latency, they incur significant degradation in generative fidelity, and inference remains considerably slower than plaintext execution. To further accelerate secure inference while preserving performance, we explore more relaxed approximations and propose a score-correction framework that rectifies the conditional score shift induced by the relaxed approximation, rather than decreasing the approximation error itself. The key insight is that unconditional generation can be executed without approximation and thus provides a high-fidelity score signal. Leveraging this unconditional score as corrective guidance enables more relaxed approximations while preserving semantic and perceptual quality. In experiments, we demonstrate that our method significantly alleviates the performance degradation caused by relaxed approximations across various benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3029. FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark
๐ฌ 15
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ Text-to-image Reasoning Generation chain-of-thought dataset benchmark
The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3030. Matched Data, Better Models: Target Aligned Data Filtering with Sparse Features
๐ฌ 11
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ data filtering submodular sparse autoencoders
Data filtering plays a central role in improving model performance, particularly for vision language models that are pretrained on large, noisy, and redundant image-caption datasets. Existing filtering techniques assess every sample individually and retain those that exceed a certain quality threshold, but such strategies fail to capture higher-order interactions. In this work, we propose a novel submodular framework for data selection that addresses this limitation. Our method, Submodular Distribution Matching (SDM), selects a subset by: (1) training a type of sparse autoencoder to learn disentangled and \emph{monotone} features; (2) estimating a target feature distribution from a target dataset; and (3) selecting a subset of samples whose feature distribution closely matches the target via submodular maximization. Given the DataComp-medium training set and no external models, SDM achieves state-of-the-art accuracy on both ImageNet-1K and average performance across 38 downstream tasks. On the full DataComp-medium benchmark, SDM delivers performance within 1\% of the state-of-the-art results while using over \textbf{\emph{5ร}} fewer GPU hours than the leading approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3031. OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
๐ฌ 26
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ Multi-Domain Multi-Modal World Modeling
The field of 4D world modelingโaiming to jointly capture spatial geometry and temporal dynamicsโhas witnessed remarkable progress in recent years, driven by advances in large-scale generative models and multimodal learning. However, the development of truly general 4D world models remains fundamentally constrained by the availability of high-quality data. Existing datasets and benchmarks often lack the dynamic complexity, multi-domain diversity, and spatial-temporal annotations required to support key tasks such as 4D geometric reconstruction, future prediction, and camera-controlled video generation. To address this gap, we introduce OmniWorld, a large-scale, multi-domain, multi-modal dataset specifically designed for 4D world modeling. OmniWorld consists of a newly collected OmniWorld-Game dataset and several curated public datasets spanning diverse domains. Compared with existing synthetic datasets, OmniWorld-Game provides richer modality coverage, larger scale, and more realistic dynamic interactions. Based on this dataset, we establish a challenging benchmark that exposes the limitations of current state-of-the-art (SOTA) approaches in modeling complex 4D environments. Moreover, fine-tuning existing SOTA methods on OmniWorld leads to significant performance gains across 4D reconstruction and video generation tasks, strongly validating OmniWorld as a powerful resource for training and evaluation. We envision OmniWorld as a catalyst for accelerating the development of general-purpose 4D world models, ultimately advancing machinesโ holistic understanding of the physical world.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3032. ORION: Decoupling and Alignment for Unified Autoregressive Understanding and Generation
๐ฌ 12
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ unified generation and understanding
Unified multimodal Large Language Models (MLLMs) hold great promise for seamlessly integrating understanding and generation. However, monolithic autoregressive architectures, despite their elegance and conversational fluency, suffer from a fundamental semanticโstructural conflict: optimizing for low-level reconstructability in generation leads to catastrophic forgetting of high-level semantic understanding. We present ORION, a unified framework that resolves this conflict through Decoupling and Alignment. A non-linear vision head decouples structural pressures from shared representations, while a novel Representation Consistency Loss explicitly aligns semantics during generation. Together with a curated progressive training recipe and high-quality multimodal data, our method enables balanced optimization of both capabilities. Built purely on a monolithic autoregressive backbone without task-specific separate parameters, ORION achieves performance on par with or exceeding recent state-of-the-art unified models that rely on more complex designs. These results validate monolithic autoregression as a simple, effective, and competitive path toward truly integrated multimodal intelligence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3033. Anatomy-aware Representation Learning for Medical Ultrasound
๐ฌ 12
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Foundation model medical ultrasound representation learning
Diagnostic accuracy of ultrasound imaging is limited by qualitative variability and its reliance on the expertise of medical professionals. Such challenges increase demand for computer-aided diagnostic systems that enhance diagnostic accuracy and efficiency. However, the unique texture and structural attributes of ultrasound images, and the scarcity of large-scale ultrasound datasets hinder the effective application of conventional machine learning methodologies. To address the challenges, we propose Anatomy-aware Representation Learning (ARL), a novel self-supervised representation learning framework specifically designed for medical ultrasound imaging. ARL incorporates an anatomy-adaptive Vision Transformer (A-ViT). The A-ViT is parameterized, using the proposed large-scale medical ultrasound dataset, to provide anatomy-aware feature representations. Through extensive experiments across various ultrasound-based diagnostic tasks, including breast and thyroid cancer, cardiac view classification, and gallbladder tumor and COVID-19 identification, we demonstrate that ARL significantly outperforms existing self-supervised learning baselines. The experiments demonstrate the potential of ARL in advancing medical ultrasound diagnostics by providing anatomy-specific feature representation
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3034. Revisiting Tree-Sliced Wasserstein Distance Through the Lens of the FermatโWeber Problem
๐ฌ 11
โญ 5.33
๐ learning on graphs and other geometries & topologies
๐ท๏ธ sliced optimal transport tree-sliced wasserstein distance tree wasserstein distance fermat-weber problem
Tree-Sliced methods have emerged as an efficient and expressive alternative to the traditional Sliced Wasserstein distance, replacing one-dimensional projections with tree-structured metric spaces and leveraging a splitting mechanism to better capture the underlying topological structure of integration domains while maintaining low computational cost. At the core of this framework is the Tree-Sliced Wasserstein (TSW) distance, defined over probability measures in Euclidean spaces, along with several variants designed to enhance its performance. A fundamental distinction between SW and TSW lies in their sampling strategiesโa component explored in the context of SW but often overlooked in comparisons. This omission is significant: whereas SW relies exclusively on directional projections, TSW incorporates both directional and positional information through its tree-based construction. This enhanced spatial sensitivity enables TSW to reflect the geometric structure of the underlying data more accurately. Building on this insight, we propose a novel variant of TSW that explicitly leverages positional information in its design. Inspired by the classical FermatโWeber problemโwhich seeks a point minimizing the sum of distances to a given set of pointsโwe introduce the FermatโWeber Tree-Sliced Wasserstein (FW-TSW) distance. By incorporating geometric median principles into the tree construction process, FW-TSW notably further improves the performance of TSW while preserving its low computational cost. These improvements are empirically validated across diverse experiments, including diffusion model training and gradient flow.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3035. Offline Reinforcement Learning with Adaptive Feature Fusion
๐ฌ 14
โญ 5.33
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Trajectory Stitching
Return-conditioned supervised learning (RCSL) algorithms have demonstrated strong generative capabilities in offline reinforcement learning (RL) by learning action distributions based on both the state and the return. However, many existing approaches treat RL as a conditional sequence modeling task, which can lead to an overreliance on suboptimal past experiences, impairing decision-making and reducing the effectiveness of trajectory synthesis. To address these limitations, we propose a novel approach, the Q-Augmented Dual-Feature Fusion Decision Transformer (QDFFDT), which adaptively combines both global sequence features and local immediate features through a learnable fusion mechanism. This model improves generalization across different tasks without the need for extensive hyperparameter tuning. Experimental results on the D4RL benchmark show that QDFFDT outperforms current methods, establishing new state-of-the-art performance and demonstrating the power of adaptive feature fusion.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3036. Disentangling Length Bias in Preference Learning via Response-Conditioned Modeling
๐ฌ 15
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models Preference Modeling Bradley-Terry Model
Reinforcement Learning from Human Feedback (RLHF) has achieved considerable success in aligning large language models (LLMs) by modeling human preferences with a learnable reward model and employing a reinforcement learning algorithm to maximize the reward model's scores. However, these reward models are susceptible to exploitation through various superficial confounding factors, with length bias emerging as a particularly significant concern. Moreover, while the pronounced impact of length bias on preference modeling suggests that LLMs possess an inherent sensitivity to length perception, our preliminary investigations reveal that fine-tuned LLMs consistently struggle to adhere to explicit length instructions. To address these two limitations, we propose a novel framework wherein the reward model explicitly differentiates between human semantic preferences and response length requirements. Specifically, we introduce a $\textbf{R}$esponse-$\textbf{c}$onditioned $\textbf{B}$radley-$\textbf{T}$erry (Rc-BT) model that enhances the model's capability in length bias mitigating and length instruction following, through training on our augmented dataset. Furthermore, we propose the Rc-RM and Rc-DPO algorithm to leverage the Rc-BT model for reward modeling and direct policy optimization (DPO) of LLMs, simultaneously mitigating length bias and promoting adherence to length instructions. Extensive experiments across various models and datasets demonstrate the effectiveness and generalizability of our approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3037. THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning
๐ฌ 20
โญ 5.33
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Large Language Models Mathematical Problem Solving Tool-Integrated Reasoning Reinforcement Learning
Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reasoning data, performing fine-grained optimization, and enhancing inference. To overcome these limitations, we propose THOR (Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen, a multi-agent based pipeline for constructing high-quality datasets of tool-integrated reasoning paths, aligning with the policy and generalizing well across diverse models. Second, to perform fine-grained hierarchical optimization, we introduce an RL strategy that jointly optimizes for both trajectory-level problem solving and step-level code generation. This is motivated by our key insight that the success of an intermediate tool call is a strong predictor of the final answer's correctness. Finally, THOR incorporates a self-correction mechanism that leverages immediate tool feedback to dynamically revise erroneous reasoning paths during inference. Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models. It further achieves state-of-the-art performance for models of a similar scale on multiple mathematical benchmarks, while also delivering consistent improvements on code benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3038. Shift-Tolerant Allocation via Black-Litterman Using Conditional Diffusion Estimates
๐ฌ 11
โญ 5.33
๐ learning on time series and dynamical systems
๐ท๏ธ portfolio allocation time-series estimation generative modeling
In dynamic financial market characterized by shifting regimes, how can we make effective investment decisions under the changing 1) market regimes and 2) their impact?
Among many research fields in financial AI, portfolio allocation stands out as one of the most practically significant areas.
Consequently, numerous researchers and financial institutions continually seek approaches that improve the riskโreward trade-off and strive to apply them in real-world investment scenarios.
However, achieving robust risk-adjusted performance is extremely challenging, because each asset's return and volatility fluctuate according to the shifting market regime.
In response, modern portfolio theory (MPT) addresses this issue by solving for asset weights that maximize a riskโreward objective, using estimates of the return mean and covariance from historical returns.
Reinforcement learning (RL) frameworks have been introduced to directly decide portfolio allocations by optimizing riskโadjusted objectives using asset prices and macroeconomic indices.
In this work, we propose STABLE (Shift-Tolerant Allocation with Black-Litterman Using Conditional Diffusion Estimates), which combines a diffusion-based generative model that captures regime shifts with an estimation-based portfolio allocation module that maximizes expected risk-adjusted return. STABLE takes macroeconomic context and asset-specific signals as inputs and generates per-stock return trajectories that reflect the prevailing macro regime while preserving firm-specific dynamics.
This yields regime-aware predictive return distributions at the single-stock level together with a coherent covariance structure, which are then incorporated as investor views within a Black-Litterman allocation module to obtain risk-diversified portfolio weights.
Empirically, STABLE delivers superior portfolio outcomes, achieving up to 122.9% higher Sharpe ratios with reduced drawdowns across major equity markets.
It also attains stateโofโtheโart timeโseries estimation, lowering MSE by up to 15.7% compared with generative baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3039. LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing
๐ฌ 13
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ MoE Optimization LLM Low resource adaptation
Recent attempts to combine low-rank adaptation (LoRA) with mixture-of-experts (MoE) for multi-task adaptation of Large Language Models (LLMs) often replace whole attention/FFN layers with switch experts or append parallel expert branches, undermining parameter efficiency and limiting task specialization. We introduce **LoRA-Mixer**, a modular MoE framework that routes task-specific LoRA experts into the core projection matrices of the attention module (input/output linear layers), rather than primarily targeting FFN blocks. The design delivers fine-grained token-level specialization by fully exploiting the attention mechanism, while remaining drop-in compatible with Transformers and state-space models (SSMs) as the linear projection layers are ubiquitous. To train robust routers from limited data while promoting stable, selective decisions and high expert reuse, **LoRA-Mixer** employs an adaptive **Routing Specialization Loss (RSL)** that jointly enforces global load balance and input-aware specialization via an entropy-shaping objective. The framework supports two regimes: (i) joint optimization of adapters and router with a differentiable hardโsoft top-k routing scheme, and (ii) plug-and-play routing over frozen, pre-trained LoRA modules sourced from public repositories. Across 15 benchmarksโincluding MedQA, GSM8K, HumanEval, and GLUEโRSL-optimized LoRA-Mixer outperforms state-of-the-art routing and LoRA-MoE baselines while using 48% of their trainable parameters, with gains of +3.79%, +2.90%, and +3.95% on GSM8K, CoLA, and ARC-C, respectively. Cross-model transfer and adapter reuse experiments further demonstrate the approachโs versatility and data efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3040. Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval
๐ฌ 11
โญ 5.33
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ Segmemt Anything Model Efficient Deep Learning Model Acceleration
Segment Anything Model 2 (SAM2) shows excellent performance in video object segmentation tasks; however, the heavy computational burden hinders its application in real-time video processing.
Although there have been efforts to improve the efficiency of SAM2, most of them focus on retraining a lightweight backbone, with little exploration into post-training acceleration.
In this paper, we observe that SAM2 exhibits sparse perception pattern as biological vision, which provides opportunities for eliminating redundant computation and acceleration:
i) In mask decoder, the attention primarily focuses on the foreground objects, whereas the image encoder in the earlier stage exhibits a broad attention span, which results in unnecessary computation to background regions.
ii) In memory bank, only a small subset of tokens in each frame contribute significantly to memory attention, and the salient regions exhibit temporal consistency, making full-token computation redundant.
With these insights, we propose Efficient-SAM2, which promotes SAM2 to adaptively focus on object regions while eliminating task-irrelevant computations, thereby significantly improving inference efficiency.
Specifically, for image encoder, we propose object-aware Sparse Window Routing (SWR), a window-level computation allocation mechanism that leverages the consistency and saliency cues from the previous-frame decoder to route background regions into a lightweight shortcut branch.
Moreover, for memory attention, we propose object-aware Sparse Memory Retrieval (SMR), which allows only the salient memory tokens in each frame to participate in computation,
with the saliency pattern reused from their first recollection.
With negligible additional parameters and minimal training overhead, Efficient-SAM2 delivers 1.68$\times$ speedup on SAM2.1-L model with only 1.0\% accuracy drop on SA-V test set, where SWR and SMR provide 1.83$\times$ and 1.78$\times$ speedups, respectively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3041. Video-KTR: Reinforcing Video Reasoning via Key Token Attribution
๐ฌ 13
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Reasoning Modality-aware Attribution Reinforcement Learning Multimodal Large Language Models
Reinforcement learning (RL) has shown strong potential for enhancing reasoning in multimodal large language models (MLLMs), yet existing video reasoning methods often rely on coarse sequence-level rewards or single-factor token selection. Such approaches neglect fine-grained links among visual inputs, temporal dynamics, and linguistic outputs, limiting both accuracy and interpretability. We propose Video-KTR, a modality-aware policy shaping framework that performs selective, token-level RL by combining three attribution signals: (1) visual-aware tokens identified via counterfactual masking to reveal perceptual dependence; (2) temporal-aware tokens detected through frame shuffling to expose causal and temporal sensitivity; and (3) high-entropy tokens signaling predictive uncertainty. By reinforcing only the union of key tokens, Video-KTR focuses learning on semantically informative, modality-sensitive content while filtering out low-value tokens. Across five challenging benchmarks, Video-KTR achieves state-of-the-art or highly competitive resultsโ42.7% on Video-Holmes, surpassing GPT-4oโwith consistent gains on both reasoning-centric and general video understanding tasks. Ablation studies verify the complementary roles of the attribution signals and the robustness of targeted token-level updates. Overall, Video-KTR improves accuracy and interpretability, offering a simple, drop-in extension to RL for complex video reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3042. New Hybrid Fine-Tuning Paradigm for LLMs: Algorithm Design and Convergence Analysis Framework
๐ฌ 9
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ peft large language models fine tuning
Fine-tuning Large Language Models (LLMs) typically involves either full fine-tuning, which updates all model parameters, or Parameter-Efficient Fine-Tuning (PEFT), which adjusts a small subset of parameters. However, both approaches have inherent limitations: full fine-tuning is computationally expensive, while PEFT often struggles to learn new knowledge and exhibits suboptimal performance. To overcome these issues, we propose a novel *hybrid fine-tuning* approach that jointly updates both LLMs and PEFT modules using a combination of zeroth-order and first-order optimization methods. To analyze our new algorithm, we develop a theoretical framework centered on the concept of *hybrid smoothness condition*, which accounts for the heterogeneous nature of the optimization landscape in joint LLM and PEFT training. We derive a rigorous convergence analysis for the convergence of reshuffling-type SGD algorithm under multiple learning rates and demonstrate its effectiveness through extensive empirical studies across various downstream tasks and model architectures. On the practical side, our results demonstrate consistent performance improvement, making the approach a viable solution for large-scale language model fine-tuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3043. Learning to Reason as Action Abstractions with Scalable Mid-Training RL
๐ฌ 11
โญ 5.33
๐ reinforcement learning
๐ท๏ธ reinforcement learning large language model
Large language models excel with reinforcement learning (RL), but fully unlocking this potential requires a mid-training stage. Intuitively, an effective mid-training stage should both learn a strong policy prior and enable fast learning through online interactions. We formalize this intuition by presenting the first theoretical result on how mid-training shapes post-training: it acquires strong policy priors by efficiently pruning the action space and accelerates RL convergence by shortening the effective planning horizon. Moreover, we prove that temporal abstractions simultaneously compress the size of the action set and reduce the decision horizon, thereby improving regret minimization after training. Building on these insights, we introduce Reasoning as Action Abstractions (RA3), a scalable mid-training algorithm. Specifically, we derive a temporal variational bound and optimize it by iteratively discovering temporally-consistent latent structures via RL, then fine-tuning on the bootstrapped data. Experiments on code generation tasks demonstrate the effectiveness of our approach. Across multiple base models, RA3 improves the average performance on HumanEval and MBPP by 8 and 4 points over the base model and the next-token prediction baseline. Furthermore, RA3 achieves faster convergence and higher asymptotic performance in RLVR on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3044. Bayesian Test-Time Adaptation via Dirichlet feature projection and GMM-Driven Inference for Motor Imagery EEG Decoding
๐ฌ 19
โญ 5.33
๐ applications to neuroscience & cognitive science
๐ท๏ธ Brain-computer interface motor imagery test-time adaptation Dirichlet distribution Bayesian inference
Generalization in EEG-based motor imagery (MI) brain-computer interfaces (BCIs) is severely hampered by cross-subject and cross-session variability. Although large-scale EEG pretraining has advanced representation learning, their practical deployment is hindered by the need for costly fine-tuning to overcome significant domain shifts. Test-time adaptation (TTA) methods that adapt models during inference offer a promising solution. However, existing EEG-TTA methods either rely on gradient-based fine-tuning (suffering from high computational cost and catastrophic forgetting) or data alignment strategies (failing to capture shifts in temporal predictive embeddings). To address these limitations, we propose BTTA-DG, a novel Bayesian Test-Time Adaptation framework that performs efficient, gradient-free adaptation by directly modeling the distribution of temporal predictive embeddings. Our approach first employs a lightweight SincAdaptNet with learnable filters to extract task-specific frequency bands. We then introduce a novel Dirichlet feature projection that maps temporal embeddings onto a compact and interpretable parameter space, effectively capturing the concentration of time-varying predictive evidence. Adaptation is achieved via a GMM-driven Bayesian inference mechanism, which models the historical distribution of these Dirichlet parameters and fuses this evidence with the model's prior predictions to calibrate outputs for the target domain. Extensive experiments show that BTTAโDG significantly outperforms previous EEGโTTA methods, achieving stateโofโtheโart accuracy while running at realโtime speed. Furthermore, visualizations confirm the physiological interpretability of our learned filters and the robust class separability of our Dirichlet feature space.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3045. Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?
๐ฌ 28
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ Instruction Following Large Language Model
Large Language Models (LLMs) achieve strong performance on diverse tasks but
often exhibit cognitive inertia, struggling to follow instructions that conflict with
the standardized patterns learned during supervised fine-tuning (SFT). To evaluate
this limitation, we propose Inverse IFEval, a benchmark that measures modelsโ
Counter-intuitive Abilityโtheir capacity to override training-induced biases and
comply with adversarial instructions. Inverse IFEval introduces eight types of
such challenges, including Question Correction, Intentional Textual Flaws, Code
without Comments, and Counterfactual Answering. Using a human-in-the-loop
pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing
leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark.
Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3046. CortiLife: A Unified Framework for Cortical Representation Learning across the Lifespan
๐ฌ 22
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Vision-language Pretraining Cortical surface modeling Lifespan
The human cerebral cortex encodes rich neurobiological information that is essential for understanding brain development, aging, and disease. Although various cortical representation learning methods have been proposed, existing models are typically restricted to stage-specific cohorts and lack generalization across the lifespan. While recent vision-language models offer a promising direction, building a unified framework for cortical representation faces three key challenges: (1) the non-Euclidean manifold structure of cortical surfaces, (2) homogenization of individual folding patterns induced by registration, and (3) distribution shifts of cortical features across the lifespan. To address these issues, we present CortiLife, the first unified vision-language framework for lifespan-aware cortical representation learning. Specifically, CortiLife introduces a surface tokenizer that integrates icosahedron-based surface patchification with multi-level patch encoding to transform complex cortical manifolds into compact token representations. The multi-level encoding incorporates three complementary streams that capture local topology, global interactions, and patch-wise distributional patterns, effectively mitigating the challenges of homogenization and distribution shifts. Furthermore, CortiLife integrates masked self-distillation with metadata language prompting, embedding information such as age, sex, health status, and attribution type into the text encoder to better capture individual-specific cortical representations while enabling both age-aware and modality-aware modeling. Extensive experiments on downstream tasks, including two encoder-frozen tasks (age prediction and cortical parcellation) and four encoder fine-tuning tasks (brain disorder diagnosis), demonstrate that CortiLife consistently outperforms state-of-the-art baselines across different age stages and modality types, underscoring its effectiveness and generalization ability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3047. MCbiF: Measuring Topological Autocorrelation in Multiscale Clusterings via 2-Parameter Persistent Homology
๐ฌ 18
โญ 5.33
๐ learning on graphs and other geometries & topologies
๐ท๏ธ topological data analysis multiparameter persistent homology multiscale clustering non-hierarchical clustering Sankey diagrams
Datasets often possess an intrinsic multiscale structure with meaningful descriptions at different levels of coarseness. Such datasets are naturally described as multi-resolution clusterings, i.e., not necessarily hierarchical sequences of partitions across scales. To analyse and compare such sequences, we use tools from topological data analysis and define the Multiscale Clustering Bifiltration (MCbiF), a 2-parameter filtration of abstract simplicial complexes that encodes cluster intersection patterns across scales. The MCbiF is a complete invariant of (non-hierarchical) sequences of partitions and can be interpreted as a higher-order extension of Sankey diagrams, which reduce to dendrograms for hierarchical sequences. We show that the multiparameter persistent homology (MPH) of the MCbiF yields a finitely presented and block decomposable module, and its stable Hilbert functions characterise the topological autocorrelation of the sequence of partitions. In particular, at dimension zero, the MPH captures violations of the refinement order of partitions, whereas at dimension one, the MPH captures higher-order inconsistencies between clusters across scales. We then demonstrate through experiments the use of MCbiF Hilbert functions as interpretable topological feature maps for downstream machine learning tasks, and show that MCbiF feature maps outperform both baseline features and representation learning methods on regression and classification tasks for non-hierarchical sequences of partitions. We also showcase an application of MCbiF to real-world data of non-hierarchical wild mice social grouping patterns across time.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3048. Boosting Multi-Domain Reasoning of LLMs via Curvature-Guided Policy Optimization
๐ฌ 19
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language models policy optimization multi-domain reasoning
Multi-domain reinforcement learning (RL) for large language models (LLMs) involves highly intricate reward surfaces, posing significant challenges in finding parameters that excel across all domains. Recent empirical studies have further highlighted conflicts among domains, where gains in one capability often come at the expense of another. However, approaches to mitigate such conflicts and enhance multi-domain reasoning remain largely underexplored. To address this challenge, we propose **C**urvature-**G**uided **P**olicy **O**ptimization (**CGPO**), a principled and scalable training framework to advance the multi-domain reasoning of LLMs. Inspired by Newton's method, CGPO exploits the geometric structure in the reward surface, while sidestepping the prohibitive cost of Hessian computation. At each update, CGPO processes domains in random order, preconditioning their gradients with curvature information from other domains to foster richer cross-domain interactions. This mechanism further promotes implicit gradient alignment by maximizing inter-domain inner products in expectation, steering the parameters toward regions that jointly enhance multi-domain performance. Extensive experiments on a mixed dataset covering math, coding, science, and creative writing, evaluated across seven widely-used benchmarks, show that CGPO significantly outperforms all baselines in terms of faster reward improvement and stronger multi-domain capability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3049. A Function-Centric Graph Neural Network Approach for Predicting Electron Densities
๐ฌ 13
โญ 5.33
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Graph Neural Network Electron Density Density Functional Theory Message Passing Basis Overlap Equivariance Molecules
Electronic structure predictions are relevant for a wide range of applications, from drug discovery to materials science. Since the cost of purely quantum mechanical methods can be prohibitive, machine learning surrogates are used to predict the result of these calculations. This work introduces the Basis Overlap Architecture (BOA), an equivariant graph neural network architecture based on a novel message passing scheme that utilizes the overlap matrix of the basis functions used to represent the ground state electron density. BOA is evaluated on QM9 and MD density datasets, surpassing the previous state-of-the-art in predicting accurate electron densities. Excellent generalization to larger molecules of up to nearly 200 atoms is demonstrated using a model trained only on QM9 molecules of up to 9 heavy atoms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3050. Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models
๐ฌ 13
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Data Contamination Detection; LLMs; Reinforcement Learning; Entropy
Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training.
As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability.
To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model's convergence to a narrow reasoning path, which causes this entropy reduction.
To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario.
Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.
Our benchmark and code are available at https://anonymous.4open.science/r/Data-Contamination-in-RL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3051. Dynamic Early Exit in Reasoning Models
๐ฌ 14
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Language Models Efficient Reasoning Early Exit
Recent advances in large reasoning language models (LRMs) rely on test-time scaling, which extends long chain-of-thought (CoT) generation to solve complex tasks. However, overthinking in long CoT not only slows down the efficiency of problem solving, but also risks accuracy loss due to the extremely detailed or redundant reasoning steps. We propose a simple yet effective method that allows LLMs to self-truncate CoT sequences by early exit during generation. Instead of relying on fixed heuristics, the proposed method monitors model behavior at potential reasoning transition points and dynamically terminates the next reasoning chain's generation when the model exhibits high confidence in a trial answer. Our method requires no additional training and can be seamlessly integrated into existing o1-like reasoning LLMs. Experiments on 10 reasoning benchmarks (e.g., GSM8K, MATH-500, AMC, GPQA, AIME and LiveCodeBench) show that the proposed method is consistently effective on 11 cutting-edge reasoning LLMs of varying series and sizes, reducing the length of CoT sequences by an average of 19.1% to 80.1% while improving accuracy by 0.3% to 5.0%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3052. Nearly-Optimal Bandit Learning in Stackelberg Games with Side Information
๐ฌ 10
โญ 5.33
๐ learning theory
๐ท๏ธ stackelberg games bandit learning side information
We study the problem of online learning in Stackelberg games with side information between a leader and a sequence of followers. In every round the leader observes contextual information and commits to a mixed strategy, after which the follower best-responds. We provide learning algorithms for the leader which achieve O(T^{1/2}) regret under bandit feedback, an improvement from the previously best-known rates of O(T^{2/3}). Our algorithms rely on a reduction to linear contextual bandits in the utility space: In each round, a linear contextual bandit algorithm recommends a utility vector, which our algorithm inverts to determine the leader's mixed strategy. We extend our algorithms to the setting in which the leader's utility function is unknown, and also apply it to the problems of bidding in second-price auctions with side information and online Bayesian persuasion with public and private states. Finally, we observe that our algorithms empirically outperform previous results on numerical simulations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3053. Policy Contrastive Decoding for Robotic Foundation Models
๐ฌ 16
โญ 5.33
๐ applications to robotics, autonomy, planning
๐ท๏ธ Robotic Foundation Models Contrastive Decoding
Generalist robot policies, or robotic foundation models, hold immense potential to enable flexible, general-purpose and dexterous robotic systems. Despite their advancements, our empirical experiments reveal that existing robot policies are prone to learning spurious correlations from pre-training trajectories, adversely affecting their generalization capabilities during inference. To tackle this, we propose a novel Policy Contrastive Decoding (PCD) approach, which redirects the robot policyโs focus toward object-relevant visual clues by contrasting action probability distributions derived from original and object-masked visual inputs. As a training-free method, our PCD can be used as a plugin to improve different types of robot policies without needing to finetune or access model weights. We conduct extensive experiments on top of three open-source robot policies, including the autoregressive policy OpenVLA and the diffusion-based policies Octo and Pi-0. The obtained results in both simulation and real-world environments prove PCDโs flexibility and effectiveness, e.g., PCD enhances the state-of-the-art policy $\pi_0$ by 8.9% in the simulation environment and by 108% in the real-world environment. Our code is publicly available at: https://github.com/pcd-robot/PCD.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3054. BEP: A Binary Error Propagation Algorithm for Binary Neural Networks Training
๐ฌ 14
โญ 5.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Binary Neural Networks Fully binary training Binary error backpropagation Gradient-free optimization Binary Recurrent Neural Networks
Binary Neural Networks (BNNs), which constrain both weights and activations to binary values, offer substantial reductions in computational complexity, memory footprint, and energy consumption. These advantages make them particularly well suited for deployment on resource-constrained devices. However, training BNNs via gradient-based optimization remains challenging due to the discrete nature of their variables. The dominant approach, quantization-aware training, circumvents this issue by employing surrogate gradients. Yet, this method requires maintaining latent full-precision parameters and performing the backward pass with floating-point arithmetic, thereby forfeiting the efficiency of binary operations during training. While alternative approaches based on local learning rules exist, they are unsuitable for global credit assignment and for back-propagating errors in multi-layer architectures. This paper introduces Binary Error Propagation (BEP), the first learning algorithm to establish a principled, discrete analog of the backpropagation chain rule. This mechanism enables error signals, represented as binary vectors, to be propagated backward through multiple layers of a neural network. BEP operates entirely on binary variables, with all forward and backward computations performed using only bitwise operations. Crucially, this makes BEP the first solution to enable end-to-end binary training for recurrent neural network architectures. We validate the effectiveness of BEP on both multi-layer perceptrons and recurrent neural networks, demonstrating gains of up to $+6.89$% and $+10.57$% in test accuracy, respectively. The proposed algorithm is released as an open-source repository.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3055. Incorporating Expert Priors into Bayesian Optimization via Dynamic Mean Decay
๐ฌ 14
โญ 5.33
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Bayesian optimization Gaussian processes Expert prior knowledge Hyperparameter optimization
Bayesian optimization (BO) is a powerful approach for black-box optimization, and in many real-world problems, domain experts possess valuable prior knowledge about promising regions of the search space. However, existing prior-informed BO methods are often overly complex, tied to specific acquisition functions, or highly sensitive to inaccurate priors. We propose DynMeanBO, a simple and general framework that incorporates expert priors into the Gaussian process mean function with a dynamic decay mechanism. This design allows BO to exploit expert knowledge in the early stages while gradually reverting to standard BO behavior, ensuring robustness against misleading priors while retaining the exploratory behavior of standard BO. DynMeanBO is broadly compatible with acquisition functions, introduces negligible computational cost, and comes with convergence guarantees under Expected Improvement and Upper Confidence Bound. Experiments on synthetic benchmarks and hyperparameter optimization tasks show that DynMeanBO accelerates convergence with informative priors and remains robust under biased ones.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3056. TRACEDET: HALLUCINATION DETECTION FROM THE DECODING TRACE OF DIFFUSION LARGE LANGUAGE MODELS
๐ฌ 10
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language models hallucination detection
Diffusion large language models (D-LLMs) have recently emerged as a promising alternative to auto-regressive LLMs (AR-LLMs). However, the hallucination problem in D-LLMs remains underexplored, limiting their reliability in real-world applications. Existing hallucination detection methods are designed for AR-LLMs and rely on signals from \emph{single-step} generation, making them ill-suited for D-LLMs where hallucination signals often emerge throughout the \emph{multi-step} denoising process. To bridge this gap, we propose \textbf{TraceDet}, a novel framework that explicitly leverages the intermediate denoising steps of D-LLMs for hallucination detection. TraceDet models the denoising process as an \emph{action trace}, with each action defined as the modelโs prediction over the cleaned response, conditioned on the previous intermediate output. By identifying the sub-trace that is maximally informative to the hallucinated responses, TraceDet leverages the key hallucination signals in the multi-step denoising process of D-LLMs for hallucination detection. Extensive experiments on various open source D-LLMs demonstrate that \textbf{TraceDet} consistently improves hallucination detection, achieving an average gain in AUROC of 15. 2\% compared to baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3057. A Primer on SO(3) Action Representations in Deep Reinforcement Learning
๐ฌ 14
โญ 5.33
๐ reinforcement learning
๐ท๏ธ RL SO(3) 3D rotations Action Representations Deep RL robotics
Many robotic control tasks require policies to act on orientations, yet the geometry of SO(3) makes this nontrivial. Because SO(3) admits no global, smooth, minimal parameterization, common representations such as Euler angles, quaternions, rotation matrices, and Lie algebra coordinates introduce distinct constraints and failure modes. While these trade-offs are well studied for supervised learning, their implications for actions in reinforcement learning remain unclear. We systematically evaluate SO(3) action representations across three standard continuous control algorithms, PPO, SAC, and TD3, under dense and sparse rewards. We compare how representations shape exploration, interact with entropy regularization, and affect training stability through empirical studies and analyze the implications of different projections for obtaining valid rotations from Euclidean network outputs. Across a suite of robotics benchmarks, we quantify the practical impact of these choices and distill simple, implementation-ready guidelines for selecting and using rotation actions. Our results highlight that representation-induced geometry strongly influences exploration and optimization and show that
representing actions as tangent vectors in the local frame yields the most reliable results across algorithms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3058. Negotiated Reasoning: On Provably Addressing Relative Over-Generalization
๐ฌ 28
โญ 5.33
๐ reinforcement learning
๐ท๏ธ Multi-Agent Reinforcement Learning Relative Over-Generalization Stein variational gradient descent
We focus on the relative over-generalization (RO) issue in fully cooperative multi-agent reinforcement learning (MARL). Existing methods show that endowing agents with reasoning can help mitigate RO empirically, but there is little theoretical insight. We first prove that RO is avoided when agents satisfy a consistent reasoning requirement. We then propose a new negotiated reasoning framework connecting reasoning and RO with theoretical guarantees. Based on it, we develop an algorithm called Stein variational negotiated reasoning (SVNR), which uses Stein variational gradient descent to form a negotiation policy that provably bypasses RO under maximum-entropy policy iteration. SVNR is further parameterized with neural networks for computational efficiency. Experiments demonstrate that SVNR significantly outperforms baselines on RO-challenged tasks, including Multi-Agent Particle World and MaMuJoCo, confirming its advantage in achieving better cooperation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3059. Plug-and-Play Compositionality for Boosting Continual Learning with Foundation Models
๐ฌ 7
โญ 5.33
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Continual learning
Vision learners often struggle with catastrophic forgetting due to their reliance on class recognition by comparison, rather than understanding classes as compositions of representative concepts.
This limitation is prevalent even in state-of-the-art continual learners with foundation models and worsens when current tasks contain few classes.
Inspired by the recent success of concept-level understanding in mitigating forgetting, we design a universal framework CompSLOT to guide concept learning across diverse continual learners.
Leveraging the progress of object-centric learning in parsing semantically meaningful slots from images, we tackle the challenge of learning slot extraction from ImageNet-pretrained vision transformers by analyzing meaningful concept properties.
We further introduce a primitive selection and aggregation mechanism to harness concept-level image understanding.
Additionally, we propose a method-agnostic self-supervision approach to distill sample-wise concept-based similarity information into the classifier, reducing reliance on incorrect or partial concepts for classification.
Experiments show CompSLOT significantly enhances various continual learners and provides a universal concept-level module for the community.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3060. Memory-Statistics Tradeoff in Continual Learning with Structural Regularization
๐ฌ 10
โญ 5.33
๐ learning theory
๐ท๏ธ continual learning deep learning theory
We study the statistical performance of a continual learning problem with two linear regression tasks in a well-specified random design setting. We consider a structural regularization algorithm that incorporates a generalized $\ell_2$-regularization tailored to the Hessian of the previous task for mitigating catastrophic forgetting. We establish upper and lower bounds on the joint excess risk for this algorithm. Our analysis reveals a fundamental trade-off between memory complexity and statistical efficiency, where memory complexity is measured by the number of vectors needed to define the structural regularization. Specifically, increasing the number of vectors in structural regularization leads to a worse memory complexity but an improved excess risk, and vice versa. Furthermore, our theory suggests that naive continual learning without regularization suffers from catastrophic forgetting, while structural regularization mitigates this issue. Notably, structural regularization achieves comparable performance to joint training with access to both tasks simultaneously. These results highlight the critical role of curvature-aware regularization for continual learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3061. Efficient and Sharp Off-Policy Learning under Unobserved Confounding
๐ฌ 17
โญ 5.33
๐ causal reasoning
๐ท๏ธ policy learning causal inference unobserved confounding partial identification
We develop a novel method for personalized off-policy learning in scenarios with unobserved confounding. Thereby, we address a key limitation of standard policy learning: standard policy learning assumes unconfoundedness, meaning that no unobserved factors influence both treatment assignment and outcomes. However, this assumption is often violated, because of which standard policy learning produces biased estimates and thus leads to policies that can be harmful. To address this limitation, we employ causal sensitivity analysis and derive a semi-parametrically efficient estimator for a sharp bound on the value function under unobserved confounding. Our estimator has three advantages: (1) Unlike existing works, our estimator avoids unstable minimax optimization based on inverse propensity weighted outcomes. (2) Our estimator is semi-parametrically efficient. (3) We prove that our estimator leads to the optimal confounding-robust policy. Finally, we extend our theory to the related task of policy improvement under unobserved confounding, i.e., when a baseline policy such as the standard of care is available. We show in experiments with synthetic and real-world data that our method outperforms simple plug-in approaches and existing baselines. Our method is highly relevant for decision-making where unobserved confounding can be problematic, such as in healthcare and public policy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3062. WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents
๐ฌ 16
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ web agent offline web environment benchmark reinforcement learning synthetic data GUI grounding
Current paradigms for training GUI agents are fundamentally limited by a reliance on either unsafe, non-reproducible live web interactions or costly, scarce human-crafted data and environments. We argue this focus on data volume overlooks a more critical factor: the efficiency of compressing a large language model's (LLM) latent knowledge into actionable agent behavior. We introduce WebFactory, a novel, fully automated closed-loop reinforcement learning pipeline for GUI agents, systematically compressing LLM-encoded internet intelligence into efficient, grounded actions. Our pipeline features a process of scalable environment synthesis โ knowledge-aware task generation โ LLM-powered trajectory collection โ decomposed reward RL training โ systematic agent evaluation.
Remarkably, our agent demonstrates exceptional data efficiency and generalization. Trained on synthetic data from only 10 websites within WebFactory, it achieves performance comparable to GUI agents trained on same amount of human-annotated data from a much larger set of environments. This superior performance is consistent across our internal offline and online transferring benchmarks, where our agent also significantly outperforms the base foundation model.
We further provide critical insights into the "embodiment potential" of different LLM foundations, offering a new axis for model evaluation. This work presents a scalable and cost-effective paradigm for transforming passive internet knowledge into active, grounded intelligence, marking a critical step towards general-purpose interactive agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3063. ContextPRM: Leveraging Contextual Coherence for multi-domain Test-Time Scaling
๐ฌ 7
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Process Reward Models
Process reward models (PRMs) have demonstrated significant efficacy in enhancing the mathematical reasoning capabilities of large language models (LLMs) by leveraging test-time scaling (TTS). However, while most PRMs exhibit substantial gains in mathematical domains, the scarcity of domain-specific training data and knowledge-based learning patterns limits their generalization ability when faced with other domains. To address this limitation, we shift the learning objective from verifying domain-specific knowledge to modeling domain-agnostic logical flow. Centering on \textit{contextual coherence} between chain-of-thought (CoT) steps, our approach is realized through a novel data annotation and training framework, which enhances the model's generalization capabilities across diverse domains. For instance, our resulting model, \textbf{ContextPRM}, achieves a notable 6.5\% average accuracy improvement over the majority voting baseline via weighted majority voting across nine non-mathematical domains in MMLU-Pro, including law, history, and philosophy, significantly surpassing the 2.2\% improvement from VersaPRM and 0.5\% gains from other mathematics-focused PRMs, demonstrating consistent performance across both mathematical and non-mathematical domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3064. Does the Data Processing Inequality Reflect Practice? On the Utility of Low-Level Tasks
๐ฌ 12
โญ 5.33
๐ learning theory
๐ท๏ธ Classification denoising dimensionality reduction Bayes optimal classifier
The data processing inequality is an information-theoretic principle stating that the information content of a signal cannot be increased by processing the observations. In particular, it suggests that there is no benefit in enhancing the signal or encoding it before addressing a classification problem. This assertion can be proven to be true for the case of the optimal Bayes classifier. However, in practice, it is common to perform "low-level" tasks before "high-level" downstream tasks despite the overwhelming capabilities of modern deep neural networks. In this paper, we aim to understand when and why low-level processing can be beneficial for classification. We present a comprehensive theoretical study of a binary classification setup, where we consider a classifier that is tightly connected to the optimal Bayes classifier and converges to it as the number of training samples increases. We prove that for any finite number of training samples, there exists a pre-classification processing that improves the classification accuracy. We also explore the effect of class separation, training set size, and class balance on the relative gain from this procedure. We support our theory with an empirical investigation of the theoretical setup. Finally, we conduct an empirical study where we investigate the effect of denoising and encoding on the performance of practical deep classifiers on benchmark datasets. Specifically, we vary the size and class distribution of the training set, and the noise level, and demonstrate trends that are consistent with our theoretical results.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3065. Benchmarking Multi-Agent Reinforcement Learning in Power Grid Operations
๐ฌ 12
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ Multi-agent reinforcement learning benchmark power grids
Improving power grid operations is essential for enhancing flexibility and accelerating grid decarbonization. Reinforcement learning (RL) has shown promise in this domain, most notably through the Learning to Run a Power Network competitions, but prior work has primarily focused on single-agent settings, neglecting the decentralized, multi-agent nature of grid control.
We fill this gap with MARL2Grid, the first benchmark for multi-agent RL (MARL) in power grid operations, developed in collaboration with transmission system operators. Built on RTE Franceโs high-fidelity simulation platform, MARL2Grid supports decentralized control across substations and generators, with configurable agent scopes, observability settings, expert-informed heuristics, and safety-critical constraints.
The benchmark includes a suite of realistic scenarios that expose key challenges, such as coordination under partial information, long-horizon objectives, and adherence to hard physical constraints. Empirical results show that current MARL methods struggle under these real-world conditions. By providing a standardized, extensible platform, MARL2Grid aims to advance the development of scalable, cooperative, and safe learning algorithms for power system operations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3066. VPI-Bench: Visual Prompt Injection Attacks for Computer-Use Agents
๐ฌ 14
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ Web Agent Attack Computer Use-Agent Browser-Use Agent Dataset Benchmark
Computer-Use Agents (CUAs) with full system access enable powerful task automation but pose significant security and privacy risks due to their ability to manipulate files, access user data, and execute arbitrary commands. While prior work has focused on browser-based agents and HTML-level attacks, the vulnerabilities of CUAs remain underexplored. In this paper, we propose an end-to-end threat model where Visual Prompt Injection (VPI) manipulates CUAs in black-box settings to perform unauthorized actions or leak sensitive
information, capturing the entire attack chain from injection to harmful outcomes. Then, we propose VPI-Bench, a benchmark of 306 test cases across five widely used platforms, to evaluate agent robustness under VPI threats. Each test case is a variant of a web platform, designed to be interactive, deployed in a realistic environment, and containing a visually embedded malicious prompt. Our empirical study shows that current CUAs and BUAs can be deceived at rates of up to 51\% and 100\%, respectively, on certain platforms. The experimental results also indicate that existing defense methods offer only limited improvements. These findings highlight the need for robust, context-aware defenses to ensure the safe deployment of multimodal AI agents in real-world environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3067. Weak Correlations as the Underlying Principle for Linearization of Gradient-Based Learning Systems
๐ฌ 9
โญ 5.33
๐ learning theory
๐ท๏ธ Neural Tangent Kernel Linearization Wide Neural Networks Correlations NTK Weak Correlations
Deep learning models, such as wide neural networks, can be viewed as nonlinear dynamical systems composed of numerous interacting degrees of freedom. When such systems approach the limit of infinite number of degrees of freedom, their dynamics tend to simplify. This paper investigates gradient descent-based learning algorithms that exhibit linearization in their parameters. We establish that this apparent linearity, arises from weak correlations between the first, and higher-order derivatives of the hypothesis function with respect to the parameters, at initialization. Our findings indicate that these weak correlations fundamentally underpin the observed linearization phenomenon of wide neural networks. Leveraging this connection, we derive bounds on the deviation from linearity during stochastic gradient descent training. To support our analysis, we introduce a novel technique for characterizing the asymptotic behavior of random tensors. We validate our theoretical insights through empirical studies, comparing the linearized dynamics to the observed correlations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3068. Salient Object Ranking via Cyclical Perception-Viewing Interaction Modeling
๐ฌ 3
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Saliency Ranking Human Attention Shift Modeling
Salient Object Ranking (SOR) aims to predict human attention shifts across different salient objects in a scene. Although a number of methods have been proposed for the task, they typically rely on modeling the bottom-up influences of image features on attention shifts. In this work, we observe that when free-viewing an image, humans instinctively browse the objects in such a way as to maximize contextual understanding of the image. This implies a cyclical interaction between content (or story) understanding of the image and attention shift over it. Based on this observation, we propose a novel SOR approach that models this explicit top-down cognitive pathway with two novel modules: a story prediction (SP) module and a guided ranking (GR) module. By formulating content understanding as the image caption generation task, the SP module learns to generate and complete the image captions conditioned on the salient object queries of the GR module, while the GR module learns to detect salient objects and their viewing orders guided by the SP module. Extensive experiments on SOR benchmarks demonstrate that our approach outperforms state-of-the-art SOR methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3069. Adaptive Hopfield Network: Rethinking Similarities in Associative Memory
๐ฌ 18
โญ 5.33
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ similarity measure associative memory Hopfield network
Associative memory models are content-addressable memory systems fundamental to biological intelligence and are notable for their high interpretability.
However, existing models evaluate the quality of retrieval based on proximity, which cannot guarantee that the retrieved pattern has the strongest association with the query, failing correctness.
We reframe this problem by proposing that a query is a generative variant of a stored memory pattern, and define a variant distribution to model this subtle context-dependent generative process.
Consequently, correct retrieval should return the memory pattern with the maximum a posteriori probability of being the query's origin.
This perspective reveals that an ideal similarity measure should approximate the likelihood of each stored pattern generating the query in accordance with variant distribution, which is impossible for fixed and pre-defined similarities used by existing associative memories.
To this end, we develop adaptive similarity, a novel mechanism that learns to approximate this insightful but unknown likelihood from samples drawn from context, aiming for correct retrieval.
We theoretically prove that our proposed adaptive similarity achieves optimal correct retrieval under three canonical and widely applicable types of variants: noisy, masked, and biased.
We integrate this mechanism into a novel adaptive Hopfield network (`A-Hop`), and empirical results show that it achieves state-of-the-art performance across diverse tasks, including memory retrieval, tabular classification, image classification, and multiple instance learning.
Our code is publicly available at https://anonymous.4open.science/r/Adaptive-Hopfield-Network-C137/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3070. Differentiable Lifting for Topological Neural Networks
๐ฌ 15
โญ 5.33
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Topological Deep Learning Graph Neural Networks graph classification
Topological neural networks (TNNs) enable leveraging high-order structures on graphs (e.g., cycles and cliques) to boost the expressive power of message-passing neural networks. In turn, however, these structures are typically identified a priori through an unsupervised graph lifting operation. Notwithstanding, this choice is crucial and may have a drastic impact on a TNN's performance on downstream tasks. To circumvent this issue, we propose
$\partial$lift (DiffLift), a general framework for learning graph liftings to hypergraphs, cellular- and simplicial complexes in an end-to-end fashion. In particular, our approach leverages learned vertex-level latent representations to identify and parameterize distributions over candidate higher-order cells for inclusion. This results in a scalable model which can be readily integrated into any TNN. Our experiments show that $\partial$lift outperforms existing lifting methods on multiple benchmarks for graph and node classification across different TNN architectures. Notably, our approach leads to gains of up to 45% over static liftings, including both connectivity- and feature-based ones.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3071. Story-Iter: A Training-free Iterative Paradigm for Long Story Visualization
๐ฌ 15
โญ 5.33
๐ generative models
๐ท๏ธ Subject-Consistent Image Generation Diffusion Model Story Visualization
This paper introduces **Story-Iter**, a new training-free iterative paradigm to enhance long-story generation. Unlike existing methods that rely on fixed reference images to construct a complete story, our approach features a novel external **iterative paradigm**, extending beyond the internal iterative denoising steps of diffusion models, to continuously refine each generated image by incorporating all reference images from the previous round. To achieve this, we propose a plug-and-play, training-free **g**lobal **r**eference **c**ross-**a**ttention (**GRCA**) module, modeling all reference frames with global embeddings, ensuring semantic consistency in long sequences. By progressively incorporating holistic visual context and text constraints, our iterative paradigm enables precise generation with fine-grained interactions, optimizing the story visualization step-by-step. Extensive experiments in the official story visualization dataset and our long story benchmark demonstrate that Story-Iter's state-of-the-art performance in long-story visualization (up to 100 frames) excels in both semantic consistency and fine-grained interactions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3072. Sem-MoE: Semantic-aware Model-Data Collaborative Scheduling for Efficient MoE Inference
๐ฌ 31
โญ 5.33
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ Mixture of Experts All-to-All Communication Distributed Inference
Prevailing LLM (Large Language Model) serving engines employ expert parallelism (EP) to implement multi-device inference of massive Mixture-of-Experts (MoE) models. However, the efficiency of expert parallel inference is largely bounded by inter-device communication, as EP embraces expensive all-to-all collectives to route tokens to the remote experts if not collocating on the same GPU/NPU device. Nevertheless, state-of-the-art schemes treat expert device-placement and request (or token) device-scheduling as separate concerns, triggering excessive communication between them and compromising inference efficiency
This paper proposes Sem-MoE, a novel \textbf{model-data} collaborative scheduling framework to minimize the steep communication costs in EP-centric MoE serving. Sem-MoE maximally collocates experts and their activating tokens onto the same device using proactively modeled activation likelihood between them and introduces three key techniques: (1) Offline model scheduling, which preliminarily clusters and collocates experts onto devices based on their co-activation tendencies for certain classes of input. (2) Online inter-request data scheduling for Attention-DP setups, which proactively rebatches incoming requests onto the device that hosts experts most likely and frequently activated by the corresponding requests. (3) Online intra-request data scheduling for Attention-TP setups, which seamlessly fuses a token reshuffling procedure into the original inference pipeline and proactively reschedules tokens to devices to reduce dispersed remote routing. We build Sem-MoE into a prevailing LLM serving engine SGLANG. Experiments show our collaborative scheduling approach can effectively reduce the all-to-all communication volume in EP and achieve superior inference throughput compared to existing solutions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3073. Measuring Uncertainty Calibration
๐ฌ 12
โญ 5.33
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ calibration classification
We make two contributions to the problem of estimating the $L_1$ calibration error of a binary classifier from a finite dataset. First, we provide an upper bound for any classifier where the calibration function has bounded variation. Second, we provide a method of modifying any classifier so that its calibration error can be upper bounded efficiently without significantly impacting classifier performance and without any restrictive assumptions. All our results are non-asymptotic and distribution-free. We conclude by providing advice on how to measure calibration error in practice. Our methods yield practical procedures that can be run on real-world datasets with modest overhead.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3074. Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models
๐ฌ 3
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ Backdoor Defense Anomaly Detection Gradient-Based Attribution Attention Mechanisms Explainability Pre-trained Language Models
Pre-trained language models have achieved remarkable success across a wide range of natural language processing (NLP) tasks, particularly when fine-tuned on large, domain-relevant datasets. However, they remain vulnerable to backdoor attacks, where adversaries embed malicious behaviors using trigger patterns in the training data. These triggers remain dormant during normal usage, but, when activated, can cause targeted misclassifications. In this work, we investigate the internal behavior of backdoored pre-trained encoder-based language models, focusing on the consistent shift in attention and gradient attribution when processing poisoned inputs; where the trigger token dominates both attention and gradient signals, overriding the surrounding context. We propose an inference-time defense that constructs anomaly scores by combining token-level attention and gradient information. Extensive experiments on text classification tasks across diverse backdoor attack scenarios demonstrate that our method significantly reduces attack success rates compared to existing baselines. Furthermore, we provide an interpretability-driven analysis of the scoring mechanism, shedding light on trigger localization and the robustness of the proposed defense.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3075. OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
๐ฌ 10
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ reasoning llms overthinking underthinking evaluation benchmark
Thinking LLMs solve complex tasks at the expense of increased compute and overthinking on simpler problems, while non-thinking LLMs are faster and cheaper but underthink on harder reasoning problems. This has led to the development of separate thinking and non-thinking LLM variants, leaving the onus of selecting the optimal model for each query on the end user. In this work, we introduce OptimalThinkingBench, a unified benchmark that jointly evaluates overthinking and underthinking in LLMs and also encourages the development of optimally-thinking models that balance performance and efficiency. Our benchmark comprises two sub-benchmarks: OverthinkingBench, featuring simple general queries in 72 domains along with simple math problems, and UnderthinkingBench, containing 11 challenging reasoning tasks along with tough math problems. Using novel thinking-adjusted accuracy metrics, we perform an extensive evaluation of 33 different thinking and non-thinking models and show that no model is able to optimally think on our benchmark. Thinking models often overthink for hundreds of tokens on the simplest user queries without improving performance. In contrast, large non-thinking models ``underthink'', often falling short of much smaller thinking models. We further explore several methods to encourage optimal thinking, but find that these approaches often improve on one sub-benchmark at the expense of the other, highlighting the need for better unified and optimal models in the future.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3076. Task Vectors, Learned Not Extracted: Performance Gains and Mechanistic Insights
๐ฌ 12
โญ 5.33
๐ interpretability and explainable AI
๐ท๏ธ Mechanistic Interpretability Large Language Model Task Vector In-context Learning
Large Language Models (LLMs) can perform new tasks from in-context demonstrations, a phenomenon known as in-context learning (ICL). Recent work suggests that these demonstrations are compressed into task vectors (TVs), compact task representations that LLMs exploit for predictions. However, prior studies typically extract TVs from model outputs or hidden states using cumbersome and opaque methods, and they rarely elucidate the mechanisms by which TVs influence computation. In this work, we address both limitations. First, we propose directly training Learned Task Vectors (LTVs), which surpass extracted TVs in accuracy and exhibit superior flexibilityโacting effectively at arbitrary layers, positions, and even with ICL prompts. Second, through systematic analysis, we investigate the mechanistic role of TVs, showing that at the low level they steer predictions primarily through attention-head OV circuits, with a small subset of โkey headsโ most decisive. At a higher level, we find that despite Transformer nonlinearities, TV propagation is largely linear: early TVs are rotated toward task-relevant subspaces to improve logits of relevant labels, while later TVs are predominantly scaled in magnitude. Taken together, LTVs not only provide a practical approach for obtaining effective TVs but also offer a principled lens into the mechanistic foundations of ICL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3077. PAC-Bayes bounds for cumulative loss in Continual Learning
๐ฌ 10
โญ 5.33
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Continual Learning PAC-Bayes Generalization bounds Lifelong Learning
In continual learning, knowledge must be preserved and re-used between tasks, requiring a balance between maintaining
good transfer to future tasks and minimizing forgetting of previously learned ones. As several practical algorithms have been
devised to address the continual learning setting, the natural question of providing reliable risk certificates has also been raised.
Although there are results for specific settings and algorithms on the behavior of memory stability, generally applicable upper bounds on learning plasticity are few and far between.
In this work, we extend existing PAC-Bayes bounds for online learning and time-uniform offline learning to the continual learning
setting. We derive general upper bounds on the cumulative generalization loss applicable for any task distribution and learning
algorithm as well as oracle bounds for Gibbs posteriors and compare their effectiveness for several different
task distributions. We demonstrate empirically that our approach yields non-vacuous bounds for several continual learning
problems in vision, as well as tight oracle bounds on linear regression tasks. To the best of our knowledge, this is the first general upper bound on learning plasticity for continual learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3078. FrontierCO: Real-World and Large-Scale Evaluation of Machine Learning Solvers for Combinatorial Optimization
๐ฌ 6
โญ 5.33
๐ datasets and benchmarks
๐ท๏ธ Combinatorial Optimization Graph Neural Networks Large Language Models
Machine learning (ML) has shown promise for tackling combinatorial optimization (CO), but much of the reported progress relies on small-scale, synthetic benchmarks that fail to capture real-world structure and scale. A core limitation is that ML methods are typically trained and evaluated on synthetic instance generators, leaving open how they perform on irregular, competition-grade, or industrial datasets. We present FrontierCO, a benchmark for evaluating ML-based CO solvers under real-world structure and extreme scale. FrontierCO spans eight CO problems, including routing, scheduling, facility location, and graph problems, with instances drawn from competitions and public repositories (e.g., DIMACS, TSPLib). Each task provides both easy sets (historically challenging but now solvable) and hard sets (open or computationally intensive), alongside standardized training/validation resources. Using FrontierCO, we evaluate 16 representative ML solvers---graph neural approaches, hybrid neuralโsymbolic methods, and LLM-based agents---against state-of-the-art classical solvers. We find a persistent performance gap that widens under structurally challenging and large instance sizes (e.g., TSP up to 10M nodes; MIS up to 8M), while also identifying cases where ML methods outperform classical solvers. By centering evaluation on real-world structure and orders-of-magnitude larger instances, FrontierCO provides a rigorous basis for advancing ML for CO.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3079. MILPnet: A Multi-Scale Architecture with Geometric Feature Sequence Representations for Advancing MILP Problems
๐ฌ 15
โญ 5.33
๐ optimization
๐ท๏ธ MILP;optimal
We propose MILPnet, a multi-scale hybrid attention framework that models Mixed Integer Linear Programming (MILP) problems as geometric sequences rather than graphs. This approach directly addresses the challenge of Foldable MILP instances, a class of problems that graph-based models, specifically Graph Neural Networks (GNNs), fail to distinguish due to expressiveness limits imposed by the Weisfeiler-Lehman test. By representing MILPs through sequences of constraint and objective features, MILPnet captures both local and global geometric structure using a theoretically grounded multi-scale attention mechanism. We theoretically prove that MILPnet can approximate feasibility, optimal objective value, and optimal solution mappings over a measurable topological space with arbitrarily small error. Empirically, MILPnet outperforms graph-based methods by multiple orders of magnitude in feasibility prediction accuracy and convergence speed on Foldable MILPs, while using significantly fewer parameters. It also generalizes effectively across problem scales and demonstrates strong performance on real-world MILP benchmarks when integrated into an end-to-end solver pipeline.Our code is available with the https://anonymous.4open.science/r/MILPnet-2BD1/
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3080. Learning Nonlinear Causal Reductions to Explain Reinforcement Learning Policies
๐ฌ 13
โญ 5.33
๐ causal reasoning
๐ท๏ธ Causal Abstraction Causal Representation Learning Reinforcement Learning Explainable AI
Why do reinforcement learning (RL) policies fail or succeed?
This is a challenging question due to the complex, high-dimensional nature of agent-environment interactions.
We take a causal perspective on explaining the global behavior of RL policies by viewing the states, actions, and rewards as variables in a low-level causal model.
We introduce random perturbations to policy actions during execution and observe their effects on the cumulative reward, learning a simplified high-level causal model that explains these relationships.
To this end, we develop a nonlinear Causal Model Reduction framework that ensures approximate interventional consistency, i.e., the simplified high-level model responds to interventions in a way consistent with the original complex system. \todo{make more convincing}We prove that for a class of nonlinear causal models, there exists a unique solution that achieves exact interventional consistency, ensuring learned explanations reflect meaningful causal patterns.
Experiments on both synthetic causal models and practical RL tasks~-~including pendulum control and robot table tennis~-~demonstrate that our approach can uncover important behavioral patterns, biases, and failure modes in trained RL policies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3081. Laplacian Multi-scale Flow Matching for Generative Modeling
๐ฌ 19
โญ 5.33
๐ generative models
๐ท๏ธ flow matching multi-scale generative modeling image generation
In this paper, we present Laplacian multiscale flow matching (LapFlow), a novel framework that enhances flow matching by leveraging multi-scale representations for image generative modeling. Our approach decomposes images into Laplacian pyramid residuals and processes different scales in parallel through a mixture-of-transformers (MoT) architecture with causal attention mechanisms. Unlike previous cascaded approaches that require explicit renoising between scales, our model generates multi-scale representations in parallel, eliminating the need for bridging processes. The proposed multi-scale architecture not only improves generation quality but also accelerates the sampling process and promotes scaling flow matching methods. Through extensive experimentation on CelebA-HQ and ImageNet, we demonstrate that our method achieves superior sample quality with fewer GFLOPs and faster inference compared to single-scale and multi-scale flow matching baselines. The proposed model scales effectively to high-resolution generation (up to 1024ร1024) while maintaining lower computational overhead.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3082. Tuning the burn-in phase in training recurrent neural networks improves their performance
๐ฌ 15
โญ 5.33
๐ learning theory
๐ท๏ธ RNN training learning theory optimization
Training recurrent neural networks (RNNs) with standard backpropagation through time (BPTT) can be challenging, especially in the presence of long input sequences. A practical alternative to reduce computational and memory overhead is to perform BPTT repeatedly over shorter segments of the training data set, corresponding to truncated BPTT. In this paper, we examine the training of RNNs when using such a truncated learning approach for time series tasks. Specifically, we establish theoretical bounds on the accuracy and performance loss when optimizing over subsequences instead of the full data sequence. This reveals that the burn-in phase of the RNN is an important tuning knob in its training, with significant impact on the performance guarantees. We validate our theoretical results through experiments on standard benchmarks from the fields of system identification and time series forecasting. In all experiments, we observe a strong influence of the burn-in phase on the training process, and proper tuning can lead to a reduction of the prediction error on the training and test data of more than 60% in some cases.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3083. MoCa: Modeling Object Consistency for 3D Camera Control in Video Generation
๐ฌ 13
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Text-To-Video Camera-Control Video Generation Generative Model
Camera control is important in text-to-video generation for achieving realistic scene navigation and view synthesis. This control is defined by parameters that describe movement through 3D space, thereby introducing a 3D consistency into the generation process. A core challenge for existing methods is achieving 3D consistency within the 2D pixel domain. Strategies that directly integrate camera conditions into text-to-video models often produce artifacts, while those relying on explicit 3D supervision face generalization issues. Both limitations originate from the gap between the 2D pixel space and the underlying 3D world. The key insight is that the projection of a smooth 3D camera movement produces consistency in object view, appearance, and motion across 2D frames. Inspired by this insight, we propose MoCa, a dual-branch framework that bridges this gap by modeling object consistency to implicitly learn 3D relationships between camera and scene. To ensure view consistency, we design a Spatial-Temporal Camera encoder with Plรผcker embedding, which encodes camera trajectories into a geometrically grounded latent representation. For appearance consistency, we introduce a semantic guidance strategy that leverages persistent vision-language features to maintain object identity and texture across frames. To address motion consistency, we propose an object-aware motion disentanglement mechanism that separates object dynamics from global camera movement, ensuring precise camera control and natural object motion. Experiments show that MoCa achieves accurate camera control while preserving video quality, offering a practical and effective solution for camera-controllable video synthesis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3084. Emergent Misalignment is Easy, Narrow Misalignment is Hard
๐ฌ 8
โญ 5.33
๐ interpretability and explainable AI
๐ท๏ธ Emergent Misalignment Interpretability Safety Alignment Model Organisms
Finetuning large language models on narrowly harmful datasets can cause them to become emergently misaligned, giving stereotypically `evil' responses across diverse unrelated settings. Concerningly, a pre-registered survey of experts failed to predict this result, highlighting our poor understanding of the inductive biases governing learning and generalisation in LLMs. We use emergent misalignment (EM) as a case study to investigate these inductive biases, and find that although models can learn the narrow dataset task, the general solution is measurably more stable and more efficient. To establish this, we first demonstrate that EM is a robust phenomena by introducing new datasets which induce misalignment more consistently and coherently than prior work. We show that different EM finetunes converge to the same linear representation of general misalignment, which can be used to mediate misaligned behaviour. However, a linear representation of the narrow solution also exists, and can be learned by introducing a KL divergence loss. Comparing these representations reveals that general misalignment achieves lower loss, is more robust to perturbations, and is more influential in the pre-training distribution. This work isolates a concrete representation of general misalignment for monitoring and mitigation. More broadly, it offers a detailed case study and metrics for understanding how inductive biases shape generalisation in LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3085. Ghost in the Cloud: Your Geo-Distributed Large Language Models Training is Easily Manipulated
๐ฌ 24
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Jailbreak attack Geo-distributed LLM Training Federated Learning Large Language Models
Geo-distributed training and Federated Learning (FL) provide viable solutions to address the substantial data and computational resource needs associated with training large language models (LLMs). However, we empirically demonstrate that a single adversarial participant can significantly compromise the safety alignment of LLMs through malicious training, exposing serious security risks.
We identify two existing server-side defense strategies that effectively counter naive jailbreak attacksโTask Performance Check (TPC), which filters out model updates with low downstream performance, and Malicious Output Scrutiny (MOS), which detects harmful outputs by prompting uploaded models with malicious queries.
To evade both defenses, we design a trigger-based jailbreak variant that preserves downstream performance using a novel regularization method to limit the excessive model updates on jailbreak datasets. We further conceal malicious triggers by mixing the malicious dataset with pseudo-contrastive safety-aligned answers to maintain the original safety alignment.
Experiments on three widely-used safety-aligned LLMs show that a single adversarial participant can implant triggers into the global model without degrading downstream performance, achieving an 80\% attack success rate (ASR) with a 7\% low detection true rate (DTR).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3086. Solving Parameter-Robust Avoid Problems with Unknown Feasibility using Reinforcement Learning
๐ฌ 3
โญ 5.33
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Optimal Control Deep Reachability
Recent advances in deep reinforcement learning (RL) have achieved strong results on high-dimensional control tasks, but applying RL to reachability problems raises a fundamental mismatch: reachability seeks to maximize the set of states from which a system remains safe indefinitely, while RL optimizes expected returns over a user-specified distribution. This mismatch can result in policies that perform poorly on low-probability states that are still within the safe set. A natural alternative is to frame the problem as a robust optimization over a set of initial conditions that specify the initial state, dynamics and safe set, but whether this problem has a solution depends on the feasibility of the specified set, which is unknown a priori. We propose Feasibility-Guided Exploration (FGE), a method that simultaneously identifies a subset of feasible initial conditions under which a safe policy exists, and learns a policy to solve the reachability problem over this set of initial conditions. Empirical results demonstrate that FGE learns policies with over 50% more coverage than the best existing method for challenging initial conditions across tasks in the MuJoCo simulator and the Kinetix simulator with pixel observations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3087. OccDriver: Future Occupancy Guided Dual-branch Trajectory Planner in Autonomous Driving
๐ฌ 4
โญ 5.33
๐ applications to robotics, autonomy, planning
๐ท๏ธ Autonomous Driving Trajectory Planning
Trajectory planning for autonomous driving is challenging due to agents' behavioral uncertainty and intricate multi-agent interaction modeling. Most existing studies generate trajectories without explicitly exploiting possible scene evolution, while world models predict consequences from ego behavior, enabling more informed planning decisions. Inspired by the world model, we propose OccDriver, a novel rasterized-to-vectorized dual-branch framework for trajectory planning. This pipeline performs a coarse-to-fine trajectory decoding process: The vectorized branch first generate multimodal coarse trajectories; Then the rasterized branch predicts future scene evolutions conditioned on each coarse trajectory via occupancy flow prediction; Lastly, the vectorized branch leverages intuitive future interaction evolution of each modality from the rasterized branch and produces refined trajectories. Several cross-modality (occupancy and trajectory) losses are further introduced to improve the consistency between trajectory and occupancy prediction. Additionally, we apply a contingency objective in both occupancy space, considering marginal and joint occupancy distributions in different planning scopes. Our model is assessed on the large-scale real-world nuPlan dataset and its associated planning benchmark. Experiments show that OccDriver achieves state-of-the-art in both Non-Reactive and Reactive closed-loop performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3088. AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework
๐ฌ 10
โญ 5.33
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Multi-agent framework Bayesian inference LLM AI-for-Science
Large Language Models (LLMs) demonstrate potentials for automating scientific code generation but face challenges in reliability, error propagation in multi-agent workflows, and evaluation in domains with ill-defined success metrics. We present a Bayesian adversarial multi-agent framework specifically designed for AI for Science (AI4S) tasks in the form of a Low-code Platform (LCP). Three LLM-based agents are coordinated under the Bayesian framework: a Task Manager that structures user inputs into actionable plans and adaptive test cases, a Code Generator that produces candidate solutions, and an Evaluator providing comprehensive feedback. The framework employs an adversarial loop where the Task Manager iteratively refines test cases to challenge the Code Generator, while prompt distributions are dynamically updated using Bayesian principles by integrating code quality metrics: functional correctness, structural alignment, and static analysis. This co-optimization of tests and code reduces dependence on LLM reliability and addresses evaluation uncertainty inherent to scientific tasks. LCP also streamlines human-AI collaboration by translating non-expert prompts into domain-specific requirements, bypassing the need for manual prompt engineering by practitioners without coding backgrounds. Benchmark evaluations demonstrate LCPโs effectiveness in generating robust code while minimizing error propagation. The proposed platform is also tested on an Earth Science cross-disciplinary task and demonstrates strong reliability, outperforming competing models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3089. Programming with Pixels: Can Computer-Use Agents do Software Engineering?
๐ฌ 21
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ computer-use agents evaluation benchmark code-generation multimodal
Computer-use agents (CUAs) hold the promise of performing a wide variety of general tasks, but current evaluations have primarily focused on simple scenarios.
It therefore remains unclear whether such generalist agents can automate more sophisticated and specialized work such as software engineering (SWE).
To investigate this, we introduce Programming with Pixels (PwP), the first comprehensive computer-use environment for software engineering, where agents visually control an IDE to perform diverse software engineering tasks.
To enable holistic evaluation, we also introduce PwP-Bench, a benchmark of 15 existing and new software-engineering tasks spanning multiple modalities, programming languages, and skillsets.
We perform an extensive evaluation of state-of-the-art open-weight and closed-weight CUAs and find that when interacting purely visually, they perform significantly worse than specialized coding agents.
However, when the same CUAs are given direct access to just two APIsโfile editing and bash operationsโperformance jumps, often reaching the levels of specialized agents despite having a task-agnostic design.
Furthermore, when given access to additional IDE tools via text APIs, all models show further gains.
Our analysis shows that current CUAs fall short mainly due to limited visual grounding and the inability to take full advantage of the rich environment, leaving clear room for future improvements.
PwP establishes software engineering as a natural domain for benchmarking whether generalist computer-use agents can reach specialist-level performance on sophisticated tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3090. Physics-Informed Audio-Geometry-Grid Representation Learning for Universal Sound Source Localization
๐ฌ 13
โญ 5.33
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Sound Source Localization Geometry-Invariant Grid-Flexible Representation Learning Physics-Informed Design Learnable Non-uniform DFT Relative Microphone Positional Encoding
Sound source localization (SSL) is a fundamental task for spatial audio understanding, yet most deep neural network-based methods are constrained by fixed array geometries and predefined directional grids, limiting generalizability and scalability. To address these issues, we propose _audio-geometry-grid representation learning_ (AGG-RL), a novel framework that jointly learns audio-geometry and grid representations in a shared latent space, enabling both geometry-invariant and grid-flexible SSL. Moreover, to enhance generalizability and interpretability, we introduce two physics-informed components: a _learnable non-uniform discrete Fourier transform_ (LNuDFT), which optimizes the dense allocation of frequency bins in a non-uniform manner to emphasize informative phase regions, and a _relative microphone positional encoding_ (rMPE), which encodes relative microphone coordinates in accordance with the nature of inter-channel time differences. Experiments on synthetic and real datasets demonstrate that AGG-RL achieved superior performance, particularly under unseen conditions. The results highlight the potential of representation learning with physics-informed design towards a universal solution for spatial acoustic scene understanding across diverse scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3091. Harpoon: Generalised Manifold Guidance for Conditional Tabular Diffusion
๐ฌ 17
โญ 5.33
๐ generative models
๐ท๏ธ Diffusion models Conditional generation Tabular diffusion Manifold learning
Generating tabular data under conditions is critical to applications requiring precise control over the generative process. Existing methods rely on training-time strategies that do not generalise to unseen constraints during inference, and struggle to handle conditional tasks beyond tabular imputation. While manifold theory offers a principled way to guide generation, current formulations are tied to specific inference-time objectives and are limited to continuous domains. We extend manifold theory to tabular data and expand its scope to handle diverse inference-time objectives. On this foundation, we introduce Harpoon, a tabular diffusion method that guides unconstrained samples along the manifold geometry to satisfy diverse tabular conditions at inference. We validate our theoretical contributions empirically on tasks such as imputation and enforcing inequality constraints, demonstrating Harpoon's strong performance across diverse datasets and the practical benefits of manifold-aware guidance for tabular data. Code URL: https://anonymous.4open.science/r/ManifoldTabularImputation-44E4/
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3092. Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts
๐ฌ 3
โญ 5.33
๐ reinforcement learning
๐ท๏ธ Offline Reinforcement Learning; Cross-domain Policy Adaptation
Single-domain offline reinforcement learning (RL) often suffers from limited data coverage, while cross-domain offline RL handles this issue by leveraging additional data from other domains with dynamics shifts. However, existing studies primarily focus on train-time robustness (handling dynamics shifts from training data), neglecting the test-time robustness against dynamics perturbations when deployed in practical scenarios. In this paper, we investigate dual (both train-time and test-time) robustness against dynamics shifts in cross-domain offline RL. We first empirically show that the policy trained with cross-domain offline RL exhibits fragility under dynamics perturbations during evaluation, particularly when target domain data is limited. To address this, we introduce a novel robust cross-domain Bellman (RCB) operator, which enhances test-time robustness against dynamics perturbations while staying conservative to the out-of-distribution dynamics transitions, thus guaranteeing the train-time robustness. To further counteract potential value overestimation or underestimation caused by the RCB operator, we introduce two techniques, the dynamic value penalty and the Huber loss, into our framework, resulting in the practical Dual-RObust Cross-domain Offline RL (DROCO) algorithm. Extensive empirical results across various dynamics shift scenarios show that DROCO outperforms strong baselines and exhibits enhanced robustness to dynamics perturbations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3093. Preserve and Personalize: Personalized Text-to-Image Diffusion Models without Distributional Drift
๐ฌ 8
โญ 5.33
๐ generative models
๐ท๏ธ Text-to-Image Diffusion Models Personalization Overfitting Distributional Drift Regularization Lipschitz Constraints
Personalizing text-to-image diffusion models involves integrating novel visual concepts from a small set of reference images while retaining the modelโs original generative capabilities. However, this process often leads to overfitting, where the model ignores the userโs prompt and merely replicates the reference images. We attribute this issue to a fundamental misalignment between the true goals of personalization, which are subject fidelity and text alignment, and the training objectives of existing methods that fail to enforce both objectives simultaneously. Specifically, prior approaches often overlook the need to explicitly preserve the pretrained modelโs output distribution, resulting in distributional drift that undermines diversity and coherence. To resolve these challenges, we introduce a Lipschitz-based regularization objective that constrains parameter updates during personalization, ensuring bounded deviation from the original distribution. This promotes consistency with the pretrained modelโs behavior while enabling accurate adaptation to new concepts. Furthermore, our method offers a computationally efficient alternative to commonly used, resource-intensive sampling techniques. Through extensive experiments across diverse diffusion model architectures, we demonstrate that our approach achieves superior performance in both quantitative metrics and qualitative evaluations, consistently excelling in visual fidelity and prompt adherence. We further support these findings with comprehensive analyses, including ablation studies and visualizations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3094. Algorithmic Guarantees for Distilling Supervised and Offline RL Datasets
๐ฌ 11
โญ 5.33
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ dataset distillation supervised learning offline RL learning theory
Given a training dataset, the goal of dataset distillation is to derive a synthetic dataset such that models trained on the latter perform as well as those trained on the training dataset. In this work, we develop and analyze an efficient dataset distillation algorithm for supervised learning, specifically regression in $\mathbb{R}^d$, based on matching the losses on the training and synthetic datasets with respect to a fixed set of randomly sampled regressors without any model training. Our first key contribution is a novel performance guarantee proving that our algorithm needs only $\tilde{O}(d^2)$ sampled regressors to derive a synthetic dataset on which the MSE loss of any bounded linear model is approximately the same as its MSE loss on the given training data. In particular, the model optimized on the synthetic data has close to minimum loss on the training data, thus performing nearly as well as the model optimized on the latter. Complementing this, we also prove a matching lower bound of $\Omega(d^2)$ for the number of sampled regressors showing the tightness of our analysis.
Our second contribution is to extend our algorithm to offline RL dataset distillation by matching the Bellman loss, unlike previous works which used a behavioral cloning objective. This is the first such method which leverages both, the rewards and the next state information, available in offline RL datasets, without any policy model optimization. We show similar guarantees: our algorithm generates a synthetic dataset whose Bellman loss with respect to any linear action-value predictor is close to the latterโs Bellman loss on the offline RL training dataset. Therefore, a policy associated with an action-value predictor optimized on the synthetic dataset performs nearly as well as that derived from the one optimized on the training data. We conduct extensive experiments to validate our theoretical guarantees and observe performance gains on real-world RL environments with offline training datasets and supervised regression datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3095. DirMoE: Dirichlet-Routed Mixture of Experts
๐ฌ 17
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ MoE Mixture of experts sparsity intrepretability
Mixture-of-Experts (MoE) models have demonstrated exceptional performance in large-scale language models. Existing routers typically rely on non-differentiable Top-$k$+Softmax, limiting their performance and scalability. We argue that two distinct decisions, which experts to activate and how to distribute expert contributions among them, are conflated in standard Top-$k$+Softmax. We introduce Dirichlet-Routed MoE (DirMoE), a novel end-to-end differentiable routing mechanism built on a Dirichlet variational autoencoder framework. This design fundamentally disentangles the core routing problems: expert selection, modeled by a Bernoulli component, and expert contribution among chosen experts, handled by a Dirichlet component. The entire forward pass remains fully differentiable through the use of Gumbel-Sigmoid relaxation for the expert selection and implicit reparameterization for the Dirichlet distribution. Our training objective, a variational ELBO, includes a direct sparsity penalty that precisely controls the number of active experts in expectation, alongside a schedule for key hyperparameters that guides the model from an exploratory to a definitive routing state. Moreover, our DirMoE router matches or exceeds other methods while improving expert specialization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3096. Single-Loop Byzantine-Resilient Federated Bilevel Optimization
๐ฌ 14
โญ 5.33
๐ optimization
๐ท๏ธ Bilevel optimization Federated learning Byzantine robustness
Federated bilevel optimization plays a crucial role in solving complex problems with nested optimization structures. However, its distributed nature makes it highly susceptible to faulty or Byzantine behaviors. Existing Byzantine-resilient approaches are either restricted to simple single-level optimization problems or rely on sub-loop updates that introduce significant computational and communication overhead. To address these limitations, we propose a family of Byzantine-resilient federated bilevel algorithms, which (i) operate within a single-loop structure, (ii) achieve optimal Byzantine resilience, and (iii) ensure computational and communication efficiency. The core of the proposed method, BR-FedBi, leverages an auxiliary variable that facilitates efficient hypergradient estimation while simultaneously solving the lower- and upper-level problems. Building on BR-FedBi, we further integrate the algorithm with Polyakโs momentum and the probabilistic gradient estimator (PAGE) (Li et al., 2021), resulting in provable optimal Byzantine resilience and optimal sample complexity. Both theoretical analysis and empirical results demonstrate the superior performance of the proposed algorithms. Our code repository is available at https://anonymous.4open.science/r/code_ICLR_11941/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3097. Deep-ICE: The first globally optimal algorithm for empirical risk minimization of two-layer maxout and ReLU networks
๐ฌ 8
โญ 5.33
๐ learning theory
๐ท๏ธ Neural network Global optimal Algorithm design Combinatorial optimization
This paper introduces the first globally optimal algorithm for the
empirical risk minimization problem of two-layer maxout and ReLU networks,
i.e., minimizing the number of misclassifications. The algorithm has
a worst-case time complexity of $O\left(N^{DK+1}\right)$, where $K$
denotes the number of hidden neurons and $D$ represents the number
of features. It can be can be generalized to accommodate arbitrary
computable loss functions without affecting its computational complexity.
Our experiments demonstrate that the proposed algorithm provides provably
exact solutions for small-scale datasets. To handle larger datasets,
we introduce a heuristic method that reduces the data size to a manageable
scale, making it feasible for our algorithm. This extension enables
efficient processing of large-scale datasets and achieves significantly
improved performance in both training and prediction, compared to state-of-the-art approaches
(neural networks trained using gradient descent and support vector
machines), when applied to the same models (two-layer networks with
fixed hidden nodes and linear models).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3098. Beyond Outliers: A Study of Optimizers Under Quantization
๐ฌ 13
โญ 5.33
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLMs Quantization Optimizers post-training quantization quantization-aware training
As new optimizers gain traction and model quantization becomes standard for efficient deployment, a key question arises: how does the choice of optimizer affect model performance in the presence of quantization? Despite progress in both areas, systematic evidence on optimizerโquantization interactions remains limited. To fill this gap, we study the impact of optimizer choice on model robustness under quantization, considering both post-training quantization (PTQ), and quantization-aware training (QAT). We first train full-precision models, ranging from 50M to 1.5B parameters, with six optimizers, to explore the hyperparameter landscape, and establish well-tuned baselines.
We then apply PTQ to evaluate how model performance degrades when trained with different optimizers. We find that outlier-related metrics, such as the max-to-mean ratio (MMR) and Kurtosis, fail to predict the PTQ performance across different optimizers. We show analytically that this is due to the MMR capturing only isolated layer errors, while ignoring how quantization errors accumulate and propagate through the network. To study the QAT degradation, we train quantized models from scratch and compare them to our original-precision baselines. We find that optimizers performing well in the original pretraining setup may not remain optimal under QAT, and that models trained with Shampoo show the lowest accuracy degradation. Finally, we derive scaling laws for quantization-aware training under different optimizers, showing that Shampoo achieves the highest parameter efficiency of all tested optimizers.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3099. Learning Pseudorandom Numbers with Transformers: Permuted Congruential Generators, Curricula, and Interpretability
๐ฌ 9
โญ 5.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ In-context Learning Curriculum Learning Interpretability Transformers Pseudo-Random Number Generators
We study the ability of Transformer models to learn sequences generated by Permuted Congruential Generators (PCGs), a widely used family of pseudo-random number generators (PRNGs). PCGs introduce substantial additional difficulty over linear congruential generators (LCGs) by applying a series of bit-wise shifts, XORs, rotations and truncations to the hidden state. We show that Transformers can nevertheless successfully perform in-context prediction on unseen sequences from diverse PCG variants, in tasks that are beyond published classical attacks. In our experiments we scale moduli up to $2^{22}$ using up to $50$ million model parameters and datasets with up to $5$ billion tokens. Surprisingly, we find even when the output is truncated to a single bit, it can be reliably predicted by the model.
When multiple distinct PRNGs are presented together during training, the model can jointly learn them, identifying structures from different permutations.
We demonstrate a scaling law with modulus $m$: the number of in-context sequence elements required for near-perfect prediction grows as $\sqrt{m}$.
For larger moduli, optimization enters extended stagnation phases; in our experiments, learning moduli $m \geq 2^{20}$ requires incorporating training data from smaller moduli, demonstrating a critical necessity for curriculum learning.
Finally, we analyze embedding layers and uncover a novel clustering phenomenon: the top principal components spontaneously group the integers into bitwise rotationally-invariant clusters, revealing how representations can transfer from smaller to larger moduli.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3100. Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization
๐ฌ 14
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Model Alignment Direct Preference Optimization
Preference optimization is crucial for aligning large language models (LLMs) with human values and intentions. A significant challenge in this process is the distribution mismatch between pre-collected offline preference data and the evolving model policy. Existing methods attempt to reduce this gap using static heuristics or decoupled online sampling strategies, but they often fail to adapt to the model's dynamic learning state. To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training. MetaAPO employs a lightweight meta-learner, as an "alignment gap estimator", to evaluate the potential benefits of on-policy sampling in relation to offline data. This guides targeted online generation and assigns sample-wise meta-weights to the optimization objective, dynamically balancing the quality and distribution of online and offline data. Experiments on AlpacaEval 2, Arena-Hard and MT-Bench demonstrate that MetaAPO consistently outperforms existing preference optimization approaches across various settings, while reducing 42% in online annotation costs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3101. Who Matters Matters: Agent-Specific Conservative Offline MARL
๐ฌ 11
โญ 5.33
๐ reinforcement learning
๐ท๏ธ Offline reinforcement learning Reinforcement learning
Offline Multi-Agent Reinforcement Learning (MARL) enables policy learning from static datasets in multi-agent systems, eliminating the need for risky or costly environment interactions during training. A central challenge in offline MARL lies in achieving effective collaboration among heterogeneous agents under the constraints of fixed datasets, where \textbf{conservatism} is introduced to restrict behaviors to data-supported distributions. Agents with distinct roles and capabilities require individualized conservatism - yet must maintain cohesive team performance. However, existing approaches often apply uniform conservatism across all agents, leading to over-constraining critical agents and under-constraining others, which hampers effective collaboration.
To address this issue, a novel framework, \textbf{OMCDA}, is proposed, where the degree of conservatism is dynamically adjusted for individual agents based on their impact on overall system performance. The framework is characterized by two key innovations: (1) A decomposed Q-function architecture is introduced to disentangle return computation from policy deviation assessment, allowing precise evaluations of each agent's contribution; and (2) An adaptive conservatism mechanism is developed to scale constraint strength according to both behavior policy divergence and the estimated importance of agents to the system.
Experiments on MuJoCo and SMAC show OMCDA outperforms existing offline MARL methods, effectively balancing the flexibility and conservatism across agents while ensuring fair credit assignment and better collaboration.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3102. WRING Out The Bias: A Rotation-Based Alternative To Projection Debiasing
๐ฌ 7
โญ 5.33
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Vision Language Models VLMs bias debias
Vision-Language models (VLMs), including CLIP, are known to encode biases such as learning spurious correlations that falsely associate background attributes with particular labels. Debiasing approaches typically aim to isolate and remove subspaces corresponding to a target concept via projecting its embedding away from the concept. This strategy succeeds in debiasing VLM embeddings with respect to the concepts considered but can amplify biased shortcuts in unconsidered concepts. In practice, it is impossible to enumerate all possible biases, meaning that an increase in bias can go unobserved during evaluation. We propose a debiasing approach for a set of known concepts such that the relation to the remaining, unconsidered, concepts is minimally changed. We achieve this by rotating the VLMโs embeddings within only a relevant subspace, rather than removing these subspaces, which mitigates unintended bias amplification.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3103. Attribution-Guided Decoding
๐ฌ 10
โญ 5.33
๐ interpretability and explainable AI
๐ท๏ธ decoding steering feature attribution mechanistic interpretability explainable AI instruction following factuality language model generation
The capacity of Large Language Models (LLMs) to follow complex instructions and generate factually accurate text is critical for their real-world application. However, standard decoding methods often fail to robustly satisfy these requirements, while existing control techniques frequently degrade general output quality.
In this work, we introduce Attribution-Guided Decoding (AGD), an interpretability-based decoding strategy.
Instead of directly manipulating model activations, AGD considers a set of high-probability output token candidates and selects the one that exhibits the highest attribution
to a user-defined Region of Interest (ROI). This ROI can be flexibly defined over different parts of the model's input or internal components, allowing AGD to steer generation towards various desirable behaviors.
We demonstrate AGD's efficacy across three challenging domains.
For instruction following, we show that AGD significantly boosts adherence (e.g., improving the overall success rate on Llama 3.1 from 66.0\% to 79.1\%).
For knowledge-intensive tasks, we show that guiding generation towards usage of internal knowledge components or contextual sources can reduce hallucinations and improve factual accuracy in both closed-book and open-book settings.
Furthermore, we propose an adaptive, entropy-based variant of AGD that mitigates quality degradation and reduces computational overhead by applying guidance only when the model is uncertain.
Our work presents a versatile, more interpretable, and effective method for enhancing the reliability of modern LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3104. Trust but Verify: Adaptive Conditioning for Reference-Based Diffusion Super-Resolution via Implicit Reference Correlation Modeling
๐ฌ 17
โญ 5.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ diffusion_models deep_learning reference-based super-resolution
Recent works have explored reference-based super-resolution (RefSR) to mitigate hallucinations in diffusion-based image restoration. A key challenge is that real-world degradations make correspondences between low-quality (LQ) inputs and reference (Ref) images unreliable, requiring adaptive control of reference usage. Existing methods either ignore LQโRef correlations or rely on brittle explicit matching, leading to over-reliance on misleading references or under-utilization of valuable cues. To address this, we propose Ada-RefSR, a single-step diffusion framework guided by a "Trust but Verify " principle: reference information is leveraged when reliable and suppressed otherwise. Its core component, Adaptive Implicit Correlation Gating (AICG), employs learnable summary tokens to distill dominant reference patterns and capture implicit correlations with LQ features. Integrated into the attention backbone, AICG provides lightweight, adaptive regulation of reference guidance, serving as a built-in safeguard against erroneous fusion. Experiments on multiple datasets demonstrate that Ada-RefSR achieves a strong balance of fidelity, naturalness, and efficiency, while remaining robust under varying reference alignment. Code and models will be publicly released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3105. A tale of two tails: Preferred and anti-preferred natural stimuli in visual cortex
๐ฌ 25
โญ 5.20
๐ applications to neuroscience & cognitive science
๐ท๏ธ computational neuroscience neuronal tuning stimulus selectivity higher-order visual cortex human psychophysics stimulus optimization deep neural networks
A fundamental quest in neuroscience is to find the preferred stimulus of a sensory neuron. This search lays the foundation for understanding how selectivity emerges in the primate visual stream---from simple edge-detecting neurons to highly-selective "grandmother" neurons---as well as for the architectures and activation functions of deep neural networks. The prevailing notion is that a visual neuron primarily responds to a single preferred visual feature like an oriented edge or object identity, resulting in a "one-tailed" distribution of responses to natural images. However, surprisingly, we instead find "two-tailed" response distributions of neurons in higher-order visual cortex (macaque V4), suggesting that V4 neurons have both preferred and anti-preferred stimuli. We ran further experiments to validate the existence of anti-preferred stimuli in V4. We find that these anti-preferred stimuli help to shape a neuron's tuning: Only a small number of preferred and anti-preferred images are needed to estimate the rest of a neuron's tuning. Moreover, in a psychophysics task, humans rely on anti-preferred images to interpret and predict V4 stimulus tuning; this was not the case for hidden units from a deep neural network. We find that the preferred and anti-preferred visual features, while clearly distinguishable for individual neurons, are not easily distinguishable across neurons. Thus, the V4 population seemingly encodes anti-preferred stimuli to double its capacitity for feature selectivity. To encourage future experiments searching for anti-preferred stimuli, we release a tool called ImageBeagle to efficiently "hunt" for a neuron's preferred and anti-preferred stimuli by traversing the nearest neighbor graph of 30 million natural images. Overall, we establish anti-preferred stimuli as an important encoding property of V4 neurons. Our work embarks on a new quest in neuroscience to search for anti-preferred stimuli along the visual stream as well as update our deep neural network models of visual cortex to account for the two-tailed response distributions of neurons.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3106. Autonomous Play with Correspondence-Driven Trajectory Warping
๐ฌ 29
โญ 5.20
๐ applications to robotics, autonomy, planning
๐ท๏ธ Robot Manipulation Autonomous Play Robot Data Generation
The ability to conduct and learn from self-directed interaction and experience is a central challenge in robotics, offering a scalable alternative to labor-intensive human demonstrations. However, realizing such "play" requires (1) a policy robust to diverse, potentially out-of-distribution environment states, and (2) a procedure that continuously produces useful, task-directed robot experience. To address these challenges, we introduce Tether, a method for autonomous play with two key contributions. First, we design a novel non-parametric policy that leverages strong visual priors for extreme generalization: given two-view images, it identifies semantic correspondences to warp demonstration trajectories into new scenes. We show that this design is robust to significant spatial and semantic variations of the environment, such as dramatic positional differences and unseen objects. We then deploy this policy for autonomous multi-task play in the real world via a continuous cycle of task selection, execution, evaluation, and improvement, guided by the visual understanding capabilities of vision-language models. This procedure generates diverse, high-quality datasets with minimal human intervention. In a household-like multi-object setup, our method is among the first to perform many hours of autonomous real-world play, producing a stream of data that consistently improves downstream policy performance over time. Ultimately, Tether yields over 1000 expert-level trajectories and trains policies competitive with those learned from human-collected demonstrations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3107. Solving Football by Exploiting Equilibrium Structure of 2p0s Differential Games with One-Sided Information
๐ฌ 25
โญ 5.20
๐ reinforcement learning
๐ท๏ธ Differential Game Incomplete-Information Game Game Theory
For a two-player imperfect-information extensive-form game (IIEFG) with $K$ time steps and a player action space of size $U$, the game tree complexity is $U^{2K}$, causing existing IIEFG solvers to struggle with large or infinite $(U,K)$, e.g., differential games with continuous action spaces. To partially address this scalability challenge, we focus on an important class of 2p0s games where the informed player (P1) knows the payoff while the uninformed player (P2) only has a belief over the set of $I$ possible payoffs. Such games encompass a wide range of scenarios in sports, defense, cybersecurity, and finance.
We prove that under mild conditions, P1's (resp. P2's) equilibrium strategy at any infostate concentrates on at most $I$ (resp. $I+1$) action prototypes. When $I\ll U$, this equilibrium structure causes the game tree complexity to collapse to $I^K$ for P1 when P2 plays pure best responses, and $(I+1)^K$ for P2 in a dual game where P1 plays pure best responses. We then show that exploiting this structure in standard learning modes, i.e., model-free multiagent reinforcement learning and model predictive control, is straightforward, leading to significant improvements in learning accuracy and efficiency from SOTA IIEFG solvers. Our demonstration solves a 22-player football game ($K=10$, $U=\infty$) where the attacking team has to strategically conceal their intention until a critical moment in order to exploit information advantage. Code is available [here](https://anonymous.4open.science/r/iclr_2026).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3108. Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models
๐ฌ 23
โญ 5.20
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ model rebasin fine-tuning transfer transfer learning model editing model patching compositionality
When a new release of a foundation model is published, practitioners typically need to repeat full fine-tuning, even if the same task has already been solved in the previous version. A promising alternative is to reuse the parameter changes (i.e., task vectors) that capture how a model adapts to a specific task. However, they often fail to transfer across different pre-trained models due to their misaligned parameter space. In this work, we show that the key to successful transfer lies in the sign structure of the gradients of the new model. Based on this insight, we propose GradFix, a novel method that approximates the ideal gradient sign structure and leverages it to transfer knowledge using only a handful of labeled samples. Notably, this requires no additional fine-tuning: the adaptation is achieved by computing a few gradients at the target model and masking the source task vector accordingly. This yields an update that is locally aligned with the target loss landscape, effectively rebasing the task vector onto the new pre-training. We provide a theoretical guarantee that our method ensures first-order descent. Empirically, we demonstrate significant performance gains on vision and language benchmarks, consistently outperforming naive task vector addition and few-shot fine-tuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3109. Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning
๐ฌ 30
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ Doctor Agent Clinical Inquiry Agentic Reinforcement Learning
The professionalism of a human doctor in outpatient service depends on two core abilities: the ability to make accurate medical decisions and the medical consultation skill to conduct strategic, empathetic patient inquiry. Existing Large Language Models (LLMs) have achieved remarkable accuracy on medical decision-making benchmarks. However, they often lack the ability to conduct the strategic and empathetic consultation, which is essential for real-world clinical scenarios. To address this gap, we propose Doctor-R1, an AI doctor agent trained to master both of the capabilities by ask high-yield questions and conduct strategic multi-turn inquiry to guide decision-making. Our framework introduces three key components: a multi-agent interactive environment, a two-tiered reward architecture that separately optimizes clinical decision-making and communicative inquiry skills, and an experience repository to ground policy learning in high-quality prior trajectories. We evaluate Doctor-R1 on OpenAI's HealthBench and MAQuE, assessed across multi-facet metrics, such as communication quality, user experience, and task accuracy. Remarkably, Doctor-R1 surpasses state-of-the-art open-source specialized LLMs by a substantial margin with higher parameter efficiency and outperforms powerful proprietary models. Furthermore, the human evaluations show a strong preference for Doctor-R1 to generate human-preferred clinical dialogue, demonstrating the effectiveness of the framework.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3110. Certifying the Full YOLO Pipeline: A Probabilistic Verification Approach
๐ฌ 30
โญ 5.20
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Probabilistic Verification Formal Verification Object Detection Safety Guaranteen
Object detection systems are essential in safety-critical applications, but they are vulnerable to object disappearance (OD) threat, in which valid objects become undetected under small input perturbations, creating serious risks. This paper addresses the problem of verifying the robustness of YOLO networks against OD by proposing a three-step probabilistic verification framework: (1) estimating output ranges under a distribution of input perturbations, (2) formally verifying the Non-Maximum Suppression (NMS) process within these ranges, and (3) iteratively refining the results to reduce over-approximation. The framework scales to practical YOLO models. Both theoretical analysis and experimental results demonstrate that our method achieves comparable probabilistic guarantees and provides tighter Intersection-over-Union (IoU) lower bounds while requiring significantly fewer samples than existing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3111. AMPED: Adaptive Multi-objective Projection for balancing Exploration and skill Diversification
๐ฌ 16
โญ 5.20
๐ reinforcement learning
๐ท๏ธ unsupervised reinforcement learning skill discovery self-supervised learning multi-joint robot locomotion
Skill-based reinforcement learning (SBRL) enables rapid adaptation in environments with sparse rewards by pretraining a skill-conditioned policy. Effective skill learning requires jointly maximizing both exploration and skill diversity. However, existing methods often face challenges in simultaneously optimizing for these two conflicting objectives. In this work, we propose a new method, Adaptive Multi-objective Projection for balancing Exploration and skill Diversification (AMPED), which explicitly addresses both: during pre-training, a gradient-surgery projection balances the exploration and diversity gradients, and during fine-tuning, a skill selector exploits the learned diversity by choosing skills suited to downstream tasks. Our approach achieves performance that surpasses SBRL baselines across various benchmarks. Through an extensive ablation study, we identify the role of each component and demonstrate that each element in AMPED is contributing to performance. We further provide theoretical evidence that, with a greedy skill selector, greater skill diversity reduces fine-tuning sample complexity. These results highlight the importance of explicitly harmonizing exploration and diversity and demonstrate the effectiveness of AMPED in enabling robust and generalizable skill learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3112. TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling
๐ฌ 20
โญ 5.20
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ spoken language modeling speech tokenization
Recent efforts target spoken language models (SLMs) that not only listen but also speak for more natural human-LLM interaction. Joint text-speech modeling is a promising direction to achieve this. However, the effectiveness of recent speech tokens for joint modeling remains under-explored. To address this, we introduce Text-Aligned Speech Tokenization and Embedding (TASTE), a method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage. We propose a method that can achieve this through a attention-based aggregation mechanism and with speech reconstruction as the training objective. We have conducted extensive experiments to demonstrate that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length. Moreover, TASTE enables straightforward joint spoken language modeling by using Low-Rank Adaptation on the pre-trained text LLM. Our experimental results show that joint modeling with TASTE outperforms other pre-trained SLMs in tasks such as speech continuation and likelihood-based next-speech selection, showcasing its effectiveness. To our best knowledge, TASTE is the first end-to-end approach that utilizes a reconstruction objective to learn a joint tokenization and embedding tailored for text-speech spoken language modeling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3113. Vid2World: Crafting Video Diffusion Models to Interactive World Models
๐ฌ 22
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ World Models Video Diffusion Models
World models, which predict future transitions from past observation and action sequences, have shown great promise for improving data efficiency in sequential decision-making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their usefulness in complex environments. In contrast, video diffusion models trained on large-scale internet data have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present _Vid2World_, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World systematically explores _video diffusion causalization_, reshaping both the architecture and training objective of pre-trained models to enable autoregressive generation. Additionally, it incorporates a _causal action guidance_ mechanism to enhance action controllability in the resulting interactive world models. Extensive experiments across multiple domains, including robot manipulation, 3D game simulation, and open-world navigation, demonstrate that our method offers a scalable and effective pathway for repurposing highly capable video diffusion models into interactive world models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3114. Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory?
๐ฌ 17
โญ 5.20
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ multi-modal language models memorization safety unified representations
We present *modal aphasia*, a systematic dissociation in which current unified multimodal models
accurately memorize concepts visually but fail to articulate them in writing, despite being trained on images and text simultaneously.
For one, we show that leading frontier models can generate near-perfect reproductions of iconic movie artwork, but confuse crucial details when asked for textual descriptions.
We corroborate those findings through controlled experiments on synthetic datasets in multiple architectures.
Our experiments confirm that modal aphasia reliably emerges as a fundamental property of current unified multimodal models, not just as a training artifact.
In practice, modal aphasia can introduce vulnerabilities in AI safety frameworks, as safeguards applied to one modality may leave harmful concepts accessible in other modalities.
We demonstrate this risk by showing how a model aligned solely on text
remains capable of generating harmful images.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3115. Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
๐ฌ 22
โญ 5.20
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ agents evaluation infrastructure reproducibility standardization
AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work (Figure 1). We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3116. AEGIS: Adversarial TargetโGuided Retention-Data-Free Robust Concept Erasure from Diffusion Models
๐ฌ 31
โญ 5.20
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Adversarial Learning; Prompt Injection Attacks; Adversarial Defending
Concept erasure helps stop diffusion models (DMs) from generating harmful content; but current methods face robustness-retention trade-off. **Robustness** means the model fine-tuned by concept erasure methods resists reactivation of erased concepts, even under semantically related prompts. **Retention** means unrelated concepts are preserved so the modelโs overall utility stays intact. Both are critical for concept erasure in practice, yet addressing them simultaneously is challenging, as existing works typically improve one factor while sacrificing the other. Prior work typically strengthens one while degrading the otherโe.g., mapping a single erased prompt to a fixed safe target leaves class-level remnants exploitable by prompt attacks, whereas retention-oriented schemes underperform against adaptive adversaries. This paper introduces AEGIS (Adversarial Erasure with Gradient-Informed Synergy), a retention-data-free framework that advances both robustness and retention. First, AEGIS replaces handpicked targets with an Adversarial Erasure Target (AET) optimized to approximate the semantic center of the erased concept class. By aligning the modelโs prediction on the erased prompt to an AET-derived target in the shared textโimage space, AEGIS increases predicted-noise distances not just for the instance but for semantically related variants, substantially hardening the DMs against state-of-the-art adversarial prompt attacks. Second, AEGIS preserves utility without auxiliary data via Gradient Regularization Projection (GRP), a conflict-aware gradient rectification that selectively projects away the destructive component of the retention update only when it opposes the erasure direction. This directional, data-free projection mitigates interference between erasure and retention, avoiding dataset bias and accidental relearning. Extensive experiments show that AEGIS markedly reduces attack success rates across various concepts while maintaining or improving FID/CLIP versus advanced baselines, effectively pushing beyond the prevailing robustnessโretention trade-off. The source code is in the supplementary.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3117. TopoFormer: Topology Meets Attention for Graph Learning
๐ฌ 27
โญ 5.20
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Topological Data Analysis Transformers Graph Representation Learning Graph Classification Molecular Property Prediction
We introduce *TopoFormer*, a lightweight and scalable framework for graph representation learning that encodes topological structure into attention-friendly sequences. At the core of our method is *Topo-Scan*, a novel module that decomposes a graph into a short, ordered sequence of topological tokens by slicing over node or edge filtrations. These sequences capture multi-scale structural patterns, from local motifs to global organization, and are processed by a Transformer to produce expressive graph-level embeddings. Unlike traditional persistent homology pipelines, *Topo-Scan* is parallelizable, avoids costly diagram computations, and integrates seamlessly with standard deep learning architectures. We provide theoretical guarantees on the stability of our topological encodings and demonstrate state-of-the-art performance across graph classification and molecular property prediction benchmarks. Our results show that *TopoFormer* matches or exceeds strong GNN and topology-based baselines while offering predictable and efficient compute. This work opens a new path for parallelizable and unifying approaches to graph representation learning that integrate topological inductive biases into attention frameworks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3118. IGU-LoRA: Adaptive Rank Allocation via Integrated Gradients and Uncertainty-Aware Scoring
๐ฌ 22
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ Low-Rank Adaptation Integrated Gradients Parameter-Efficient Fine-Tuning Uncertainty-Aware Scoring
As large language models (LLMs) scale to billions of parameters, full-parameter fine-tuning becomes compute- and memory-prohibitive. Parameter-efficient fine-tuning (PEFT) mitigates this issue by updating only a small set of task-specific parameters while keeping the base model frozen. Among PEFT approaches, low-rank adaptation (LoRA) is widely adopted; however, it enforces a uniform rank across layers despite substantial variation in layer importance, motivating layerwise rank allocation. Recent adaptive-rank variants (e.g., AdaLoRA) allocate ranks based on importance scores, yet typically rely on instantaneous gradients that capture only local sensitivity, overlooking non-local, pathwise effects within the same layer, which yields unstable and biased scores. To address this limitation, we introduce IGU-LoRA, an adaptive-rank LoRA that (i) computes within-layer Integrated Gradients (IG) sensitivities and aggregates them into a layer-level score for rank allocation, and (ii) applies an uncertainty-aware scheme using exponential moving averages with deviation tracking to suppress noisy updates and calibrate rank selection. Theoretically, we prove an upper bound on the composite trapezoidal rule approximation error for parameter-space IG under a pathwise Hessian-Lipschitz condition, which informs the quadrature budget. Across diverse tasks and architectures, IGU-LoRA consistently outperforms strong PEFT baselines at matched parameter budgets, improving downstream accuracy and robustness. Ablations confirm the contributions of pathwise within-layer sensitivity estimates and uncertainty-aware selection to effective rank allocation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3119. Shrinking Proteins with Diffusion
๐ฌ 18
โญ 5.20
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Proteins Generative model diffusion discrete diffusion
Many proteins useful in modern medicine or bioengineering are challenging to make in the lab, fuse with other proteins in cells, or deliver to tissues in the body because their sequences are too long.
Shortening these sequences typically involves costly, time-consuming experimental campaigns.
Ideally, we could instead use modern models of massive databases of sequences from nature to learn how to propose shrunken proteins that resemble sequences found in nature.
Unfortunately, these models struggle to efficiently search the combinatorial space of all deletions, and are not trained with inductive biases to learn how to delete.
To address this gap, we propose SCISOR, a novel discrete diffusion model that deletes letters from sequences to generate protein samples that resemble those found in nature.
To do so, SCISOR trains a de-noiser to reverse a forward noising process that adds random insertions to natural sequences.
As a generative model, SCISOR fits evolutionary sequence data competitively with previous large models.
In evaluation, SCISOR achieves state-of-the-art predictions of the functional effects of deletions on ProteinGym.
Finally, we use the SCISOR de-noiser to shrink long protein sequences, and show that its suggested deletions result in significantly more realistic proteins and more often preserve functional motifs than previous models of evolutionary sequences.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3120. No outlier channels but with outlier blocks
๐ฌ 18
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ outliers Quantization
With the rapid scaling of large language models, achieving efficient compression while maintaining model performance has become a critical challenge. To address the limitations of existing non-uniform quantization methods, which typically rely on fixed codebooks and require costly optimization, we propose a novel arbitrary bit-width non-uniform Quantization (NuBitQ). The framework enables flexible, layer-specific quantization strategies, significantly enhancing adaptability and efficiency. Notably, traditional outlier compensation methods used in uniform quantization are ill-suited for the anomalous distribution characteristics encountered in our context. To address this, we design a novel outlier evaluation metric that integrates weight perturbation, activation distribution, and perturbation propagation. Based on this metric, we further develop an Outlier Compensation Plugin (OCP) that implements multi-level, fine-grained outlier compensation strategies, effectively mitigating performance degradation caused by outliers. Our approach avoids direct complex Hessian computation and fine-tuning, offering strong applicability and scalability. Extensive experiments on multiple tasks and across various model series demonstrate the effectiveness of the proposed approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3121. IncVGGT: Incremental VGGT for Memory-Bounded Long-Range 3D Reconstruction
๐ฌ 18
โญ 5.20
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Reconstruction memory efficient long range point cloud
We present IncVGGT, a training-free incremental variant of VGGT that makes transformer-based 3D reconstruction feasible for long sequences in real-world applications. Vanilla VGGT relies on dense global attention, which causes memory to grow quadratically and requires excessive computation, making it impractical for long-sequence scenarios. Even evolved streaming variants, such as StreamVGGT, still suffer from rapidly growing cache and latency. IncVGGT addresses these challenges from two orthogonal directions: (1) register and fuse overlapping frames into composite views, reducing duplicate tokens, and (2) history-side pruning retains only the top-$k$ most relevant/maximum slots together with the most recent one, bounding cache growth. This incremental and memory-efficient design minimizes computation and memory occupation across arbitrarily long sequences. Compared to StreamVGGT, IncVGGT sustains arbitrarily long sequences with large efficiency gains (e.g., on 500-frame sequences, 58.5$\times$ fewer operators, 9$\times$ lower memory, 25.7$\times$ less energy, and 4.9$\times$ faster inference) while maintaining comparable accuracy. More importantly, unlike existing baselines that directly run out of memory beyond 300 (VGGT)โ500 (StreamVGGT) frames, IncVGGT continues to operate smoothly even on 10k-frame inputs under an 80GB GPU, showing that our design truly scales to ultra-long sequences without hitting memory limits. These results highlight IncVGGTโs potential for deployment in resource-constrained edge devices for long-range 3D scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3122. Newton Method Revisited: Global Convergence Rates up to $O(1/k^3)$ for Stepsize Schedules and Linesearch Procedures
๐ฌ 17
โญ 5.20
๐ optimization
๐ท๏ธ Damped Newton Methods Tensor Methods Linesearch Procedures Global Convergence Guarantees
This paper investigates the global convergence of stepsized Newton methods for convex functions with Hรถlder continuous Hessians or third derivatives. We propose several simple stepsize schedules with fast global convergence guarantees, up to $\mathcal {O}(1/k^3)$
. For cases with multiple plausible smoothness parameterizations or an unknown smoothness constant, we introduce a stepsize linesearch and a backtracking procedure with provable convergence as if the optimal smoothness parameters were known in advance. Additionally, we present strong convergence guarantees for the practically popular Newton method with exact linesearch.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3123. An efficient, provably optimal, practical algorithm for the 0-1 loss linear classification problem
๐ฌ 14
โญ 5.20
๐ learning theory
๐ท๏ธ Classification Global optimal algorithm Hyperplane arrangement Interpretable machine learning
Algorithms for solving the linear classification problem have a long history, dating back at least to 1936 with linear discriminant analysis.
For linearly separable data, many algorithms can obtain the exact
solution to the corresponding 0-1 loss classification problem efficiently,
but for data which is not linearly separable, it has been shown that
this problem, in full generality, is NP-hard. Alternative approaches
all involve approximations of some kind, such as the use of surrogates
for the 0-1 loss (for example, the hinge or logistic loss), none of
which can be guaranteed to solve the problem exactly. Finding an efficient,
rigorously proven algorithm for obtaining an exact (i.e., globally
optimal) solution to the 0-1 loss linear classification problem remains
an open problem.
By analyzing the combinatorial and incidence relations between hyperplanes and data points, we derive a rigorous construction algorithm, incremental cell enumeration (ICE),
that can solve the 0-1 loss classification problem exactly in $O\left(N^{D+1}\right)$---exponential
in the data dimension $D$. To the best of our knowledge, this is
the first standalone algorithm---one that does not rely on general-purpose
solvers---with rigorously proven guarantees for this problem. Moreover,
we further generalize ICE to address the polynomial hypersurface classification
problem in $O\left(N^{G+1}\right)$ time, where $G$ is determined by both the data dimension $D$ and the polynomial degree
$K$ defining the hypersurface. The correctness of our algorithm is
proved by the use of tools from the theory of hyperplane arrangements and
oriented matroids.
We demonstrate the effectiveness of our algorithm on real-world datasets, achieving optimal training accuracy for small-scale datasets and higher test accuracy on most datasets. Furthermore, our complexity analysis shows that the ICE algorithm offers superior computational efficiency compared with state-of-the-art branch-and-bound algorithm.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3124. TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models
๐ฌ 23
โญ 5.20
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Jailbreaking Attacks Large Language Models
Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3125. Human Behavior Atlas: Benchmarking Unified Psychological And Social Behavior Understanding
๐ฌ 39
โญ 5.20
๐ datasets and benchmarks
๐ท๏ธ Multimodal Learning Unified Models Benchmarking Transfer Learning Human Behavior
Using intelligent systems to perceive psychological and social behaviors, that is, the underlying affective, cognitive, and pathological states that are manifested through observable behaviors and social interactions, remains a challenge due to their complex, multifaceted, and personalized nature. Existing work tackling these dimensions through specialized datasets and single-task systems often miss opportunities for scalability, cross-task transfer, and broader generalization. To address this gap, we curate Human Behavior Atlas, a unified benchmark of diverse behavioral tasks designed to support the development of unified models for understanding psychological and social behaviors. Human Behavior Atlas comprises over 100,000 samples spanning text, audio, and visual modalities, covering tasks on *affective states, cognitive states, pathologies,* and *social processes*. Our unification efforts can reduce redundancy and cost, enable training to scale efficiently across tasks, and enhance generalization of behavioral features across domains. On Human Behavior Atlas, we train three models: OmniSapiens-7B SFT, OmniSapiens-7B BAM, and OmniSapiens-7B RL. We show that training on Human Behavior Atlas enables models to consistently outperform existing multimodal LLMs across diverse behavioral tasks. Pretraining on Human Behavior Atlas also improves transfer to novel behavioral datasets; with the targeted use of behavioral descriptors yielding meaningful performance gains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3126. Align Your Structures: Generating Trajectories with Structure Pretraining for Molecular Dynamics
๐ฌ 23
โญ 5.20
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ geometric diffusion models molecular dynamics
Generating molecular dynamics (MD) trajectories using deep generative models has attracted increasing attention, yet remains inherently challenging due to the limited availability of MD data and the complexities involved in modeling high-dimensional MD distributions. To overcome these challenges, we propose a novel framework that leverages structure pre-training for MD trajectory generation. Specifically, we first train a diffusion-based structure generation model on a large-scale conformer dataset, on top of which we introduce an interpolator module trained on MD trajectory data, designed to enforce temporal consistency among generated structures. Our approach effectively harnesses abundant structural data to mitigate the scarcity of MD trajectory data and effectively decomposes the intricate MD modeling task into two manageable subproblems: structural generation and temporal alignment. We comprehensively evaluate our method on the QM9 and DRUGS small-molecule datasets across unconditional generation, forward simulation, and interpolation tasks, and further extend our framework and analysis to tetrapeptide and protein monomer systems. Experimental results confirm that our approach excels in generating chemically realistic MD trajectories, as evidenced by remarkable improvements of accuracy in geometric, dynamical, and energetic measurements.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3127. No Prior, No Leakage: Revisiting Reconstruction Attacks in Trained Neural Networks
๐ฌ 15
โญ 5.20
๐ learning theory
๐ท๏ธ Deep learning theory Implicit bias Training reconstruction attack Data privacy Data protection Deep learning security
The memorization of training data by neural networks raises pressing concerns for privacy and security. Recent work has shown that, under certain conditions, portions of the training set can be reconstructed directly from model parameters. Some of these methods exploit implicit bias toward margin maximization, suggesting that properties often regarded as beneficial for generalization may actually compromise privacy. Yet despite striking empirical demonstrations, the reliability of these attacks remains poorly understood and lacks a solid theoretical foundation. In this work, we take a complementary perspective: rather than designing stronger attacks, we analyze the inherent weaknesses and limitations of existing reconstruction methods and identify conditions under which they fail. We rigorously prove that, without incorporating prior knowledge about the data, there exist infinitely many alternative solutions that may lie arbitrarily far from the true training set, rendering reconstruction fundamentally unreliable. Empirically, we further demonstrate that exact duplication of training examples occurs only by chance. Our results refine the theoretical understanding of when training set leakage is possible and offer new insights into mitigating reconstruction attacks. Remarkably, we demonstrate that networks trained more extensively, and therefore satisfying implicit bias conditions more strongly -- are, in fact, less susceptible to reconstruction attacks, reconciling privacy with the need for strong generalization in this setting.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3128. Maximizing Incremental Information Entropy for Contrastive Learning
๐ฌ 20
โญ 5.20
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Self-supervised Learning; Contrastive Learning;
Contrastive learning has achieved remarkable success in self-supervised representation learning, often guided by information-theoretic objectives such as mutual information maximization. Motivated by the limitations of static augmentations and rigid invariance constraints, we propose IE-CL (Incremental-Entropy Contrastive Learning), a framework that explicitly optimizes the entropy gain between augmented views while preserving semantic consistency. Our theoretical framework reframes the challenge by identifying the encoder as an information bottleneck and proposes a joint optimization of two components: a learnable transformation for entropy generation and an encoder regularizer for its preservation. Experiments on CIFAR-10/100, STL-10, and ImageNet demonstrate that IE-CL consistently improves performance under small-batch settings. Moreover, our core modules can be seamlessly integrated into existing frameworks. This work bridges theoretical principles and practice, offering a new perspective in contrastive learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3129. Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights
๐ฌ 32
โญ 5.20
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Privacy Generalizability Weights Rewinding Fine-Tuning
Prior approaches for membership privacy preservation usually update or retrain all weights in neural networks, which is costly and can lead to unnecessary utility loss or even more serious misalignment in predictions between training data and non-training data. In this paper, we empirically show that only a very small number of weights are liable to membership privacy vulnerability. However, we also identify that those neurons are not only liable to membership privacy breach but also contribute to generalizability. According to these insights, to preserve privacy, instead of discarding those neurons, we rewind only the weights for fine-tuning. We show that through extensive experiments, this mechanism, plugged into other approaches, shows enhanced resilience against Membership Inference Attacks while maintaining utility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3130. DELTA-Code: How RL Unlocks and Transfers New Programming Algorithms in LLMs
๐ฌ 11
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Reinforcement Learning Generalization Learnability
It remains an open question whether LLMs can acquire or generalize genuinely new reasoning strategies, beyond the sharpened skills encoded in their parameters during pre-training or post-training. To attempt to answer this debate, we introduce DELTA-Code โDistributional Evaluation of Learnability and Transferrability in Algorithmic Coding, a controlled benchmark of synthetic coding problem families designed to probe two fundamental aspects: learnabilityโcan LLMs, through reinforcement learning (RL), solve problem families where pretrained models exhibit failure with large enough attempts (pass@K=0)?โand transferrabilityโ if learnability happens, can such skills transfer systematically to out-of-distribution (OOD) test sets? Unlike prior public coding datasets, DELTA isolates reasoning skills through templated problem generators and introduces fully OOD problem families that demand novel strategies rather than tool invocation or memorized patterns. Our experiments reveal a striking grokking phase transition: after an extended period with near-zero reward, RL-trained models abruptly climb to near-perfect accuracy. To enable learnability on previously unsolvable problem families, we explore key training ingredients such as staged warm-up with dense rewards, experience replay, curriculum training, and verification-in-the-loop.
Beyond learnability, we use DELTA to evaluate transferability or generalization along exploratory, compositional, and transformative axes, as well as cross-family transfer. Results show solid gains within families and for recomposed skills, but persistent weaknesses in transformative cases. DELTA thus offers a clean testbed for probing the limits of RL-driven reasoning and for understanding how models can move beyond existing priors to acquire new algorithmic skills.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3131. When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger
๐ฌ 21
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language model Preference alignment
Preference alignment is an essential step in adapting large language models (LLMs) to human values, but existing approaches typically depend on costly human annotations or large-scale API-based models. We explore whether a weak LLM can instead act as an effective annotator. We surprisingly find that selecting only a subset of a weak LLM's highly confident samples leads to substantially better performance than using full human annotations. Building on this insight, we propose **C**onfidence-**W**eighted **P**reference **O**ptimization (CW-PO), a general framework that re-weights training samples by a weak LLMโs confidence and can be applied across different preference optimization objectives. Notably, the model aligned by CW-PO with just 20\% of human annotations outperforms the model trained with 100\% of annotations under standard DPO. These results suggest that weak LLMs, when paired with confidence weighting, can dramatically reduce the cost of preference alignment while even outperforming methods trained on fully human-labeled data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3132. Mathesis: Towards Formal Theorem Proving from Natural Languages
๐ฌ 28
โญ 5.20
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ autoformalization AI for math AI for science Lean 4 formal reasoning parallel corpus large language model LLM
Recent advances in large language models (LLMs) show strong promise for formal reasoning. However, most LLM-based theorem provers remain constrained by the need for expert-written formal statements as inputs, limiting their applicability to real-world problems expressed in natural language. We address this gap by focusing on autoformalization, the task of translating informal problems into formal statements. We propose Mathesis, the first pipeline for the systematic study of formal theorem proving from natural language. It contributes the first autoformalizer trained with reinforcement learning, which integrates syntactic, semantic, and prover feedback as reward signals to yield accurate and verifiable formalizations. This is further supported by our novel LeanScorer framework for evaluating semantic correctness. To assess real-world applicability, we introduce Gaokao-Formal, a benchmark of 495 complex proof problems from the college entrance exams. Experiments demonstrate that our autoformalizer improves pass rates by 45% on Gaokao-Formal and 6% on MiniF2F compared to state-of-the-art baselines. Paired with provers, our autoformalizer consistently enhances proving accuracy, including a 42% gain for DeepSeek-Prover-V2 on Gaokao-Formal. Our code is available at https://anonymous.4open.science/r/Mathesis-2D14.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3133. LaSeR: Reinforcement Learning with Last-Token Self-Rewarding
๐ฌ 26
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reinforcement Learning LLM Reasoning Self-Rewarding
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). To address the lack of verification signals at test time after RLVR, prior studies incorporate the training of model's self-verification capabilities into the standard RLVR process, thereby unifying reasoning and verification capabilities within a single LLM. However, previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which doubles the inference cost per sample and significantly reduces efficiency. In this work, we theoretically reveal that the closed-form solution to the RL objective of self-verification training can be approximately reduced to a remarkably simple form: the true reasoning reward of a solution is equal to its last-token self-rewarding score, which is computed as the difference between the policy model's next-token log-probability assigned to any pre-specified token at the solution's last token and a pre-calculated constant, scaled by the KL coefficient. Based on this insight, we propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a Mean Squared Error (MSE) loss that aligns the last-token self-rewarding scores with the verifier-based reasoning rewards, and jointly optimizes the reasoning and self-rewarding capabilities of LLMs.
The optimized self-rewarding scores serve as auxiliary reward signals in both training and testing to enhance model performance.
Notably, our algorithm derives these scores from the predicted next-token probability distribution of the last solution token immediately after solution generation, thereby incurring only the minimal extra cost of at most one additional token inference.
Experimental results show that our method not only improves the reasoning performance of the model also equips it with remarkable self-rewarding capability, thereby further boosting its inference-time scaling performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3134. FMIP: Joint Continuous-Integer Flow For Mixed-Integer Linear Programming
๐ฌ 20
โญ 5.20
๐ optimization
๐ท๏ธ Mixed Integer-Linear Programming Learning to Optimize Flow Matching
Mixed-Integer Linear Programming (MILP) is a foundational tool for complex decision-making problems.
However, the NP-hard nature of MILP presents a significant computational challenge, motivating the development of machine learning-based heuristic solutions to accelerate downstream solvers.
While recent generative models have shown promise in learning powerful heuristics, they suffer from a critical limitation.
That is, they model the distribution of only the integer variables and fail to capture the intricate coupling between integer and continuous variables, creating an information bottleneck and ultimately leading to suboptimal solutions.
To this end, we propose Joint Continuous-Integer Flow for Mixed-Integer Linear Programming (FMIP), which is the first generative framework that models the joint distribution of both integer and continuous variables for MILP solutions.
Built upon the joint modeling paradigm, a holistic guidance mechanism is designed to steer the generative trajectory, actively refining solutions toward optimality and feasibility during the inference process.
Extensive experiments on eight standard MILP benchmarks demonstrate the superior performance of FMIP against existing baselines, reducing the primal gap by 41.34% on average.
Moreover, we show that FMIP is fully compatible with arbitrary backbone networks and various downstream solvers, making it well-suited for a broad range of real-world MILP applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3135. Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling
๐ฌ 13
โญ 5.20
๐ generative models
๐ท๏ธ Function calling Tool-augmented LMs Too-use
As language models gain access to external tools through structured function calls, they become increasingly more capable of solving complex, multi-step tasks. However, existing benchmarks for tool-augmented language models (TaLMs) provide insufficient control over factors such as the number of functions accessible, task complexity, and input size, and remain vulnerable to data contamination.
We present FuncBenchGen, a unified, contamination-free framework that evaluates TaLMs by generating synthetic multi-step tool-use tasks to stress-test TaLMs.
The key idea is to cast tool use as traversal over a hidden function-dependency DAG where nodes are function calls and an edge between nodes represents one function consuming the output of another.
Given a set of external function schemas, initial variable values, and a target variable, models must compose the correct call sequence to compute the target variable. FuncBenchGen allows users to precisely control task difficulty (e.g., graph size, dependency depth, and distractor functions) while avoiding pretraining/test-time leakage.
We apply our FuncBenchGen framework to evaluate seven open and closed LLMs on tool use tasks of varying difficulty. Reasoning-optimized models consistently outperform general-purpose models with GPT-5 significantly outperforming other available models. Performance declines sharply as dependency depth increases. Furthermore, connected distractors---irrelevant functions sharing type-compatible variables with relevant functions---prove especially difficult to handle.
We find that strong models often make syntactically valid function calls but propagate incorrect or stale argument values across steps, revealing brittle state tracking by LLMs in multi-turn tool use.
Motivated by this observation, we introduce a simple mitigation strategy that explicitly restates prior variable values to the agent at each step. Surprisingly, this lightweight change yields substantial gains across models. e.g., yielding an improvement in success rate from 62.5\% to 81.3\% for GPT-5, without modifying the underlying architectures or training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3136. BA-LoRA: Bias-Alleviating Low-Rank Adaptation to Mitigate Catastrophic Inheritance in Large Language Models
๐ฌ 41
โญ 5.20
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Parameter-Efficient Fine-Tuning PEFT LoRA Bias Mitigation Catastrophic Inheritance Representation Learning Robustness Large Language Models
Parameter-efficient fine-tuning (PEFT) has become a de facto standard for adapting Large Language Models (LLMs). However, we identify a critical vulnerability within popular low-rank adaptation methods like LoRA: they can exacerbate "Catastrophic Inheritance"โthe unchecked propagation of biases, noise, and data imbalances from pre-training. This phenomenon can degrade model robustness and fairness, undermining the benefits of efficient adaptation. To address this, we introduce Bias-Alleviating Low-Rank Adaptation (BA-LoRA). Our approach is founded on a principled decomposition of Catastrophic Inheritance into three core challenges: Knowledge Drift, Representation Collapse, and Overfitting to Noise. BA-LoRA systematically mitigates these issues by incorporating a trio of targeted regularizersโconsistency, diversity, and SVDโdesigned to preserve core knowledge, enforce representational richness, and stabilize the low-rank updates. We conduct comprehensive evaluations on a suite of natural language understanding (NLU) and generation (NLG) tasks using diverse, prominent open-source language models (e.g., LLaMA-2-7B and DeBERTa-v3-base). Our results show that BA-LoRA not only outperforms state-of-the-art LoRA variants in terms of performance and stability, but also demonstrates quantitatively superior robustness and bias mitigation on targeted evaluations. This confirms its ability to counteract the adverse effects of Catastrophic Inheritance. The code is available at https://anonymous.4open.science/r/p5q9a1z8.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3137. How Base Frequency Shapes RoPE: An Analytical Study of Frequency-Band Formation
๐ฌ 15
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ Rotary Position Embedding Position Interpolation Extrapolation Large Language Model
Rotary Position Embeddings (RoPE) are widely adopted in LLMs, and it is commonly believed that larger base frequencies $\theta$ yield better long-context performance.
In this paper, we show that a high-norm RoPE dimension, referred to as the โfrequency band,โ consistently emerges across multiple models, and we focus on this band to reveal the trade-offs inherent in RoPE.
We find that replacing the RoPE dimensions below the frequency band with NoPE during inference has little effect on performance, indicating that these lower-frequency dimensions are only weakly utilized.
We further find that the location of the frequency band depends on the RoPE base $\theta$ and the training sequence length. Moreover, the band forms early during pre-training and persists even after context extension via position interpolation.
Notably, we show that aligning $\theta$ with the training length shifts the band toward lower frequencies and improves extrapolation, whereas increasing $\theta$ enhances interpolation but reduces extrapolation, revealing a clear trade-off between interpolation and extrapolation.
We believe this work is a step toward a sharper understanding of positional embeddings in LLMs, with falsifiable diagnostics and practical guidance for choosing $\theta$ that support scaling to longer contexts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3138. Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence
๐ฌ 24
โญ 5.20
๐ learning theory
๐ท๏ธ Model Collapse Synthetic Data Verifier-guided retraining
Synthetic data has been increasingly used to train frontier generative models. However, recent study raises key concerns that iteratively retraining a generative model on its self-generated synthetic data may keep deteriorating model performance, a phenomenon often coined model collapse. In this paper, we investigate ways to modify the synthetic retraining process to avoid model collapse, and even possibly help reverse the trend from collapse to improvement. Our key finding is that by injecting information through an external synthetic data verifier, whether a human or a better model, synthetic retraining will not cause model collapse. Specifically, we situate our theoretical analysis in the fundamental linear regression problem, showing that verifier-guided retraining yields early improvements when the verifier is accurate, and in the long run, the parameter estimate converges to the verifierโs knowledge center. Our theory predicts that the performance of synthetic retraining will have early gains but eventually plateaus or even reverses, unless the verifier is perfectly reliable. Indeed, our experiments on both linear regression as well as Conditional Variational Autoencoder (CVAE) trained on MNIST data also confirm these theoretical insights.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3139. Robust Adversarial Quantification via Conflict-Aware Evidential Deep Learning
๐ฌ 16
โญ 5.20
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Uncertainty Quantification Adversarial Attack Detection Evidential Deep Learning
Reliability of deep learning models is critical for deployment in high-stakes applications, where out-of-distribution or adversarial inputs may lead to detrimental outcomes. Evidential Deep Learning, an efficient paradigm for uncertainty quantification, models predictions as Dirichlet distributions of a single forward pass. However, EDL is particularly vulnerable to adversarially perturbed inputs, making overconfident errors. Conflict-aware Evidential Deep Learning (C-EDL) is a lightweight post-hoc uncertainty quantification approach that mitigates these issues, enhancing adversarial and OOD robustness without retraining. C-EDL generates diverse, task-preserving transformations per input and quantifies representational disagreement to calibrate uncertainty estimates when needed. C-EDL's conflict-aware prediction adjustment improves detection of OOD and adversarial inputs, maintaining high in-distribution accuracy and low computational overhead. Our experimental evaluation shows that C-EDL significantly outperforms state-of-the-art EDL variants and competitive baselines, achieving substantial reductions in coverage for OOD data (up to $\approx55\%$) and adversarial data (up to $\approx90\%$), across a range of datasets, attack types, and uncertainty metrics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3140. KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
๐ฌ 21
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ vector quantization llm Moe
Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation. However, their enormous parameter sizes and memory demands pose significant challenges for deployment in resource-constrained environments.
Vector Quantization (VQ) offers a promising approach for ultra-low-bit compression in Large Language Models (LLMs) by constructing and leveraging a codebookโwhere weight vectors are mapped to the most similar discrete codewords within the codebook.
However, its direct application to MoEs suffers from significant performance degradation caused by two critical obstacles: (1) redundant representation among experts leads to VQ repeatedly quantizing similar representations for each expert, resulting in inefficient utilization of the limited codebook capacity; and (2) cumulative outputs bias, amplified by expert aggregation, leads to distributional shifts in the quantized outputs, resulting in degraded model accuracy.
To this end, we propose KBVQ-MoE, a novel VQ framework to enhance extremely low-bit quantization for MoE-based LLMs.
KBVQ-MoE introduces two lightweight and offline techniques that introduce negligible runtime computational and memory overhead:
(1) Input-driven redundancy elimination, where a KarhunenโLoรจve Transform (KLT) guided singular value decomposition (SVD) extracts and shares dominant weight components across experts.
(2) Bias-corrected output stabilization, where vector quantization is applied to expert-specific (i.e., non-redundant) representations and the quantized outputs are corrected with channel-wise affine compensation.
Experiments on various MoE LLMs demonstrate that KBVQ-MoE preserves accuracy substantially better than existing quantization methods.
For instance, 3-bit quantization of Qwen1.5-MoE-A2.7B achieves an average accuracy of 67.99, nearly identical to the FP16 baseline of 68.07, underscoring the potential of KBVQ-MoE for efficient deployment on edge devices and other resource-constrained platforms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3141. Adaptive Concept Discovery for Interpretable Few-Shot Text Classification
๐ฌ 21
โญ 5.20
๐ interpretability and explainable AI
๐ท๏ธ concept bottleneck models; few-shot text classification;
Few-shot text classification is a critical real-world task for which Large Language Models (LLMs) have shown great promise. However, their high inference costs and lack of interpretability limit their practical use. While Concept Bottleneck Models (CBMs) offer an efficient and interpretable alternative, their reliance on training surrogate models makes them incompatible with few-shot scenarios.
To bridge this gap, we introduce a novel CBM paradigm that relies solely on sample-concept similarity to make predictions.
We ensure the effectiveness of our concepts through a prototypical-discriminative dual-level architecture and a dynamic concept refinement mechanism. Extensive experiments show that with as few as 10 training samples, our method surpasses prior CBMs and even achieves performance comparable to LLMs. The code is available at https://anonymous.4open.science/r/StructCBM-EB1E.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3142. FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting
๐ฌ 6
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ LVLMs Video Reasoning Reinforcement Learning
While Large Vision-Language Models (LVLMs) have achieved substantial progress in video understanding, their application to long video reasoning is hindered by uniform frame sampling and static textual reasoning, which are inefficient and struggle to handle visually intensive video tasks.
To overcome these challenges, in this paper, we introduce
the concept of thinking with long videos and propose a novel framework FrameThinker. Within this framework, LVLMs are able to iteratively interrogate video content.
Developing such video reasoning capabilities in LVLMs presents notable challenges, particularly in adapting the model to new video actions (e.g. select frame), and designing reward functions to guide LVLMs to adopt the newly introduced action.
To solve these challenges,
we propose a two-phase training strategy, first employing Supervised Fine-Tuning (SFT) to instill fundamental action capabilities, followed by Reinforcement Learning (RL) to optimize a strategic decision-making policy.
Notably, in this RL phase, we conduct an in-depth and comprehensive exploration of the reward design for each action and format reward.
Extensive experiments on reasoning benchmarks like Video-Holmes, LongVideo-Reason, and long-video understanding benchmarks such as LongVideoBench, MLVU, VideoMME, and LVBench, demonstrate that FrameThinker gets a significant average improvement of +10.4\% over baselines while drastically reducing the number of processed frames.
Most notably, our 7B model, FrameThinker establishes a new state-of-the-art on LongVideo-Reason, achieving 76.1\% accuracy using an average of only 20.6 frames. This not only outperforms the competitive LongVILA-R1 (72.0\%) but does so with over 20x fewer frames (vs. 512), demonstrating unparalleled efficiency and effectiveness. Our code is available at:
\url{https://anonymous.4open.science/r/FrameThinker-B5FD}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3143. Path Matters: Unveiling Geometric Implicit Bias via Curvature-Aware Sparse View Optimization
๐ฌ 18
โญ 5.20
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Gaussian Splatting
3D Gaussian Splatting (3DGS) has recently emerged as a powerful approach for novel view synthesis by reconstructing scenes as sets of Gaussian ellipsoids. Despite its success in scenarios with dense input images, 3DGS faces critical challenges in sparse view settings, often resulting in geometric inaccuracies, inconsistencies across views, and degraded rendering quality. In this paper, we uncover and address two key implicit biases of 3DGS reconstruction algorithm in sparse-view: (1) the model has a stronger demand for supervision signal toward regions of high curvature, and (2) the model is sensitive to the smoothness of the trajectory of the input views. To tackle these issues, we propose a novel framework that optimizes camera trajectories to maximize curvature coverage while enforcing smooth motion, and we further enhance the informativeness of data through a synthetic view generation process. Extensive experiments on Mip-NeRF 360, DTU, Blender, Tanks & Temples, and LLFF datasets show that our method substantially outperforms state-of-the-art solutions in sparse-view scenarios, both in rendering quality and geometric fidelity. Beyond these empirical gains, our investigation uncovers the subtle ways in which data representation and trajectory planning interact to shape 3DGS performance, offering deeper theoretical insights into the algorithmโs inherent biases.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3144. GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing
๐ฌ 37
โญ 5.20
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Generic Object Tracking Model Editing Model Prediction Visual Geometry Null-Space
Human perception for effective object tracking in a 2D video stream arises from the implicit use of prior 3D knowledge combined with semantic reasoning.
In contrast, most generic object tracking (GOT) methods primarily rely on 2D features of the target and its surroundings while neglecting 3D geometric cues, which makes them susceptible to partial occlusion, distractors, and variations in geometry and appearance.
To address this limitation, we introduce GOT-Edit, an online cross-modality model editing approach that integrates geometry-aware cues into a generic object tracker from a 2D video stream.
Our approach leverages features from a pre-trained Visual Geometry Grounded Transformer to enable geometric cue inference from only a few 2D images.
To tackle the challenge of seamlessly combining geometry and semantics, GOT-Edit performs online model editing with null-space constrained updates that incorporate geometric information while preserving semantic discrimination, yielding consistently better performance across diverse scenarios.
Extensive experiments on multiple GOT benchmarks demonstrate that GOT-Edit achieves superior robustness and accuracy, particularly under occlusion and clutter, establishing a new paradigm for combining 2D semantics with 3D geometric reasoning for generic object tracking.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3145. Latent-Guided Reasoning: Empowering Small LLMs with Large-Model Thinking
๐ฌ 5
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Efficient Reasoning
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, but their high computational costs limit their widespread practical application. We argue that this inefficiency arises from the tight coupling of high-level cognitive planning (devising the solution strategy) and low-level linguistic realization (generating step-by-step text). To address this challenge, we propose a novel collaborative framework that decouples these two processes through Latent Guidance. Our approach implements a division of labor: a large model acts as an Implicit Thinker, performing high-level cognitive planning and compressing its solution strategy into a set of compact latent guidance vectors. A small, efficient model then serves as an Explicit Executor, which receives this latent guidance to generate a concise and effective reasoning chain. This process is enabled by a dual-loss training objective, grounded in information-theoretic principles, where a reconstruction loss explicitly compels the latent guidance to become a high-fidelity representation of the full reasoning chain. Extensive experiments on 8 diverse reasoning benchmarks demonstrate that our method substantially enhances the reasoning capabilities of small models across various scales (from 0.5B to 8B), allowing them to outperform strong baselines and exhibit superior generalization. Notably, our framework boosts small model accuracy by up to 13.9% and its speed by 2x over its standalone baseline, while being up to 4x faster than the large model. Our work introduces a new, theoretically-grounded paradigm for empowering small models with large-model thinking, substantially improving the performance-cost trade-off for complex reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3146. Scaling Goal-conditioned Reinforcement Learning with Multistep Quasimetric Distances
๐ฌ 23
โญ 5.20
๐ reinforcement learning
๐ท๏ธ Goal-conditioned reinforcement learning quasimetrics robotics
The problem of learning how to reach goals in an environment has been a long-
standing challenge in for AI researchers. Effective goal-conditioned reinforcement
learning (GCRL) methods promise to enable reaching distant goals without task-
specific rewards by stitching together past experiences of different complexity.
Mathematically, there is a duality between the notion of optimal goal-reaching
value functions (the likelihood of success at reaching a goal) and temporal dis-
tances (transit times states). Recent works have exploited this property by learning
quasimetric distance representations that stitch long-horizon behaviors using the in-
ductive bias of their architecture. These methods have shown promise in simulated
benchmarks, reducing value learning to a shortest-path problem. But quasimet-
ric, and more generally, goal-conditioned RL methods still struggle in complex
environments with stochasticity and high-dimensional (visual) observations. There
is a fundamental tension between the local dynamic programming (TD backups,
temporal distances) that enables optimal shortest-path reasoning in theory and the
statistical global MC updates (multistep returns, suboptimal in theory). We show
how these approaches can be integrated into a practical GCRL method that fits a
quasimetric distance using a multistep Monte-Carlo return. We show our method
outperforms existing GCRL methods on long-horizon simulated tasks with up to
4000 steps, even with visual observations. We also demonstrate that our method
can enable stitching in the real-world robotic manipulation domain (Bridge setup).
Our approach is the first end-to-end GCRL method that enables multistep stitching
in this real-world manipulation domain from an unlabeled offline dataset of visual
observations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3147. Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models
๐ฌ 17
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Vision Language Models Multi-Image Safety Dataset Safety Fine-Tuning
Large Vision-Language Models (VLMs) have achieved remarkable performance across a wide range of tasks. However, their deployment in safety-critical domains poses significant challenges. Existing safety fine-tuning methods, which focus on textual or multimodal content, fall short in addressing challenging cases or disrupt the balance between helpfulness and harmlessness. Our evaluation highlights a safety reasoning gap: these methods lack safety visual reasoning ability, leading to such bottlenecks. To address this limitation and enhance both visual perception and reasoning in safety-critical contexts, we propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance. Specifically, we introduce the Multi-Image Safety (MIS) dataset, an instruction-following dataset tailored for multi-image safety scenarios, consisting of training and test splits. Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks requiring safety-related visual reasoning. This approach not only delivers exceptional safety performance but also preserves general capabilities without any trade-offs. Specifically, fine-tuning with MIS increases average accuracy by 0.83% across five general benchmarks and reduces the Attack Success Rate (ASR) on multiple safety benchmarks by a large margin.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3148. CrossPL: Systematic Evaluation of Large Language Models for Cross Programming Language Interoperating Code Generation
๐ฌ 22
โญ 5.20
๐ datasets and benchmarks
๐ท๏ธ cross programming language interactions LLM based workflow benchmark code generation
Large language models (LLMs) have shown strong performance in single-language code generation, but how well they produce cross-programming-language (CPL) interoperating code, which is widely used in cross-platform and complex software systems, remains underexplored. Therefore, a benchmark for evaluating CPL interaction code generation is essential. However, Constructing such a benchmark is challenging owing to sparse interoperating code in real-world multi-programming-language projects, diverse Inter-process Communication (IPC) mechanisms, vast Foreign Function Interface (FFI) language pairs, and the difficulty of evaluation. To address this gap, we introduce CrossPL, the first benchmark for systematically assessing LLM performance of CPL code generation across two primary interoperation modes and 2534 tasks, specifically 1,982 IPC tasks spanning six languages and 522 PythonโC FFI tasks. Its construction involved a review of CPL documentation, 156 finite state machines, and analysis of 19,169 multi-language GitHub repositories. Two LLM-based workflows are designed for automating the benchmark construction and evaluation, and assess 20 state-of-the-art LLMs. Results reveal clear limitations: the best model achieves only 19.5\% Pass@1 and 26.46\% Pass@5 on the FFI subset, in sharp contrast to the strong performance of these models on single-language benchmarks. These findings underscore the urgent need for improving LLMs regarding CPL interoperating code generation. The benchmark and code are available at https://anonymous.4open.science/r/crosspl-2814/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3149. Hierarchical Semantic-Acoustic Modeling via Semi-Discrete Residual Representations for Expressive End-to-End Speech Synthesis
๐ฌ 17
โญ 5.20
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ text-to-speech synthesis diffusion language model semi-discrete representations voice cloning
Generative models for speech synthesis face a fundamental trade-off: discrete tokens ensure stability but sacrifice expressivity, while continuous signals retain acoustic richness but suffer from error accumulation due to task entanglement. This challenge has driven the field towards multi-stage pipelines that rely on pre-trained discrete speech tokenizers, but these create a semantic-acoustic divide, limiting holistic and expressive speech generation. We resolve these dilemma through hierarchical semantic-acoustic modeling with semi-discrete residual representations.Our framework introduces a differentiable quantization bottleneck that induces natural specialization: a Text-Semantic Language Model (TSLM) generates semantic-prosodic plans, while a Residual Acoustic Model (RALM) recovers fine-grained acoustic details.This hierarchical semantic-acoustic representation guides a local diffusion-based decoder to generate high-fidelity speech latents.
Critically, the entire architecture is trained end-to-end under a simple diffusion objective, eliminating dependency on external discrete speech tokenizers. Trained on over 1 million hours of speech, our 0.5B-parameter model achieves state-of-the-art zero-shot TTS performance among open-source systems, demonstrating that our approach delivers expressive and stable synthesis. Audio samples are available at: https://voxcpm.github.io/VoxCPM-demopage/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3150. Chain-of-Context Learning: Dynamic Constraint Understanding for Multi-Task VRPs
๐ฌ 20
โญ 5.20
๐ optimization
๐ท๏ธ Multi-task learning deep reinforcement learning vehicle routing problems
Multi-task Vehicle Routing Problems (VRPs) aim to minimize routing costs while satisfying diverse constraints. Existing solvers typically adopt a unified reinforcement learning (RL) framework to learn generalizable patterns across tasks. However, they often overlook the constraint and node dynamics during the decision process, making the model fail to accurately react to the current context. To address this limitation, we propose Chain-of-Context Learning (CCL), a novel framework that progressively captures the evolving context to guide fine-grained node adaptation. Specifically, CCL constructs step-wise contextual information via a Relevance-Guided Context Reformulation (RGCR) module, which adaptively prioritizes salient constraints. This context then guides node updates through a Trajectory-Shared Node Re-embedding (TSNR) module, which aggregates shared node features from all trajectories' contexts and uses them to update inputs for the next step. By modeling evolving preferences of the RL agent, CCL captures step-by-step dependencies in sequential decision-making. We evaluate CCL on 48 diverse VRP variants, including 16 in-distribution and 32 out-of-distribution (with unseen constraints) tasks. Experimental results show that CCL performs favorably against the state-of-the-art baselines, achieving the best performance on all in-distribution tasks and the majority of out-of-distribution tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3151. One-Step Flow for Image Super-Resolution with Tunable Fidelity-Realism Trade-offs
๐ฌ 20
โญ 5.20
๐ generative models
๐ท๏ธ Diffusion distillation; rectified flow; one-step diffusion
Recent advances in diffusion and flow-based generative models have demonstrated remarkable success in image restoration tasks, achieving superior perceptual quality compared to traditional deep learning approaches. However, these methods either require numerous sampling steps to generate high-quality images, resulting in significant computational overhead, or rely on common model distillation, which usually imposes a fixed fidelity-realism trade-off and thus lacks flexibility. In this paper, we introduce OFTSR, a novel flow-based framework for one-step image super-resolution that can produce outputs with tunable levels of fidelity and realism. Our approach first trains a conditional flow-based super-resolution model to serve as a teacher model. We then distill this teacher model by applying a specialized constraint. Specifically, we force the predictions from our one-step student model for same input to lie on the same sampling ODE trajectory of the teacher model. This alignment ensures that the student model's single-step predictions from initial states match the teacher's predictions from a closer intermediate state. Through extensive experiments on datasets including FFHQ (256$\times$256), DIV2K, and ImageNet (256$\times$256), we demonstrate that OFTSR achieves state-of-the-art performance for one-step image super-resolution, while having the ability to flexibly tune the fidelity-realism trade-off.
Code and pre-trained models will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3152. On learning linear dynamical systems in context with attention layers
๐ฌ 5
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ in-context learning; linear attention; linear dynamical systems; kalman filter; time series
This paper studies the expressive power of linear attention layers for in-context learning (ICL) of linear dynamical systems (LDS). We consider training on sequences of inexact observations produced by noise-corrupted LDSs, with all perturbations being Gaussian; importantly, we study the non-i.i.d. setting as it is closer to real-world scenarios. We provide the optimal weight construction for a single linear-attention layer and show its equivalence to one step of Gradient Descent relative to an autoregression objective of window size one. Guided by experiments, we uncover a relation to the Preconditioned Conjugate Gradient method for larger window sizes. We back our findings with numerical evidence. These results add to the existing understanding of transformers' expressivity as in-context learners, and offer plausible hypotheses for experimental observations whereby they compete with Kalman filters --- the optimal model-dependent learners for this setting.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3153. Missingness Bias Calibration in Feature Attribution Explanations
๐ฌ 12
โญ 5.20
๐ interpretability and explainable AI
๐ท๏ธ explainability feature attribution calibration missingness bias medical medicine LLMs Machine Learning
Popular explanation methods often produce unreliable feature importance scores due to "missingness bias", a systematic distortion that arises when models are probed with ablated, out-of-distribution inputs.
Existing solutions treat this as a deep representational flaw that requires expensive retraining or architectural modifications.
In this work, we challenge this assumption and show that missingness bias can be effectively treated as a superficial artifact of the model's output space.
We introduce MCal, a lightweight post-hoc method that corrects this bias by fine-tuning a simple linear head on the outputs of a frozen base model.
Surprisingly, we find this simple correction consistently reduces missingness bias and is competitive with, or even outperforms, prior heavyweight approaches across diverse medical benchmarks spanning vision, language, and tabular domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3154. Spatially Guided Training for Vision-Language-Action Model
๐ฌ 21
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ Spatial Priors Robot Manipulation Instruction Following
Large visionโlanguage models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce SP-VLA, a dual-system **V**isionโ**L**anguageโ**A**ction framework that leverages **S**patial **P**riors as a bridge between linguistic instructions and embodiment-specific control.
introduce SP-VLA aligns action learning with spatial priors through two stages: (i) spatial grounding pre-training, which equips the VLM with transferable priors via scalable point, box, and trajectory prediction from both web-scale and robot-specific data, and (ii) spatially guided action post-training, which encourages the model to produce richer spatial priors to guide action generation via spatial prompting.
This design preserves spatial grounding during policy learning and promotes consistent optimization across spatial and action objectives. Empirically, introduce SP-VLA achieves substantial improvements over vanilla VLA, with performance increasing from $66.1{\rightarrow}84.6$ on Google Robot and from $54.7{\rightarrow}73.2$ on WidowX, establishing new state-of-the-art results on SimplerEnv. It also demonstrates stronger generalization to unseen objects and paraphrased instructions, as well as robustness to long-horizon perturbations in real-world settings. These results highlight scalable spatially guided training as a promising direction for robust, generalizable robot learning. We will release code, data, and model checkpoints to support future research.
See more visualization results at the anonymous page: https://sp-vla-anonymous.vercel.app
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3155. Sparse Attention Adaptation for Long Reasoning
๐ฌ 11
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Sparse Attention Reasoning
We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models.
Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding.
With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters.
We demonstrate that SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128).
Using TileLang, we develop a highly optimized sparse decoding kernel that achieves near-theoretical speedups of up to 9x over FlashAttention-3 on H100 GPU at 90\% sparsity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3156. PSDNorm: Temporal Normalization for Deep Learning in Sleep Staging
๐ฌ 17
โญ 5.20
๐ applications to neuroscience & cognitive science
๐ท๏ธ Normalization Layer Sleep Staging Optimal Transport
Distribution shift poses a significant challenge in machine learning, particularly in
biomedical applications using data collected across different subjects, institutions, and recording devices, such as sleep data.
While existing normalization layers, BatchNorm, LayerNorm and InstanceNorm, help mitigate distribution shifts, when applied over the time dimension they ignore the dependencies and auto-correlation inherent to the vector coefficients they normalize.
In this paper, we propose PSDNorm that leverages Monge mapping and temporal context to normalize feature maps in deep learning models for signals.
Evaluations with architectures based on U-Net or transformer backbones trained on 10K subjects across 10 datasets,
show that PSDNorm achieves state-of-the-art performance on unseen left-out datasets while being 4-times more data-efficient than BatchNorm.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3157. LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding
๐ฌ 18
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Long-Context Inference Sparse Attention Hybrid-Head Attention
The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-$k$ selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks for long-context understanding (e.g., LongBench, RULER) and complex reasoning (e.g., AIME24, OlympiadBench), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to a 2.7x speedup at a 128K context length. By preserving the functional diversity of attention heads, our fine-grained strategy overcomes the performance bottlenecks of existing methods, providing a powerful and validated pathway to both efficient and high-quality long-context LLM inference. The implementation code, kernels, and models will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3158. Addressing divergent representations from causal interventions on neural networks
๐ฌ 22
โญ 5.20
๐ interpretability and explainable AI
๐ท๏ธ activation patching mech interp DAS representational divergence faithfulness
A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate theoretically and empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two cases of such divergences: "harmless" divergences that occur in the behavioral null-space of the layer(s) of interest, and "pernicious" divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to mitigate the pernicious cases, we apply and modify the Counterfactual Latent (CL) loss from Grant (2025) allowing representations from causal interventions to remain closer to the natural distribution, reducing the likelihood of harmful divergences while preserving the interpretive power of the interventions. Together, these results highlight a path towards more reliable interpretability methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3159. EmoPrefer: Can Large Language Models Understand Human Emotion Preferences?
๐ฌ 19
โญ 5.20
๐ datasets and benchmarks
๐ท๏ธ multimodal emotion recognition descriptive emotions EmoPrefer EmoPrefer-Data EmoPrefer-Bench
Descriptive Multimodal Emotion Recognition (DMER) has garnered increasing research attention. Unlike traditional discriminative paradigms that rely on predefined emotion taxonomies, DMER aims to describe human emotional state using free-form natural language, enabling finer-grained and more interpretable emotion representations. However, this free-form prediction paradigm introduces new challenges regarding its evaluation. Previous works depend on ground-truth descriptions, but emotions are inherently tied to diverse human behaviors, and generating a comprehensive and accurate description is inherently demanding. Other researchers reformulate this problem into a more tractable human preference learning task, but pairwise preference annotation involves substantial manual effort. This leads to a question: *can we leverage multimodal LLMs (MLLMs) to achieve more cost-efficient preference annotation?* To answer this, we propose **EmoPrefer**, a pioneering work exploring the potential of LLMs in decoding human emotion preferences. Specifically, we construct the first emotion preference dataset, **EmoPrefer-Data**, featuring high-quality preference annotations from experts. Additionally, we introduce **EmoPrefer-Bench**, which evaluates the performance of various MLLMs and prompting techniques in preference prediction, while also revealing new strategies to enhance their performance. To the best of our knowledge, this is the first work exploring the capabilities of LLMs in understanding human emotion preferences. Our work advances the field of DMER and lays the foundation for more intelligent human-computer interaction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3160. ViPRA: Video Prediction for Robot Actions
๐ฌ 27
โญ 5.20
๐ applications to robotics, autonomy, planning
๐ท๏ธ vision-language-action models robotics video prediction imitation learning
Can we turn a video prediction model into a robot policy? Videos, including those of humans or teleoperated robots, capture rich physical interactions. However, most of them lack labeled actions, which limits their use in robot learning. We present *Video Prediction for Robot Actions* (**ViPRA**), a simple pretraining-finetuning framework that learns continuous robot control from these actionless videos. Instead of directly predicting actions, we train a video-language model to predict *both future visual observations and motion-centric latent actions*, which serve as intermediate representations of scene dynamics. We train these latent actions using perceptual losses and optical flow consistency to ensure they reflect physically grounded behavior. For downstream control, we introduce a chunked *flow-matching decoder* that maps latent actions to robot-specific continuous action sequences, using only 100 to 200 teleoperated demonstrations. This approach avoids expensive action annotation, supports generalization across embodiments, and enables smooth, high-frequency continuous control upto 22 Hz via chunked action decoding. Unlike prior latent action works that treat pretraining as autoregressive policy learning, ViPRA explicitly models both what changes and how. Our method outperforms strong baselines, with a 16% gain on the SIMPLER benchmark and a 13% improvement across real world manipulation tasks. We will release models and code [here](https://vipra-robot.github.io/).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3161. Why Ask One When You Can Ask $k$? Learning-to-Defer to the Top-$k$ Experts
๐ฌ 18
โญ 5.20
๐ learning theory
๐ท๏ธ learning to defer selective prediction routing machine learning
Existing _Learning-to-Defer_ (L2D) frameworks are limited to _single-expert deferral_, forcing each query to rely on only one expert and preventing the use of collective expertise. We introduce the first framework for _Top-$k$ Learning-to-Defer_, which allocates queries to the $k$ most cost-effective entities. Our formulation unifies and strictly generalizes prior approaches, including the _one-stage_ and _two-stage_ regimes, _selective prediction_, and classical cascades. In particular, it recovers the usual Top-1 deferral rule as a special case while enabling principled collaboration with multiple experts when $k>1$. We further propose _Top-$k(x)$ Learning-to-Defer_, an adaptive variant that learns the optimal number of experts per query based on input difficulty, expert quality, and consultation cost. To enable practical learning, we develop a novel surrogate loss that is Bayes-consistent, $\mathcal{H}_h$-consistent in the one-stage setting, and $(\mathcal{H}_r,\mathcal{H}_g)$-consistent in the two-stage setting. Crucially, this surrogate is independent of $k$, allowing a single policy to be learned once and deployed flexibly across $k$. Experiments across both regimes show that Top-$k$ and Top-$k(x)$ deliver superior accuracyโcost trade-offs, opening a new direction for multi-expert deferral in L2D.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3162. JULI: Jailbreak Large Language Models by Self-Introspection
๐ฌ 19
โญ 5.20
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ large language models jailbreaking
Large Language Models (LLMs) are trained with safety alignment to prevent generating malicious content. Although some attacks have highlighted vulnerabilities in these safety-aligned LLMs, they typically have limitations, such as necessitating access to the model weights or the generation process. Since proprietary models through API-calling do not grant users such permissions, these attacks find it challenging to compromise them. In this paper, we propose Jailbreaking Using LLM Introspection (JULI), which jailbreaks LLMs by manipulating the token log probabilities, using a tiny plug-in block, BiasNet. JULI relies solely on the knowledge of the target LLM's predicted token log probabilities. It can effectively jailbreak API-calling LLMs under a black-box setting and knowing only top-$5$ token log probabilities. Our approach demonstrates superior effectiveness, outperforming existing state-of-the-art (SOTA) approaches across multiple metrics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3163. Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning
๐ฌ 23
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multi-Modal Reasoning Reinforcement Learning from Verifiable Rewards
Reinforcement Learning with verifiable rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi-modal large language models (MLLMs). However, during RL training, the enormous state space of MLLM and sparse rewards often leads to entropy collapse, policy degradation, or over-exploitation of suboptimal behaviors. This necessitates an exploration strategy that maintains productive stochasticity while avoiding the drawbacks of uncontrolled random sampling, yielding inefficient exploration. In this paper, we propose CalibRL, a hybrid-policy RLVR framework that supports controllable exploration with expert guidance, enabled by two key mechanisms. First, a distribution-aware advantage weighting scales updates by group rareness to calibrate the distribution, therefore preserving exploration. Meanwhile, the asymmetric activation function (LeakyReLU) leverages the expert knowledge as a calibration baseline to moderate overconfident updates while preserving their corrective direction. CalibRL increases policy entropy in a guided manner and clarifies the target distribution by estimating the on-policy distribution through online sampling. Updates are driven by these informative behaviors, avoiding convergence to erroneous patterns. Importantly, these designs help alleviate the distributional mismatch between the modelโs policy and expert trajectories, thereby achieving a more stable balance between exploration and exploitation. Extensive experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements, validating the effectiveness of our controllable hybrid-policy RLVR training. Code and model weights will be released to ensure full reproducibility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3164. ECHO: Toward Contextual Seq2Seq Paradigms in Large EEG Models
๐ฌ 31
โญ 5.20
๐ applications to neuroscience & cognitive science
๐ท๏ธ Electroencephalography In-context Learning Large EEG Model
Electroencephalography (EEG), with its broad range of applications, necessitates models that can generalize effectively across various tasks and datasets. Large EEG Models (LEMs) address this by pretraining encoder-centric architectures on large-scale unlabeled data to extract universal representations. While effective, these models lack decoders of comparable capacity, limiting the full utilization of the learned features.
To address this issue, we introduce ECHO, a novel decoder-centric LEM paradigm that reformulates EEG modeling as sequence-to-sequence learning. ECHO captures layered relationships among signals, labels, and tasks within sequence space, while incorporating discrete support samples to construct contextual cues. This design equips ECHO with in-context learning, enabling dynamic adaptation to heterogeneous tasks without parameter updates.
Extensive experiments across multiple datasets demonstrate that, even with basic model components, ECHO consistently outperforms state-of-the-art single-task LEMs in multi-task settings, showing superior generalization and adaptability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3165. Fine-tuning Quantized Neural Networks with Zeroth-order Optimization
๐ฌ 25
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ Memory-efficient Training Zeroth-order Optimization Quantization
As the size of large language models grows exponentially, GPU memory has become a bottleneck for adapting these models to downstream tasks. In this paper, we aim to push the limits of memory-efficient training by minimizing memory usage on model weights, gradients, and optimizer states, within a unified framework. Our idea is to eliminate both gradients and optimizer states using zeroth-order optimization, which approximates gradients by perturbing weights during forward passes to identify gradient directions. To minimize memory usage on weights, we employ model quantization, e.g., converting from bfloat16 to int4. However, directly applying zeroth-order optimization to quantized weights is infeasible due to the precision gap between discrete weights and continuous gradients, which would otherwise require de-quantization and re-quantization. To overcome this challenge, we propose Quantized Zeroth-order Optimization (QZO), a simple yet effective approach that perturbs the continuous quantization scale for gradient estimation and uses a directional derivative clipping method to stabilize training. QZO is orthogonal to both scalar-based and codebook-based post-training quantization methods. Compared to full-parameter fine-tuning in 16 bits, QZO can reduce the total memory cost by more than 18$\times$ for 4-bit LLMs, and enables fine-tuning Llama-2-13B within a single 24GB GPU. Code will be released publicly.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3166. Mixture of Cognitive Reasoners: Modular Reasoning with Brain-Like Specialization
๐ฌ 23
โญ 5.20
๐ applications to neuroscience & cognitive science
๐ท๏ธ Mixture of Experts Functional Specialization Brain-Inspired AI Interpretability Behavioral Alignment
Human cognitive behavior arises from the interaction of specialized brain networks dedicated to distinct functions, such as language, logic, and social reasoning. Inspired by this organization, we propose Mixture of Cognitive Reasoners (MiCRo): a modular, transformer-based architecture post-trained with a curriculum that induces functional specialization across experts. Concretely, we partition the layers of a pretrained language model into four expert modules aligned with well-studied cognitive networks in the human brain. MiCRo offers three key advantages over standard language models. (1) The specialized experts are interpretable and causally meaningful---ablating a module causes substantial drops on benchmarks requiring its specialized domain. (2) MiCRo's behavior can be dynamically steered at inference time by routing tokens to particular experts (e.g., favoring social over logical reasoning), enabling fine-grained control over outputs. (3) MiCRo outperforms or matches comparable baselines on both machine-learning reasoning benchmarks (e.g., GSM8K, BBH) and alignment to human behavior (CogBench), while maintaining interpretability. Taken together, cognitively grounded functional specialization yields models that are both more human-like and more human-interpretable.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3167. CORDS - Continuous Representations of Discrete Structures
๐ฌ 13
โญ 5.20
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Continuous set representations Neural fields Variable-cardinality prediction Invertible encoding/decoding Diffusion and flow matching Object detection Molecular generation Simulation-based inference
Many learning problems require predicting sets of objects when the number of objects is not known beforehand. Examples include object detection, molecular modeling, and scientific inference tasks such as astrophysical source detection. Existing methods often rely on padded representations or must explicitly infer the set size, which often poses challenges. We present a novel strategy for addressing this challenge by casting prediction of variable-sized sets as a continuous inference problem. Our approach, CORDS (Continuous Representations of Discrete Structures), provides an invertible mapping that transforms a set of spatial objects into continuous fields: a density field that encodes object locations and count, and a feature field that carries their attributes over the same support. Because the mapping is invertible, models operate entirely in field space while remaining exactly decodable to discrete sets. We evaluate CORDS across molecular generation and regression, object detection, simulation-based inference, and a mathematical task involving recovery of local maxima, demonstrating robust handling of unknown set sizes with competitive accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3168. Beyond Text-Only: Towards Multimodal Table Retrieval in Open-World
๐ฌ 15
โญ 5.20
๐ datasets and benchmarks
๐ท๏ธ Table Retrieval MultiModal Retrieval
Open-domain table retrieval aims to retrieve semantically relevant structured tables from a large-scale corpus in response to natural language queries. Unlike unstructured text, tables store information not only through their textual or numerical content but also through their structural properties, including hierarchical relationships between headers and cells, as well as complex spatial arrangements within the table layout. Existing methods predominantly treat table retrieval as a variant of text retrieval. They struggle to accurately preserve the rich structural semantics of diverse table formats during text serialization. Existing methods typically flatten tables into linear text sequences through row-wise or column-wise serialization, inadvertently discarding structural information. The problem becomes particularly acute when processing complex table layouts containing merged cells or irregular alignments, ultimately compromising retrieval performance. Moreover, existing methods struggle to handle embedded images within table cells. Notably, visual representations inherently preserve both structural and content information while being format-agnostic. This insight motivates our exploration of image-based table retrieval, as it can naturally overcome the challenges faced by existing methods. In this paper, we introduce TaR-ViR (Table Retrieval via Visual Representations), a new benchmark that reformulates table retrieval as a multimodal task by treating tables as images. Experiments on TaR-ViR show that this paradigm shift achieved more effective and efficient retrieval performance. Crucially, it eliminates the need for error-prone text conversion, enabling scalable collection and utilization of open-world tables. Our data are available at \url{https://anonymous.4open.science/r/Anonymization-B115}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3169. Predictive CVaR Q-learning
๐ฌ 10
โญ 5.20
๐ reinforcement learning
๐ท๏ธ CVaR optmization Risk-sensitive RL Q-learning Bellman equation Policy improvemen
We propose a sample-efficient Q-learning algorithm for reinforcement learning with the Conditional Value-at-Risk (CVaR) objective. Our algorithm is built upon predictive tail value function, a novel formulation of risk-sensitive action value, that admits a recursive structure as in the conventional risk-neutral Bellman equation. This structure enables the Q-learning algorithm to utilize the entire set of sample trajectories rather than relying only on worst-case outcomes, enhancing the sample efficiency. We further derive a Bellman optimality equation and a policy improvement theorem, which provide theoretical foundations of our algorithm and remedy inconsistencies that have existed in the literature. Empirical results demonstrate that our method consistently improves CVaR performance while maintaining stable and interpretable learning dynamics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3170. Strong Correlations Induce Cause Only Predictions in Transformer Training
๐ฌ 29
โญ 5.20
๐ optimization
๐ท๏ธ Implicit bias Transformers Optimization Causal robustness prediction
We revisit when Transformers can prioritize causes over spurious effects by viewing the problem through data correlation strength and the implicit regularization of gradient descent. We identify a phenomenon called Correlation Crowding-Out (CCO) arising from the training dynamics of Transformers. Specifically, under strongly correlated causal features, gradient descent filters out spurious cues and converges to a predictor that relies almost exclusively on the causes. Theoretically, using a simplified Transformer model trained on data from a minimal causal chain, we introduce a Dominant-coordinate condition that characterizes when CCO arises and explain its mechanism as a coupling of ''occupation'' and ''crowding-out''. ''Occupation'' denotes to rapid growth of weights aligned with the dominant causal direction while non-dominant directions remain small. ''Crowding-out'' denotes to attention logits align with separation directions favoring the causal branch, suppressing descendants. We provide convergence guarantees for both the optimization trajectory and generalization. Our empirical results on simulated and real examples across various tasks including vision and natural language demonstrate the procedure. Together, these results show that, under suitable conditions, standard training alone can induce cause only prediction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3171. CubeBench: Diagnosing Interactive, Long-Horizon Physical Intelligence under Partial Observations
๐ฌ 18
โญ 5.20
๐ datasets and benchmarks
๐ท๏ธ Agent Benchmark Spatial Reasoning Long Horizon Tool Calling
Large Language Model (LLM) agents, while proficient in the digital realm, face a significant gap in physical-world deployment due to the challenge of forming and maintaining a robust spatial mental model. We identify three core cognitive challenges hindering this transition: spatial reasoning, long-horizon state tracking via mental simulation, and active exploration under partial observation. To isolate and evaluate these faculties, we introduce \textbf{CubeBench}, a novel generative benchmark centered on the Rubik's Cube. CubeBench uses a three-tiered diagnostic framework that progressively assesses agent capabilities, from foundational state tracking with full symbolic information to active exploration with only partial visual data. Our experiments on leading LLMs reveal critical limitations, including a uniform 0.00\% pass rate on all long-horizon tasks, exposing a fundamental failure in long-term planning. We also propose a diagnostic framework to isolate these cognitive bottlenecks by providing external solver tools. By analyzing the failure modes, we provide key insights to guide the development of more physically-grounded intelligent agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3172. Constitutional Classifiers++: Production-Grade Defenses against Universal Jailbreaks
๐ฌ 19
โญ 5.20
๐ foundation or frontier models, including LLMs
๐ท๏ธ robustness safeguards
We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation defenses. We first identify vulnerabilities in existing systems that evaluate model outputs without regard to the conversational context, and address these vulnerabilities using full exchange classifiers. Building on this, we implement a classifier cascade where lightweight classifiers screen all traffic, escalating only suspicious exchanges to more expensive classifiers. Combining this approach with other optimizations, we develop a new production-grade jailbreak defense system that achieves a 5.4ร computational cost reduction compared to our baseline exchange classifier, while also achieving a 0.036% refusal rate on production traffic. Through extensive red-teaming comprising over 560K queries, we demonstrate protection against universal jailbreaks---no attack on this system successfully elicited responses to all eight target queries comparable in detail to an undefended model. Finally, we explore efficient classification techniques by training linear activation probes. We show using logit smoothing and a weighted loss function is crucial for performance, and further that probes can be combined with external classifiers to provide particularly strong performance. Our work establishes Constitutional Classifiers as practical safeguards for large language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3173. Latent Visual Reasoning
๐ฌ 19
โญ 5.20
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ multimodal large language models vision Language Models multimodal reasoning
Multimodal Large Language Models (MLLMs) have achieved notable gains in various tasks by incorporating Chain-of-Thought (CoT) reasoning in language spaces. Recent work extends this direction by leveraging external tools for visual editing, thereby enhancing the visual signal along the reasoning trajectories. Nevertheless, these approaches remain fundamentally constrained: reasoning is still confined to the language space, with visual information treated as static preconditions. We introduce Latent Visual Reasoning (LVR), a new paradigm that enables autoregressive reasoning directly in the visual embedding space. A visual encoder first projects images into visual tokens within a joint semantic space shared with the language model. The language model is then trained to generate latent states that reconstruct key visual tokens critical for answering the query, constituting the process of latent visual reasoning. By interleaving LVR with standard text generation, our model achieves substantial gains on perception-intensive visual question answering tasks. In addition, we adapt the GRPO algorithm to conduct reinforcement learning on latent reasoning, further balancing LVR and textual generation. We show that LVR substantially improves fine-grained visual understanding and perception, achieving 71.67\% on MMVP compared to 66.67\% with Qwen2.5-VL. Code base and model weights will be released later.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3174. Conformalized Survival Counterfactuals Prediction for General Right-Censored Data
๐ฌ 14
โญ 5.20
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Conformalized survival analysis Counterfactual inference General right-censored data
This paper aims to develop a lower prediction bound (LPB) for survival time across different treatments in the general right-censored setting. Although previous methods have utilized conformal prediction to construct the LPB, their resulting prediction sets provide only probably approximately correct (PAC)โtype miscoverage guarantees rather than exact ones. To address this problem, we propose a new calibration procedure under the potential outcome framework. Under the strong ignorability assumption, we propose a reweighting scheme that can transform the problem into a weighted conformal inference problem, allowing an LPB to be obtained via quantile regression with an exact miscoverage guarantee. Furthermore, our procedure is doubly robust against model misspecification. Empirical evaluations on synthetic and real-world clinical data demonstrate the validity and informativeness of our constructed LPBs, which indicate the potential of our analytical benchmark for comparing and selecting personalized treatments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3175. SocialJax: An Evaluation Suite for Multi-agent Reinforcement Learning in Sequential Social Dilemmas
๐ฌ 20
โญ 5.20
๐ datasets and benchmarks
๐ท๏ธ MARL Sequential Social Dilemmas
Sequential social dilemmas pose a significant challenge in the field of multi-agent reinforcement learning (MARL), requiring environments that accurately reflect the tension between individual and collective interests.
Previous benchmarks and environments, such as Melting Pot, provide an evaluation protocol that measures generalization to new social partners in various test scenarios. However, running reinforcement learning algorithms in traditional environments requires substantial computational resources.
In this paper, we introduce SocialJax, a suite of sequential social dilemma environments and algorithms implemented in JAX. JAX is a high-performance numerical computing library for Python that enables significant improvements in operational efficiency. Our experiments demonstrate that the SocialJax training pipeline achieves at least 50\texttimes{} speed-up in real-time performance compared to Melting Potโs RLlib baselines. Additionally, we validate the effectiveness of baseline algorithms within SocialJax environments. Finally, we use Schelling diagrams to verify the social dilemma properties of these environments, ensuring that they accurately capture the dynamics of social dilemmas.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3176. Efficient Sliced Wasserstein Distance Computation via Adaptive Bayesian Optimization
๐ฌ 17
โญ 5.20
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Sliced Wasserstein Distance Wasserstein Distance Bayesian Optimization Bayesian Quadrature Quasi-Monte Carlo
The sliced Wasserstein distance (SW) reduces optimal transport on $\mathbb{R}^d$ to a sum of one-dimensional projections, and thanks to this efficiency, it is widely used in geometry, generative modeling, and registration tasks. Recent work shows that quasi-Monte Carlo constructions for computing SW (QSW) yield direction sets with excellent approximation error. This paper presents an alternate, novel approach: learning directions with Bayesian optimization (BO), particularly in settings where SW appears inside an optimization loop (e.g., gradient flows). We introduce a family of drop-in selectors for projection directions: **BOSW**, a one-shot BO scheme on the unit sphere; **RBOSW**, a periodic-refresh variant; **ABOSW**, an adaptive hybrid that seeds from competitive QSW sets and performs a few lightweight BO refinements; and **ARBOSW**, a restarted hybrid that periodically relearns directions during optimization. Our BO approaches can be composed with QSW and its variants (demonstrated by ABOSW/ARBOSW) and require no changes to downstream losses or gradients. We provide numerical experiments where our methods achieve state-of-the-art performance, and on the experimental suite of the original QSW paper, we find that ABOSW and ARBOSW can achieve convergence comparable to the best QSW variants with modest runtime overhead. We release code with fixed seeds and configurations to support faithful replication (see supplementary material).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3177. Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning
๐ฌ 20
โญ 5.20
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Vision Language Model Reasoning Data Synthesis Game Playing Visual Question Answering Data Sets or Data Repositories Benchmarks
Vision-language reinforcement learning (RL) has primarily focused on narrow domains (e.g. geometry or chart reasoning). This leaves broader training scenarios and resources underexplored, limiting the exploration and learning of Vision Language Models (VLMs) through RL. We find video games inherently provide rich visual elements and mechanics that are easy to verify. To fully use the multimodal and verifiable reward in video games, we propose Game-RL, constructing diverse game tasks for RL training to boost VLMs general reasoning ability. To obtain training data, we propose Code2Logic, a novel approach that adapts game code to synthesize game reasoning task data, thus obtaining the GameQA dataset of 30 games and 158 tasks with controllable difficulty gradation. Unexpectedly, training solely on GameQA would help VLMs obtain better out of domain generalization, demonstrating the value of Game-RL for enhancing VLMs general reasoning. Furthermore, this suggets that RL can lead to generalizable improvements in VLMs' reasoning abilities, and notably, video games may serve as valuable scenarios and resources to bring this generalization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3178. Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models
๐ฌ 23
โญ 5.20
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ visual autoregressive model concept erasure
The rapid progress of visual autoregressive (VAR) models has brought new opportunities for text-to-image generation, but also heightened safety concerns. Existing concept erasure techniques, primarily designed for diffusion models, fail to generalize to VARs due to their next-scale token prediction paradigm. In this paper, we first propose a novel VAR Erasure framework **VARE** that enables stable concept erasure in VAR models by leveraging auxiliary visual tokens to reduce fine-tuning intensity. Building upon this, we introduce **S-VARE**, a novel and effective concept erasure method designed for VAR, which incorporates a filtered cross entropy loss to precisely identify and minimally adjust unsafe visual tokens, along with a preservation loss to maintain semantic fidelity, addressing the issues such as language drift and reduced diversity introduce by na\"ive fine-tuning. Extensive experiments demonstrate that our approach achieves surgical concept erasure while preserving generation quality, thereby closing the safety gap in autoregressive text-to-image generation by earlier methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3179. REMem: Reasoning with Episodic Memory in Language Agent
๐ฌ 24
โญ 5.20
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ language agent episodic memory long-term memory
Humans excel at remembering concrete experiences along spatiotemporal contexts and performing reasoning across those events, i.e., the capacity for episodic memory. In contrast, memory in language agents remains mainly semantic, and current agents are not yet capable of effectively recollecting and reasoning over interaction histories. We identify and formalize the core challenges of episodic recollection and reasoning from this gap, and observe that existing work often overlooks episodicity, lacks explicit event modeling, or overemphasizes simple retrieval rather than complex reasoning. We present REMem, a two-phase framework for constructing and reasoning with episodic memory: 1) Offline indexing, where REMem converts experiences into a hybrid memory graph that flexibly links time-aware gists and facts. 2) Online inference, where REMem employs an agentic retriever with carefully curated tools for iterative retrieval over the memory graph. Comprehensive evaluation across four episodic memory benchmarks shows that REMem substantially outperforms state-of-the-art memory systems such as Mem0 and HippoRAG 2, showing 3.4% and 13.4% absolute improvements on episodic recollection and reasoning tasks, respectively. Moreover, REMem also demonstrates more robust refusal behavior for unanswerable questions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3180. lmgame-Bench: How Good are LLMs at Playing Games?
๐ฌ 24
โญ 5.20
๐ datasets and benchmarks
๐ท๏ธ LLM VLM Agents Benchmark Games
Playing video games requires perception, reasoning, memory, and long-horizon planningโexactly the faculties expected of modern large language and visionโlanguage models (LLMs/VLMs). We introduce LMGame-Bench, a benchmark built on six popular games spanning platformer, puzzle, and narrative games through a unified Gymโstyle API. Unlike prior game benchmarks that entangle multiple skills, LMGame-Bench employs a modular harnessโincluding perception, memory, and reasoning modulesโthat can be toggled to selectively probe distinct capabilities. The benchmark further improves robustness through prompt standardization and contamination mitigation. Evaluation of 13 state-of-the-art models demonstrates that LMGame-Bench remains challenging yet effectively discriminates among models. Correlation analysis reveals that individual games align with core LLM capabilities, providing a quantitative framework for interpreting performance. Finally, LMGame-Bench exposes modelsโ limitations in visual state extraction, reflection, spatiotemporal reasoning, and long-context reasoning, pointing to concrete directions for model improvement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3181. Concept-Aware Privacy Mechanisms for Defending Embedding Inversion Attacks
๐ฌ 16
โญ 5.20
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Text Embedding Privacy Defense Inversion Attack
Text embeddings enable numerous NLP applications but face severe privacy risks from embedding inversion attacks, which can expose sensitive attributes or reconstruct raw text. Existing differential privacy defenses assume uniform sensitivity across embedding dimensions, leading to excessive noise and degraded utility. We propose SPARSE, a user-centric framework for concept-specific privacy protection in text embeddings. SPARSE combines (1) differentiable mask learning to identify privacy-sensitive dimensions for user-defined concepts, and (2) the Mahalanobis mechanism that applies elliptical noise calibrated by dimension sensitivity. Unlike traditional spherical noise injection, SPARSE selectively perturbs privacy-sensitive dimensions while preserving non-sensitive semantics. Evaluated across six datasets with three embedding models and attack scenarios, SPARSE consistently reduces privacy leakage while achieving superior downstream performance compared to state-of-the-art DP methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3182. MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling
๐ฌ 18
โญ 5.20
๐ generative models
๐ท๏ธ LLM training FP8 tensor scaling
Training large language models with FP8 formats offers significant efficiency gains. However, the reduced numerical precision of FP8 poses challenges for stable and accurate training. Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights. While effective, per-group quantization requires scaling along the inner dimension of matrix multiplication, introducing additional dequantization overhead. Moreover, these frameworks often rely on just-in-time scaling to dynamically adjust scaling factors based on the current data distribution. However, this online quantization is inefficient for FP8 training, as it involves multiple memory reads and writes that negate the performance benefits of FP8. To overcome these limitations, we propose MOSS, a novel FP8 training framework that ensures both efficiency and numerical stability. MOSS introduces two key innovations: (1) a two-level microscaling strategy for quantizing sensitive activations, which balances precision and dequantization cost by combining a high-precision global scale with compact, power-of-two local scales; and (2) automatic scaling for weights in linear layers, which eliminates the need for costly max-reduction operations by predicting and adjusting scaling factors during training. Leveraging these techniques, MOSS enables efficient FP8 training of a 7B parameter model, achieving performance comparable to the BF16 baseline while achieving up to 34\% higher training throughput. Our code is available at: https://anonymous.4open.science/r/MOSS-1CAF.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3183. WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference
๐ฌ 28
โญ 5.14
๐ foundation or frontier models, including LLMs
๐ท๏ธ Sparse Activation Efficient Inference
The ever-increasing computational demands of large language models (LLMs) make efficient inference a central challenge. While recent advances leverage specialized architectures or selective activation, they typically require (re)training or architectural modifications, limiting their broad applicability. Training-free sparse activation, in contrast, offers a plug-and-play pathway to efficiency; however, existing methods often rely solely on hidden state magnitudes, leading to significant approximation error and performance degradation. To address this, we introduce WINA (Weight-Informed Neuron Activation): a simple framework for training-free sparse activation that incorporates both hidden state magnitudes and weight matrix structure. By also leveraging the โ2-norm of the modelโs weight matrices, WINA yields a principled sparsification strategy with provably optimal approximation error bounds, offering better and tighter theoretical guarantees than prior state-of-the-art approaches. Overall, WINA also empirically outperforms many previous training-free methods across diverse LLM architectures and datasets: not only matching or exceeding their accuracy at comparable sparsity levels, but also sustaining performance better at more extreme sparsity levels. Together, these results position WINA as a practical, theoretically grounded, and broadly deployable solution for efficient inference. Our source code is anonymously available at https://anonymous.4open.science/r/wina-F704/README.md.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3184. ChinaTravel: An Open-Ended Travel Planning Benchmark with Compositional Constraint Validation for Language Agents
๐ฌ 20
โญ 5.14
๐ datasets and benchmarks
๐ท๏ธ Benchmarking Travel Planning Neuro-Symbolic Learning LLM Planning
Travel planning stands out among real-world applications of \emph{Language Agents} because it couples significant practical demand with a rigorous constraint-satisfaction challenge. However, existing benchmarks primarily operate on a slot-filling paradigm, restricting agents to synthetic queries with pre-defined constraint menus, which fails to capture the open-ended nature of natural language interaction, where user requirements are compositional, diverse, and often implicitly expressed. To address this gap, we introduce \emph{ChinaTravel}, with four key contributions: 1) a practical sandbox aligned with the multi-day, multi-POI travel planning, 2) a compositionally generalizable domain-specific language (DSL) for scalable evaluation, covering feasibility, constraint satisfaction, and preference comparison 3) an open-ended dataset that integrates diverse travel requirements and implicit intent from 1154 human participants, and 4) fine-grained analysis reveal the potential of neuro-symbolic agents in travel planning, achieving a 37.0\% constraint satisfaction rate on human queries, a 10$\times$ improvement over purely neural models, \blue{yet highlighting significant challenges in compositional generalization}. Overall, ChinaTravel provides a foundation for advancing language agents through compositional constraint validation in complex, real-world planning scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3185. Radiometrically Consistent Gaussian Surfels for Inverse Rendering
๐ฌ 27
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Radiometric Consistency Indirect Illumination Gaussian Splatting Inverse Rendering
Inverse rendering with Gaussian Splatting has advanced rapidly, but accurately disentangling material properties from complex global illumination effects, particularly indirect illumination, remains a major challenge. Existing methods often query indirect radiance from Gaussian primitives pre-trained for novel-view synthesis. However, these pre-trained Gaussian primitives are supervised only towards limited training viewpoints, thus lack supervision for modeling indirect radiances from unobserved views. To address this issue, we introduce radiometric consistency, a novel physically-based constraint that provides supervision towards unobserved views by minimizing the residual between each Gaussian primitiveโs learned radiance and its physically-based rendered counterpart. Minimizing the residual for unobserved views establishes a self-correcting feedback loop that provides supervision from both physically-based rendering and novel-view synthesis, enabling accurate modeling of inter-reflection.
We then propose Radiometrically Consistent Gaussian Surfels (RadioGS), an inverse rendering framework built upon our principle by efficiently integrating radiometric consistency by utilizing Gaussian surfels and 2D Gaussian ray tracing. We further propose a finetuning-based relighting strategy that adapts Gaussian surfel radiances to new illuminations within minutes, achieving low rendering cost ($<$10ms). Extensive experiments on existing inverse rendering benchmarks show that RadioGS outperforms existing Gaussian-based methods in inverse rendering, while retaining the computational efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3186. Lumos-1: On Autoregressive Video Generation with Discrete Diffusion from a Unified Model Perspective
๐ฌ 18
โญ 5.00
๐ generative models
๐ท๏ธ Video generation autoregressive models
Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive (AR) video generation. Existing AR video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an LLM-based unified model for AR video generation with efficient discrete diffusion. Firstly, to fit videos with LLMs, we identify that 1D RoPE is ill-suited for visual spatiotemporal correlation modeling, and while demonstrated to be useful, naive 3D RoPE exhibits imbalanced frequency spectra. Therefore, we propose MMโRoPE, which preserves the original textual RoPE while seamlessly accommodating video data with comprehensive frequency spectra and scaled 3D positions. Secondly, to fit the video data's nature and overcome the inefficiency of next-token decoding, we adopt a parallel and mask-based discrete diffusion with the intra-frame bidirectional and inter-frame causal attention masks. Based on this attention mask, we uncover the frameโwise loss imbalance issue caused by spatial information redundancy and propose Autoregressive Discrete Diffusion Forcing, which introduces temporal tube masking during training with a compatible inferenceโtime masking policy to avoid quality degradation. Despite using only 48 GPUs for pre-training, limited data and a discrete tokenizer, Lumos-1 achieves results surpassing those of Show-o2 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3187. Test-Time Efficient Pretrained Model Portfolios for Time Series Forecasting
๐ฌ 12
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ pretrained time series models time series forecasting foundation model combination
Is bigger always better for time series foundation models? With the question in mind, we explore an alternative to training a single, large monolithic model: building a portfolio of smaller, pretrained forecasting models. By applying ensembling or model selection over these portfolios, we achieve competitive performance on large-scale benchmarks using much fewer parameters. We explore strategies for designing such portfolios and find that collections of specialist models consistently outperform portfolios of independently trained generalists. Remarkably, we demonstrate that post-training a base model is a compute-effective approach for creating sufficiently diverse specialists, and provide evidences that ensembling and model selection are more compute-efficient than test-time fine-tuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3188. Autoregressive-based Progressive Coding for Ultra-Low Bitrate Image Compression
๐ฌ 21
โญ 5.00
๐ generative models
๐ท๏ธ lossy image compression autoregressive model
Generative models have demonstrated significant results in ultra-low bitrate image compression, owing to their powerful capabilities for content generation and texture completion. Existing works primarily based on diffusion models still face challenges such as limited bitrate adaptability and high computational complexity for encoding and decoding. Inspired by the success of Visual AutoRegressive model (VAR), we introduce AutoRegressive-based Progressive Coding (ARPC) for ultra-low bitrate image compression, a progressive image compression framework based on next-scale prediction visual autoregressive model. Based on multi-scale residual vector quantizer, ARPC efficiently encodes the image into multi-scale discrete token maps and controls the bitrates by selecting different scales for transmission. For decompression, ARPC leverages the prior knowledge inherent in the visual autoregressive model to predict the unreceived scales, which is naturally the autoregressive generation process. To further increase the compression ratio, we target the VAR as a probability estimator for lossless entropy coding and propose group-masked bitwise multi-scale residual quantizer to adaptively allocate bits for different scales. Extensive experiments show that ARPC achieves state-of-the-art perceptual fidelity at ultra-low bitrates and high decompression efficiency compared with existing diffusion-based methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3189. Globally aware optimization with resurgence
๐ฌ 14
โญ 5.00
๐ optimization
๐ท๏ธ non-convex optimization optimization landscape
Modern optimization faces a fundamental challenge: local gradient-based methods provide no global information about the objective function $L$ landscape, often leading to suboptimal convergence and sensitivity to initialization. We introduce a novel optimization framework that leverages resurgence theory from complex analysis to extract global structural information from divergent asymptotic series. Our key insight is that the factorially divergent perturbative expansions of parameter space partition functions encode precise information about all critical objective function value in the landscape through their Borel transform singularities.
The algorithm works by computing the statistical mechanical partition function $Z(g) = \int e^{-L(\theta)/g} d\theta$ for small coupling $g\ll 1$, extracting its asymptotic series coefficients, and identifying Borel plane singularities that correspond one-to-one with critical objective function values. These target values provide global guidance to local optimizers, enabling principled learning rate adaptation and escape from suboptimal regions. Unlike heuristic adaptive methods, targets are theoretically grounded in the geometry of the optimization landscape.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3190. Pixel-Perfect Puppetry: Precision-Guided Enhancement for Face Image and Video Editing
๐ฌ 10
โญ 5.00
๐ generative models
๐ท๏ธ FaceVideo Editing Face Image Editing Precision Guidance
Preserving identity while precisely manipulating attributes is a central challenge
in face editing for both images and videos. Existing methods often introduce visual artifacts or fail to maintain temporal consistency. We present **FlowGuide**,
a unified framework that achieves fine-grained control over face editing in diffusion models. Our approach is founded on the local linearity of the UNet bottleneckโs latent space, which allows us to treat semantic attributes as corresponding
to specific linear subspaces, providing a mathematically sound basis for disentanglement. FlowGuide first identifies a set of orthogonal basis vectors that span
these semantic subspaces for both the original content and the target edit, a representation that efficiently captures the most salient features of each. We then
introduce a novel guidance mechanism that quantifies the geometric alignment
between these bases to dynamically steer the denoising trajectory at each step.
This approach offers superior control by ensuring edits are confined to the desired
attributeโs semantic axis while preserving orthogonal components related to identity. Extensive experiments demonstrate that FlowGuide achieves state-of-the-art
performance, producing high-quality edits with superior identity preservation and
temporal coherence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3191. DriveAgent-R1: Advancing VLM-based Autonomous Driving with Active Perception and Hybrid Thinking
๐ฌ 20
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Autonomous Driving Vision-Language Models Hybrid Thinking Visual Reasoning Reinforcement Learning
The advent of Vision-Language Models (VLMs) has significantly advanced end-to-end autonomous driving, demonstrating powerful reasoning abilities for high-level behavior planning tasks. However, existing methods are often constrained by a passive perception paradigm, relying solely on text-based reasoning. This passivity restricts the modelโs capacity to actively seek crucial visual evidence when faced with uncertainty. To address this, we introduce DriveAgent-R1, an autonomous driving agent capable of active perception for planning. In complex scenarios, DriveAgent-R1 proactively invokes tools to perform visual reasoning, firmly grounding its decisions in visual evidence, thereby enhancing both interpretability and reliability. Furthermore, we propose a hybrid thinking framework, inspired by human driver cognitive patterns, allowing the agent to adaptively switch between efficient text-only reasoning and robust tool-augmented visual reasoning based on scene complexity. This capability is cultivated through a three-stage progressive training strategy, featuring a core Cascaded Reinforcement Learning (Cascaded RL) phase. Extensive experiments on the Drive-Internal dataset, which is rich in long-tail scenarios, and the public nuScenes dataset show that, with only 3B parameters, DriveAgent-R1 achieves competitive performance comparable to top closed model systems such as GPT-5 and to human driving proficiency while remaining deployment-friendly, offering a proven path toward building more intelligent autonomous driving systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3192. Towards Dynamic Interleaving Optimizers
๐ฌ 13
โญ 5.00
๐ optimization
๐ท๏ธ HPO optimizer
Optimizers are critical for training deep neural networks. Existing training processes rely on a single static optimizer (e.g., SGD) or a simple hybrid of two optimizers, which miss the opportunity to exploit evolving dynamics in different training states, degrading model quality and convergence. In this paper, we propose a novel dynamic optimizer switching method called **D**ynamic **O**ptimizer **I**nterleaving **T**raining (DOIT) method, which builds surrogate models to predict different optimizers' performance from current parameter states. DOIT uses an acquisition function that combines the results from surrogate models with transferability assessments and process information to select a suitable optimizer for the subsequent training. Experiments on various models and tasks (e.g., image and text classification, machine translation, and object detection) show that DOIT effectively enhances the training, achieving faster convergence (i.e., 2\% to 10\% faster) and higher accuracy (i.e., 1\% to 3\% improvement). Additional independent experiments and case studies further validate DOIT's effectiveness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3193. Geometric Constraints for Small Language Models to Understand and Expand Scientific Taxonomies
๐ฌ 18
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Small LM Hyperbolic Deep Learning Taxonomy Structure
Recent findings reveal that token embeddings of Large Language Models (LLMs) exhibit strong hyperbolicity. This insight motivates leveraging LLMs for scientific taxonomy tasks, where maintaining and expanding hierarchical knowledge structures is critical. Although potential, generally-trained LLMs face challenges in directly handling domain-specific taxonomies, including computational cost and hallucination. Meanwhile, Small Language Models (SLMs) provide a more economical alternative if empowered with proper knowledge transfer. In this work, we introduce SS-Mono (Structure-Semantic Monotonization), a novel pipeline that combines local taxonomy augmentation from LLMs, self-supervised fine-tuning of SLMs with geometric constraints, and LLM calibration. Our approach enables efficient and accurate taxonomy expansion across root, leaf, and intermediate nodes. Extensive experiments on both leaf and non-leaf expansion benchmarks demonstrate that a fine-tuned SLM (e.g., DistilBERT-base-110M) consistently outperforms frozen LLMs (e.g., GPT-4o, Gemma-2-9B) and domain-specific baselines. These findings highlight the promise of lightweight yet effective models for structured knowledge enrichment in scientific domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3194. Language-Instructed Vision Embeddings for Controllable and Generalizable Perception
๐ฌ 11
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Vision Language Model
Vision foundation models are typically trained as static feature extractors, forcing the burden of task adaptation onto large downstream models. We propose a different paradigm: instead of solely feeding visual features into language, we use language itself to dynamically guide the vision encoder. Our method, Language-Instructed Vision Embeddings (LIVE), leverages language as high-level guidance to produce task-centric embeddings at inference timeโwithout requiring task-specific retraining. This enables the encoder to focus attention on contextually relevant aspects of the input, yielding more controllable and generalizable representations. Empirically, LIVE reduces visual hallucinations (+34 points on MMVP), outperforms visionโlanguage models with orders of magnitude more parameters on visual question answering, and generalizes to unseen instructions and tasks---offering a direct path toward adaptive, instruction-driven visual intelligence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3195. AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models
๐ฌ 4
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ large reasoning model safety alignment chain-of-thought
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in complex problem-solving through Chain-of-Thought (CoT) reasoning. However, the multi-step nature of CoT introduces new safety challenges that extend beyond conventional language model alignment. We identify a failure mode in current safety CoT tuning methods: the snowball effect, where minor reasoning deviations progressively amplify throughout the thought process, leading to either harmful compliance or excessive refusal. This effect stems from models being trained to imitate perfect reasoning scripts without learning to self-correct. To address this limitation, we propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning. Our method involves constructing a dataset containing Temptation-Correction and Hesitation-Correction samples, where models learn to recover from harmful reasoning drifts and unnecessary cautions. Extensive experiments show that AdvChain significantly enhances robustness against jailbreak attacks and CoT hijacking while substantially reducing over-refusal on benign prompts, achieving a superior safety-utility balance without compromising reasoning capabilities. Our work establishes a new direction for building more robust and reliable reasoning models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3196. Seeing Through Words: Controlling Visual Retrieval Quality with Language
๐ฌ 17
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Language Models Vision-Language Models Query Completion
Text-to-image retrieval is a fundamental task in vision--language learning, yet in real-world scenarios it is often challenged by short and underspecified user queries. Such queries are typically only one or two words long, making them semantically ambiguous, prone to collisions across diverse visual interpretations, and lacking explicit control over the quality of retrieved images. To address these issues, we propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality. Our key idea is to leverage a generative large language model as a query completion function, extending underspecified queries into descriptive forms that capture fine-grained visual attributes such as pose, scene, and aesthetics. We introduce a training framework that conditions query completion on discretized quality levels, derived from relevance and aesthetic scoring models, so that query enrichment is not only semantically meaningful but also quality-aware. The resulting system provides three key advantages: {1} flexibility, as it is compatible with any pretrained vision--language model without modification; {2} transparency, since enriched queries are explicitly interpretable by users; and {3} controllability, enabling retrieval results to be steered toward user-preferred quality levels. Extensive experiments demonstrate that our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern vision--language models and the underspecified nature of short user queries.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3197. CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts
๐ฌ 21
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Mixture-of-experts quantization
Outliers have emerged as a fundamental bottleneck in preserving accuracy for low-precision large models, particularly within Mixture-of-Experts (MoE) architectures that are increasingly central to large-scale language modeling. Under post-training quantization (PTQ), these outliers induce substantial quantization errors, leading to severe accuracy degradation. While recent rotation-based smoothing techniques alleviate the problem by redistributing outlier magnitudes, residual errors remain and continue to impede reliable low-precision deployment.
In this work, we tackle this challenge by introducing a unified quantization-and-clustering scheme that contains smoothing activation outliers via learnable rotation and absorbing weight outliers into fine-tuned cluster centroids for MoE. This design reduces the influence of extreme values by fitting them within cluster centroids, thereby lowering quantization error while maintaining expressive capacity. We introduce~\textit{CodeQuant}, a unified quantization-and-clustering framework tailored for MoEs. Coupled with a dedicated kernel design for GPU and CPU, CodeQuant achieves up to $4.15\times$ speedup while delivering significantly higher accuracy than state-of-the-art quantization approaches across diverse MoE models. Our results highlight CodeQuant as a promising direction for efficient and accurate deployment of MoE-based large language models under low-precision constraints.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3198. Distributionally Robust Optimization via Generative Ambiguity Modeling
๐ฌ 20
โญ 5.00
๐ optimization
๐ท๏ธ Distributionally Robust Optimization Generative Models OOD Generalization
This paper studies Distributionally Robust Optimization (DRO), a fundamental framework for enhancing the robustness and generalization of statistical learning and optimization. An effective ambiguity set for DRO must involve distributions that remain consistent to the nominal distribution while being diverse enough to account for a variety of potential scenarios. Moreover, it should lead to tractable DRO solutions. To this end, we propose generative model-based ambiguity sets that capture various adversarial distributions beyond the nominal support space while maintaining consistency with the nominal distribution. Building on this generative ambiguity modeling, we propose DRO with Generative Ambiguity Set (GAS-DRO), a tractable DRO algorithm that solves the inner maximization over the parameterized generative model space. We formally establish the stationary convergence performance of GAS-DRO. We implement GAS-DRO with a diffu-
sion model and empirically demonstrate its superior Out-of-Distribution (OOD) generalization performance in a ML prediction task.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3199. LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation
๐ฌ 9
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Reasoning Benchmark Linguistic reasoning Permutation
Frontier language models appear strong at solving reasoning problems, but their performance is often inflated by shortcuts such as memorisation and knowledge. We introduce LingOLY-TOO, a challenging reasoning benchmark of 6,995 questions that counters these shortcuts by applying expert-designed obfuscations to Linguistics Olympiad problems. These obfuscations preserve the underlying solution logic while removing orthographic clues that could trigger patterns from memorisation or knowledge. Our experiments show that models exploit shortcuts on the original questions as performance markedly drop upon obfuscation. Even the best reasoning models remain highly sensitive, with scores dropping from around $0.60$ on original problems to $0.48$ after obfuscation. LingOLY-TOO disentangles reasoning from knowledge, offering a clear measure of true reasoning capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3200. ASTRAEA: A Token-wise Acceleration Framework for Video Diffusion Transformers
๐ฌ 17
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ video diffusion acceleration
Video diffusion transformers (vDiTs) have made tremendous progress in text-to-video generation, but their high computational demands pose a major challenge for practical deployment. While existing studies propose acceleration methods to reduce workload at various granularities, they often rely on heuristics, limiting their applicability.
We introduce Astraea, a framework that searches for near-optimal configurations for vDiT-based video generation with a performance target. At its core, Astraea proposes a lightweight token selection mechanism and a memory-efficient, GPU-parallel sparse attention strategy, enabling linear reductions in execution time with minimal impact on generation quality. Meanwhile, to determine optimal token reduction for different timesteps, we further design a search framework that leverages a classic evolutionary algorithm to automatically determine the distribution of the token budget effectively. Together, Astraea achieves up to 2.4x inference speedup on a single GPU with great scalability (up to 13.2x speedup on 8 GPUs) while retaining better video quality compared to the state-of-the-art methods (<0.5% loss on the VBench score compared to the baseline vDiT models).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3201. Detect, Decide, Unlearn: A Transfer-Aware Framework for Continual Learning
๐ฌ 18
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Continual Learning
Continual learning (CL) aims to continuously learn new tasks from data streams. While most CL research focuses on mitigating catastrophic forgetting, memorizing outdated knowledge can cause negative transfer, where irrelevant prior knowledge interferes with new task learning and impairs adaptability. Inspired by how the human brain selectively unlearns unimportant information to prioritize learning and to recall relevant knowledge, we explore the intuition that effective CL should not only preserve but also selectively unlearn prior knowledge that hinders adaptation. We introduce DEtect, Decide, Unlearn in Continual lEarning (DEDUCE), a novel CL framework that dynamically detects negative transfer and mitigates it by a hybrid unlearning mechanism. Specifically, we investigate two complementary negative transfer detection strategies: transferability bound and gradient conflict analysis. Based on this detection, the model decides whether to activate a Local Unlearning Module (LUM) to filter outdated knowledge before learning new task. Additionally, a Global Unlearning Module (GUM) periodically reclaims model capacity to enhance plasticity. Our experiments demonstrate that DEDUCE effectively mitigates task interference and improves overall accuracy with an average gain of up to 5.5\% over state-of-the-art baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3202. Rethinking Expressivity and Degradation-Awareness in Attention for All-in-One Blind Image Restoration
๐ฌ 14
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Transformer Image Restoration Representation Learning
All-in-one image restoration (IR) aims to recover high-quality images from diverse degradations, which in real-world settings are often mixed and unknown. Unlike single-task IR, this problem requires a model to approximate a family of heterogeneous inverse functions, making it fundamentally more challenging and practically important. Although recent focus has shifted toward large multimodal models, their robustness still depends on faithful low-level inputs, and the principles that govern effective restoration remain underexplored. We revisit attention mechanisms through the lens of all-in-one IR and identify two overlooked bottlenecks in widely adopted Restormer-style backbones: \textit{(i) the value path remains purely linear}, restricting outputs to the span of inputs and weakening expressivity, and \textit{(ii) the absence of an explicit global slot} prevents attention from encoding degradation context. To address these issues, we propose two minimal, backbone-agnostic primitives: a nonlinear value transform that upgrades attention from a selector to a selectorโtransformer, and a global spatial token that provides an explicit degradation-aware slot. Together, these additions improve restoration across synthetic, mixed, underwater, and medical benchmarks, with negligible overhead and consistent performance gains. Analyses with foundation model embeddings, spectral statistics, and separability measures further clarify their roles, positioning our study as a step toward rethinking attention primitives for robust all-in-one IR.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3203. SpectraLLM: Uncovering the Ability of LLMs for Molecule Structure Elucidation from Multi-Spectra
๐ฌ 27
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ structure elucidation spectral molecular large language model domain specific training
Automated molecular structure elucidation remains challenging, as existing approaches often depend on pre-compiled databases or restrict themselves to single spectroscopic modalities. Here we introduce **SpectraLLM**, a large language model that performs end-to-end structure prediction by reasoning over one or multiple spectra. Unlike conventional spectrum-to-structure pipelines, SpectraLLM represents both continuous (IR, Raman, UV-Vis, NMR) and discrete (MS) modalities in a shared language space, enabling it to capture substructural patterns that are complementary across different spectral types. We pretrain and fine-tune the model on small-molecule domains and evaluate it on four public benchmark datasets. SpectraLLM achieves state-of-the-art performance, substantially surpassing single-modality baselines. Moreover, it demonstrates strong robustness in unimodal settings and further improves prediction accuracy when jointly reasoning over diverse spectra, establishing a scalable paradigm for language-based spectroscopic analysis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3204. ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
๐ฌ 34
โญ 5.00
๐ generative models
๐ท๏ธ Interleaved Chain-of-Thought Multimodal Reasoning
Multimodal reasoning is a dynamic process that requires synergistic coordination of language and vision. However, current approaches to multimodal interleaved generation fall short of providing a generalizable recipe that productively engages text and vision to advance reasoning. We introduce ThinkMorph, a unified thinking model capable of effective interleaved reasoning. By constructing a high-quality pipeline for generating interleaved reasoning data for training unified models, we enable ThinkMorph to generate multimodal reasoning traces where language and vision mutually advance each other. ThinkMorph delivers substantial gains on vision-centric reasoning, including +11.53\% on visual search and +38.75\% on jigsaw assembly over the base model. It also reaches 80.33\% on MMVP and 52.67\% on SAT, indicating strong generalization. The improvements are large enough to close the gap with, and in some cases even surpass, leading large-scale or proprietary VLMs.
Moreover, ThinkMorph reveals \emph{emergent properties} indicative of higher-level multimodal intelligence. These include unseen visual manipulation skills during finetuning, such as zoom-in and image inpainting, as well as autonomous reasoning mode switching, wherein a model trained exclusively on interleaved data chooses to engage in text-only reasoning due to the nature of the task, for instance. We show that this ability to think in text, vision, and multimodality opens new avenues for test-time scaling, allowing ThinkMorph to \textit{effectively scale and aggregate thoughts across three reasoning modes}.
These findings suggest promising directions for future work to characterize the emergent capabilities of unified models for multimodal reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3205. Accelerating Materials Design via LLM-Guided Evolutionary Search
๐ฌ 25
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Large Language Models Evolutionary Optimization AI for Science Material Discovery
Materials discovery must navigate vast compositional spaces while satisfying multiple, often conflicting, objectives. We present **LL**M-guided **E**volution for **MA**terials design (LLEMA), a unified framework that couples the scientific knowledge embedded in large language models with chemistry-informed evolutionary rules and memory-based refinement. At each iteration, an LLM proposes crystallographically specified candidates under explicit property constraints; a surrogate-augmented oracle estimates physicochemical properties; and a multi-objective scorer updates success/failure memories to guide subsequent generations. Evaluated on **10** realistic tasks spanning electronics, energy, coatings, optics, and aerospace, LLEMA discovers candidates that are chemically plausible, thermodynamically stable, and property-aligned, achieving higher hit-rates and stronger Pareto fronts than generative and LLM-only baselines. Ablation studies confirm the importance of rule-guided generation, memory-based refinement, and surrogate prediction. By enforcing synthesizability and multi-objective trade-offs, LLEMA delivers a principled pathway to accelerate practical materials discovery.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3206. Test-time Domain Generalization for Image Super-resolution
๐ฌ 24
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Test-time domain generalization
Test-time domain generalization (TTDG) methods enhance the performance of neural networks on target domains by transferring the feature distribution of target samples to approximate that of the source domain, while avoiding the computational cost associated with fine-tuning on the target domain. However, existing TTDG methods primarily rely on style transfer strategies operating at a coarse granularity, which prove ineffective for pixel-level prediction tasks such as image super-resolution (SR). To address this limitation, we propose a multi-codebook based test-time domain generalization framework (MC-TTDG). Our method leverages both domain-specific and domain-invariant codebooks to achieve fine-grained representation learning on source domains, and performs pixel-level nearest-neighbor feature matching and transfer to accurately adjust target domain features. Furthermore, we introduce a voting-based strategy for optimal domain-specific codebook selection, which improves the precision of feature transfer through multi-party consensus. Extensive experiments across diverse data distributions, and network architectures demonstrate that the proposed method effectively transfers feature distributions for SR networks. Our code is available at ***.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3207. Towards Better Branching Policies: Leveraging the Sequential Nature of Branch-and-Bound Tree
๐ฌ 32
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Mixed-Integer Linear Programming Branch-and-Bound Branching Variable Selection Policy Generalization Mamba
The branch-and-bound (B\&B) method is a dominant exact algorithm for solving Mixed-Integer Linear Programming problems (MILPs). While recent deep learning approaches have shown promise in learning branching policies using instance-independent features, they often struggle to capture the sequential decision-making nature of B\&B, particularly over long horizons with complex inter-step dependencies and intra-step variable interactions. To address these challenges, we propose Mamba-Branching, a novel learning-based branching policy that leverages the Mamba architecture for efficient long-sequence modeling, enabling effective capture of temporal dynamics across B\&B steps. Additionally, we introduce a contrastive learning strategy to pre-train discriminative embeddings for candidate branching variables, significantly enhancing Mamba's performance. Experimental results demonstrate that Mamba-Branching outperforms all previous neural branching policies on real-world MILP instances and achieves superior computational efficiency compared to the advanced open-source solver SCIP. The source code can be accessed via an anonymized repository at https://anonymous.4open.science/r/Mamba-Branching-B4B4/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3208. Fast Estimation of Wasserstein Distances via Regression on Sliced Wasserstein Distances
๐ฌ 20
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Optimal Transport Wasserstein distance Sliced Wasserstein distance Regression
We address the problem of efficiently computing Wasserstein distances for multiple pairs of distributions drawn from a meta-distribution. To this end, we propose a fast estimation method based on regressing Wasserstein distance on sliced Wasserstein (SW) distances. Specifically, we leverage both standard SW distances, which provide lower bounds, and lifted SW distances, which provide upper bounds, as predictors of the true Wasserstein distance. To ensure parsimony, we introduce two linear models: an unconstrained model with a closed-form least-squares solution, and a constrained model that uses only half as many parameters. We show that accurate models can be learned from a small number of distribution pairs. Once estimated, the model can predict the Wasserstein distance for any pair of distributions via a linear combination of SW distances, making it highly efficient. Empirically, we validate our approach on diverse tasks, including Gaussian mixture, point-cloud classification, and Wasserstein-space visualization for 3D point clouds. Across various datasets such as MNIST point clouds, ShapeNetV2, MERFISH Cell Niches, and scRNA-seq, our method consistently provides a better approximation of Wasserstein distance than the state-of-the-art Wasserstein embedding model, Wasserstein Wormhole, particularly in low-data regimes. Finally, we demonstrate that our estimator can also accelerate Wormhole training, yielding \textit{RG-Wormhole}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3209. KL-Regularized Reinforcement Learning is Designed to Mode Collapse
๐ฌ 25
โญ 5.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning LLM diversity KL divergence
Classical intuitions cast minimizing reverse KL as "mode seeking" and forward KL as "mass covering". In KL-regularized reinforcement learning, however, the regularizer determines _both_ the target distribution's shape _and_ the divergence being implicitly minimized, making its role more nuanced than simply inducing generic mode-seeking or mass-covering behaviour. Specifically, the target distribution is defined jointly by the reward function, the reference model, the type of regularizer, and the regularization strength. We show that under common settingsโsuch as low regularization strength and equal verifiable rewardsโboth forward and reverse KL regularization tend to specify target distributions whose mass concentrates on a single high-reward region. Thus, the objective itself _by construction_ induces diversity collapse, regardless of the policy optimization algorithm used.
Building on this perspective, we introduce a simple and scalable modification that rescales rewards to induce target distributions assigning substantial probability across _all_ high-reward regions. This yields a principled objective that maintains high solution quality while achieving broad reward-mode coverage. Empirically, this approach improves post-training diversity and performance for Large Language Models and Chemical Language Models, and is effective with either forward or reverse KL regularization, while using either naively fails.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3210. WMPO: World Model-based Policy Optimization for Vision-Language-Action Models
๐ฌ 15
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ World Models; Vision-Language-Action Models; Reinforcement Learning
Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation, but their reliance on expert demonstrations limits their ability to learn from failures and perform self-corrections.
Reinforcement learning (RL) addresses these through self-improving interactions with the physical environment, but suffers from high sample complexity on real robots.
We introduce World-Model-based Policy Optimization (WMPO), a principled framework for on-policy VLA RL without interacting with the real environment.
In contrast to widely used latent world models,
WMPO focuses on pixel-based predictions that align the "imagined" trajectories with the VLA features pretrained with web-scale images.
Crucially, WMPO enables the policy to perform on-policy GRPO that provides stronger performance than the often-used off-policy methods.
Extensive experiments in both simulation and real-robot settings demonstrate that WMPO (i) substantially improves sample efficiency, (ii) achieves stronger overall performance, (iii) exhibits emergent behaviors such as self-correction, and (iv) demonstrates robust generalization and lifelong learning capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3211. Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents
๐ฌ 22
โญ 5.00
๐ reinforcement learning
๐ท๏ธ LLM Agent Reinforcement Learning Long-Context LLM
Large language models face challenges in long-context question answering, where key evidence of a query may be dispersed across millions of tokens.
Existing works equip large language models with a memory corpus that is dynamically updated during a single-pass document scan, also known as the "memorize while reading" methods.
While this approach scales efficiently, it suffers from irreversible forward-only processing, information loss through overwriting, and sparse reinforcement learning signals.
To tackle these challenges, we present ReMemR1, a memory-augmented agent with callback-enhanced memory that allows selective retrieval from the entire memory history and allows non-linear reasoning and revisiting of early evidence.
To further strengthen training, we propose Reinforcement Learning with Multi-Level Rewards (RLMLR), which combines final-answer rewards with dense, step-level signals that guide effective memory use.
Together, these contributions mitigate information degradation, improve supervision, and support multi-hop memory utilizing.
Experiments on long-document QA show significant gains over existing memory-based approaches, which validates ReMemR1 as an effective solution for long-context reasoning agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3212. The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think
๐ฌ 18
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Reasoning in Language Models Chain-of-Thought Interpretability Model Behavior Control
Long chain-of-thought (CoT) is an essential ingredient in effective usage of modern large language models, but our understanding of the reasoning strategies underlying these capabilities remains limited. While some prior works have attempted to categorize CoTs using predefined strategy types, such approaches are constrained by human intuition and fail to capture the full diversity of model behaviors. In this work, we introduce the CoT Encyclopedia, a bottom-up framework for analyzing and steering model reasoning. Our method automatically extracts diverse reasoning criteria from model-generated CoTs, embeds them into a semantic space, clusters them into representative categories, and derives contrastive rubrics to interpret reasoning behavior. Human evaluations show that this framework produces more interpretable and comprehensive analyses than existing methods. Moreover, we show that this understanding translates into measurable improvements on both problem-solving and safety benchmarks. We can predict which strategy a model is likely to use and guide it toward more effective alternatives. Finally, we show that training data format (e.g., free-form vs. multiple-choice) impacts reasoning far more than data domain, highlighting the importance of format-aware model design. In short, the CoT Encyclopedia turns reasoning from a black box into a controllable asset, enabling LLMs that think more clearly, perform more reliably, and act more safely.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3213. AgentPO: Enhancing Multi-Agent Collaboration via Reinforcement Learning
๐ฌ 11
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Multi-Agent Systems Reinforcement Learning
Multi-Agent Systems (MAS) offer a powerful paradigm for solving complex problems through distributed reasoning and collaboration. However, their effectiveness is often hindered by the challenge of optimizing interactions among agents. To address this, we introduce AgentPO, a novel framework that directly optimizes agent collaboration. AgentPO employs reinforcement learning to train a specialized Collaborator agent, which refines its interaction policy to enhance overall system performance within a fixed multi-agent topology. We evaluated AgentPO on multiple mathematical reasoning tasks, where it consistently outperformed strong baselines. With Llama-3.2-3B-Instruct as the actor model, AgentPO achieves accuracy improvements of +1.8\% and +7.2\% over strong baselines like Role Assignment and EvoAgent, respectively. When using the larger Llama-3.1-8B-Instruct model, these gains increase to +5.6\% and +11.3\%. Crucially, AgentPO achieves these results with remarkable efficiency: it requires only 500 training samples and operates at just 7.8\% of EvoAgent's inference cost, highlighting its superior scalability and practicality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3214. UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction as Reasoning
๐ฌ 17
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ GUI Grounding GUI Agents Multimodal Large Language Model
GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior work largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3\% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a 76\% relative performance improvement.
In this paper, we introduce the "Instruction as Reasoning" paradigm, treating instructions as dynamic analytical pathways that offer distinct perspective and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition.
Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy: 87.3\% on UI-I2E-Bench and 84.9\% on MMBench-GUI L2, besides, UI-Ins-7B yields superior agent performance, achieving a 66.1\% success rate on the AndroidWorld. All code, data, and models will be publicly released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3215. Enhancing Shortcut Models with Cumulative Self-Consistency Loss for One-Step Diffusion
๐ฌ 4
โญ 5.00
๐ generative models
๐ท๏ธ One-Step Diffusion Optimal Control Shortcut Models
Although iterative denoising (i.e., diffusion/flow) methods offer strong generative performance, they suffer from low generation efficiency, requiring hundreds of steps of network forward passes to simulate a single sample. Mitigating this requires taking larger step-sizes during simulation, thereby allowing one- or few-step generation. Recently proposed shortcut model learns larger step-sizes by enforcing alignment between its direction and the path defined by a base many-step flow-matching model through a self-consistency loss. However, its generation quality is significantly lower than the base model. In this paper, we interpret the self-consistency loss through the lens of optimal control by formulating the few-step generation as a controlled base generative process. This perspective enables us to develop a general cumulative self-consistency loss that penalizes the misalignment at both the current step and future steps along the trajectory. This encourages the model to take larger step-sizes that not only align with the base model at the current time step but also guide subsequent steps towards high-quality generation. Furthermore, we draw a connection between our approach and reinforcement learning, potentially opening the door to a new set of approaches for few-step generation. Extensive experiments show that we significantly improve one- and few-step generation quality under the same training budget.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3216. Feed-forward Human Performance Capture via Progressive Canonical Space Updates
๐ฌ 17
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ human performance capture; monocular human performance capture; feed-forward reconstruction
We present a feed-forward human performance capture method that renders novel views of a performer from a monocular RGB stream. A key challenge in this setting is the lack of sufficient observations, especially for unseen regions. Assuming the subject moves continuously over time, we take advantage of the fact that more body parts become observable by maintaining a canonical space that is progressively updated with each incoming frame. This canonical space accumulates appearance information over time and serves as a context bank when direct observations are missing in the current live frame. To effectively utilize this context while respecting the deformation of the live state, we formulate the rendering process as probabilistic regression. This resolves conflicts between past and current observations, producing sharper reconstructions than deterministic regression approaches. Furthermore, it enables plausible synthesis even in regions with no prior observations. Experiments on both in-domain (4D-Dress) and out-of-distribution (MVHumanNet) datasets demonstrate the effectiveness of our approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3217. Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking
๐ฌ 18
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multimodal Retrieval VisionโLanguage Models Joint Encoding Efficient Re-ranking Token Compression
Multimodal retrieval still leans on embedding-based models like CLIP for fast
vector search over pre-computed image embeddings. Yet, unlike text retrieval
where joint-encoder rerankers are standard, comparable visionโlanguage rerankers
are largely absent. We find that seminal joint encoders such as BLIP are severely
bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale.
Motivated by this bottleneck, we introduce EDJE , an
Efficient Discriminative Joint Encoder that precomputes vision tokens offline and
compresses them via a lightweight attention-based adapter, so online inference runs
only a compact joint encoder over a small set of visual tokens plus the text. EDJE
preserves strong retrieval performance while drastically reducing storage and online
compute, enabling high-throughput inference. Specifically, EDJE processes 50k
imageโtext pairs/second while requiring 49kB of disk storage per image, matching
prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3218. DuPO: Enabling Reliable Self-Verification via Dual Preference Optimization
๐ฌ 11
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Self-Verification Dual Learning Preference Optimization Large Language Model
We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via the generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)โs reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learningโs restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal taskโs input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMsโ ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.1 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.2 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3219. EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models
๐ฌ 20
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Speech Language Models Empathetic Dialogue MultiโStage Evaluation Benchmark Voice Cues
Speech Language Models (SLMs) have made significant progress in spoken language understanding. Yet it remains unclear whether they can fully perceive non lexical vocal cues alongside spoken words, and respond with empathy that aligns with both emotional and contextual factors.
Existing benchmarks typically evaluate linguistic, acoustic, reasoning, or dialogue abilities in isolation, overlooking the integration of these skills that is crucial for humanโlike, emotionally intelligent conversation. We present EchoMind, the first interrelated, multiโlevel benchmark that simulates the cognitive process of empathetic dialogue through sequential, contextโlinked tasks: spokenโcontent understanding, vocalโcue perception, integrated reasoning, and response generation.
All tasks share identical and semantically neutral scripts that are free of explicit emotional or contextual cues, and controlled variations in vocal style are used to test the effect of delivery independent of the transcript.
EchoMind is grounded in an empathyโoriented framework spanning 3 coarse and 12 fineโgrained dimensions, encompassing 39 vocal attributes, and evaluated using both objective and subjective metrics.
Testing 12 advanced SLMs reveals that even stateโofโtheโart models struggle with high-expressive vocal cues, limiting empathetic response quality. Analyses of prompt strength, speech source, and ideal vocal cue recognition reveal persistent weaknesses in instructionโfollowing, resilience to natural speech variability, and effective use of vocal cues for empathy.
These results underscore the need for SLMs that integrate linguistic content with diverse vocal cues to achieve truly empathetic conversational ability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3220. AFTER: Mitigating the Object Hallucination of LVLM via Adaptive Factual-Guided Activation Editing
๐ฌ 5
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Vision-Language Model Hallucination Mitigation Activation Editing inferece-time
Large Vision-Language Models (LVLMs) have achieved substantial progress in cross-modal tasks. However, due to language bias, LVLMs are susceptible to object hallucination, which can be primarily divided into category, attribute, and relation hallucination, significantly impeding the trustworthy AI applications. Editing the internal activations of LVLMs has shown promising effectiveness in mitigating hallucinations with minimal cost. However, previous editing approaches neglect the effective guidance offered by factual textual semantics, thereby struggling to explicitly mitigate language bias. To address these issues, we propose Adaptive Factual-guided Visual-Textual Editing for hallucination mitigation (AFTER), which comprises Factual-Augmented Activation Steering (FAS) and Query-Adaptive Offset Optimization (QAO), to adaptively guides the original biased activations towards factual semantics. Specifically, FAS is proposed to provide factual and general guidance for activation editing, thereby explicitly modeling the precise visual-textual associations. Subsequently, QAO introduces a query-aware offset estimator to establish query-specific editing from the general steering vector, enhancing the diversity and granularity of editing. Extensive experiments on standard hallucination benchmarks across three widely adopted LVLMs validate the efficacy of the proposed AFTER, notably achieving up to a 16.3% reduction of hallucination over baseline on the AMBER benchmark. Our code and data will be released for reproducibility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3221. MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
๐ฌ 13
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Efficient VLM Inference Multimodal Model
Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual inputs to vision tokens. However, redundancy in vision tokens results in the degenerated inference efficiency of VLMs. While many algorithms have been proposed to reduce the number of vision tokens, most of them apply only unimodal information (i.e., vision/text) for pruning and ignore the inherent multimodal property of vision-language tasks. Moreover, it lacks a generic criterion that can be applied to different modalities. To mitigate this limitation, in this work, we propose to leverage both vision and text tokens to select informative vision tokens by the coverage criterion. We first formulate the subset selection problem as a maximum coverage problem. Afterwards, a subset of vision tokens is optimized to cover the text tokens and the original set of vision tokens, simultaneously. The proposed method MMTok is extensively evaluated on benchmark datasets with different VLMs. The comparison illustrates that vision and text information are complementary, and combining multimodal information can surpass the unimodal baseline with a clear margin. Moreover, under the maximum coverage criterion on the POPE dataset, our method achieves a 1.87ร speedup while maintaining 98.7\% of the original performance on LLaVA-NeXT-13B. Furthermore, with only four vision tokens, 87.7\% of the original performance is still preserved on LLaVA-1.5-7B. These results highlight the effectiveness of coverage in token selection.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3222. Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking
๐ฌ 19
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Self-evolving image quality assessment vision-language model
Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques such as self-consistency have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM's own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model's iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLMโs perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM's zero-shot performance by 31.8% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks. Furthermore, the framework demonstrates significant flexibility, allowing it to be stacked with pre-trained IQA models to bolster generalization on unseen datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3223. CLoD-GS: Continuous Level-of-Detail via 3D Gaussian Splatting
๐ฌ 16
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Level of Detail 3D Gaussian Splatting Neural Scene Representation
Level of Detail (LoD) is a fundamental technique in real-time computer graphics for managing the rendering costs of complex scenes while preserving visual fidelity. Traditionally, LoD is implemented using discrete levels (DLoD), where multiple, distinct versions of a model are swapped out at different distances. This long-standing paradigm, however, suffers from two major drawbacks: it requires significant storage for multiple model copies and causes jarring visual "popping" artifacts during transitions, degrading the user experience. We argue that the explicit, primitive-based nature of the emerging 3D Gaussian Splatting (3DGS) technique enables a more ideal paradigm: Continuous LoD (CLoD). A CLoD approach facilitates smooth, seamless quality scaling within a single, unified model, thereby circumventing the core problems of DLOD. To this end, we introduce CLoD-GS, a framework that integrates a continuous LoD mechanism directly into a 3DGS representation. Our method introduces a learnable, distance-dependent decay parameter for each Gaussian primitive, which dynamically adjusts its opacity based on viewpoint proximity. This allows for the progressive and smooth filtering of less significant primitives, effectively creating a continuous spectrum of detail within one model. To train this model to be robust across all distances, we introduce a virtual distance scaling mechanism with rendered point count regularization. Our approach not only eliminates the storage overhead and visual artifacts of discrete methods but also reduces the primitive count and memory footprint of the final model. Extensive experiments demonstrate that CLoD-GS achieves smooth, quality-scalable rendering from a single model, delivering high-fidelity results across a wide range of performance targets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3224. EVLP: Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning
๐ฌ 13
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Embodied intelligence multimodal large language models reinforcement learning long-sequence planning
In complex embodied long-horizon manipulation tasks, effective task decomposition and execution require synergistic integration of textual logical reasoning and visual-spatial imagination to ensure efficient and accurate operation. Current methods fail to adopt a unified generation framework for multimodal planning, leading to inconsistencies in multimodal planning. To address this challenge, we present EVLP (Embodied Vision-Language Planner), an innovative multimodal unified generation framework that jointly models linguistic reasoning and visual generation. Our approach achieves multimodal planning for long-horizon tasks through a novel training pipeline incorporating dynamic pretraining and reinforced alignment. Our core innovations consist of three key components: 1. Unified Multimodal Generation Framework: For understanding, we integrate semantic information with spatial features to provide comprehensive visual perception. For generation, we directly learn the joint distribution of discrete images for one-step visual synthesis, enabling coordinated language-visual modeling through learnable cross-modal attention mechanisms. 2. Dynamic Perception Pretraining: We propose a bidirectional dynamic alignment strategy employing inverse dynamics tasks and forward dynamics tasks, effectively strengthening multimodal correlations within a unified feature space. 3. Reinforced Supervised Fine-Tuning: While conducting instruction-based fine-tuning in the unified generation space, we construct a reinforce loss to align the spatial logic between textual actions and generated images, enabling the model to acquire spatio-aware multimodal planning capabilities.Comprehensive evaluations on multiple complex tasks demonstrate that EVLP significantly outperforms competitive baselines in both instruction execution accuracy and task success rate, benefiting from its unified multimodal architecture and well-designed training pipeline. Extensive ablation studies further validate the rationality of our framework design.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3225. The Hidden Lattice Geometry of LLMs
๐ฌ 17
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Interpretability formal concept analysis language models ontology
We uncover the hidden lattice geometry of large language models (LLMs): a symbolic backbone that grounds conceptual hierarchies and logical operations in embedding space. Our framework unifies the Linear Representation Hypothesis with Formal Concept Analysis (FCA), showing that linear attribute directions with separating thresholds induce a concept lattice via half-space intersections. This geometry enables symbolic reasoning through geometric meet (intersection) and join (union) operations, and admits a canonical form when attribute directions are linearly independent. Experiments on WordNet sub-hierarchies provide empirical evidence that LLM embeddings encode concept lattices and their logical structure, revealing a principled bridge between continuous geometry and symbolic abstraction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3226. Revisting Node Affinity Prediction In Temporal Graphs
๐ฌ 20
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Node affinity Temporal graph networks dynamic graphs graph neural netowrks
Node affinity prediction is a common task that is widely used in temporal graph learning with applications in social and financial networks, recommender systems, and more. Recent works have addressed this task by adapting state-of-the-art dynamic link property prediction models to node affinity prediction. However, simple heuristics, such as persistent forecast or moving average, outperform these models.
In this work, we analyze the challenges in training current Temporal Graph Neural Networks for node affinity prediction and suggest appropriate solutions. Combining the solutions, we develop NAVIS - Node Affinity prediction model using VIrtual State, by exploiting the equivalence between heuristics and state space models. While promising, training NAVIS is non-trivial. Therefore, we further introduce a novel loss function for node affinity prediction. We evaluate NAVIS on TGB and show that it outperforms the state of the art, including heuristics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3227. D-AR: Diffusion via Autoregressive Models
๐ฌ 11
โญ 5.00
๐ generative models
๐ท๏ธ visual generation diffusion models autoregressive models flow matching
This paper introduces Diffusion via Autoregressive (D-AR) models, a new paradigm recasting the pixel diffusion process as a vanilla autoregressive procedure in the standard next-token-prediction fashion. We start by designing the tokenizer that converts an image into the sequence of discrete tokens, where tokens in different positions can be decoded into different diffusion denoising steps in the pixel space. Thanks to the diffusion property, these tokens naturally follow a coarse-to-fine order, which directly lends itself to autoregressive modeling. Then, we apply standard next-token prediction to these tokens, without modifying any underlying designs (either causal masks or training/inference strategies), and such sequential autoregressive token generation directly mirrors the diffusion procedure in image space. That is, once the autoregressive model generates an increment of tokens, we can directly decode these tokens into the corresponding diffusion denoising step on pixels in a streaming manner. Our pipeline naturally reveals several intriguing properties, for example, it supports consistent previews when generating only a subset of tokens and enables zero-shot layout-controlled synthesis. On the standard ImageNet benchmark, our method achieves 2.09 and 2.00 FID using a 775M and 1.4B Llama backbone with 256 discrete tokens. We hope our work can inspire future research on unified autoregressive architectures of visual synthesis, especially with large language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3228. Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning
๐ฌ 10
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ contrastive learning multimodal learning interpretability
Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in vision-language representation learning, powering diverse downstream tasks and serving as the default vision backbone in multimodal large language models (MLLMs). Despite its success, CLIP's dense and opaque latent representations pose significant interpretability challenges. A common assumption is that interpretability and performance are in tension: enforcing sparsity during training degrades accuracy, motivating recent post-hoc approaches such as Sparse Autoencoders (SAEs). However, these post-hoc approaches often suffer from degraded downstream performance and loss of CLIP's inherent multimodal capabilities, with most learned features remaining unimodal.
We propose a simple yet effective approach that integrates sparsity directly into CLIP training, yielding representations that are both interpretable and performant. Compared to SAEs, our Sparse CLIP representations preserve strong downstream task performance, achieve superior interpretability, and retain multimodal capabilities. We show that multimodal sparse features enable straightforward semantic concept alignment and reveal training dynamics of how cross-modal knowledge emerges. Finally, as a proof of concept, we train a vision-language model on sparse CLIP representations that enables interpretable, vision-based steering capabilities. Our findings challenge conventional wisdom that interpretability requires sacrificing accuracy and demonstrate that interpretability and performance can be co-optimized, offering a promising design principle for future models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3229. What Do Large Language Models Know About Opinions?
๐ฌ 15
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ large language models opinions computational social science interpretability
What large language models (LLMs) know about human opinions has important implications for aligning LLMs to human values, simulating humans with LLMs, and understanding what LLMs learn during training. While prior works have tested LLMs' knowledge of opinions via their next token outputs, we present the first study to probe LLMs' internal knowledge of opinions, evaluating LLMs across 22 demographic groups on a wide range of topics. First, we show that LLMs' internal knowledge of opinions far exceeds what is revealed by their outputs, with a 50-59% improvement in alignment with the human answer distribution; this improvement is competitive to fine-tuning but 278 times less computationally expensive. Second, we find that knowledge of opinions emerges rapidly in the middle layers of the LLM and identify the final unembeddings as the source of the discrepancy between internal knowledge and outputs. Third, using sparse autoencoders, we trace the knowledge of opinions in the LLM's residual stream back to attention heads, and we identify specific attention head features responsible for different demographic groups. These findings open new avenues for building value-aligned and computationally efficient LLMs, with applications in survey research, social simulation, and more broadly, safe and trustworthy AI. We will release our code upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3230. Analyzing the Training Dynamics of Image Restoration Transformers: A Revisit to Layer Normalization
๐ฌ 15
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Low-level vision image restoration network architecture normalization
This work analyzes the training dynamics of Image Restoration (IR) Transformers and uncovers a critical yet overlooked issue: conventional LayerNorm (LN) drives feature magnitudes to diverge to a million scale and collapses channel-wise entropy. We analyze this in the perspective of networks attempting to bypass LayerNormโs constraints, which conflict with IR tasks. Accordingly, we address two misalignments: 1) per-token normalization that disrupts spatial correlations, and 2) input-independent scaling that discards input-specific statistics. To address this, we propose Image Restoration Transformer Tailored Layer Normalization (i-LN), a simple drop-in replacement that normalizes features holistically and adaptively rescales them per input. We provide theoretical insights and empirical evidence that this design effectively captures important spatial correlations and better preserves input-specific statistics throughout the network. Experimental results verify that the proposed i-LN consistently outperforms vanilla LN on various IR tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3231. RIVER: Real-time Video Interaction Benchmark
๐ฌ 22
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Online Video Interaction Multimodal Video Understanding
The rapid advancement of multimodal large language models has demonstrated impressive capabilities, yet nearly all operate in an offline paradigm, hindering real-time interactivity. Addressing this gap, we introduce the Real-tIme Video intERaction Bench (RIVER Bench), designed for evaluating online video comprehension. RIVER Bench introduces a novel framework comprising Retrospective Memory, Live-Perception, and Proactive Anticipation tasks, closely mimicking interactive dialogues rather than responding to entire videos at once. We conducted detailed annotations using videos from diverse sources and varying lengths, and precisely defined the real-time interactive format. Evaluations across various model categories reveal that while offline models perform well in single question-answering tasks, they struggle with real-time processing. Addressing the limitations of existing models in online video interaction, especially their deficiencies in long-term memory and future perception, we proposed a general improvement method that enables models to interact with users more flexibly in real time. We believe this work will significantly advance the development of real-time interactive video understanding models and inspire future research in this emerging field. The code and data will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3232. Hierarchy Decoding: A Training-free Parallel Decoding Strategy for Diffusion Large Language Models
๐ฌ 16
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Diffusion Large Language Models Inference Acceleration
The utilization of large language models (LLMs) has become increasingly widespread, and has attracted considerable attention. Although the emergence of discrete diffusion large language models (dLLMs) mitigates the inference latency inherent in autoregressive LLM decoding, its computational overhead remains substantial. To address this challenge, we propose Hierarchy-dLLM, a hierarchical decoding framework inspired by the divide-and-conquer principle. Our method recursively partitions masked spans into smaller sub-decoding areas and decodes tokens according to their confidence, which substantially increases the number of tokens generated per forward pass and improves information utilization. Extensive experiments conducted on multiple benchmarks demonstrate that Hierarchy-dLLM achieves accuracy comparable to or even surpassing existing baselines. Meanwhile, it is up to 17ร faster than vanilla decoding and about 1.5ร faster than the Fast-dLLM. These results establish hierarchical decoding as a practical solution for efficient dLLMs inference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3233. Nudging the Boundaries of LLM Reasoning
๐ฌ 14
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Nudging LLM LLM Reasoning GRPO
Current online reinforcement learning (RL) algorithms like GRPO share a key limitation in LLM reasoning: they cannot learn from problems that are "unsolvable" to the model. In other words, they can only improve performance on problems where the model is capable of exploring the correct answer. If a problem is too difficult -- such that even hundreds of attempts never produce a correct solution -- the model cannot learn from it. Consequently, the model's "upper limit" remains unchanged after RL training, even though the likelihood of solving easier, solvable problems may increase. These hard, unsolvable samples -- though potentially rich in learning signal -- cannot contribute to training, as no rollouts yield rewards and thus no gradients are produced. To unlock learning from these hard samples, we propose NuRL, a "nudging" method that aims to push the upper bound of LLM reasoning using self-generated hints, i.e., abstract cues that help reduce the problem difficulty for the model. Given a question and its gold answer, the model generates a Chain-of-Thought (CoT) and then produces a hint containing the core knowledge needed to solve the problem. During online RL training, we generate G rollouts from the base policy and use the pass rate to decide whether the hint should be injected. For hard samples with a 0% pass rate, we inject the offline-generated hint and regenerate a new batch of trajectories. This yields two benefits: (1) the hint boosts pass rates (from 0% to non-zero), thereby introducing training signals for previously unsolvable samples, and (2) the hints are self-generated (conditioned on the gold answer), avoiding distributional shift and do not rely on external models. Compared to standard GRPO, NuRL achieves consistent improvements across six diverse benchmarks and three models, while remaining complementary to test-time scaling. Notably, NuRL can raise the model's upper limit, whereas GRPO leaves pass@1024 unchanged from the base model. Furthermore, we present a systematic study of what makes an effective hint and when hints are most useful. Interestingly, the best hints are abstract and high-level -- as revealing gold answers actually hurt performance -- and are most beneficial when applied necessarily and after GRPO has converged.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3234. Beyond Visual Reconstruction Quality: Object Perception-aware 3D Gaussian Splatting for Autonomous Driving
๐ฌ 13
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Scene reconstruction Object detection Autonomous Driving System
Reconstruction techniques, such as 3D Gaussian Splatting (3DGS), are increasingly used for generating scenarios in autonomous driving system (ADS) research. Existing 3DGS-based works for autonomous driving scenario generation have, through various optimizations, achieved high visual similarity in reconstructed scenes. However, this route is built on a strong assumption: that higher scene similarity directly translates into better preservation of ADS behaviour. Unfortunately, this assumption has not been effectively validated, and ADS behaviour is more closely related to the objects within the field of view rather than the global image.
Thus, we focus on the perception moduleโthe entry point of ADS. Preliminary experiments reveal that although current methods can produce reconstructions with high overall similarity, they often fail to ensure that the perception module outputs remain consistent with those obtained from the original images. Such a limitation can significantly harm the applicability of reconstruction in the ADS domain. To address this gap, we propose two complementary solutions: a perception-aligned loss, which directly leverages the output differences between reconstructed and ground truth images during the training process; and an object zone quality loss, which specifically reinforces the training on object locations identified by the perception model on ground-truth images. Experiments demonstrate that both of our methods improve the ability of reconstructed scenes to preserve consistency between the perception module outputs and those from the ground-truth inputs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3235. Summaries as Centroids for Interpretable and Scalable Text Clustering
๐ฌ 11
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Text clustering unsupervised learning natural language processing
We introduce k-NLPmeans and k-LLMmeans, text-clustering variants of k-means that periodically replace numeric centroids with textual summaries. The key ideaโsummary-as-centroidโretains k-means assignments in embedding space while producing human-readable, auditable cluster prototypes. The method is LLM-optional: k-NLPmeans uses lightweight, deterministic summarizers, enabling offline, low-cost, and stable operation; k-LLMmeans is a drop-in upgrade that uses an LLM for summaries under a fixed per-iteration budget whose cost does not grow with dataset size. We also present a mini-batch extension for real-time clustering of streaming text. Across diverse datasets, embedding models, and summarization strategies, our approach consistently outperforms classical baselines and approaches the accuracy of recent LLM-based clusteringโwithout extensive LLM calls. Finally, we provide a case study on sequential text streams and release a StackExchange-derived benchmark for evaluating streaming text clustering.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3236. SDErasure: Concept-Specific Trajectory Shifting for Concept Erasure via Adaptive Diffusion Classifier
๐ฌ 23
โญ 5.00
๐ generative models
๐ท๏ธ Diffusion Model AIGC Safety Concept Erasure
Concept erasure methods have proven effective in mitigating the potential for textโtoโimage diffusion models to produce harmful content. Nevertheless, prevailing methods based on post fine-tuning introduce substantial disruption to the original modelโs parameter distribution and suffer from excessive model intrusiveness in two dimensions. (1) Images generated under erased concepts are perceptually aberrant. (2) Images generated under unrelated concepts exhibit pronounced quality degradation. We attribute these limitations to applying a uniform strategy to erase diverse concepts, failing to account for concept-specific generative mechanisms. Through rigorous experimentation and analysis, we identify that the generative process of each concept hinges on a narrow subset of critical timesteps. This insight motivates a targeted intervention strategy that enables precise and minimally invasive concept erasure. Therefore, we introduce $\textbf{SDErasure}$, a novel training framework for concept-specific erasure via adaptive trajectory shifting. First, a Step Selection algorithm that utilizes a diffusion classifier is proposed to guide the model in pinpointing the key timesteps associated with the undesired conceptโs generation. Second, a Score Rematching loss is introduced to align the modelโs predicted score function with that of anchor concepts, extending its applicability to both anchor-free erasing and anchor-based altering. Third, a Quality Regulation consisting of early-preserve loss and concept-retain loss is introduced to maintain the model's generative quality along two dimensions. Empirical results demonstrate that SDErasure achieves state-of-the-art concept erasure performance, reducing FID from 9.51 to 6.74 while effectively eliminating the target concept.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3237. AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking
๐ฌ 21
โญ 5.00
๐ reinforcement learning
๐ท๏ธ large language models abstract reasoning robustness reinforcement learning
Recent studies have shown that large language models (LLMs), especially smaller ones, often lack robustness in grade school math (GSM) reasoning. In particular, they tend to experience performance drops when faced with distribution shifts, such as changes to numerical or nominal variables, or insertions of distracting clauses. A possible strategy to address this involves generating synthetic data to further "instantiate" reasoning problems on potential variations. In this work, we instead focuses on the strategy of ``abstracting'' reasoning problems. This not only helps counteract distribution shifts but also facilitates the connection to symbolic tools for deriving solutions. Focusing on GSM, we find that this abstraction process is better acquired through reinforcement learning (RL) than just supervised fine-tuning, which often fails to produce faithful abstractions. Our method, AbstRaL---which promotes abstract reasoning in LLMs using RL on granular abstraction data---significantly mitigates performance degradation on recent GSM perturbation benchmarks. Besides, improving GSM robustness via AbstRaL is shown to also implicitly benefit LLMs' capabilities on OOD mathematical and general reasoning tasks, indicating that abstract thinking broadly enables better generalizability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3238. Align-SAM: Seeking Flatter Minima for Better Cross-Subset Alignment
๐ฌ 31
โญ 5.00
๐ optimization
๐ท๏ธ Sharpness-aware gradient alignment
Sharpness-Aware Minimization (SAM) has proven effective in enhancing deep neural network training by simultaneously minimizing the training loss and the sharpness of the loss landscape, thereby guiding models toward flatter minima that are empirically linked to improved generalization. From another perspective, generalization can be seen as a modelโs ability to remain stable under distributional variability. In particular, effective learning requires that updates derived from different subsets or resamplings of the same data distribution remain consistent. In this work, we investigate the connection between the flatness induced by SAM and the alignment of gradients across random subsets of the data distribution, and propose \textit{Align-SAM} as a novel strategy to further enhance model generalization. Align-SAM extends the core principles of SAM by promoting optimization toward flatter minima on a primary subset (the training set), while simultaneously enforcing low loss on an auxiliary subset drawn from the same distribution. This dual-objective approach leads to solutions that are not only resilient to local perturbations but also robust against distributional shifts in each training iteration. Empirical evaluations demonstrate that Align-SAM consistently improves generalization across diverse datasets and challenging settings, including scenarios with noisy labels and limited data availability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3239. A Statistical Theory of Overfitting for Imbalanced Classification
๐ฌ 13
โญ 5.00
๐ learning theory
๐ท๏ธ Imbalanced classification overfitting margin logistic regression support vector machine overparametrization calibration
Classification with imbalanced data is a common challenge in machine learning, where minority classes form only a small fraction of the training samples. Classical theory, relying on large-sample asymptotics and finite-sample corrections, is often ineffective in high dimensions, leaving many overfitting phenomena unexplained. In this paper, we develop a statistical theory for high-dimensional imbalanced linear classification, showing that dimensionality induces truncation or skewing effects on the logit distribution, which we characterize via a variational problem. For linearly separable Gaussian mixtures, logits follow $\\mathsf{N}(0,1)$ on the test set but converge to $\\max\\{\\kappa,\\mathsf{N}(0,1)\\}$ on the training set---a pervasive phenomenon we confirm on tabular, image, and text data.
This phenomenon explains why the minority class is more severely affected by overfitting. We further show that margin rebalancing mitigates minority accuracy drop and provide theoretical insights into calibration and uncertainty quantification.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3240. When LLMs get significantly worse: A statistical approach to detect model degradations
๐ฌ 25
โญ 5.00
๐ generative models
๐ท๏ธ LLM Benchmarking Statistics Accuracy
Minimizing the inference cost and latency of foundation models has become a
crucial area of research. Optimization approaches include theoretically lossless
methods and such without accuracy guarantees like quantization. In all of these
cases it is crucial to ensure that the model quality did not degrade. However, even
at temperature zero, model generations are not necessarily robust even to theo-
retically lossless model optimizations due to numerical errors. We thus require
statistical tools to decide whether a finite-sample accuracy deviation is an evi-
dence of a modelโs degradation or whether it can be attributed to (harmless) noise
in the evaluation. We propose a statistically sound hypothesis testing framework
based on McNemarโs test allowing to efficiently detect model degradations, whilst
guaranteeing a controlled rate of false positives. The crucial insight is that we have
to confront the model scores on each sample, rather than aggregated on the task
level. Furthermore, we propose three approaches to aggregate accuracy estimates
across multiple benchmarks into a single decision. We provide an implementation
on top of the largely adopted open-source LM Evaluation Harness and provide
a case study illustrating that the method correctly flags degraded models, whilst
not flagging model optimizations that are provably lossless. We find that with our
tests even empirical accuracy degradations of 0.3% can be confidently attributed
to actual degradations rather than noise.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3241. DGNet: Learning Spatiotemporal PDEs with Discrete Green Networks
๐ฌ 20
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Partial Differential Equations Graph Neural Networks Physics-Informed Machine Learning Generalization
Spatiotemporal partial differential equations (PDEs) underpin a wide range of scientific and engineering applications, yet classical solvers are computationally costly on large or irregular domains. Neural PDE solvers have emerged as an efficient alternative, but they struggle to generalize to unseen source terms, which represent external forcing such as heat generation, body forces, or reaction rates. Since existing models typically mix the source with the system state, they lack a principled mechanism to capture source responses. We propose DGNet, a discrete Green network that explicitly decouples system evolution from source response. The key idea is to transform the classical
Greenโs function---a cornerstone of PDE theory---into a graph-based discrete formulation, preserving the superposition principle in a computable update rule. To ensure fidelity on irregular meshes, we construct a hybrid operator by combining physics-based discretizations with GNN-based corrections, while a lightweight residual GNN captures dynamics beyond the operator. Across three categories of spatiotemporal PDE scenarios, DGNet consistently achieves state-of-the-art accuracy. In particular, on the most challenging setting
with entirely novel source terms, DGNet maintains stable performance while existing approaches collapse.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3242. Variation in Verification: Understanding Verification Dynamics in Large Language Models
๐ฌ 18
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ generative verification large language model test-time scaling
Recent advances have shown that scaling test-time computation enables large language models (LLMs) to solve increasingly complex problems across diverse domains. One effective paradigm for test-time scaling (TTS) involves LLM generators producing multiple solution candidates, with LLM verifiers assessing the correctness of these candidates without reference answers. In this paper, we study generative verifiers, which perform verification by generating chain-of-thought (CoT) reasoning followed by a binary verdict. We systematically analyze verification dynamics across three dimensions -- problem difficulty, generator capability, and verifier generation capability -- through empirical studies on 12 benchmarks across mathematical reasoning, knowledge, and natural language reasoning tasks using 14 open-source models (2B to 72B parameter range) and GPT-4o. Our experiments reveal three key findings about verification effectiveness: (1) Easy problems allow verifiers to more reliably certify correct responses; (2) Weak generators produce errors that are easier to detect than strong generators; (3) Verification ability is generally correlated with the verifier's own problem-solving capability, but this relationship varies with problem difficulty. These findings reveal opportunities for optimizing basic verification strategies in TTS applications. First, given the same verifier, some weak generators can nearly match stronger ones in post-verification TTS performance (e.g., the Gemma2-9B to Gemma2-27B performance gap shrinks by 75.7%). Second, we identify cases where strong verifiers offer limited advantages over weak ones, as both fail to provide meaningful verification gains, suggesting that verifier scaling alone cannot overcome fundamental verification challenges.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3243. Topology-Preserved Auto-regressive Mesh Generation in the Manner of Weaving Silk
๐ฌ 27
โญ 5.00
๐ generative models
๐ท๏ธ 3D Generation Auto-regressive Mesh Generation
Existing auto-regressive mesh generation approaches suffer from ineffective topology preservation, which is crucial for practical applications.
This limitation stems from previous mesh tokenization methods treating meshes as simple collections of equivalent triangles, lacking awareness of the overall topological structure during generation. To address this issue, we propose a novel mesh tokenization algorithm that provides a canonical topological framework through vertex layering and ordering, ensuring critical geometric properties including manifoldness, watertightness, face normal consistency, and part awareness in the generated meshes. Measured by Compression Ratio and Bits-per-face, we also achieved state-of-the-art compression efficiency. Furthermore, we introduce an online non-manifold data processing algorithm and a training resampling strategy to expand the scale of trainable dataset and avoid costly manual data curation.
Experimental results demonstrate the effectiveness of our approach, showcasing not only intricate mesh generation but also significantly improved geometric integrity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3244. No Pixel Left Behind: A Detail-Preserving Architecture for Robust High-Resolution AI-Generated Image Detection
๐ฌ 9
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ AIGI detection High-Resolution detection Featrue aggregation
The rapid growth of high-resolution, meticulously crafted AI-generated images poses a significant challenge to existing detection methods, which are often trained and evaluated on low-resolution, automatically generated datasets that do not align with the complexities of high-resolution scenarios. A common practice is to resize or center-crop high-resolution images to fit standard network inputs. However, without full coverage of all pixels, such strategies risk either obscuring subtle, high-frequency artifacts or discarding information from uncovered regions, leading to input information loss. In this paper, we introduce the **H**igh-Resolution **D**etail-**A**ggregation Network (**HiDA-Net**), a novel framework that ensures no pixel is left behind. We use the Feature Aggregation Module (FAM), which fuses features from multiple full-resolution local tiles with a down-sampled global view of the image. These local features are aggregated and fused with global representations for final prediction, ensuring that native-resolution details are preserved and utilized for detection. To enhance robustness against challenges such as localized AI manipulations and compression, we introduce Token-wise Forgery Localization (TFL) module for fine-grained spatial sensitivity and JPEG Quality Factor Estimation (QFE) module to disentangle generative artifacts from compression noise explicitly. Furthermore, to facilitate future research, we introduce **HiRes-50K**, a new challenging benchmark consisting of **50,568** images with up to **64 megapixels**. Extensive experiments show that HiDA-Net achieves state-of-the-art, increasing accuracy by over **13%** on the challenging Chameleon dataset and **8%** on our HiRes-50K.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3245. Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction
๐ฌ 14
โญ 5.00
๐ optimization
๐ท๏ธ LLM Compression Pruning Reasoning Chain-of-Thought
Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce more thinking tokens but with worse performance. We show that this is partly due to the fact that standard LLM pruning methods often focus on input reconstruction, whereas reasoning is a decode-dominated task. We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the modelโs on-policy chain-of-thought traces. This โReasoning-Aware Compressionโ (RAC) integrates seamlessly into existing pruning workflows such as SparseGPT, and boosts their performance significantly. Anonymized code can be found at: https://github.com/Anon-ICLR-RAC/ICLR-2025-Anonymous-Submission
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3246. EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget
๐ฌ 18
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language models reasoning models reinforcement learning RLVR exploration unlearning
Balancing exploration and exploitation remains a central challenge in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs). Current RLVR methods often overemphasize exploitation, leading to entropy collapse, reduced exploratory capacity, and ultimately limited performance gains. Although techniques that add randomness increase policy stochasticity, they frequently fail to escape dominant behavioral modes. The resulting sample-and-reward dynamics amplify these modes, eroding exploration and leading to entropy collapse. We introduce Exploration-Enhanced Policy Optimization (EEPO), a novel framework that promotes exploration through two-stage rollouts with adaptive unlearning. In the first stage, the model generates half of the trajectories; it then undergoes a lightweight, temporary unlearning step to suppress these sampled responses, forcing the second stage to explore different regions of the output space. This sample-then-forget mechanism actively steers the policy away from dominant modes and encourages mode-seeking exploration. Across five reasoning benchmarks, EEPO consistently outperforms baselines, achieving average gains of 24.3% on Qwen2.5-3B, 33.0% on Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3247. From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics
๐ฌ 21
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Large Language Models Mathematical Reasoning Evaluation
Large language models now solve many benchmark math problems at nearโexpert levels, yet this progress has not fully translated into reliable performance in realโworld applications. We study this gap through contextual mathematical reasoning, where the mathematical core must be formulated from descriptive scenarios.We introduce CORE-MATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings: Scenario Grounding (SG), which embeds abstract problems into realistic narratives without increasing reasoning complexity, and Complexity Scaling (CS), which transforms explicit conditions into subโproblems to capture how constraints often appear in practice. Evaluating 61 proprietary and openโsource models, we observe sharp drops: on average, openโsource models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20. Error analysis shows that errors are dominated by incorrect problem formulation, with formulation accuracy declining as original problem difficulty increases. Correct formulation emerges as a prerequisite for success, and its sufficiency improves with model scale, indicating that larger models advance in both understanding and reasoning. Nevertheless, formulation and reasoning remain two complementary bottlenecks that limit contextual mathematical problem solving. Finally, we find that fineโtuning with scenario data improves performance, whereas formulationโonly training is ineffective. However, performance gaps are only partially alleviated, highlighting contextual mathematical reasoning as a central unsolved challenge for LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3248. CollectiveKV: Decoupling and Sharing Collaborative Information in Sequential Recommendation
๐ฌ 9
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ KV cache Sequential recommendation
Sequential recommendation models are widely used in applications, yet they face stringent latency requirements.
Mainstream models leverage the Transformer attention mechanism to improve performance, but its computational complexity grows with the sequence length, leading to a latency challenge for long sequences.
Consequently, KV cache technology has recently been explored in sequential recommendation systems to reduce inference latency.
However, KV cache introduces substantial storage overhead in sequential recommendation systems, which often have a large user base with potentially very long user history sequences.
In this work, we observe that KV sequences across different users exhibit significant similarities, indicating the existence of collaborative signals in KV.
Furthermore, we analyze the KV using singular value decomposition (SVD) and find that the information in KV can be divided into two parts: the majority of the information is shareable across users, while a small portion is user-specific.
Motivated by this, we propose CollectiveKV, a cross-user KV sharing mechanism.
It captures the information shared across users through a learnable global KV pool. During inference, each user retrieves high-dimensional shared KV from the pool and concatenates them with low-dimensional user-specific KV to obtain the final KV.
Experiments on five sequential recommendation models and three datasets show that our method can compress the KV cache to only 0.8\% of its original size, while maintaining or even enhancing model performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3249. SigLIP-HD by Fine-to-Coarse Supervision
๐ฌ 15
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ vision-language model multi-modality LLM
High-quality visual representation is a long-standing pursuit in computer vision. In the context of multimodal LLMs (MLLMs), feeding higher-resolution images can produce more fine-grained visual tokens. However, it introduces additional computational and design complexity, due to multiple forward passes and post-processing of increased tokens. Before simply adopting a higher resolution, have we truly unlocked the model's full perception capability at a standard resolution? Therefore, we study an interesting problem: how to achieve fine visual perception under lower cost without larger images. We present SigLIP-HD in this work. The core is a highly simple fine-to-coarse supervision design. We enforce the coarse feature of a mid-resolution image to mimic the fine-grained feature of its high-resolution version. We build this framework on the advanced SigLIP 2 model. Our final model produces better visual tokens at exactly the same inference budget. It is validated on extensive MLLM benchmarks and consistently delivers stronger results than our baseline model, especially on OCR-related tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3250. MAVEN: A Mesh-Aware Volumetric Encoding Network for Simulating 3D Flexible Deformation
๐ฌ 19
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Deep Learning Explicit modeling Geometric Deep Learning Simulation of Solid Deformation
Deep learning-based approaches, particularly graph neural networks (GNNs), have gained prominence in simulating flexible deformations and contacts of solids, due to their ability to handle unstructured physical fields and nonlinear regression on graph structures. However, existing GNNs commonly represent meshes with graphs built solely from vertices and edges. These approaches tend to overlook higher-dimensional spatial features, e.g. 2D facets and 3D cells, from the original geometry. As a result, it is challenging to accurately capture boundary representations and volumetric characteristics, though this information is critically important for modeling contact interactions and internal physical quantity propagation, particularly under sparse mesh discretization. In this paper, we introduce MAVEN, a mesh-aware volumetric encoding network for simulating 3D flexible deformation, which explicitly models geometric mesh elements of higher dimension to achieve a more accurate and natural physical simulation. MAVEN establishes learnable mappings among 3D cells, 2D facets, and vertices, enabling flexible mutual transformations. Explicit geometric features are incorporated into the model to alleviate the burden of implicitly learning geometric patterns. Experimental results show that MAVEN consistently achieves state-of-the-art performance across established datasets and a novel metal stretch-bending task featuring large deformations and prolonged contacts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3251. Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning
๐ฌ 12
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models; Reinforcement Learning;Adaptive Sampling Temperature;Meta-Optimization;GRPO;
Temperature is a crucial hyperparameter in large language models (LLMs), controlling the trade-off between exploration and exploitation during text generation. High temperatures encourage diverse but noisy outputs, while low temperatures produce focused outputs but may cause premature convergence. Yet static or heuristic temperature schedules fail to adapt to the dynamic demands of reinforcement learning (RL) throughout training, often limiting policy improvement. We propose Temperature Adaptive Meta Policy Optimization (TAMPO), a new framework that recasts temperature control as a learnable meta-policy. TAMPO operates through a hierarchical two-loop process. In the inner loop, the LLM policy is updated (e.g., using GRPO) with trajectories sampled at the temperature selected by the meta-policy.
In the outer loop, meta-policy updates the distribution over candidate temperatures by rewarding those that maximize the likelihood of high-advantage trajectories. This trajectory-guided, reward-driven mechanism enables online adaptation without additional rollouts, directly aligning exploration with policy improvement. On five mathematical reasoning benchmarks, TAMPO outperforms baselines using fixed or heuristic temperatures, establishing temperature as an effective learnable meta-policy for adaptive exploration in LLM reinforcement learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3252. CoMind: Towards Community-Driven Agents for Machine Learning Engineering
๐ฌ 13
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ LLM Agent
Large language model (LLM) agents show promise in automating machine learning (ML) engineering. However, existing agents typically operate in isolation on a given research problem, without engaging with the broader research community, where human researchers often gain insights and contribute by sharing knowledge. To bridge this gap, we introduce MLE-Live, a live evaluation framework designed to assess an agent's ability to communicate with and leverage collective knowledge from a simulated Kaggle research community. Building on this framework, we propose CoMind, an multi-agent system designed to actively integrate external knowledge.
CoMind employs an iterative parallel exploration mechanism, developing multiple solutions simultaneously to balance exploratory breadth with implementation depth.
On 75 past Kaggle competitions within our MLE-Live framework, CoMind achieves a 36% medal rate, establishing a new state of the art. Critically, when deployed in eight live, ongoing competitions, CoMind outperforms 92.6% of human competitors on average, placing in the top 5% on three official leaderboards and the top 1\% on one.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3253. Online Black-Box Prompt Optimization with Regret Guarantees under Noisy Feedback
๐ฌ 20
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Black-Box Prompt Optimization Online Learning Generative Al
Generative AI excels in various tasks through advanced language modeling techniques, with its performance heavily influenced by input prompts. This has driven significant research into prompt optimization, particularly in commercial generative AI platforms, where prompt optimization is treated as a black-box optimization problem. Most existing research on black-box prompt optimization primarily focuses on offline learning and overlooks the randomness in outputs. However, in real-world applications, black-box prompt optimization typically operates in an online learning setting, which remains largely unexplored, especially given the noisy outputs. To address these challenges, we propose an \textbf{A}daptive \textbf{O}nline \textbf{Z}eroth-order \textbf{P}rompt \textbf{T}uning (AOZPT) approach which integrates zeroth-order optimization with online learning in the non-convex setting. Specifically, we developed an uncertainty-scale-adjustment mechanism to mitigate the noise inherent in generative AI and the high variance associated with zeroth-order estimates. We conducted a comprehensive regret analysis of the AOZPT approach, and the results indicate that sublinear regret convergence is achievable. Extensive generative experiments demonstrate that AOZPT outperforms existing black-box prompt tuning methods, particularly in terms of stability in online scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3254. Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
๐ฌ 18
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Visual Document Retrieval Test Time Hybrid Retrieval multimodal RAG
Multimodal encoders have pushed the boundaries of visual document retrieval, matching textual tokens directly to image patches and achieving state-of-the-art performance on challenging benchmarks. Recent models relying on this paradigm have massively scaled the dimensionality of their query and document representations, presenting obstacles to deployment and scalability in real-world pipelines.
Furthermore, purely vision-centric approaches may be constrained by the inherent modality gap still exhibited by modern vision-language models. In this work, we connect these challenges to the paradigm of hybrid retrieval, investigating whether a lightweight dense text retriever can enhance a stronger vision-centric model. Existing hybrid methods, which rely on coarse-grained fusion of ranks or scores,
fail to exploit the rich interactions within each modelโs representation space. To address this, we introduce Guided Query Refinement (GQR), a novel test-time optimization method that refines a primary retrieverโs query embedding using guidance from a complementary retrieverโs scores. Through extensive experiments on visual document retrieval benchmarks, we demonstrate that GQR allows ColPali-based models to match the performance of models with significantly larger representations, while being up to 14x faster and requiring 54x less memory. Our findings show that GQR effectively pushes the Pareto frontier for performance and efficiency in multimodal retrieval. We release our code at https://anonymous.4open.science/r/test-time-hybrid-retrieval-5485.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3255. ZeroSiam: An Efficient Siamese for Test-Time Entropy Optimization without Collapse
๐ฌ 22
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Test-Time Adaptation Out-of-distribution Generalization Entropy Minimization
Test-time entropy minimization helps adapt a model to novel environments and incentivize its reasoning capability, unleashing the model's potential during inference by allowing it to evolve and improve in real-time using its own predictions. However, pure test-time entropy minimization can favor non-generalizable shortcuts, such as inflating the logit norm and driving all predictions to a dominant class to reduce entropy, risking collapsed solutions (e.g., constant one-hot outputs) that trivially minimize the objective without meaningful learning. In this paper, we introduce ZeroSiam, an efficient asymmetric Siamese architecture tailored for test-time entropy minimization. ZeroSiam prevents collapse through asymmetry learning, which is efficiently achieved by a learnable predictor and a stop-gradient operator before the classifier. Beyond the collapse prevention effect, we provide empirical and theoretical evidence that ZeroSiam also absorbs and regularizes biases at testing, enhancing TTA effectiveness even when no collapse occurs. Despite its simplicity, extensive results show that ZeroSiam can avoid collapse and perform more stably over prior methods using negligible overhead, demonstrating efficacy on both vision and language tasks across challenging test scenarios and diverse models, e.g., tiny based models that are particularly collapse-prone.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3256. SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning
๐ฌ 14
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Semi-Structured Data Reinforcement Learning Information Extraction
Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging. Existing methods either lack generalization or are resource-intensive due to per-page LLM inference. In this paper, we introduce SCRIBES (**SCRI**pt-**B**ased Semi-Structured Content **E**xtraction at Web-**S**cale), a novel reinforcement learning framework that leverages layout similarity across webpages within the same site as a reward signal. Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Our approach further improves by iteratively training on synthetic annotations from in-the-wild CommonCrawl data. Experiments show that our approach outperforms strong baselines by over 13\% in script quality and boosts downstream question answering accuracy by more than 4\% for GPT-4o, enabling scalable and resource-efficient web information extraction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3257. RESCHED: Rethinking Flexible Job Shop Scheduling from a Transformer-based Architecture with Simplified States
๐ฌ 6
โญ 5.00
๐ optimization
๐ท๏ธ Flexible Flow Shop Scheduling Problem; Deep Reinforcement Learning; Transformer Architecture ;
Neural approaches to the Flexible Job Shop Scheduling Problem (FJSP), particularly those based on deep reinforcement learning (DRL), have gained growing attention in recent years. Yet existing methods often rely on cumbersome state representations (i.e. sometimes requiring more than 20 handcrafted features) and suboptimal neural architectures. We introduce \textsc{ReSched}, a minimalist DRL framework that rethinks both the scheduling formulation and model design. First, we revisit the Markov Decision Process (MDP) formulation of FJSP, reducing the state to just four essential features and replacing historical dependencies with a graph structure that directly encodes intra-job operation relationships. Second, we employ Transformer blocks with dot-product attention, augmented by three lightweight but effective architectural modifications tailored to scheduling. Extensive experiments show that \textsc{ReSched} outperforms classical dispatching rules and state-of-the-art DRL methods on FJSP. Moreover, \textsc{ReSched} generalizes well to the Job Shop Scheduling Problem (JSSP) and the Flexible Flow Shop Scheduling Problem (FFSP), achieving competitive performance against neural baselines specifically designed for these variants.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3258. Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness
๐ฌ 18
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ adversarial robustness inference-compute scaling VLMs efficiency
Models are susceptible to adversarially out-of-distribution (OOD) data despite large training-compute investments into their robustification. Zaremba et al. (2025) make progress on this problem at test time, showing LLM reasoning improves satisfaction of model specifications designed to thwart attacks, resulting in a correlation between reasoning effort and robustness to jailbreaks. However, this benefit of test compute fades when attackers are given access to gradients or multimodal inputs. We address this gap, clarifying that inference-compute offers benefits even in such cases. Our approach argues that compositional generalization, through which OOD data is understandable via its in-distribution (ID) components, enables adherence to defensive specifications on adversarially OOD inputs. Namely, we posit the Robustness from Inference Compute Hypothesis (RICH): inference-compute defenses profit as the model's training data better reflects the attacked dataโs components. We empirically support this hypothesis across vision language model and attack types, finding robustness gains from test-time compute if specification following on OOD data is unlocked by compositional generalization. For example, InternVL 3.5 gpt-oss 20B gains little robustness when its test compute is scaled, but such scaling adds significant robustness if we first robustify its vision encoder. This correlation of inference-compute's robustness benefit with base model robustness is the rich-get-richer dynamic of the RICH: attacked data components are more ID for robustified models, aiding compositional generalization to OOD data. Thus, we advise layering train-time and test-time defenses to obtain their synergistic benefit.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3259. CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention
๐ฌ 45
โญ 5.00
๐ generative models
๐ท๏ธ Multi Latent Attention Covariance & Rank aware Singular value decomposition
Converting pretrained attention modules such as *grouped-query attention* (GQA) into *multi-head latent attention* (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, existing conversion methods typically apply naรฏve singular value decomposition (SVD). They focus on minimizing the difference between weight matrices rather than how those weights affect input activations, ignore the covariance structure of activations, and enforce a uniform rank across layersโcausing activation drift and degraded attention fidelity. To address these issues, we propose **CARE** (**C**ovariance-**A**ware, **R**ank-**E**nhanced), a MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i). **Activation-preserving factorization** โ align the approximation with the actual input activations rather than just the weights. (ii). **Adjusted-rank allocation** โ distribute a fixed KV budget across layers by giving more capacity to layers that need it most. (iii). **KV-parity mapping** โ reparameterize the converted \(K\) and \(V\) to fit the MLA format while keeping the KV-cache size unchanged. Under a matched KV-cache budget, our method consistently outperforms a uniform-rank SVD baseline on Llama-3-8B, delivering up to **331%** relative gains in one-shot evaluation (higher accuracy, lower perplexity). With a brief post-SVD โhealingโ fine-tune, we fully recover the original modelโs accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3260. An Agentic Framework with LLMs for Solving Complex Vehicle Routing Problems
๐ฌ 25
โญ 5.00
๐ optimization
๐ท๏ธ Vehicle Routing Problems Agent LLM
Complex vehicle routing problems (VRPs) remain a fundamental challenge, demanding substantial expert effort for intent interpretation and algorithm design. While large language models (LLMs) offer a promising path toward automation, current approaches still rely on external intervention, which restrict autonomy and often lead to execution errors and low solution feasibility. To address these challenges, we propose an Agentic Framework with LLMs (AFL) for solving complex vehicle routing problems, achieving full automation from problem instance to solution. AFL directly extracts knowledge from raw inputs and enables self-contained code generation without handcrafted modules or external solvers. To improve trustworthiness, AFL decomposes the overall pipeline into three manageable subtasks and employs four specialized agents whose coordinated interactions enforce cross-functional consistency and logical soundness. Extensive experiments on 20 complex VRPs, ranging from standard benchmarks to practical variants, validate the effectiveness and generality of our framework, showing comparable performance against meticulously designed algorithms. Notably, it substantially outperforms existing LLM-based baselines in both code reliability and solution feasibility, achieving rates close to 100% on the evaluated benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3261. Q&C: When Quantization Meets Cache in Efficient Generation
๐ฌ 13
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ diffusion model visual generation models cache
Quantization and cache mechanisms are typically applied individually in efficient generation tasks, each showing notable potential for acceleration. However, their joint effect on efficiency remains under-explored. Through both empirical investigation and theoretical analysis, we find that that combining quantization with caching is non-trivial, as it introduces two major challenges that severely degrade performance:
(i) the sample efficacy of calibration datasets in post-training quantization (PTQ) is significantly eliminated by cache operation; (ii) the joint use of the two mechanisms exacerbates exposure bias in the sampling distribution, leading to amplified error accumulation during generation. In this work, we take advantage of these two acceleration mechanisms and propose a hybrid acceleration method by tackling the above challenges, aiming to further improve the efficiency of tasks while maintaining excellent generation capability.
Concretely, a temporal-aware parallel clustering (TAP) is designed to dynamically improve the sample selection efficacy for the calibration within PTQ for different diffusion steps. A variance compensation (VC) strategy is derived to correct the sampling distribution. It mitigates exposure bias through an adaptive correction factor generation. Extensive experiments demonstrate that our method is broadly applicable to diverse generation tasks, achieving up to 12.7x acceleration while preserving competitive generation quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3262. ReTrace: Reinforcement Learning-Guided Reconstruction Attacks on Machine Unlearning
๐ฌ 19
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Machine Unlearning Reinforcement Learning Reconstruction Attack
Machine unlearning has emerged as an inevitable AI mechanism to support GDPR requirements such as revoking user consent through the "right to be forgotten".
However, existing approaches often leave residual traces that make them vulnerable to data reconstruction attacks.
In this work, we propose ReTrace, the first reconstruction attack framework that uniquely formulates unlearned data recovery on large-scale deep architectures as a reinforcement learning (RL) problem.
By treating residual unlearning traces as reward signals, ReTrace guides a generator to actively explore the input space and converge toward the forgotten data distribution.
This RL-guided approach enables both instance-level recovery of individual samples and distribution-level reconstruction of unlearned classes.
We provide a theoretical foundation showing that the RL objective converges to an exponential-tilted distribution that amplifies forgotten regions.
Empirically, ReTrace achieves up to 73.1\% instance-level recovery and reduces FID and KL scores beyond state-of-the-art baselines, UIA (IEEE S\&P 2024) and HRec (NeurIPS 2024).
Strikingly, on the challenging task of text unlearning, it improves BLEU scores by nearly 100\% over black-box baselines while preserving distributional fidelity, demonstrating that RL can recover even high-dimensional and structured modalities. Furthermore, ReTrace demonstrates effectiveness across both convolutional (ResNet) and transformer-based models, with Distil-BERT as the largest architecture attacked to date. These results show that current unlearning methods remain vulnerable, highlighting the need for robust and provably private mechanisms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3263. Learning for Highly Faithful Explainability
๐ฌ 8
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Explainability; Faithfulness; Learning to Explain
Learning to Explain is a forward-looking paradigm recently proposed in the field of explainable AI, which envisions training explainers capable of producing high-quality explanations for target models efficiently. Although existing studies have made attempts through self-supervised optimization or learning from prior explanation methods, the Learning to Explain paradigm still faces three critical challenges: 1) self-supervised objectives often rely on assumptions about the target model or task, restricting their generalizability; 2) methods driven by prior explanations struggle to guarantee the quality of the supervisory signals; and 3) depending exclusively on either approach leads to poor convergence or limited explanation quality. To address these challenges, we propose a faithfulness-guided amortized explainer that 1) theoretically derives a self-supervised objective free from assumptions about the target model or task, 2) practically generates high-quality supervisory signals by deduplicating and filtering prior explanations, and 3) jointly optimizes both objectives via a dynamic weighting strategy, enabling the amortized explainer to produce more faithful explanations for complex, high-dimensional models. We re-formalize multiple well-validated faithfulness evaluation metrics within a unified notation system and theoretically prove that an explanation mapping can simultaneously achieve optimality across all these metrics. We aggregate prior explanation methods to generate high-quality supervised signals through deduplicating and faithfulness-based filtering. Our amortized explainer leverages dynamic weighting to guide optimization, initially emphasizing pattern consistency with the supervised signals for rapid convergence, and subsequently refining explanation quality by approximating the most faithful explanation mapping. Extensive experiments across various target models and image, text, and tabular tasks demonstrate that the proposed explainer consistently outperforms all prior explanation methods across all faithfulness metrics, highlighting its effectiveness and its potential to offer a systematic solution to the fundamental challenges of the Learning to Explain paradigm.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3264. Structural Prognostic Event Modeling for Multimodal Cancer Survival Analysis
๐ฌ 12
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Computational Pathology Multimodal Learning Cancer Survival Prediction
The integration of histology images and gene profiles has shown great promise for improving survival prediction in cancer. However, current approaches often struggle to model intra- and inter-modal interactions efficiently and effectively due to the high dimensionality and complexity of the inputs. A major challenge is capturing critical prognostic events that, though few, underlie the complexity of the observed inputs and largely determine patient outcomes. These events---manifested as high-level structural signals such as spatial histologic patterns or pathway co-activations---are typically sparse, patient-specific, and unannotated, making them inherently difficult to uncover. To address this, we propose SlotSPE, a slot-based framework for structural prognostic event modeling. Specifically, inspired by the principle of factorial coding, we compress each patientโs multimodal inputs into compact, modality-specific sets of mutually distinctive slots using slot attention. By leveraging these slot representations as encodings for prognostic events, our framework enables both efficient and effective modeling of complex intra- and inter-modal interactions, while also facilitating seamless incorporation of biological priors that enhance prognostic relevance. Extensive experiments on ten cancer benchmarks show that SlotSPE outperforms existing methods in 8 out of 10 cohorts, achieving an overall improvement of 2.9%. It remains robust under missing genomic data and delivers markedly improved interpretability through structured event decomposition.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3265. SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
๐ฌ 19
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reinforcement Learning Diffusion Language Models Policy Gradient
Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% on GSM8K, 2.6% on MATH500, 18.4% on Countdown, and 27.0% on Sudoku.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3266. Flow Matching with Semidiscrete Couplings
๐ฌ 17
โญ 5.00
๐ generative models
๐ท๏ธ flow matching optimal transport semidiscrete optimal transport
Flow models parameterized as time-dependent velocity fields can generate data from noise by integrating an ODE.
These models are often trained using flow matching, i.e. by sampling random pairs of noise and target points $(x_0,x_1)$ and ensuring that the velocity field is aligned, on average, with $x_1-x_0$ when evaluated along a time-indexed segment linking $x_0$ to $x_1$.
While these noise/data pairs are sampled independently by default, they can also be selected more carefully by matching batches of $n$ noise to $n$ target points using an optimal transport (OT) solver.
Although promising in theory, the OT flow matching (OT-FM) approach (Pooladian et al., 2023, Tong et al., 2024) is not widely used in practice.
Zhang et al. (2025), pointed out recently that OT-FM truly starts paying off when the batch size $n$ grows significantly, which only a multi-GPU implementation of the Sinkhorn algorithm can handle.
Unfortunately, the pre-compute costs of running Sinkhorn can quickly balloon, requiring $O(n^2/\varepsilon^2)$ operations for every $n$ pairs used to fit the velocity field, where $\varepsilon$ is a regularization parameter that should be typically small to yield better results.
To fulfill the theoretical promises of OT-FM, we propose to move away from batch-OT and rely instead on a semidiscrete formulation that can leverage the fact that the target dataset distribution is usually of finite size $N$. The SD-OT problem is solved by estimating a dual potential vector of size $N$ using SGD; using that vector, freshly sampled noise vectors at train time can then be matched with data points at the cost of a maximum inner product search (MIPS) over the dataset.
Semidiscrete FM (SD-FM) removes the quadratic dependency on $n/\varepsilon$ that bottlenecks OT-FM. SD-FM beats both FM and OT-FM on all training metrics and inference budget constraints, across multiple datasets, on unconditional/conditional generation, or when using mean-flow models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3267. Unlocking the Power of Co-Occurrence in CLIP: A DualPrompt-Driven Method for Training-Free Zero-Shot Multi-Label Classification
๐ฌ 4
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Zero-shot multi-label classification co-occurrence
Contrastive Language-Image Pretraining (CLIP) has exhibited powerful zero-shot capacity in various single-label image classification tasks. However, when applying to the multi-label scenarios, CLIP suffers from significant performance declines due to the lack of explicit exploitation of co-occurrence information. In pretraining, due to the contrastive property of its used objective, the model focuses on the prominent object in an image, while overlooking other objects and their co-occurrence relationships; in inference, it uses a discriminative prompt containing only a target label name to make predictions, which does not introduce any co-occurrence information. Then, an important question is as follows: \textit{Do we need label co-occurrence in CLIP for achieving effective zero-shot multi-label learning?} In this paper, we propose to rewrite the original prompt into a correlative form consisting of both the target label and its co-occurring labels. An interesting finding is that such a simple modification can effectively introduce co-occurrence information into CLIP and it exhibits both good and bad effects. On the one hand, it can enhance the recognition capacity of CLIP by exploiting the correlative pattern activated by the correlative prompt; on the other hand, it leads to object hallucination in CLIP, where the model predicts objects that do not actually exist in the image, due to overfitting to co-occurrence. To address this problem, we proposed to calibrate CLIP predictions by keeping the positive effect while removing the negative effect caused by suspicious co-occurrence. This can be achieved by using dual prompts consisting of the discriminative and correlative prompts, which introduce label co-occurrence while emphasizing the discriminative pattern of the target object. Experimental results verify that our method can achieve performance than the state-of-the-art methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3268. Dual Randomized Smoothing: Beyond Global Noise Variance
๐ฌ 9
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ robustness certification randomized smoothing
Randomized Smoothing (RS) is a prominent technique for certifying the robustness of neural networks against adversarial perturbations. With RS, achieving high accuracy at small radii requires a small noise variance, while achieving high accuracy at large radii requires a large noise variance. However, the global noise variance used in the standard RS formulation leads to a fundamental limitation: there exists no global noise variance that simultaneously achieves strong performance at both small and large radii. To break through the global variance limitation, we propose a dual RS framework which enables input-dependent noise variances. To achieve that, we first prove that RS remains valid with input-dependent noise variances, provided the variance is locally constant around each input. Building on this result, we introduce two components which form our dual RS framework: (i) a variance estimator first predicts an optimal noise variance for each input, (ii) this estimated variance is then used by a standard RS classifier. The variance estimator is independently smoothed via RS to ensure local constancy, enabling flexible design. We also introduce training strategies to iteratively optimize the two components involved in the framework. Extensive experiments on the CIFAR-10 dataset demonstrate that our dual RS method provides strong performance for both small and large radiiโunattainable with global noise varianceโwhile incurring only a 60\% computational overhead at inference. Moreover, it consistently outperforms prior input-dependent noise approaches across most radii, with particularly large gains at radii 0.5, 0.75, and 1.0, achieving relative improvements of 19.2\%, 24.2\%, and 20.6\%, respectively. On ImageNet, dual RS remains effective across all radii, with roughly 1.5x performance advantages at radii 0.5, 1.0 and 1.5. Additionally, the proposed dual RS framework naturally provides a routing perspective for certified robustness, improving the accuracy-robustness trade-off with off-the-shelf expert RS models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3269. DEAS: DEtached value learning with Action Sequence for Scalable Offline RL
๐ฌ 18
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Offline Reinforcement Learning Vision Language Action Model
Offline reinforcement learning (RL) presents an attractive paradigm for training intelligent agents without expensive online interactions. However, current approaches still struggle with complex, long-horizon sequential decision making. In this work, we introduce DEtached value learning with Action Sequence (DEAS), a simple yet effective offline RL framework that leverages action sequences for value learning. These temporally extended actions provide richer information than single-step actions, enabling reduction of the effective planning horizon by considering longer sequences at once. However, directly adopting such sequences in actor-critic algorithms introduces excessive value overestimation, which we address through detached value learning that steers value estimates toward in-distribution actions that achieve high returns in the offline dataset. We demonstrate that DEAS consistently outperforms baselines on complex, long-horizon tasks from OGBench and can be applied to enhance the performance of large-scale Vision-Language-Action models that predict action sequences, significantly boosting performance in both RoboCasa Kitchen simulation tasks and real-world manipulation tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3270. TAMMs:~Change Understanding and Forecasting in Satellite Image Time Series with a Temporal-Aware Multimodal Model
๐ฌ 4
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Remote Sensing Satellite Image Time Series Temporal Reasoning Generative models Change-aware Generation Multimodal Large Language Models
Temporal Change Description (TCD) and Future Satellite Image Forecasting (FSIF) are critical, yet historically disjointed tasks in Satellite Image Time Series (SITS) analysis. Both are fundamentally limited by the common challenge of modeling long-range temporal dynamics. To explore how to improve the performance of methods on both tasks simultaneously by enhancing long-range temporal understanding capabilities, we introduce \textbf{TAMMs}, the first unified framework designed to jointly perform TCD and FSIF within a single MLLM-diffusion architecture. TAMMs introduces two key innovations: Temporal Adaptation Modules (\textbf{TAM}) enhance frozen MLLM's ability to comprehend long-range dynamics, and Semantic-Fused Control Injection (\textbf{SFCI}) mechanism translates this change understanding into fine-grained generative control. This synergistic design makes the understanding from the TCD task to directly inform and improve the consistency of the FSIF task. Extensive experiments demonstrate TAMMs significantly outperforms state-of-the-art specialist baselines on both tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3271. Reassessing Layer Pruning in LLMs: New Insights and Methods
๐ฌ 25
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Layer Pruning Model Compression
Although large language models (LLMs) have achieved remarkable success across various domains, their considerable scale necessitates substantial computational resources, posing significant challenges for deployment in resource-constrained environments. Layer pruning, as a simple yet effective compression method, removes layers of a model directly, reducing computational overhead. However, what are the best practices for layer pruning in LLMs? Are sophisticated layer selection metrics truly effective? Does the LoRA (Low-Rank Approximation) family, widely regarded as a leading method for pruned model fine-tuning, truly meet expectations when applied to post-pruning fine-tuning? To answer these questions, we dedicate thousands of GPU hours to benchmarking layer pruning in LLMs and gaining insights across multiple dimensions. Our results demonstrate that a simple approach, i.e., pruning the final layers followed by fine-tuning the lm\_head and the remaining last three layers, yields remarkably strong performance. These pruning strategies are further supported by theoretical analyses based on the gradient flow. Following this guide, our method surpasses existing state-of-the-art pruning methods by $5.62\%$โ$17.27\%$ on Llama-3.1-8B-It, by $2.36\%$โ$19.45\%$ on Llama-3-8B and by $4.34\%$โ$9.59\%$ on Llama-3-70B. The code is available on GitHub.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3272. Three Forward, One Backward: Memory-Efficient Full-Rank Fine-Tuning of Large Models via Extra Forward Passes
๐ฌ 32
โญ 5.00
๐ optimization
๐ท๏ธ "LLM tuning" "LoRA" "Zeroth order"
Fine-tuning large language models (LLMs) has achieved significant success in downstream tasks.
However, as the model size continues to grow, traditional fine-tuning methods have become increasingly impractical due to their high computational and memory costs.
This has motivated researchers to explore parameter-efficient and memory-friendly fine-tuning strategies to enable scalable approaches, with Low-Rank Adaptation (LoRA) standing out as a representative work.
However, the LoRA update is restricted to a low-rank subspace, which results in suboptimal performance compared to the full-parameter update.
Recent research has also explored memory-efficient fine-tuning LLMs using just forward passes while suffer from high variance in gradient estimation and low convergence speed.
To address the issues above, we propose a new alternating optimization framework called LMAO (Low-rank and Memory-efficient Zeroth-Order Alternating Optimization), which combines the advantages of LoRA and MeZO.
This method alternately updates the low-rank components and zeroth-order directions during training.
By performing three forward propagations and one backward propagation, each update is full-rank, thereby reducing feature loss and enabling efficient fine-tuning under strict memory constraints.
We provide theoretical guarantees on the convergence and convergence rate of this method.
Empirical results demonstrate that, in experiments on multiple models (e.g., OPT, RoBERTa-large), LMAO achieves performance comparable to first-order methods.
This presents a practical and scalable solution for fine-tuning large-scale models.
Our source code is available at https://anonymous.4open.science/r/mlao-C2EC/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3273. In-Context Algebra
๐ฌ 15
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Interpretability In-Context Learning ICL Algebra Grokking Symbolic Reasoning
We investigate the mechanisms that arise when transformers are trained to solve arithmetic on sequences where tokens are variables whose meaning is determined only through their interactions. While previous work has found that transformers develop geometric embeddings that mirror algebraic structure, those previous findings emerge from settings where arithmetic-valued tokens have fixed meanings. We devise a new task in which the assignment of symbols to specific algebraic group elements varies from one sequence to another. Despite this challenging setup, transformers achieve near-perfect accuracy on the task and even generalize to unseen algebraic groups. We develop targeted data distributions to create causal tests of a set of hypothesized mechanisms, and we isolate three mechanisms the models learn: commutative copying where a dedicated head copies answers, identity element recognition that distinguishes identity-containing facts, and closure-based cancellation that tracks group membership to constrain valid answers. Complementary to the geometric representations found in fixed-symbol settings, our findings show that models develop symbolic reasoning mechanisms when trained to reason in-context with variables whose meanings are not fixed.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3274. UrbanFeel๏ผA Comprehensive Benchmark for Temporal and Perceptual Understanding of City Scenes through Human Perspective
๐ฌ 23
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Benchmark Urban Change Urban Perception Multimodel Large Language Models
Urban development impacts over half of the global population, making human-centered understanding of its structural and perceptual changes essential for smart city planning. While Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various domains, existing benchmarks that explore their performance in urban environments remain limited, lacking systematic exploration of temporal evolution and subjective perception of urban environment that aligns with human perception. To address these limitations, we propose UrbanFeel, a comprehensive benchmark designed to evaluate the performance of MLLMs in urban development understanding and subjective environmental perception. UrbanFeel comprises 14.3K carefully constructed visual questions spanning three cognitively progressive dimensions: Static Scene Perception, Temporal Change Perception, and Subjective Environmental Perception. We collect multi-temporal single-view and panoramic street-view images from 11 representative cities worldwide, and generate high-quality question-answer pairs through a hybrid pipeline of spatial clustering, rule-based generation, model-assisted prompting, and manual annotation. Through extensive evaluation of 20 state-of-the-art MLLMs, we observe that Gemini-2.5 Pro achieves the best overall performance, with its accuracy approaching human expert levels and narrowing the average gap to just 1.5%. Most models perform well on tasks grounded in scene understanding. In particular, some models even surpass human annotators in pixel-level change detection. However, performance drops notably in tasks requiring temporal reasoning over urban development. Additionally, in the subjective perception dimension, several models reach human-level or even higher consistency in evaluating dimension such as beautiful and safety. Our results suggest that MLLMs are demonstrating rudimentary emotion understanding capabilities. Our UrbanFeel benchmark will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3275. EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark
๐ฌ 5
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Egocentric vision; Benchmark; MLLMs; VQA
Most existing benchmarks for egocentric vision understanding focus primarily on daytime scenarios, overlooking the low-light conditions that are inevitable in real-world applications. To investigate this gap, we present EgoNight, the first comprehensive benchmark for nighttime egocentric vision, with visual question answering (VQA) as the core task. A key feature of EgoNight is the introduction of dayโnight aligned videos, which enhance night annotation quality using the daytime data and reveal clear performance gaps between lighting conditions. To achieve this, we collect both synthetic videos rendered by Blender and real-world recordings, ensuring that scenes and actions are visually and temporally aligned. Leveraging these paired videos, we construct EgoNight-VQA, supported by a novel day-augmented night auto-labeling engine and refinement through extensive human verification. Each QA pair is double-checked by annotators for reliability. In total, EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types, with more than 300 hours of human work. Evaluations of the state-of-the-art multimodal large language models (MLLMs) reveal substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions. Beyond VQA, EgoNight also introduces two auxiliary tasks, dayโnight correspondence retrieval and egocentric depth estimation at night, that further explore the boundaries of existing models. We believe EgoNight-VQA provides a strong foundation for advancing application-driven egocentric vision research and for developing models that generalize across illumination domains. All the data and code will be made available upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3276. LANE: Label-Aware Noise Elimination for Fine-Grained Text Classification
๐ฌ 8
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ label noise fine-grained classification label relationships
In this paper, we propose Label-Aware Noise Elimination (LANE), a new approach that improves the robustness of deep learning models when trained under increased label noise in fine-grained text classification. LANE leverages the semantic relations between classes and monitors the training dynamics of the model on each training example to dynamically lower the importance of training examples that are perceived to have noisy labels. We test the effectiveness of LANE in fine-grained text classification and benchmark our approach on a wide variety of datasets with various number of classes and various amounts of label noise. LANE considerably outperforms strong baselines on all datasets, obtaining significant improvements ranging from an average improvement of 2.4% in F1 on manually annotated datasets to a considerable average improvement of 4.5% F1 on datasets with higher levels of label noise. We carry out comprehensive analyses of LANE and identify the key components that lead to its success.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3277. Dynamic Multi-sample Mixup with Gradient Exploration for Open-set Graph Anomaly Detection
๐ฌ 14
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ graph neural network graph anomaly detection open set mixup energy gradient pseudo labelling
This paper studies the problem of open-set graph anomaly detection, which aims to generalize a graph neural network (GNN) trained with a small number of both normal and abnormal nodes to detect unseen anomalies different from training anomalies during inference. This problem is highly challenging due to both the data scarcity of unseen anomalies and the label scarcity for training nodes. Towards this end, we propose a novel approach named Dynamic Multi-sample Mixup with Gradient Exploration (DEMO) for open-set graph anomaly detection. The core of our proposed DEMO is to leverage a dynamic framework to adapt the optimization procedure with high generalizability. In particular, our DEMO first adaptively fuses multiple seen nodes to simulate the unseen anomalies, which expands the decision boundary for the detection model with enhanced generalizability. Moreover, we dynamically adjust sample weights based on their energy gradients to prioritize uncertain and informative nodes, ensuring a robust optimization procedure. To further address both label scarcity and severe class imbalance, we maintain a memory bank of historical records to guide the pseudo-labeling process of unlabeled nodes. Extensive experiments on various benchmark datasets validate the superiority of the proposed DEMO in comparison to various baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3278. Exploring Knowledge Purification in Multi-Teacher Knowledge Distillation for LLMs
๐ฌ 10
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ knowledge distillation large language model LLM routing
Knowledge distillation has emerged as a pivotal technique for transferring knowledge from stronger large language models (LLMs) to smaller, more efficient models. However, traditional distillation approaches face challenges related to knowledge conflicts and high resource demands, particularly when leveraging multiple teacher models. In this paper, we introduce the concept of **Knowledge Purification**, which consolidates the rationales from multiple teacher LLMs into a single rationale, thereby mitigating conflicts and enhancing efficiency. To investigate the effectiveness of knowledge purification, we further propose five purification methods from various perspectives. Our experiments demonstrate that these methods not only improve the performance of the distilled model but also effectively alleviate knowledge conflicts. Moreover, router-based methods exhibit robust generalization capabilities, underscoring the potential of innovative purification techniques in optimizing multi-teacher distillation and facilitating the practical deployment of powerful yet lightweight models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3279. Multi-Synaptic Cooperation: A Bio-Inspired Framework for Robust and Scalable Continual Learning
๐ฌ 18
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Continual Learning Catastrophic Forgetting Brain-inspired Computing Multi-Synaptic Cooperation
Continual learning aims to acquire new knowledge incrementally while retaining prior information, with catastrophic forgetting (CF) being a central challenge. Existing methods can mitigate CF to some extent but are constrained by limited capacity, which often requires dynamic expansion for long task sequences and makes performance sensitive to task order. Inspired by the richness and plasticity of synaptic connections in biological nervous systems, we propose the Multi-Synaptic Cooperative Network (MSCN), a generalized framework that models cooperative interactions among multiple synapses through multi-synaptic connections modulated by local synaptic activity. This design enhances model representational capacity and enables task-adaptive plasticity by means of multi-synaptic cooperation, providing a new avenue for expanding model capacity while improving robustness to task order. During learning, our MSCN dynamically activates task-relevant synapses while suppressing irrelevant ones, enabling targeted retrieval and minimizing interference. Extensive experiments across four benchmark datasets, involving both spiking and non-spiking neural networks, demonstrate that our method consistently outperforms state-of-the-art continual learning methods with significantly improved robustness to task-order variation. Furthermore, our analysis reveals an optimal trade-off between synaptic richness and learning efficiency, where excessive connectivity can impair circuit performance. These findings highlight the importance of the multi-synaptic cooperation mechanism for achieving efficient continual learning and provide new insights into biologically inspired, robust, and scalable continual learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3280. I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
๐ฌ 14
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ large language model representations human-interpretable concept latent variable model
Recent empirical evidence shows that LLM representations encode human-interpretable concepts. Nevertheless, the mechanisms by which these representations emerge remain largely unexplored. To shed further light on this, we introduce a novel generative model that generates tokens on the basis of such concepts formulated as latent discrete variables. Under mild conditions, even when the mapping from the latent space to the observed space is non-invertible, we establish rigorous identifiability result: the representations learned by LLMs through next-token prediction can be approximately modeled as the logarithm of the posterior probabilities of these latent discrete concepts given input context, up to an invertible linear transformation. This theoretical finding: 1) provides evidence that LLMs capture essential underlying generative factors, 2) offers a unified and principled perspective for understanding the linear representation hypothesis, and 3) motivates a theoretically grounded approach for evaluating sparse autoencoders. Empirically, we validate our theoretical results through evaluations on both simulation data and the Pythia, Llama, and DeepSeek model families.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3281. Multi-Object System Identification from Videos
๐ฌ 15
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Object Property Identification Physics-based Modeling
We introduce the challenging problem of multi-object system identification from videos, for which prior methods are ill-suited due to their focus on single-object scenes or discrete material classification with a fixed set of material prototypes. To address this, we propose MOSIV, a new framework that directly optimizes for continuous, per-object material parameters using a differentiable simulator guided by geometric objectives derived from video. We also present a new synthetic benchmark with contact-rich, multi-object interactions to facilitate evaluation. On this benchmark, MOSIV substantially improves grounding accuracy and long-horizon simulation fidelity over adapted baselines, establishing it as a strong baseline for this new task. Our analysis shows that object-level fine-grained supervision and geometry-aligned objectives are critical for stable optimization in these complex, multi-object settings. The source code and dataset will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3282. Autoregressive Visual Decoding from EEG Signals
๐ฌ 17
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ EEG decoding Visual reconstruction BCI Visual neural decoding
Electroencephalogram (EEG) signals have become a popular medium for decoding visual information due to their cost-effectiveness and high temporal resolution. However, current approaches face significant challenges in bridging the modality gap between EEG and image data. These methods typically rely on complex adaptation processes involving multiple stages, making it hard to maintain consistency and manage compounding errors. Furthermore, the computational overhead imposed by large-scale diffusion models limit their practicality in real-world brain-computer interface (BCI) applications. In this work, we present AVDE, a lightweight and efficient framework for visual decoding from EEG signals. First, we leverage LaBraM, a pre-trained EEG model, and fine-tune it via contrastive learning to align EEG and image representations. Second, we adopt an autoregressive generative framework based on a "next-scale prediction" strategy: images are encoded into multi-scale token maps using a pre-trained VQ-VAE, and a transformer is trained to autoregressively predict finer-scale tokens starting from EEG embeddings as the coarsest representation. This design enables coherent generation while preserving a direct connection between the input EEG signals and the reconstructed images. Experiments on two datasets show that AVDE outperforms previous state-of-the-art methods in both image retrieval and reconstruction tasks, while using only 10% of the parameters. In addition, visualization of intermediate outputs shows that the generative process of AVDE reflects the hierarchical nature of human visual perception. These results highlight the potential of autoregressive models as efficient and interpretable tools for practical BCI applications. The code is available at https://anonymous.4open.science/r/avde-783D.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3283. Mitigating Privacy Risk via Forget Set-Free Unlearning
๐ฌ 15
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ machine learning unlearning privacy corrective unlearning deep learning risks approximate unlearning empirical
Training machine learning models requires the storage of large datasets, which often contain sensitive or private data.
Storing data is associated with a number of potential risks which increase over time, such as database breaches and malicious adversaries.
Machine unlearning is the study of methods to efficiently remove the influence of training data subsets from previously-trained models.
Existing unlearning methods typically require direct access to the "forget set"---the data to be forgotten-and organisations must retain this data for unlearning rather than deleting it immediately upon request, increasing risks associated with the forget set.
We introduce partially-blind unlearning---utilizing auxiliary information to unlearn without explicit access to the forget set.
We introduce a practical framework Reload, a partially-blind method based on gradient optimization and structured weight sparsification to operationalize partially-blind unlearning.
We show that Reload efficiently unlearns, approximating models retrained from scratch, and outperforms several forget set-dependent approaches. On language models, Reload unlearns entities using <0.025\% of the retain set and <7\% of model weights in <8 minutes on Llama2-7B.
In the corrective case, Reload achieves unlearning even when only 10\% of corrupted data is identified.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3284. Forget Many, Forget Right: Scalable and Precise Concept Unlearning in Diffusion Models
๐ฌ 13
โญ 5.00
๐ generative models
๐ท๏ธ machine unlearning large-scale unlearning diffusion model
While multi-concept unlearning has shown progress, extending to large-scale scenarios remains difficult, as existing methods face three persistent challenges: **(i)** they often introduce conflicting weight updates, making some targets difficult to unlearn or causing degradation of generative capability; **(ii)** they lack precise mechanisms to keep unlearning strictly confined to target concepts, resulting in collateral damage on similar content; **(iii)** many approaches rely on additional data or auxiliary modules, causing scalability and efficiency bottlenecks as the number of concepts grows. To simultaneously address these challenges, we propose **Scalable-Precise Concept Unlearning (ScaPre)**, a unified and lightweight framework tailored for scalable and precise large-scale unlearning. ScaPre introduces a *conflict-aware stable design*, which integrates the spectral trace regularizer and geometry alignment to stabilize the optimization space, suppress conflicting updates, and preserve the pretrained global structure. Furthermore, the *Informax Decoupler* identifies concept-relevant parameters and adaptively reweights updates, ensuring that unlearning is confined to the target subspace without collateral damage. ScaPre yields an efficient closed-form solution, requiring no additional data or auxiliary sub-models, while maintaining both scalability and precision. Comprehensive experiments across large-scale objects, styles, and explicit content benchmarks demonstrate that ScaPre effectively removes target concepts while maintaining generation quality. It can forget up to **ร5** more concepts than the best baseline within the limits of acceptable generative quality, and outperforms existing multi-concept approaches in precision and efficiency, achieving a new state of the art for large-scale unlearning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3285. Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks
๐ฌ 17
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ MLLM Referring Expression Comprehensions
Referring Expression Comprehension (REC) links language to region level visual
perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have
progressed rapidly with multimodal LLMs but remain weak tests of visual rea-
soning and grounding: (i) many expressions are very short, leaving little reason-
ing demand; (ii) images often contain few distractors, making the target easy to
find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine
text understanding and visual reasoning. We introduce Ref-Adv, a modern REC
benchmark that suppresses shortcuts by pairing linguistically nontrivial expres-
sions with only the information necessary to uniquely identify the target. The
dataset contains 5k expressions on real images (1k human authored, 4k human
verified), curated with hard distractors and annotated with reasoning facets includ-
ing negation. We conduct comprehensive ablations (word order perturbations and
descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning
beyond simple cues, and we evaluate a broad suite of contemporary multimodal
LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and Ref-
COCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and
gaps in visual reasoning and grounding. We provide an in depth failure analysis
and aim for Ref-Adv to guide future work on visual reasoning and grounding in
MLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3286. MobileIPL: Enhancing Mobile Agents Thinking Process via Iterative Preference Learning
๐ฌ 14
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Mobile-Agent
The Chain of Action-Planning Thoughts (CoaT) paradigm has been shown to improve the reasoning performance of VLM-based mobile agents in GUI tasks. However, the scarcity of diverse CoaT trajectories limits the expressiveness and generalization ability of such agents. While self-training is commonly employed to address data scarcity, existing approaches either overlook the correctness of intermediate reasoning steps or depend on expensive process-level annotations to construct process reward models (PRM). To address the above problems, we propose an Iterative Preference Learning (IPL) that constructs a CoaT-tree through interative sampling, scores leaf nodes using rule-based reward, and backpropagates feedback to derive Thinking-level Direct Preference Optimization (T-DPO) pairs. To prevent overfitting during warm-up supervised fine-tuning, we further introduce a three-stage instruction evolution, which leverages GPT-4o to generate diverse Q&A pairs based on real mobile UI screenshots, enhancing both generality and layout understanding. Experiments on three standard Mobile GUI-agent benchmarks demonstrate that our agent MobileIPL outperforms strong baselines, including continual pretraining models such as OS-ATLAS and UI-TARS. It achieves state-of-the-art performance across three standard Mobile GUI-Agents benchmarks and shows strong generalization to out-of-domain scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3287. RedacBench: Can AI Erase Your Secrets?
๐ฌ 15
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Redaction Benchmark Security Language Model Privacy Sensitive Information Data Sanitization
The ability of modern language models to easily extract unstructured sensitive information has made redactionโthe selective removal of such informationโan essential task for data security. However, existing benchmarks and evaluation methods for redaction are often limited to predefined categories of data like personally identifiable information (PII), or particular techniques like masking. To bridge this gap, we introduce RedacBench, a novel benchmark for a comprehensive evaluation of redaction capabilities, independent of specific domains or redaction strategies. Constructed from 514 human-written texts from individuals, corporations, and governments, along with 187 security policies, RedacBench measures a model's ability to selectively remove policy-violating information while preserving the original text's utility. We robustly quantify this performance using metrics derived from 8,053 inferable propositions, assessing both securityโthrough the redaction of sensitive propositionsโand utilityโthrough the preservation of non-sensitive ones. Our experiments on various redaction strategies using state-of-the-art language models reveal that while more advanced models and strategies can increase security, maintaining utility remains a significant challenge. To facilitate future research, we publicly release RedacBench along with a web-based playground for custom dataset creation and evaluation at https://redacbench.vercel.app/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3288. TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design
๐ฌ 11
โญ 5.00
๐ reinforcement learning
๐ท๏ธ unsupervised environment design autocurricula zero-shot coordination
Generalizing deep reinforcement learning agents to unseen environments remains a significant challenge. One promising solution is Unsupervised Environment Design (UED), a coโevolutionary framework in which a teacher adaptively generates tasks with high learning potential, while a student learns a robust policy from this evolving curriculum. Existing UED methods typically measure learning potential via regret, the gap between optimal and current performance, approximated solely by valueโfunction loss. Building on these approaches, we introduce the transition-prediction error as an additional term in our regret approximation. To capture how training on one task affects performance on others, we further propose a lightweight metric called Co-Learnability. By combining these two measures, we present Transitionโaware Regret Approximation with Coโlearnability for Environment Design (TRACED). Empirical evaluations show that TRACED produces curricula that improve zero-shot generalization over strong baselines across multiple benchmarks. Ablation studies confirm that the transition-prediction error drives rapid complexity rampโup and that CoโLearnability delivers additional gains when paired with the transition-prediction error. These results demonstrate how refined regret approximation and explicit modeling of task relationships can be leveraged for sample-efficient curriculum design in UED.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3289. Representing local protein environments with machine learning force fields
๐ฌ 16
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Machine learning force fields structural biology NMR representation learning
The local structure of a protein strongly impacts its function and interactions with other molecules. Representing local biomolecular environments remains a key challenge while applying machine learning approaches over protein structures. The structural and chemical variability of these environments makes them challenging to model, and performing representation learning on these objects remains largely under-explored. In this work, we propose representations for local protein environments that leverage intermediate features from machine learning force fields (MLFFs). We extensively benchmark state-of-the-art MLFFsโcomparing their performance across latent spaces and downstream tasksโand show that their embeddings capture local structural (e.g., secondary motifs) and chemical features (e.g., amino acid identity and protonation state), organizing protein environments into a structured manifold. We show that these representations enable zero-shot generalization and transfer across diverse downstream tasks. As a case study, we build a physics-informed, uncertainty-aware chemical shift predictor that achieves state-of-the-art accuracy in biomolecular NMR spectroscopy. Our results establish MLFFs as general-purpose, reusable representation learners for protein modeling, opening new directions in representation learning for structured physical systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3290. Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization
๐ฌ 18
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language model reasoning latent reasoning chain of thought
Recent advancements in Large Language Models (LLMs) have shifted from explicit Chain-of-Thought (CoT) reasoning to more efficient latent reasoning, where intermediate thoughts are represented as vectors rather than text. However, latent reasoning can be brittle on challenging, out-of-distribution tasks where robust reasoning is most critical. To overcome these limitations, we introduce Latent Thought Policy Optimization (LTPO), a parameter-free framework that enhances LLM reasoning entirely at test time, without requiring model parameter updates. LTPO treats intermediate latent "thought" vectors as dynamic parameters that are actively optimized for each problem instance. It employs an online policy gradient method guided by an intrinsic, confidence-based reward signal computed directly from the frozen LLM's own output distributions, eliminating the need for external supervision or expensive text generation during optimization. Extensive experiments on five reasoning benchmarks show that LTPO not only matches or surpasses strong baselines on standard tasks but also demonstrates remarkable robustness where others fail. Most notably, on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy, LTPO delivers substantial improvements, showcasing a unique capability for complex reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3291. Ice Cream Doesnโt Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference
๐ฌ 15
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Large Language Model Causal inference
Reliable causal inference is essential for making decisions in high-stakes areas like medicine, economics, and public policy. However, it remains unclear whether large language models (LLMs) can handle rigorous and trustworthy \textit{statistical causal inference}. Current benchmarks usually involve simplified tasks. For example, these tasks might only ask LLMs to identify semantic causal relationships or draw conclusions directly from raw data. As a result, models may overlook important statistical pitfalls, such as Simpsonโs paradox or selection bias. This oversight limits the applicability of LLMs in the real world.
To address these limitations, we propose \textbf{CausalPitfalls}, a comprehensive benchmark designed to rigorously evaluate the capability of LLMs in overcoming common causal inference pitfalls. Our benchmark features structured challenges across multiple difficulty levels, each paired with grading rubrics. This approach allows us to quantitatively measure both causal reasoning capabilities and the reliability of LLMs' responses. We evaluate models using two protocols: (1) direct prompting, which assesses intrinsic causal reasoning, and (2) code-assisted prompting, where models generate executable code for explicit statistical analysis. Additionally, we validate the effectiveness of this judge by comparing its scoring with assessments from human experts.
Our results reveal significant limitations in current LLMs when performing statistical causal inference. The CausalPitfalls benchmark provides essential guidance and quantitative metrics to advance the development of trustworthy causal reasoning systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3292. Reasoning Language Model Inference Serving Unveiled: An Empirical Study
๐ฌ 31
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Reasoning Large Language Model LLM Serving
The reasoning large language model (RLLM) has been proven competitive in solving complex reasoning tasks such as mathematics, coding, compared to traditional LLM.
However, the serving performance and behavior of RLLM remains \textit{unexplored}, which may undermine the deployment and utilization of RLLM in real-world scenario. To close this gap, in this paper, we conduct a comprehensive study of RLLM service. We first perform a pilot study on comparing the serving performance between RLLM and traditional LLM and reveal that there are several distinct differences regarding serving behavior: (1) \textit{significant memory usage and fluctuations}; (2) \textit{straggler requests}; (3) \textit{adaptive running time}; (4) \textit{domain preference}. Then we further investigate whether existing inference optimization techniques are valid for RLLM.
Our main takeaways are that model quantization methods and speculative decoding can improve service system efficiency with small compromise to RLLM accuracy, while prefix caching, KV cache quantization may even degrade accuracy or serving performance for small RLLM. Lastly, we conduct evaluation under real world workload modeled by Gamma distribution to verify our findings.
Empirical results for real world workload evaluation across different dataset are \textit{aligned} with our main findings regarding RLLM serving. We hope our work can provide the research community and industry with insights to advance RLLM inference serving.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3293. One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning
๐ฌ 7
โญ 5.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning multi-task reinforcement learning world model MCTS latent space planning
In heterogeneous multi-task decision-making, tasks not only exhibit diverse observation and action spaces but also vary substantially in their underlying complexities. While conventional multi-task world models like UniZero excel in single-task settings, we find that when handling a broad and diverse suite of tasks, gradient conflicts and the loss of model plasticity often constrain their sample efficiency.
In this work, we address these challenges from two complementary perspectives: the single learning iteration and the overall learning process. First, to mitigate the gradient conflicts, we systematically investigate key architectural designs for extending UniZero. Our investigation identifies a Mixture-of-Experts (MoE) architecture as the most effective approach. We demonstrate, both theoretically and empirically, that this architecture alleviates gradient conflicts by routing task-specific representations to specialized sub-networks. This finding leads to our proposed model, \textit{ScaleZero}. Second, to dynamically allocate model capacity throughout the learning process, we introduce an online Dynamic Parameter Scaling (DPS) strategy. This strategy progressively integrates LoRA adapters in response to task-specific progress, enabling adaptive knowledge retention and parameter expansion. Evaluations on a diverse set of standard benchmarks (Atari, DMC, Jericho) demonstrate that ScaleZero, utilizing solely online reinforcement learning with one model, performs on par with specialized single-task agents. With the DPS strategy, it remains competitive while using just 71.5\% of the environment interactions. These findings underscore the potential of ScaleZero for effective multi-task planning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3294. BIRD: Behavior Induction via Representation-structure Distillation
๐ฌ 10
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Knowledge Distillation AI Alignment Weak-to-strong generalization
Human-aligned deep learning models exhibit behaviors consistent with human values, such as robustness, safety, and fairness. Transferring these behavioral properties to models trained on different tasks or data distributions remains challenging: aligned behavior is easily forgotten during fine-tuning, and collecting task-specific data that preserves this behavior can be prohibitively costly. We introduce BIRD, a flexible framework for transferring aligned behavior by matching the internal representation structure of a student model to that of a teacher. Applied to out-of-distribution robustness in image classification, BIRD outperforms fine-tuning, transfer learning, and continual learning methods, improving robust accuracy by up to 18\% over the next strongest baseline. It remains effective even when the teacher is trained on a much simpler dataset and is $25\times$ smaller in parameter count than the student. In a large-scale study of over 400 teacher-student pairs, we show that three interpretable and computable properties of the teacher's representations explain up to 85\% of the variance in transfer success, offering practical guidance for teacher selection and design. We further show that BIRD generalizes beyond applications in vision by enhancing safety alignment in language models when paired with Direct Preference Optimization and improving weak-to-strong generalization when combined with soft-label distillation. BIRD turns small, well-aligned models into scalable alignment seeds, mitigating challenges from key bottlenecks in deploying safe AI systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3295. Extending Fourier Neural Operators for Modeling Parameterized and Coupled PDEs
๐ฌ 6
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ neural operators Fourier neural operators parameterized dynamics couple
Parameterized and coupled partial differential equations (PDEs) are central to modeling phenomena in science and engineering, yet neural operator methods that address both aspects remain limited. We extend Fourier neural operators (FNOs) with minimal architectural modifications along two directions. For parameterized dynamics, we propose a hypernetwork-based modulation that conditions the operator on physical parameters. For coupled systems, we conduct a systematic exploration of architectural choices, examining how operator components can be adapted to balance shared structure with cross-variable interactions while retaining the efficiency of standard FNOs. Evaluations on benchmark PDEs, including the one-dimensional capacitively coupled plasma equations and the GrayโScott system, show that our methods achieve up to 55~72% lower errors than strong baselines, demonstrating the effectiveness of principled modulation and systematic design exploration.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3296. Goal-Aware Identification and Rectification of Misinformation in Multi-Agent Systems
๐ฌ 5
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM-based Agent Multi-agent System Misinformation
Large Language Model-based Multi-Agent Systems (MASs) have demonstrated strong advantages in addressing complex real-world tasks. However, due to the introduction of additional attack surfaces, MASs are particularly vulnerable to misinformation injection. To facilitate a deeper understanding of misinformation propagation dynamics within these systems, we introduce **MisinfoTask**, a novel dataset featuring complex, realistic tasks designed to evaluate MAS robustness against such threats. Building upon this, we propose **ARGUS**, a two-stage, training-free defense framework leveraging goal-aware reasoning for precise misinformation rectification within information flows. Our experiments demonstrate that in challenging misinformation scenarios, ARGUS exhibits significant efficacy across various injection attacks, achieving an average reduction in misinformation toxicity of approximately 28.17% and improving task success rates under attack by approximately 10.33%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3297. Constant Degree Matrix-Driven Incomplete Multi-View Clustering via Connectivity-Structure and Embedding Tensor Learning
๐ฌ 14
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Multi-view clustering
Tensor-based incomplete multi-view clustering has attracted significant research attention due to its capability to exploit high-order correlations across different views for revealing underlying cluster structures from partially observed multi-view data. However, most existing approaches construct tensors from adjacency matrices, which necessitate post-processing operations (e.g., singular value decomposition, SVD) and thereby introduce additional computational overhead and potential errors. Some approaches instead employ latent embedding tensors to avoid post-processing, but they often fail to capture the geometric structure of the underlying graph. To address these limitations, we propose **C**onst**A**nt degree **M**trix-driv**E**n incomp**L**ete multi-view clustering via connectivity-structure and embedding tensor learning (**CAMEL**). Specifically, CAMEL jointly learns view-specific latent embeddings under structured constraints and organizes them into a tensor with an ${\ell_{\delta}}$ low-rank constraint, thereby enabling coordinated optimization of graph connectivity and high-order correlations. To further mitigate the $\mathcal{O}(n^2)$ or ever higher complexity complexity associated with conventional connectivity constraints, CAMEL approximates the variable Laplacian degree matrix with a constant-degree matrix, reducing the computational cost to $\mathcal{O}(1)$. Clustering assignments are subsequently derived via $k$-means on the concatenated embeddings, eliminating the need for post-processing operations on adjacency matrices such as SVD. Extensive experiments on nine benchmark datasets demonstrate the superior effectiveness and efficiency of CAMEL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3298. Bi-directional Bias Attribution: Debiasing Large Language Models without Modifying Prompts
๐ฌ 24
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large language models; Algorithmic fairness; Social bias
Large language models (LLMs) have demonstrated impressive capabilities across a wide range of natural language processing tasks. However, their outputs often exhibit social biases, raising fairness concerns. Existing debiasing methods, such as fine-tuning on additional datasets or prompt engineering, face scalability issues or compromise user experience in multi-turn interactions. To address these challenges, we propose a framework for detecting stereotype-inducing words and attributing neuron-level bias in LLMs, without the need for fine-tuning or prompt modification. Our framework first identifies stereotype-inducing adjectives and nouns via comparative analysis across demographic groups. We then attribute biased behavior to specific neurons using two attribution strategies based on integrated gradients. Finally, we mitigate bias by directly intervening on their activations at the projection layer. Experiments on three widely used LLMs demonstrate that our method effectively reduces bias while preserving overall model performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3299. SAGE: Spatial-visual Adaptive Graph Exploration for Visual Place Recognition
๐ฌ 18
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Visual Place Recognition Geo-Localization Graph Mining Global Descriptor
Visual Place Recognition (VPR) requires robust retrieval of geotagged images despite large appearance, viewpoint, and environmental variation.
Prior methods focus on descriptor fine-tuning or fixed sampling strategies yet neglect the dynamic interplay between spatial context and visual similarity during training.
We present SAGE (Spatial-visual Adaptive Graph Exploration), a unified training pipeline that enhances granular spatialโvisual discrimination by jointly improving local feature aggregation, organize samples during training, and hard sample mining.
We introduce a lightweight Soft Probing module that learns residual weights from training data for patch descriptors before bilinear aggregation, boosting distinctive local cues.
During training we reconstruct an online geoโvisual graph that fuses geographic proximity and current visual similarity so that candidate neighborhoods reflect the evolving embedding landscape.
To concentrate learning on the most informative place neighborhoods, we seed clusters from high-affinity anchors and iteratively expand them with a greedy weighted clique expansion sampler.
Implemented with a frozen DINOv2 backbone and parameter-efficient fine-tuning, SAGE achieves SOTA across eight benchmarks. It attains 98.9\%, 95.8\%, 94.5\%, and 96.0\% Recall@1 on SPED, Pitts30k-test, MSLS-val, and Nordland, respectively. Notably, our method obtains 100\% Recall@10 on SPED only using 4096D global descriptors.
Code and pretrained weights will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3300. OmniNav: A Unified Framework for Prospective Exploration and Visual-Language Navigation
๐ฌ 19
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Object-Goal Navigation Instruct-Goal Navigation Active Exploration Policy-Diffusion
Embodied navigation is a foundational challenge for intelligent robots, demanding the ability to comprehend visual environments, follow natural language instructions, and explore autonomously. However, existing models struggle to provide a unified solution across heterogeneous navigation paradigms, often yielding low success rates and limited generalization. We present OmniNav, a unified framework that handles instruct-goal, object-goal, point-goal navigation, and frontier-based exploration within a single architecture. First, we introduce a lightweight, low-latency policy that predicts continuous-space waypoints (coordinates and orientations) with high accuracy, outperforming action-chunk methods in precision and supporting real-world deployment with control frequencies up to 5 Hz. Second, at the architectural level, OmniNav proposes a fast-slow system design: a fast module performs waypoint generation from relatively short-horizon visual context and subtasks, while a slow module conducts deliberative planning using long-horizon observations and candidate frontiers to select the next subgoal and subtask. This collaboration improves path efficiency and maintains trajectory coherence in exploration and memory-intensive settings. Notably, we find that the primary bottleneck lies not in navigation policy learning per se, but in robust understanding of general instructions and objects. To enhance generalization, we incorporate large-scale general-purpose training dataset including those used for image captioning and visual into a joint multi-task regimen, which substantially boosts success rates and robustness. Extensive experiments demonstrate state-of-the-art performance across diverse navigation benchmarks, and real-world deployment further validates the approach. OmniNav offers practical insights for embodied navigation and points to a scalable path toward versatile, highly generalizable robotic intelligence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3301. FlowSymm: PhysicsโAware, SymmetryโPreserving Graph Attention for Network Flow Completion
๐ฌ 16
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ graphs networks flow graphs graph attention networks group action bilevel-optimization physics-aware graph neural networks
Recovering missing flows on the edges of a network, while exactly respecting local conservation laws, is a fundamental inverse problem that arises in many systems such as transportation, energy, and mobility. We introduce FlowSymm, a novel architecture that combines (i) a group-action on divergence-free flows, (ii) a graph-attention encoder to learn feature-conditioned weights over these symmetry-preserving actions, and (iii) a lightweight Tikhonov refinement solved via implicit bilevel optimization. The method first anchors the given observation on a minimum-norm divergence-free completion. We then compute an orthonormal basis for all admissible group actions that leave the observed flows invariant and parameterize the valid solution subspace, which shows an Abelian group structure under vector addition. A stack of GATv2 layers then encodes the graph and its edge features into per-edge embeddings, which are pooled over the missing edges and produce per-basis attention weights. This attention-guided process selects a set of physics-aware group actions that preserve the observed flows. Finally, a scalar Tikhonov penalty refines the missing entries via a convex least-squares solver, with gradients propagated implicitly through Cholesky factorization. Across three real-world flow benchmarks (traffic, power, bike), FlowSymm substantially outperforms state-of-the-art baselines in RMSE, MAE and correlation metrics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3302. Off-Trajectory Reasoning: Can LLMs Collaborate on Reasoning Trajectory?
๐ฌ 24
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ large langage models; language model reasoning; multi-model collaboration; off-trajectory reasoning
Reasoning LLMs are trained to verbalize their thinking process, yielding strong gains on complex tasks. This transparency also opens a promising direction: multiple reasoners can directly collaborate on each other's thinking on a shared trajectory, yielding better inference efficiency and exploration. A key prerequisite, however, is the ability to assess usefulness and build on another model's partial thinking โwe call this *off-trajectory reasoning*. Our paper investigates a critical question: can standard *solo-reasoning* training pipelines yield desired *off-trajectory* behaviors? We propose twin tests that capture the two extremes of the off-trajectory spectrum, namely
**Recoverability**, which tests whether LLMs can backtrack from ``distractions'' induced by misleading reasoning traces, and **Guidability**, which tests their ability to build upon correct reasoning from stronger collaborators. Our study evaluates 15 open-weight LLMs (1.5Bโ32B) and reveals a counterintuitive finding โ "stronger" LLMs on benchmarks are often more fragile under distraction. Moreover, all models tested fail to effectively leverage guiding steps from collaborators on problems beyond their inherent capabilities, with solve rates remaining under 9.2%. Finally, we conduct control studies to isolate the effects of three factors in post-training on these behaviors: the choice of distillation teacher, the use of RL, and data selection strategy. Our results provide actionable insights for training natively strong reasoning collaborators; e.g., we find that sub-optimal recoverability behaviors of teacher models are transferred to distilled students even if the distilled data trajectories are correct. Taken together, this work lays the groundwork for evaluating multi-model collaborations under shared reasoning, while revealing limitations of off-the-shelf reasoning LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3303. Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation
๐ฌ 4
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Empirical Study Large Vision-Language Model Benchmark Evaluation
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasksโfundamental to computer visionโremain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.28 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on twelve representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://anonymous.4open.science/r/FG-BMK-7B51.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3304. Learning Brain Representation with Hierachical Visual Embeddings
๐ฌ 19
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Visual Decoding Brain-Computer Interface EEG Contrastive Learning
Decoding visual representations from brain signals has attracted significant attention in both neuroscience and artificial intelligence. However, the degree to which brain signals truly encode visual information remains unclear. Current visual decoding approaches explore various brainโimage alignment strategies, yet most emphasize high-level semantic features while neglecting pixel-level details, thereby limiting our understanding of the human visual system.
In this paper, we propose a brainโimage alignment strategy that leverages multiple pre-trained visual encoders with distinct inductive biases to capture hierarchical and multiscale visual representations, while employing a contrastive learning objective to achieve effective alignment between brain signals and visual embeddings. Furthermore, we introduce a Fusion Prior, which learns a stable mapping on large-scale visual data and subsequently matches brain features to this pre-trained prior, thereby enhancing distributional consistency across modalities. Extensive quantitative and qualitative experiments demonstrate that our method achieves a favorable balance between retrieval accuracy and reconstruction fidelity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3305. VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations
๐ฌ 32
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Multimodal Large Language Models Visualization Assessment Data Visualization
Visualization, a domain-specific yet widely used form of imagery, is an effective way to turn complex datasets into intuitive insights, and its value depends on whether data are faithfully represented, clearly communicated, and aesthetically designed. However, evaluating visualization quality is challenging: unlike natural images, it requires simultaneous judgment across data encoding accuracy, information expressiveness, and visual aesthetics. Although multimodal large language models (MLLMs) have shown promising performance in aesthetic assessment of natural images, no systematic benchmark exists for measuring their capabilities in evaluating visualizations. To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality. It contains 3,090 expert-annotated samples from real-world scenarios, covering single visualizations, multiple visualizations, and dashboards across 32 chart types. Systematic testing on this benchmark reveals that even the most advanced MLLMs (such as GPT-5) still exhibit significant gaps compared to human experts in judgment, with a Mean Absolute Error (MAE) of 0.553 and a correlation with human ratings of only 0.428. To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment. Experimental results demonstrate that VisJudge significantly narrows the gap with human judgment, reducing the MAE to 0.421 (a 23.9% reduction) and increasing the consistency with human experts to 0.687 (a 60.5% improvement) compared to GPT-5.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3306. MoSA: Mosaic Shared Adaptation of Large Language Models
๐ฌ 14
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Parameter-efficient fine-tuning Large Language Model Adaptation Transformer
We introduce MoSA, a new parameter-efficient fine-tuning (PEFT) method that replaces low-rank factorization with randomized, fine-grained sharing of weight updates. Each adapted weight matrix is constructed by broadcasting a small set of learned scalars over a fixed tessellation, a pre-defined group assignment of weight entries of the weight matrix, producing expressive changes under the same parameter budget as low-rank adaptation (LoRA). MoSA requires no architectural changes and can be merged into the base model for zero-overhead inference. Across diverse language understanding and generation tasks, MoSA matches or surpasses strong PEFT baselines under strictly matched budgets. Analyses and ablations indicate that non-local parameter sharing acts as an effective regularizer, and that grouping design and budget allocation govern the expressivityโefficiency trade-off. These results position MoSA as a simple, scalable alternative to LoRA.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3307. FM4NPP: A Scaling Foundation Model for Nuclear and Particle Physics
๐ฌ 15
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Foundation Model State Space Model Neural Scaling Particle Tracking Nuclear Physics Particle Physics
Large language models have revolutionized artificial intelligence by enabling large, generalizable models trained through self-supervision. This paradigm has inspired the development of scientific foundation models (FMs). However, applying this capability to experimental particle physics is challenging due to the sparse, spatially distributed nature of detector data, which differs dramatically from natural language. This work addresses if an FM for particle physics can scale and generalize across diverse tasks. We introduce a new dataset with more than 11 million particle collision events and a suite of downstream tasks and labeled data for evaluation. We propose a novel self-supervised training method for detector data and demonstrate its neural scalability with models that feature up to 188 million parameters. With frozen weights and task-specific adapters, this FM consistently outperforms baseline models across all downstream tasks. The performance also exhibits robust data-efficient adaptation. Further analysis reveals that the representations extracted by the FM are task-agnostic but can be specialized via a single linear mapping for different downstream tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3308. Reliable Poisoned Sample Detection against Backdoor Attacks Enhanced by Sharpness Aware Minimization
๐ฌ 23
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Backdoor Defense Poisoned Sample Detection AI security
This work investigates Poisoned Sample Detection (PSD), a promising defense approach against backdoor attacks. However, we observe that the effectiveness of many advanced PSD methods degrades significantly under weak backdoor attacks (\eg, low poisoning ratios or weak trigger patterns). To substantiate this observation, we conduct a statistical analysis across various attacks and PSD methods, revealing a strong correlation between the strength of the backdoor effect and the detection performance. Inspired by this, we propose amplifying the backdoor effect through training with Sharpness-Aware Minimization (SAM). Both theoretical insights and empirical evidence validate that SAM enhances the activations of top Trigger Activation Change (TAC) neurons while suppressing others. Based on this, we introduce SAM-enhanced PSD, a simple yet effective framework that seamlessly improves existing PSD methods by extracting detection features from the SAM-trained model rather than the conventionally trained model. Extensive experiments across multiple benchmarks demonstrate that our approach significantly improves detection performance under both strong and weak backdoor attacks, achieving an average True Positive Rate (TPR) gain of +34.3% over conventional PSD methods. Overall, we believe that the revealed correlation between the backdoor effect and detection performance could inspire future research advancements.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3309. CryoSplat: Gaussian Splatting for Cryo-EM Homogeneous Reconstruction
๐ฌ 13
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Cryo-EM 3D Reconstruction Gaussian Mixture Model
As a critical modality for structural biology, cryogenic electron microscopy (cryo-EM) facilitates the determination of macromolecular structures at near-atomic resolution. The core computational task in single-particle cryo-EM is to reconstruct the 3D electrostatic potential of a molecule from noisy 2D projections acquired at unknown orientations. Gaussian mixture models (GMMs) provide a continuous, compact, and physically interpretable representation for molecular density and have recently gained interest in cryo-EM reconstruction. However, existing methods rely on external consensus maps or atomic models for initialization, limiting their use in self-contained pipelines. In parallel, differentiable rendering techniques such as Gaussian splatting have demonstrated remarkable scalability and efficiency for volumetric representations, suggesting a natural fit for GMM-based cryo-EM reconstruction. However, off-the-shelf Gaussian splatting methods are designed for photorealistic view synthesis and remain incompatible with cryo-EM due to mismatches in the image formation physics, reconstruction objectives, and coordinate systems. Addressing these issues, we propose cryoSplat, a GMM-based method that integrates Gaussian splatting with the physics of cryo-EM image formation. In particular, we develop an orthogonal projection-aware Gaussian splatting, with adaptations such as a view-dependent normalization term and FFT-aligned coordinate system tailored for cryo-EM imaging. These innovations enable stable and efficient homogeneous reconstruction directly from raw cryo-EM particle images using random initialization. Experimental results on real datasets validate the effectiveness and robustness of cryoSplat over representative baselines. The code will be released upon publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3310. Flow Straight and Fast in Hilbert Space: Functional Rectified Flow
๐ฌ 10
โญ 5.00
๐ generative models
๐ท๏ธ Hilbert space superposition principle
Many generative models originally developed in finite-dimensional Euclidean space have functional generalizations in infinite-dimensional settings. However, the extension of rectified flow to infinite-dimensional spaces remains unexplored. In this work, we establish a rigorous functional formulation of rectified flow in an infinite-dimensional Hilbert space. Our approach builds upon the superposition principle for continuity equations in an infinite-dimensional space. We further show that this framework extends naturally to functional flow matching and functional probability flow ODEs, interpreting them as nonlinear generalizations of rectified flow. Notably, our extension to functional flow matching removes the restrictive measure-theoretic assumptions in the existing theory of \citet{kerrigan2024functional}. Furthermore, we demonstrate experimentally that our method achieves superior performance compared to existing functional generative models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3311. The Tutor-Pupil Augmentation: Enhancing Learning and Interpretability via Input Corrections
๐ฌ 13
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ model augmentation machine learning for physical sciences
State-of-the-art machine learning models often incorporate prior knowledge or structural information about the task or data distribution. In some tasks, such knowledge may arise from first principles or emerge as simplified, learned functions that distill essential aspects of the data distribution. Model augmentation has emerged as a strategy to leverage this structured knowledge by coupling it with an auxiliary model to improve predictive performance, while preserving the interpretability offered by the simpler component. In this work, we present a new augmentation framework called the Tutor-Pupil scheme, which is designed to enhance both performance and interpretability. The Pupil is a fixed model, structurally designed for the core task, while the Tutor is a more flexible model trained to apply minimal input-level corrections to improve the Pupilโs performance on the modified input. This strict separation of roles enables the Tutor not only to compensate for the Pupilโs limitations but also to act as a diagnostic instrument. By examining the Tutorโs targeted interventions, we can identify failure modes, detect regions where the Pupil struggles to generalize, and uncover residual patterns or higher-order structures in the data not captured by the original model.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3312. Dr.LLM: Dynamic Layer Routing in LLMs
๐ฌ 17
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Efficient Inference Adaptive Computation Test-time Optimization Monte Carlo Tree Search Dynamic Layer Routing
Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr.LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, Dr.LLM shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3313. Scale-wise Distillation of Diffusion Models
๐ฌ 13
โญ 5.00
๐ generative models
๐ท๏ธ diffusion distillation few-step models image generation video generation
Recent diffusion distillation methods have achieved remarkable progress, enabling high-quality ${\sim}4$-step sampling for large-scale text-conditional image and video diffusion models (DMs).
However, further reducing the number of sampling steps becomes more and more challenging, suggesting that efficiency gains may be better mined along other model axes.
Motivated by this perspective, we introduce SwD, a scale-wise diffusion distillation framework that equips few-step models with progressive generation, avoiding redundant computations at intermediate diffusion timesteps.
Beyond efficiency, SwD enriches the family of distribution matching distillation approaches by introducing a simple distillation objective based on kernel Maximum Mean Discrepancy (MMD).
This loss significantly improves the convergence of existing distillation methods and performs surprisingly well in isolation, offering a competitive baseline for diffusion distillation.
Applied to state-of-the-art text-to-image/video diffusion models, SwD approaches the sampling speed of two full-resolution steps and largely outperforms alternatives under the same compute budget, as evidenced by automatic metrics and human preference studies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3314. One-Step Video Restoration via Diffusion Adversarial Post-Training
๐ฌ 20
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ One-step video restoration adversarial training
Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference.
While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings.
In this work, we propose a one-step diffusion-based VR model, termed as AnonymousVR, which performs adversarial VR training against real data.
To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures.
Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size.
To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency.
Extensive experiments show that AnonymousVR can achieve comparable or even better performance compared with existing VR approaches in a single step.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3315. Efficient Orthogonal Fine-Tuning with Principal Subspace Adaptation
๐ฌ 29
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Parameter-Efficient Fine-Tuning Efficiency Orthogonal Fine-Tuning Large Language Models
Driven by the rapid growth of model parameters, parameter-efficient fine-tuning (PEFT) has become essential for adapting large models to diverse downstream tasks under constrained computational resources. Within this paradigm, orthogonal fine-tuning and its variants preserve semantic representations of pre-trained models, but struggle to achieve both expressiveness and efficiency in terms of parameter counts, memory, and computation. To overcome this limitation, we propose efficient Orthogonal Fine-Tuning with Principal Subspace adaptation (PSOFT), which confines orthogonal transformations to the principal subspace of pre-trained weights. Specifically, PSOFT constructs this subspace via matrix decomposition to enable compatible transformations with higher rank, establishes a theoretical condition that strictly maintains the geometry of this subspace for essential semantic preservation, and introduces efficient tunable vectors that gradually relax orthogonality during training to enhance adaptability. Extensive experiments on 35 NLP and CV tasks across four representative models demonstrate that PSOFT offers a practical and scalable solution to simultaneously achieve semantic preservation, expressiveness, and multi-dimensional efficiency in PEFT.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3316. Faithfulness Under the Distribution: A New Look at Attribution Evaluation
๐ฌ 24
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Attribution Evaluation Model Faithfulness In-Distribution Perturbation Out-of-Distribution Bias
Evaluating the faithfulness of attribution methods remains an open challenge. Standard metrics such as Insertion and Deletion Scores rely on heuristic input perturbations (e.g., zeroing pixels), which often push samples out of the data distribution (OOD). This can distort model behavior and lead to unreliable evaluations. We propose FUD, a novel evaluation framework that reconstructs masked regions using score-based diffusion models to produce in-distribution, semantically coherent inputs. This distribution-aware approach avoids the common pitfalls of existing Attribution Evaluation Methods (AEMs) and yields assessments that more accurately reflect attribution faithfulness. Experiments across models show that FUD produces significantly differentโand more reliableโjudgments than prior approaches. Our implementation is available at: https://anonymous.4open.science/r/FUD-CCD5/
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3317. Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training
๐ฌ 22
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ synthetic data synthetic caption scene graph text-to-image generation
Recent advances in text-to-vision generation excel in visual fidelity but struggle with compositional generalization and semantic alignment. Existing datasets are noisy and weakly compositional, limiting models' understanding of complex scenes, while scalable solutions for dense, high-quality annotations remain a challenge. We introduce **Generate Any Scene**, a data engine that systematically enumerates scene graphs representing the combinatorial array of possible visual scenes. Generate Any Scene dynamically constructs scene graphs of varying complexity from a structured taxonomy of objects, attributes, and relations. Given a sampled scene graph, Generate Any Scene translates it into a caption for text-to-image or text-to-video generation; it also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment. Using Generate Any Scene, we first design a self-improving framework where models iteratively enhance their performance using generated data. SDv1.5 achieves an average ***4%*** improvement over baselines and surpassing fine-tuning on CC3M. Second, we also design a distillation algorithm to transfer specific strengths from proprietary models to their open-source counterparts. Using fewer than 800 synthetic captions, we fine-tune SDv1.5 and achieve a ***10%*** increase in TIFA score on compositional and hard concept generation. Third, we create a reward model to align model generation with semantic accuracy at a low cost. Using GRPO algorithm, we fine-tune SimpleAR-0.5B-SFT and surpass CLIP-based methods by ***+5%*** on DPG-Bench. Finally, we apply these ideas to the downstream task of content moderation where we train models to identify challenging cases by learning from synthetic data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3318. Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs
๐ฌ 9
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Explainable AI Attribution LLMs
Attribution methods seek to explain language model predictions by quantifying the contribution of input tokens to generated outputs. However, most existing techniques are designed for encoder-based architectures and rely on linear approximations that fail to capture the causal and semantic complexities of autoregressive generation in decoder-only models. To address these limitations, we propose **Hessian-Enhanced Token Attribution (HETA)**, a novel attribution framework tailored for decoder-only language models. HETA combines three complementary components: a semantic transition vector that captures token-to-token influence across layers, Hessian-based sensitivity scores that model second-order effects, and KL divergence to measure information loss when tokens are masked. This unified design produces context-aware, causally faithful, and semantically grounded attributions. Additionally, we introduce a **curated benchmark dataset** for systematically evaluating attribution quality in generative settings. Empirical evaluations across multiple models and datasets demonstrate that HETA consistently outperforms existing methods in attribution faithfulness and alignment with human annotations, establishing a new standard for interpretability in autoregressive language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3319. Efficient Test-Time Scaling for Small Vision-Language Models
๐ฌ 21
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ VLMs test-time scaling test-time augmentation test-time adaptation test-time compute multimodal learning
Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3320. Scaling Synthetic Task Generation for Agents via Exploration
๐ฌ 9
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ data science embodied ai agents computer use agents
Post-Training Multimodal Large Language Models (MLLMs) to build interactive agents holds promise across domains such as computer-use, web navigation, and robotics. A key challenge in scaling such post-training is lack of high-quality downstream agentic task datasets with tasks that are diverse, feasible, and verifiable. Existing approaches for task generation rely heavily on human annotation or prompting MLLM with limited downstream environment information, which is either costly or poorly scalable as it yield tasks with limited coverage. To remedy this, we present AutoPlay, a scalable pipeline for task generation that explicitly explores interactive environments to discover possible interactions and current state information to synthesize environment-grounded tasks. AutoPlay operates in two stages: (i) an exploration phase, where an MLLM explorer agent systematically uncovers novel environment states and functionalities, and (ii) a task generation phase, where a task generator leverages exploration trajectories and a set of task guideline prompts as context to synthesize diverse, executable, and verifiable tasks. We show AutoPlay generates $20$k tasks across $20$ Android applications and $10$k tasks across 13 applications Ubuntu applications to train mobile-use and computer-use agents. AutoPlay generated tasks enable large-scale task demonstration synthesis without human annotation by employing an MLLM task executor and verifier. This data enables training MLLM-based UI agents that improve success rates up to $20.0\%$ on mobile-use and $10.9\%$ on computer-use scenarios. In addition, AutoPlay generated tasks combined with MLLM verifier-based rewards enable scaling reinforcement learning training of UI agents, leading to an additional $5.7\%$ gain. coverage. These results establish AutoPlay as a scalable approach for post-training capable MLLM agents reducing reliance on human annotation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3321. Analytica: Soft Propositional Reasoning for Robust and Scalable LLM-Driven Analysis
๐ฌ 23
โญ 5.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Large Languege Models LLM Agent Reasoning Forecast
Large language model (LLM) agents are increasingly tasked with complex real-world analysis (e.g., in financial forecasting, scientific discovery), yet their reasoning suffers from stochastic instability and lacks a verifiable, compositional structure. To address this, we introduce **Analytica**, a novel agent architecture built on the principle of **Soft Propositional Reasoning (SPR)**. SPR reframes complex analysis as a structured process of estimating the soft truth values of different outcome propositions, allowing us to formally model and minimize the estimation error in terms of its bias and variance.
Analytica operationalizes this through a parallel, divide-and-conquer framework that systematically reduces both sources of error. To reduce bias, problems are first decomposed into a tree of subpropositions, and tool-equipped LLM *grounder agents* are employed โincluding a novel Jupyter Notebook agent for data-driven analysisโthat help to validate and score facts. To reduce variance, Analytica recursively synthesizes these grounded leaves using robust linear models that average out stochastic noise with superior efficiency, scalability, and enable interactive "what-if" scenario analysis.
Our theoretical and empirical results on economic, financial, and political forecasting tasks show that Analytica improves 15.84\% accuracy on average over diverse base models, achieving 71.06\% accuracy with the lowest variance of 6.02\% when working with a Deep Research grounder. Our Jupyter Notebook grounder shows strong cost-effectiveness that achieves a close 70.11\% accuracy with 90.35\% less cost and 52.85\% less time. Analytica also exhibits highly noise-resilient and stable performance growth as the analysis depth increases, with a near-linear time complexity, as well as good adaptivity to open-weight LLMs and scientific domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3322. Unveiling the Potential of Diffusion Large Language Model in Controllable Generation
๐ฌ 14
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ diffusion large langauge models controllable generation structured output
Controllable generation is a fundamental task in NLP with many applications, providing a basis for function calling to agentic communication. However, even state-of-the-art autoregressive Large Language Models (LLMs) today exhibit unreliability when required to generate structured output. Inspired by the current new diffusion-based large language models (dLLM), we realize that the architectural difference, especially the global information-sharing mechanism for language modeling, may be the key to unlock next-level controllable generation. To explore the possibility, we propose Self-adaptive Schema Scaffolding ($S^3$), a novel framework that enables dLLM to stably generate reliable structured outputs (e.g., JSON) by utilizing its innate reverse reasoning capability and global context awareness. $S^3$ initiates a schematic template directly in the output context as a starting state for dLLM, offering a more robust and general method than intricate prompt optimization. Experiments demonstrate that our method substantially unlocks the dLLMโs potential in controllable generation in terms of structure adherence, content fidelity, and faithfulness. These results establish new perspectives and practical pathways for deploying language models in controllable generation tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3323. Text-Aware Image Restoration with Diffusion Models
๐ฌ 14
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Diffusion Model Image Restoration Text-spotting Scene-Text Image Super Resolution
While diffusion models have achieved remarkable success in natural image restoration, they often fail to faithfully recover textual regions, frequently producing plausible yet incorrect text-like patterns, a phenomenon we term text-image hallucination. To address this limitation, we propose Text-Aware Image Restoration (TAIR), a task requiring simultaneous recovery of visual content and textual fidelity. For this purpose, we introduce SA-Text, a large-scale benchmark of 100K high-quality scene images with dense annotations of diverse and complex text instances. We further present a multi-task diffusion framework, TeReDiff, which leverages internal features of diffusion models to jointly train a text-spotting module with the restoration module. This design allows intermediate text predictions from the text-spotting module to condition the diffusion-based restoration process during denoising, thereby enhancing text recovery. Extensive experiments demonstrate that our approach faithfully restores textual regions, outperforms existing diffusion-based methods, and achieves new state-of-the-art results on TextZoom, an STISR benchmark considered a subtask of TAIR. The code, weights, and dataset will be publicly released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3324. Beyond Linear Processing: Dendritic Bilinear Integration in Spiking Neural Networks
๐ฌ 11
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ spiking neuron models spiking neural networks dendritic integration brain-inspired computing
As widely used neuron model in Spiking Neural Networks (SNNs), the Leaky Integrate-and-Fire (LIF) model assumes the linear summation of injected currents. However, recent studies have revealed that a biological neuron can integrate inputs nonlinearly and perform computations such as XOR while an LIF neuron cannot. To bridge this gap, we propose the Dendritic LIF (DLIF) model, which incorporates a bilinear dendritic integration rule derived from neurophysiological experiments. At the single-neuron level, we theoretically demonstrate that a DLIF neuron can capture input correlations, enabling it to perform nonlinear classification tasks. At the network level, we prove that DLIF neurons can preserve and propagate correlation structures from the input layer to the readout layer. These theoretical findings are further confirmed by our numerical experiments. Extensive experiments across diverse architecturesโincluding ResNet, VGG, and Transformerโdemonstrate that DLIF achieves state-of-the-art performance on static (CIFAR-10/100, ImageNet) and neuromorphic (DVS-Gesture, DVS-CIFAR10) benchmarks, surpassing LIF and other advanced alternatives while maintaining comparable computational cost. This work provides a biologically plausible and computationally powerful spiking neuron model, paving the way for next-generation brain-inspired computing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3325. Libra: Effective yet Efficient Load Balancing for Large-scale MoE Inference
๐ฌ 13
โญ 5.00
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ Mixture-of-Experts Model inference Expert load imbalance Dynamic load balancing
Distributed inference of large-scale Mixture-of-Experts (MoE) models faces a critical challenge: expert load imbalance. Numerous system-level approaches have been proposed for load balancing, but they either fail to achieve a satisfactory level of balance or introduce new bottlenecks due to the overhead of the load balancing mechanism itself. To this end, we propose Libra, a system that achieves near-optimal load balancing with minimal overhead. Libra adopts sophisticated mechanisms that accurately predict future expert activations and, based on these predictions, systematically perform load balancing. At the same time, it effectively hides the associated overhead by reconstructing the execution flow so that these costs are overlapped with MoE computation. Evaluations with two large-scale state-of-the-art MoE models on 8 H200 GPUs demonstrate that Libra improves throughput by up to 19.2\%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3326. station2radar: queryโconditioned gaussian splatting for precipitation field
๐ฌ 31
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ climate change precipitation
Precipitation forecasting relies on heterogeneous data sets. Weather radar is accurate, but coverage is geographically limited and costly to maintain. Weather stations provide accurate but sparse point measurements, while satellites offer dense, high-resolution coverage without direct rainfall retrieval. To overcome these limitations, we propose Query-Conditioned Gaussian Splatting (QCGS), the first framework to fuse automatic weather station (AWS) observations with satellite imagery for generating radar-like rainfall fields. Unlike conventional 2D Gaussian splatting, which renders the entire image plane, QCGS selectively renders only queried rainfall regions, avoiding unnecessary computation in non-precipitating areas while preserving sharp precipitation structures. The framework combines a radar point proposal network that identifies rainfall-support locations with an implicit neural representation (INR) network that predicts Gaussian parameters for each point. QCGS enables efficient, resolution-flexible rainfall field generation in real time. Through extensive evaluation with benchmark precipitation products, QCGS demonstrates over 50\% improvement in RMSE compared to conventional gridded rainfall products, and consistently maintains high performance across multiple spatiotemporal scales.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3327. ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents
๐ฌ 17
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Model Prompt Injection Attack LLM Agent
The growing deployment of large language model (LLM) based agents that interact with external environments has created new attack surfaces for adversarial manipulation. One major threat is indirect prompt injection, where attackers embed malicious instructions in external environment output, causing agents to interpret and execute them as if they were legitimate prompts. While previous research has focused primarily on plain-text injection attacks, we find a significant yet underexplored vulnerability: LLMs' dependence on structured chat templates and their susceptibility to contextual manipulation through persuasive multi-turn dialogues. To this end, we introduce ChatInject, an attack that formats malicious payloads to mimic native chat templates, thereby exploiting the model's inherent instruction-following tendencies. Building on this foundation, we develop a template-based Multi-turn variant that primes the agent across conversational turns to accept and execute otherwise suspicious actions. Through comprehensive experiments across frontier LLMs, we demonstrate three critical findings: (1) ChatInject achieves significantly higher average attack success rates than traditional prompt injection methods, improving from 5.18% to 32.05% on AgentDojo and from 15.13% to 45.90% on InjecAgent, with multi-turn dialogues showing particularly strong performance at average 52.33% success rate on InjecAgent, (2) chat-template-based payloads demonstrate strong transferability across models and remain effective even against closed-source LLMs, despite their unknown template structures, and (3) existing prompt-based defenses are largely ineffective against this attack approach, especially against Multi-turn variants. These findings highlight vulnerabilities in current agent systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3328. Learnable Sparsity for Vision Generative Models
๐ฌ 13
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ efficiency diffusion model pruning flow matching
Generative models have achieved impressive advancements in various vision tasks. However, these gains often rely on increasing model size, which raises computational complexity and memory demands. The increased computational demand poses challenges for deployment, elevates inference costs, and impacts the environment. While some studies have explored pruning techniques to improve the memory efficiency of diffusion models, most existing methods require extensive retraining to maintain model performance. Retraining a large model is extremely costly and resource-intensive, which limits the practicality of pruning methods. In this work, we achieve low-cost pruning by proposing a general pruning framework for vision generative models that learns a differentiable mask to sparsify the model. To learn a mask that minimally deteriorates the model, we design a novel end-to-end pruning objective that spans the entire generation process over all steps. Since end-to-end pruning is memory-intensive, we further design a time step gradient checkpointing technique for the end-to-end pruning, a technique that significantly reduces memory usage during optimization, enabling end-to-end pruning within a limited memory budget. Results on the state-of-the-art U-Net diffusion models Stable Diffusion XL (SDXL) and DiT flow models (FLUX) show that our method efficiently prunes 20% of parameters in just 10 A100 GPU hours, outperforming previous pruning approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3329. Improving Set Function Approximation with Quasi-Arithmetic Neural Networks
๐ฌ 15
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ representation learning set function learning
Sets represent a fundamental abstraction across many types of data. To handle the unordered nature of set-structured data, models such as DeepSets and PointNet rely on fixed, non-learnable pooling operations (e.g., sum or max) -- a design choice that can hinder the transferability of learned embeddings and limits model expressivity. More recently, learnable aggregation functions have been proposed as more expressive alternatives. In this work, we advance this line of research by introducing the Neuralized Kolmogorov Mean (NKM) -- a novel, trainable framework for learning a generalized measure of central tendency through an invertible neural function. We further propose quasi-arithmetic neural networks (QUANNs), which incorporate the NKM as a learnable aggregation function. We provide a theoretical analysis showing that, QUANNs are universal approximators for a broad class of common set-function decompositions and, thanks to their invertible neural components, learn more structured latent representations. Empirically, QUANNs outperform state-of-the-art baselines across diverse benchmarks, while learning embeddings that transfer effectively even to tasks that do not involve sets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3330. Latent-to-Data Cascaded Diffusion Models for Unconditional Time Series Generation
๐ฌ 15
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ time series unconditional synthetic
Synthetic time series generation (TSG) is crucial for applications such as privacy preservation, data augmentation, and anomaly detection. A key challenge in TSG lies in modeling the multi-modal distributions of time series, which requires simultaneously capturing diverse high-level representation distributions and preserving local temporal fidelity. Most existing diffusion models, however, are constrained by their single-space focus: latent-space models capture representation distributions but often compromise local fidelity, while data-space models preserve local details in the data space but struggle to learn high-level representations essential for multi-modal time series.
To address these limitations, we propose L2D-Diff, a dual-space diffusion framework for synthetic time series generation. Specifically, L2D-Diff first compresses input sequences into a latent space to efficiently model the distribution of time series representations. The distribution then guides a data-space diffusion model to refine local data details, enabling faithful generation of time series distribution without relying on external conditions.
Experiments on both single-modal and multi-modal datasets demonstrate the effectiveness of L2D-Diff in tackling unconditional TSG tasks. Ablation studies further highlight the necessity and impact of its dual-space design, showcasing its capability to achieve representation coherence and local fidelity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3331. Towards Interpretable Visual Decoding with Attention to Brain Representations
๐ฌ 13
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Brain Decoding Visual Reconstruction fMRI Stable Diffusion Interpretability
Recent work has demonstrated that complex visual stimuli can be decoded from human brain activity using deep generative models, helping brain science researchers interpret how the brain represents real-world scenes. However, most current approaches leverage mapping brain signals into intermediate image or text feature spaces before guiding the generative process, masking the effect of contributions from different brain areas on the final reconstruction output. In this work, we propose \textit{NeuroAdapter}, a visual decoding framework that directly conditions a latent diffusion model on brain representations, bypassing the need for intermediate feature spaces. Our method demonstrates competitive visual reconstruction quality on public fMRI datasets compared to prior work, while providing greater transparency into how brain signals shape the generation process. To this end, we contribute an ImageโBrain BI-directional interpretability framework (\textit{IBBI}) which investigates cross-attention mechanisms across diffusion denoising steps to reveal how different cortical areas influence the unfolding generative trajectory. Our results highlight the potential of end-to-end brain-to-image decoding and establish a path toward interpreting diffusion models through the lens of visual neuroscience.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3332. AutoDV: An End-to-End Deep Learning Model for High-Dimensional Data Visualization
๐ฌ 17
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ High-dimensional Data Visualization Deep Learning Cross-dimension Generalization
High-dimensional data visualization (HDV) plays an important role in data science and engineering applications. Traditional HDV methods, such as Autoencoder and t-SNE, require hyperparameter tuning and iterative optimization on every dataset and cannot effectively utilize the knowledge from historical datasets, which lowers the efficiency, convenience, and accuracy in real applications. In this paper, we present AutoDV, an end-to-end deep learning model, for high-dimensional data visualization. AutoDV is built upon a graph transformer network and an invariant loss function and is trained on a number of diverse datasets converted into multi-weight graphs. Given a new dataset, AutoDV outputs the 2D or 3D embeddings of all data points directly. AutoDV has the following merits: 1) There is no hyperparameter selection during the data visualization stage; 2) The end-to-end model avoids re-training or iterative optimization when visualizing data; 3) The input dataset can have any number of features and can be from any domain. Our experiments show that AutoDV can successfully generalize to unseen datasets without retraining with
89.37\% precision of t-SNE and 91.05\% precision of UMAP on the unseen CIFAR10 datasets. Compared with existing parametric data visualization deep models, our method obtains significant improvement with 86.65% precision gain. AutoDV can perform even better than t-SNE and UMAP on gene and UCI tabular datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3333. Bi-Lipschitz Autoencoder With Injectivity Guarantee
๐ฌ 16
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Autoencoder Injectivity
Autoencoders are widely used for dimensionality reduction, based on the assumption that high-dimensional data lies on low-dimensional manifolds. Regularized autoencoders aim to preserve manifold geometry during dimensionality reduction, but existing approaches often suffer from non-injective mappings and overly rigid constraints that limit their effectiveness and robustness. In this work, we identify encoder non-injectivity as a core bottleneck that leads to poor convergence and distorted latent representations. To ensure robustness across data distributions, we formalize the concept of admissible regularization and provide sufficient conditions for its satisfaction. In this work, we propose the Bi-Lipschitz Autoencoder (BLAE), which introduces two key innovations: (1) an injective regularization scheme based on a separation criterion to eliminate pathological local minima, and (2) a bi-Lipschitz relaxation that preserves geometry and exhibits robustness to data distribution drift. Empirical results on diverse datasets show that BLAE consistently outperforms existing methods in preserving manifold structure while remaining resilient to sampling sparsity and distribution shifts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3334. Beyond Match Maximization and Fairness: Retention-Objectified Two-Sided Matching
๐ฌ 23
โญ 5.00
๐ optimization
๐ท๏ธ Two-Sided Matching User Retention Learning-to-Rank
On two-sided matching platforms such as online dating and recruiting, recommendation algorithms often aim to maximize the total number of matches. However, this objective creates an imbalance, where some users receive far too many matches while many others receive very few and eventually abandon the platform. Retaining users is crucial for many platforms, such as those that depend heavily on subscriptions. Some may use fairness objectives to solve the problem of match maximization. However, fairness in itself is not the ultimate objective for many platforms, as users do not suddenly reward the platform simply because exposure is equalized. In practice, where user retention is often the ultimate goal, casually relying on fairness will leave the optimization of retention up to luck.
In this work, instead of maximizing matches or axiomatically defining fairness, we formally define the new problem setting of maximizing user retention in two-sided matching platforms. To this end, we introduce a dynamic learning-to-rank (LTR) algorithm called Matching for Retention (MRet). Unlike conventional algorithms for two-sided matching, our approach models user retention by learning personalized retention curves from each userโs profile and interaction history. Based on these curves, MRet dynamically adapts recommendations by jointly considering the retention gains of both the user receiving recommendations and those who are being recommended, so that limited matching opportunities can be allocated where they most improve overall retention. Naturally but importantly, empirical evaluations on synthetic and real-world datasets from a major online dating platform show that MRet achieves higher user retention, since conventional methods optimize matches or fairness rather than retention.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3335. NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents
๐ฌ 20
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ large language models benchmark virtual environment generalization agent scientific law discovery
Large language models (LLMs) are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science.
However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce **NewtonBench**, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using counterfactual law shifts - systematic alterations of canonical laws - to generate a vast suite of problems that are scalable, scientifically relevant, and memorization-resistant.
Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive evaluation of 11 state-of-the-art LLMs reveals a clear but fragile capability for discovery in frontier models: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge for the future of automated science. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next-generation AI agents capable of genuine scientific discovery.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3336. In-Context Compositional Q-Learning for Offline Reinforcement Learning
๐ฌ 15
โญ 5.00
๐ reinforcement learning
๐ท๏ธ In-context Learning Reinforcement Learning
Accurately estimating the Q-function is a central challenge in offline reinforcement learning. However, existing approaches often rely on a single global Q-function, which struggles to capture the compositional nature of tasks involving diverse subtasks. We propose In-context Compositional Q-Learning ($\texttt{ICQL}$), the first offline RL framework that formulates Q-learning as a contextual inference problem, using linear Transformers to adaptively infer local Q-functions from retrieved transitions without explicit subtask labels. Theoretically, we show that under two assumptionsโlinear approximability of the local Q-function and accurate weight inference from retrieved contextโ$\texttt{ICQL}$ achieves bounded Q-function approximation error, and supports near-optimal policy extraction. Empirically, $\texttt{ICQL}$ substantially improves performance in offline settings: improving performance in Kitchen tasks by up to 29. 46\%, and in Gym and Adroit tasks by up to 6\%. These results highlight the underexplored potential of in-context learning for robust and compositional value estimation, positioning $\texttt{ICQL}$ as a principled and effective framework for offline RL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3337. Improving Classifier-Free Guidance in Masked Diffusion: Low-Dim Theoretical Insights with High-Dim Impact
๐ฌ 9
โญ 5.00
๐ generative models
๐ท๏ธ discrete diffusion conditional generation guidance
Classifier-Free Guidance (CFG) is a widely used technique for conditional generation and improving sample quality in continuous diffusion models, and its extensions to discrete diffusion has recently started to be investigated. In order to improve the algorithms in a principled way, this paper starts by analyzing the exact effect of CFG in the context of a low-dimensional masked diffusion model, with a special emphasis on the guidance schedule. Our analysis shows that high guidance early in sampling (when inputs are heavily masked) harms generation quality, while late-stage guidance has a larger effect. These findings provide a theoretical explanation for empirical observations in recent studies on guidance schedules. The analysis also reveals an imperfection of the current CFG implementations. These implementations can unintentionally cause imbalanced transitions, such as unmasking too rapidly during the early stages of generation, which degrades the quality of the resulting samples. To address this, we draw insight from the analysis and propose a novel classifier-free guidance mechanism. Intuitively, our method smoothens the transport between the data distribution and the initial (masked) distribution, which results in improved sample quality. Remarkably, our method is achievable via a simple one-line code change. Experiments on conditional image and text generation empirically confirm the efficacy of our method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3338. Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rankness
๐ฌ 16
โญ 5.00
๐ optimization
๐ท๏ธ Implicit Bias Implicit Regularization Loss of Plasticity Matrix Completion Depth Low-Rank Theory Dynamics
We study matrix completion via deep matrix factorization (a.k.a. deep linear neural networks) as a simplified testbed to examine how network depth influences training dynamics. Despite the simplicity and importance of the problem, prior theory largely focuses on shallow (depth-2) models and does not fully explain the implicit low-rank bias observed in deeper networks. We identify coupled dynamics as a key mechanism behind this bias and show that it intensifies with increasing depth. Focusing on gradient flow under diagonal observations, we prove: (a) networks of depth $\geq 3$ exhibit coupling unless initialized diagonally, and (b) convergence to rank-1 occurs if and only if the dynamics is coupledโresolving an open question by Menon (2024) for a family of initializations. We also revisit the loss of plasticity phenomenon in matrix completion (Kleinman et al., 2024), where pre-training on few observations and resuming with more degrades performance. We show that deep models avoid plasticity loss due to their low-rank bias, whereas depth-2 networks pre-trained under decoupled dynamics fail to converge to low-rank, even when resumed training (with additional data) satisfies the coupling conditionโshedding light on the mechanism behind this phenomenon.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3339. VERIFY: A Novel Multi-Domain Dataset Grounding LTL in Contextual Natural Language via Provable Intermediate Logic
๐ฌ 12
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ LLM Linear Temporal Logic model checking formal verification
Bridging the gap between the formal precision of system specifications and the nuances of human language is critical for reliable engineering, robotics, and AI safety, but it remains a major bottleneck. Prior efforts in grounding formal logic remain fragmented, resulting in datasets that are very small-scale (~2-5k examples), domain-specific, or translate logic into overly technical forms rather than context-rich natural language (NL). Thus, failing to adequately bridge formal methods and practical NLP. To address this gap, we introduce VERIFY, the first large-scale dataset meticulously designed to unify these elements. This dataset contains more than 200k+ rigorously generated triplets, each comprising a Linear Temporal Logic (LTL) formula, a structured, human-readable 'Intermediate Technical Language' (ITL) representation designed as a bridge between logic and text, and a domain-specific NL description contextualized across 13 diverse domains. VERIFY's construction pipeline ensures high fidelity: LTL formulas are enumerated and verified via model checking, mapped to the novel ITL representation using a provably complete formal grammar, and then translated into context-aware NL via LLM-driven generation. We guarantee data quality through extensive validation protocols, i.e., manual expert verification of 10,000 diverse samples. Furthermore, automated semantic consistency checks judged by Llama 3.3 confirmed an estimated >97% semantic correctness. From the initial experiments, we demonstrate VERIFY's scalability, logical complexity, and contextual diversity, significantly challenging standard models such as T5 and Llama 3.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3340. Knowledge Distillation as Decontamination? Revisiting the โData Launderingโ Concern
๐ฌ 14
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Knowledge Distillation Data Contamination Benchmark Integrity Data Decontamination
Concerns have been raised that knowledge distillation may transfer test-set knowledge from a contaminated teacher to a clean studentโa โdata launderingโ effect that potentially threatens evaluation integrity. In this paper, we assess the severity of this phenomenon. If these concerns regarding data laundering are minor, then distillation could be used to mitigate risks of direct data exposure. Across eight classification benchmarks, we find that substantial laundering is the exception rather than the rule: unlike the large performance gains from direct contamination, any accuracy inflation from laundering is consistently smaller and statistically insignificant in all but two cases. More broadly, we find that the two phenomena are weakly correlated, suggesting that laundering is not simply a diluted form of contamination but a distinct effect that arises primarily when benchmarks exhibit large trainโtest distribution gaps. Motivated by this, we conduct controlled experiments that systematically enlarge the trainโtest distance on two benchmarks where laundering was initially negligible, and observe that laundering becomes more significant as the gap widens. Taken together, our results indicate that knowledge distillation, despite rare benchmark-specific residues, can be expected to function as an effective decontamination technique that largely mitigates test-data leakage.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3341. EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling
๐ฌ 28
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Image Editing Reinforcement Learning Reward Model
Instruction-guided image editing has achieved remarkable progress, yet current models still face challenges with complex instructions and often require multiple samples to produce a desired result. Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been severely hindered by the lack of a high-fidelity, efficient reward signal. In this work, we present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model. We first introduce $\textbf{EditReward-Bench}$, a comprehensive benchmark to systematically evaluate reward models on editing quality. Guided by this benchmark, we develop $\textbf{EditScore}$, an efficient model to evaluate the quality of instruction-guided editing. Through meticulous data curation and filtering, EditScore effectively matches the performance of learning proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy tailored for the generative nature of EditScore, our largest variant even surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity reward model is the key to unlocking online RL for image editing. Our experiments show that, while even the largest open-source VLMs fail to provide an effective learning signal, EditScore enables efficient and robust policy optimization. Applying our framework to a strong base model, OmniGen2, results in a final model that shows a substantial and consistent performance uplift. Overall, this work provides the first systematic path from benchmarking to reward modeling to RL training in image editing, showing that a high-fidelity, domain-specialized reward model is the key to unlocking the full potential of RL in this domain. Our code, models, and benchmark will be released publicly.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3342. Theoretical Modeling of Large Language Model Self-Improvement Training Dynamics Through Solver-Verifier Gap
๐ฌ 16
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Training Dynamics Self-Improvement
Self-improvement is a significant techniques within the realm of large language model (LLM), aiming to enhance the LLM performance without relying on external data. Despite its significance, generally how LLM performances evolve during the self-improvement process remains underexplored. In this paper, we theoretically model the training dynamics of self-improvement via the concept of solver-verifier gap. This is inspired by the conjecture that the performance enhancement of self-improvement stems from the gap between LLM's solver capability and verifier capability. Based on the theoretical framework, we further show how to model the entire training trajectory. This framework allows quantifying the capability limit of self-improvement by fitting the theoretical model to the experiment results. We validate the effectiveness of the theoretical framework on various LLMs and datasets. Beyond self-improvement, we extend our analysis to investigate how external data influences these dynamics within the framework. Notably, we find that under limited external data regimes, such external data can be utilized at any stage without significantly affecting final performances, which accords with the empirical observations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3343. Donโt Pass@$k$: A Bayesian Framework for Large Language Model Evaluation
๐ฌ 11
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Evaluation Metrics Bayesian Methods Uncertainty Quantification Model Ranking Reasoning Statistical Significance
Pass@$k$ is widely used to report performance for LLM reasoning, but it often yields unstable, misleading rankings, especially when the number of trials (samples) is limited and compute is constrained. We present a principled Bayesian evaluation framework that replaces Pass@$k$ and average accuracy over $N$ trials (avg@$N$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass@$1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT, and BrUMO, the Bayesian/avg procedure achieves faster convergence and greater rank stability than Pass@$k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass@$k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3344. Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges
๐ฌ 22
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ large language models jailbreak attacks meta-optimization
Identifying the vulnerabilities of large language models (LLMs) is crucial for improving their safety by addressing inherent weaknesses. Jailbreaks, in which adversaries bypass safeguards with crafted input prompts, play a central role in red-teaming by probing LLMs to elicit unintended or unsafe behaviors. Recent optimization-based jailbreak approaches iteratively refine attack prompts by leveraging LLMs. However, they often rely heavily on either binary attack success rate (ASR) signals, which are sparse, or manually crafted scoring templates, which introduce human bias and uncertainty in the scoring outcomes. To address these limitations, we introduce AMIS (Align to MISalign), a meta-optimization framework that jointly evolves jailbreak prompts and scoring templates through a bi-level structure. In the inner loop, prompts are refined using fine-grained and dense feedback using fixed scoring template. In the outer loop, the template is optimized using an ASR alignment score, gradually evolving to better reflect true attack outcomes across queries. This co-optimization process yields progressively stronger jailbreak prompts and more calibrated scoring signals. Evaluations on AdvBench and JBB-Behaviors demonstrate that AMIS achieves state-of-the-art performance, including 88.0% ASR on Claude-3.5-Haiku and 100.0% ASR on Claude-4-Sonnet, outperforming existing baselines by substantial margins.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3345. TaCo: A Benchmark for Lossless and Lossy Codecs of Heterogeneous Tactile Data
๐ฌ 9
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Tactile Dataset Lossless Compression Lossy Compression Heterogeneous Tactile Data
Tactile sensing is crucial for embodied intelligence, providing fine-grained perception and control in complex environments. However, efficient tactile data compression, which is essential for real-time robotic applications under strict bandwidth constraints, remains underexplored. The inherent heterogeneity and spatiotemporal complexity of tactile data further complicate this challenge. To bridge this gap, we introduce TaCo, the first comprehensive benchmark for Tactile data Codecs. TaCo evaluates 30 compression methods, including off-the-shelf compression algorithms and neural codecs, across five diverse datasets from various sensor types. We systematically assess both lossless and lossy compression schemes on four key tasks: lossless storage, human visualization, material and object classification, and dexterous robotic grasping. Notably, we pioneer the development of data-driven codecs explicitly trained on tactile data, TaCo-LL (lossless) and TaCo-L (lossy). Results have validated the superior performance of our TaCo-LL and TaCo-L. This benchmark provides a foundational framework for understanding the critical trade-offs between compression efficiency and task performance, paving the way for future advances in tactile perception.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3346. Bird's-eye-view Informed Reasoning Driver
๐ฌ 13
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Autonomous driving Key Intent Points
Motion planning in complex environments remains a core challenge for autonomous driving. While existing rule-based or imitation learning-based motion planning methods perform well in common scenarios, they often struggle with complex, long-tail scenarios. To address this problem, we introduce the Bird's-eye-view Informed Reasoning Driver (BIRDriver), a hierarchical framework that combines a Vision-Language Model (VLM) with a motion planner. BIRDriver leverages the common sense reasoning capabilities of the VLM to effectively handle these challenging long-tail scenarios. Unlike prior methods that require domain-specific encoders and costly alignment, our approach compresses the environment into a single-frame bird's-eye-view (BEV) map, a paradigm that enables the model to fully leverage its knowledge from internet-scale pre-training. It then generates high-level key points, which are encoded and passed to the motion planner to produce the final trajectory. However, a major challenge is that standard VLMs struggle to generate the precise numerical coordinates required for such key points. We address this limitation by fine-tuning them on a composite dataset of three auxiliary types to enhance spatial localization, scene understanding, and key-point generation, complemented by a token-level weighted mechanism for improved numerical precision. Experiments on the nuPlan dataset demonstrate that BIRDriver outperforms the base motion planner in most cases on both Test14-hard and Test14-random benchmarks, and achieves state-of-the-art (SOTA) performance on the InterPlan long-tail benchmark.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3347. MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning
๐ฌ 5
โญ 5.00
๐ generative models
๐ท๏ธ Image Generation;Next-Scale Prediction;Markovian Conditioning
Essential to visual generation is efficient modeling of visual data priors. Conventional next-token prediction methods define the process as learning the conditional probability distribution of successive tokens. Recently, next-scale prediction methods redefine the process to learn the distribution over multi-scale representations, significantly reducing generation latency. However, these methods condition each scale on all previous scales and require each token to consider all preceding tokens, exhibiting scale and spatial redundancy. To better model the distribution by mitigating redundancy, we propose Markovian Visual AutoRegressive modeling (MVAR), a novel autoregressive framework that introduces scale and spatial Markov assumptions to reduce the complexity of conditional probability modeling. Specifically, we introduce a scale-Markov trajectory that only takes as input the features of adjacent preceding scale for next-scale prediction, enabling the adoption of a parallel training strategy that significantly reduces GPU memory consumption. Furthermore, we propose spatial-Markov attention, which restricts the attention of each token to a localized neighborhood of size (k) at corresponding positions on adjacent scales, rather than attending to every token across these scales, for the pursuit of reduced modeling complexity. Building on these improvements, we reduce the computational complexity of attention calculation from (\mathcal{O}(N^{2})) to (\mathcal{O}(N k)), enabling training with just eight NVIDIA RTX 4090 GPUs and eliminating the need for KV cache during inference. Extensive experiments on ImageNet demonstrate that MVAR achieves comparable or superior performance with both small model trained from scratch and large fine-tuned models, while reducing the average GPU memory footprint by 3.0x.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3348. Learning under Quantization for High-Dimensional Linear Regression
๐ฌ 27
โญ 5.00
๐ learning theory
๐ท๏ธ quantization generalization linear regression
The use of low-bit quantization has emerged as an indispensable technique for enabling the efficient training of large-scale models. Despite its widespread empirical success, a rigorous theoretical understanding of its impact on learning performance remains notably absent, even in the simplest linear regression setting. We present the first systematic theoretical study of this fundamental question, analyzing finite-step stochastic gradient descent (SGD) for high-dimensional linear regression under a comprehensive range of quantization: data, labels, parameters, activations, and gradients. Our novel analytical framework establishes precise algorithm-dependent and data-dependent excess risk bounds that characterize how different quantization affects learning: parameter, activation, and gradient quantization amplify noise during training; data quantization distorts the data spectrum; and data and label quantization introduce an additional bias error. Crucially, we prove that for multiplicative quantization, this spectral distortion can be eliminated, and for additive quantization, a beneficial scaling effect with batch size emerges. Furthermore, under common polynomial-decay data spectrum scenarios, we quantitatively compare FP and Integer quantization methods, identifying the settings where each is more suitable. Our theory provides a powerful lens to characterize how quantization shapes the learning dynamics of optimization algorithms, paving the way to further explore learning theory under practical hardware constraints.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3349. Capability-Based Scaling Laws for LLM-Based Red-Teaming
๐ฌ 23
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ jailbreaks red-teaming ai safety
As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a \emph{weak-to-strong} problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the \emph{capability gap} between attacker and target. We evaluate more than 600 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the targetโs capability exceeds the attacker's, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these observations, we derive a \emph{jailbreaking scaling curve} that predicts attack success for a fixed target based on attacker-target capability gap. These findings suggest that fixed-capability attackers (e.g., humans) may become ineffective against future models, increasingly capable open-source models amplify risks for existing systems, and model providers must accurately measure and control models' persuasive and manipulative abilities to limit their effectiveness as attackers.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3350. Query-Guided SpatialโTemporalโFrequency Interaction for Music AudioโVisual Question Answering
๐ฌ 28
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Audioโvisual question answering Multimodal Music scene understanding
AudioโVisual Question Answering (AVQA) is a challenging multimodal task that requires jointly reasoning over audio, visual, and textual information in a given video to answer natural language questions. Inspired by recent advances in Video QA, many existing AVQA approaches primarily focus on visual information processing, leveraging pre-trained models to extract object-level and motion-level representations. However, in those methods, the audio input is primarily treated as complementary to video analysis, and the textual question information contributes minimally to audioโvisual understanding, as it is typically integrated only in the final stages of reasoning. To address these limitations, we propose a novel Query-guided SpatialโTemporalโFrequency (QSTar) interaction method, which effectively incorporates question-guided clues and exploits the distinctive frequency-domain characteristics of audio signals, alongside spatial and temporal perception, to enhance audioโvisual understanding. Furthermore, we introduce a Query Context Reasoning (QCR) block inspired by prompting, which guides the model to focus more precisely on semantically relevant audio and visual features. Extensive experiments conducted on several AVQA benchmarks demonstrate the effectiveness of our proposed method, achieving significant performance improvements over existing Audio QA, Visual QA, Video QA, and AVQA approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3351. Understanding Routing Mechanism in Mixture-of-Experts Language Models
๐ฌ 18
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Large Language Model Mixture-of-Experts Routing Mechanism
Mixture-of-Experts (MoE) has been a prevalent method for scaling up large language models at a reduced computational cost. Despite its effectiveness, the routing mechanism of MoE still lacks a clear understanding from the perspective of cross-layer mechanistic interpretability. We propose a light-weight methodology at which we can break down the routing decision for MoE to contribution of model components, in a recursive fashion. We use our methodology to dissect the routing mechanism by decomposing the input of routers into model components. We study how different model components contribute to the routing in different widely used open models. Our findings on four different LLMs reveal common patterns such as: a) MoE layer outputs contribute more than attention layer outputs to the routing decisions of latter layers, b) \emph{MoE entanglement} at which MoE firing up in layers consistently correlate with firing up of MoE in latter layers, and c) some components can persistently influence the routing in many following layers. Our study also includes findings on how different models have different patterns when it comes to long range and short range inhibiting/promoting effects that components can have over MoE in latter layers. Our results indicate importance of quantifying the impact of components across different layers on MoE to understand the mechanism of routing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3352. Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
๐ฌ 29
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ multimodal reasoning visual question answering vision-language model information-intensive images speculative decoding
Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements.
The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence.
We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model.
In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers.
To further improve both efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict.
Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3353. DR-GGAD: Dual Residual Centering for Mitigating Anomaly NonโDiscriminativity in Generalist Graph Anomaly Detection
๐ฌ 29
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph Mining Social Network Analysis Graph Anomaly Detection
Generalist Graph Anomaly Detection (GGAD) seeks a unified representation learning model to detect anomalies in unseen graphs, but cross-domain transfer often entangles the learned anomalous and normal representations. We formalize this degradation as Anomaly non-Discriminativity (AnD) and define a normalized score to quantify it. We present DR-GGAD, which avoids direct comparison between anomalous and normal nodes via two residual modules: 1) a multi-scale Hyper Residual (HR) Center measuring node-to-center distances, yielding a compact normal residual structure with margin-pushed anomalies; 2) an Affinity-Residual (AR) module enforcing local residual directional consistency to recover structural separability. With frozen parameters (no target fine-tuning), DR-GGAD fuses both signals into a unified score. On 8 benchmark target graphs, it achieves new SOTA: mean AUROC +5.14% over the best prior GGAD, with large gains on high-AnD datasets (Facebook +12.59%, ACM +9.96%, Amazon +7.48%) and strong AUPRC boosts (Amazon +17.12%, CiteSeer +17.77%). Ablations confirm complementary roles of the two modules. DR-GGAD thus establishes AnD as a measurable bottleneck and delivers robust cross-domain anomaly detection.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3354. BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation
๐ฌ 17
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM-as-a-Judge bias
LLM-as-a-Judge has been widely adopted across various research and practical applications, yet the robustness and reliability of its evaluation remain a critical issue. A core challenge it faces is bias, which has primarily been studied in terms of known biases and their impact on evaluation outcomes, while automated and systematic exploration of potential unknown biases is still lacking. Nevertheless, such exploration is crucial for enhancing the robustness and reliability of evaluations. To bridge this gap, we propose BiasScope, a LLM-driven framework for automatically and at scale discovering potential biases that may arise during model evaluation. BiasScope can uncover potential biases across different model families and scales, with its generality and effectiveness validated on the JudgeBench dataset. Moreover, based on BiasScope, we propose JudgeBench-Pro, an extended version of JudgeBench and a more challenging benchmark for evaluating the robustness of LLM-as-a-judge. Strikingly, even powerful LLMs as evaluators show error rates above 50% on JudgeBench-Pro, underscoring the urgent need to strengthen evaluation robustness and to mitigate potential biases further.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3355. Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs
๐ฌ 29
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Information Extraction Document Analysis Small Language Models Reinforcement Learning
Large language models (LLMs) are widely applied to data analytics over documents, yet direct reasoning over long, noisy documents remains brittle and error-prone. Hence, we study document question answering (QA) that consolidates dispersed evidence into a structured output (e.g., a table, graph, or chunks) to support reliable, verifiable QA. We propose a two-pillar framework, LiteCoST, to achieve both high accuracy and low latency with small language models (SLMs). Pillar 1: Chain-of-Structured-Thought (CoST). We introduce a CoST templateโa schema-aware instruction that guides a strong LLM to produce both a step-wise CoST trace and the corresponding structured output. The process induces a minimal structure, normalizes entities/units, aligns records, serializes the output, and verifies/refines it (optionally with an LLM-as-judge), yielding auditable supervision. Pillar 2: SLM fine-tuning. We then train compact models on the LLM-generated
CoST traces/structured data in two phasesโSupervised Fine-Tuning for structure/format/steps, followed by Group Relative Policy Optimization with dual rewards for answer/format quality and process consistencyโtransferring structure-first behavior to SLMs for low-latency deployment. This approach achieves LLM-comparable quality on finance and legal long-document QA (Loong), with 3B/7B SLMs while delivering 2โ4รlower latency than GPT-4o and DeepSeek-R1 (671B).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3356. APT: Towards Universal Scene Graph Generation via Plug-in Adaptive Prompt Tuning
๐ฌ 15
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Prompt Tuning Scene Graph Generation Open Vocabulary
Scene Graph Generation (SGG) is pivotal for structured visual understanding, yet it remains hindered by a fundamental limitation: the reliance on fixed, frozen semantic representations from pre-trained language models. These semantic priors, while beneficial in other domains, are inherently misaligned with the dynamic, context-sensitive nature of visual relationships, leading to biased and suboptimal performance. In this paper, we transcend the traditional one-stage v.s. two-stage architectural debate and identify this representational bottleneck as the core issue. We introduce Adaptive Prompt Tuning (APT), a universal paradigm that converts frozen semantic features into dynamic, context-aware representations through lightweight, learnable prompts. APT acts as a plug-in module that can be seamlessly integrated into existing SGG frameworks. Extensive experiments demonstrate that APT achieves +2.7 improvement in mR@100 on PredCls, +3.6 gain in F@100 and up to +6.0 gain in mR@50 in open-vocabulary novel splits. Notably, it achieves this with less than 0.5M additonal parameters (<1.5\% overhead) and reduced 7.8\%-25\% training time, establishing a new state-of-the-art while offering a unified, efficient, and scalable solution for future SGG research. The source code of APT is available at .
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3357. Expressive and Invariant Graph Learning via Canonical Tree Cover Neural Networks
๐ฌ 20
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ graph neural networks canonicalization invariance tree molecule graph
While message-passing NNs (MPNNs) are naturally invariant on graphs, they are fundamentally limited in expressive power. Canonicalization offers a powerful alternative by mapping each graph to a unique, invariant representation on which expressive encoders can operate. However, existing approaches rely on a single canonical sequence, which flattens the structure, distorts graph distances, and restricts expressivity. To address these limitations, we introduce Canonical Tree Cover Neural Networks (CTNNs), which represent the graph with a canonical spanning tree cover, i.e., a small collection of canonical trees covering all edges. Each tree is then processed with an existing expressive tree encoder. Theoretically, tree covers better preserve graph distances than sequences, and on sparse graphs, the cover recovers all edges with a logarithmic number of trees in the graph size, making CTNNs strictly more expressive than sequence-based canonicalization pipelines. Empirically, CTNNs consistently outperform invariant GNNs, random samplers, and sequence canonicalizations across graph classification benchmarks. Overall, CTNNs advance graph learning by providing an efficient, invariant, and expressive representation learning framework via tree cover-based canonicalization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3358. A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models
๐ฌ 19
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Model Calibration Angular Diversity Uniformity Vision-Language Models CLIP Prompt Tuning Test-Time Adaptation
Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs' reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding learnable prompts. This uniformity is achieved by maximizing the minimum pairwise angular distance between features on the unit hypersphere. We show that our approach consistently surpasses state-of-the-art TPT methods in reducing the aggregate average calibration error while maintaining comparable accuracy through extensive experiments with various backbones on different datasets. Notably, our approach exhibits superior zero-shot calibration performance on natural distribution shifts and generalizes well to medical datasets. We provide extensive analyses, including theoretical aspects, to establish the grounding of A-TPT. These results highlight the potency of promoting angular diversity to achieve well-dispersed textual features, significantly improving VLM calibration during test-time adaptation. Our code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3359. Riemannian Zeroth-Order Gradient Estimation with Structure-Preserving Metrics for Geodesically Incomplete Manifolds
๐ฌ 16
โญ 5.00
๐ optimization
๐ท๏ธ zeroth-order optimization riemannian optimization stochastic gradient descent
In this paper, we study Riemannian zeroth-order optimization in settings where the underlying Riemannian metric $g$ is geodesically incomplete, and the goal is to approximate stationary points with respect to this incomplete metric. To address this challenge, we construct structure-preserving metrics that are geodesically complete while ensuring that every stationary point under the new metric remains stationary under the original one. Building on this foundation, we revisit the classical symmetric two-point zeroth-order estimator and analyze its mean-squared error from an intrinsic perspective, depending only on the manifoldโs geometry rather than any ambient embedding. Leveraging this intrinsic analysis, we establish convergence guarantees for stochastic gradient descent (SGD) with this intrinsic estimator. Under additional suitable conditions, an $\epsilon$-stationary point under the constructed metric $g'$ also corresponds to an $\epsilon$-stationary point under the original metric $g$, thereby matching the best-known complexity in the geodesically complete setting. Empirical studies on synthetic problems confirm our theoretical findings, and experiments on a practical mesh optimization task demonstrate that our framework maintains stable convergence even in the absence of geodesic completeness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3360. Embracing Discrete Search: A Reasonable Approach to Causal Structure Learning
๐ฌ 16
โญ 5.00
๐ causal reasoning
๐ท๏ธ Causal Discovery Bayesian Networks DAGs Structure Learning
We present FLOP (Fast Learning of Order and Parents), a score-based causal discovery algorithm for linear models. It pairs fast parent selection with iterative Cholesky-based score updates, cutting run-times over prior algorithms. This makes it feasible to fully embrace discrete search, enabling iterated local search with principled order initialization to find graphs with scores at or close to the global optimum. The resulting structures are highly accurate across benchmarks, with near-perfect recovery in standard settings. This performance calls for revisiting discrete search over graphs as a reasonable approach to causal discovery.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3361. RL for Reasoning by Adaptively Revealing Rationales
๐ฌ 15
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ reasoning reinforcement learning curriculum learning sequence modeling
Learning in the combinatorially large output space of sequence generation problems is challenging as providing expert demonstrations scales poorly with sequence length, and RL struggles with sparse rewards. Between dense demonstrations in supervised training and no demonstrations in reinforcement learning lies an underexplored regime: partial supervision. We ask whether some classes of sequence learning problems become efficiently learnable by exploiting this gap.
We address this by introducing adaptive backtracking (AdaBack), a per-sample curriculum learning algorithm that reveals a partial prefix of the target output.
The supervision length is adjusted dynamically for each sample based on the modelโs past reward signal, allowing it to incrementally learn to complete reasoning chains by conditioning on correct partial solutions.
We investigate this intermediate regime between SFT and RL and argue that per-sample curriculum learning is more than a trade-off between efficiency and generalityโit can succeed in tasks with long sequences of latent dependencies where SFT and RL both fail to generalize.
Using a synthetic task with latent parity constraints, we show that AdaBack reliably solves problems that are otherwise intractable. On two mathematical reasoning benchmarks, MATH and GSM8k, we find that AdaBack enables models to solve problems that RL alone cannot, acquiring new reasoning capabilities through incremental exposure to partial solutions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3362. Navigating the Accuracy-Size Trade-Off with Flexible Model Merging
๐ฌ 14
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Model Merging Model Fusion Task Arithmetic Multi-task Learning
Model merging has emerged as an efficient method to combine multiple single-task fine-tuned models. The merged model can enjoy multi-task capabilities without expensive training. While promising, merging into a single model often suffers from an accuracy gap with respect to individual fine-tuned models. On the other hand, deploying all individual fine-tuned models incurs high storage costs. We propose FlexMerge, a novel data-free model merging framework that: (a) flexibly generates merged models of varying sizes, spanning the full spectrum from a single merged model to retaining all individual fine-tuned models; and (b) supports multiple merging algorithms in a unified framework. Using FlexMerge, we systematically characterize the accuracyโsize trade-off of different algorithms. Our study reveals two key findings: first, even modestly larger merged models can yield steep accuracy gains (up to 13.5% when just doubling the size); second, algorithm rankings are not consistent as size increases, with some methods overtaking others beyond the one-model regime. These results uncover a new design dimension for model merging: developing and comparing algorithms across the full spectrum of sizes rather than only at the single-model limit. Extensive experiments on vision and NLP benchmarks, with up to 30 tasks, confirm the generality and practicality of FlexMerge.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3363. Scaling up Memory for Robotic Control via Experience Retrieval
๐ฌ 19
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Robot Learning Memory Vision-Language-Action Models
Humans rely on memory to perform tasks; our goal is to endow robot policies with the same ability. Naively conditioning on long observation histories is computationally expensive and brittle under covariate shift, while indiscriminate subsampling of history leads to irrelevant or redundant information. We propose a hierarchical policy framework, where the high-level policy is trained to select and track previous task-relevant keyframes from its experience. The high-level policy uses selected keyframes and the most recent frames when generating text instructions for a low-level policy to execute. This design is compatible with existing vision-language-action (VLA) models and enables the system to efficiently reason over long-horizon dependencies. In our experiments, we fine-tune Qwen2.5-VL-7B-Instruct and $\pi_{0.5}$ as the high-level and low-level policies respectively, using demonstrations supplemented with minimal language annotations. Our approach, MemER, outperforms prior methods on three real-world long-horizon robotic manipulation tasks that require minutes of memory. Videos and code can be found at https://memer-policy.github.io.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3364. MVR: Multi-view Video Reward Shaping for Reinforcement Learning
๐ฌ 19
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Reward Shaping; Reinforcement Learning; VLM Reward
Reward design is of great importance for solving complex tasks with reinforcement learning. Recent studies have explored using image-text similarity produced by vision-language models (VLMs) to augment rewards of a task with visual feedback. A common practice linearly adds VLM scores to task or success rewards without explicit shaping, potentially altering the optimal policy. Moreover, such approaches, often relying on single static images, struggle with tasks whose desired behavior involves complex, dynamic motions spanning multiple visually different states. Furthermore, single viewpoints can occlude critical aspects of an agent's behavior. To address these issues, this paper presents Multi-View Video Reward Shaping (MVR), a framework that models the relevance of states regarding the target task using videos captured from multiple viewpoints. MVR leverages video-text similarity from a frozen pre-trained VLM to learn a state relevance function that mitigates the bias towards specific static poses inherent in image-based methods. Additionally, we introduce a state-dependent reward shaping formulation that integrates task-specific rewards and VLM-based guidance, automatically reducing the influence of VLM guidance once the desired motion pattern is achieved. We confirm the efficacy of the proposed framework with extensive experiments on challenging humanoid locomotion tasks from HumanoidBench and manipulation tasks from MetaWorld, verifying the design choices through ablation studies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3365. Same Content, Different Representations: A Controlled Study for Table QA
๐ฌ 11
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Table Question Answering Semi-structured Table Structured Table
Table Question Answering (Table QA) in real-world settings must operate over both structured databases and semi-structured tables containing textual fields. However, existing benchmarks are tied to fixed data formats and have not systematically examined how representation itself affects model performance.
We present the first controlled study that isolates the role of table representation by holding content constant while varying structure. Using a verbalization pipeline, we generate paired structured and semi-structured tables, enabling direct comparisons across modeling paradigms. To support detailed analysis, we introduce a diagnostic benchmark with splits along table size, join requirements, query complexity, and schema quality.
Our experiments reveal consistent trade-offs: SQL-based methods achieve high accuracy on structured inputs but degrade on semi-structured data, LLMs exhibit flexibility but reduced precision, and hybrid approaches strike a balance, particularly under noisy schemas. These effects intensify with larger tables and more complex queries.
Ultimately, no single method excels across all conditions, and we highlight the central role of representation in shaping Table QA performance. Our findings provide actionable insights for model selection and design, paving the way for more robust hybrid approaches suited for diverse real-world data formats.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3366. OVID: Open-Vocabulary Intrusion Detection
๐ฌ 27
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Open-Vocabulary Intrusion Detection Datasets Framework Strategy
Various vision intrusion detection models have made great success in many scenarios, e.g., autonomous driving, intelligent monitoring, and security, etc. However, their reliance on pre-defined classes limits their applicability in open-world intrusion detection scenarios. To remedy these, we introduce the Open-Vocabulary Intrusion Detection (OVID) project for the first time. Specifically, we first develop a novel dataset, Cityintrusion-OpenV for OVID, with more diverse intrusion categories and corresponding text prompts. Then, we design a multi-modal, multi-task, and end-to-end open-vocabulary intrusion detection framework named OVIDNet. It achieves open-world intrusion detection via aligning visual features with language embeddings. Further, two simple yet effective strategies are proposed to improve the generalization and performance of this specific task: (1) A Multi-Distributed Noise Mixing strategy is introduced to enhance location information of unknown and unseen categories. (2) A Dynamic Memory-Gated module is designed to capture the contextual information under complex scenarios. Finally, comprehensive experiments and comparisons are conducted on multiple dominant datasets, e.g., COCO, Cityscape, Foggy-Cityscape, and Cityintrusion-OpenV. Besides, we also evaluate the universal applicability of our model in real scenarios. The results show that our method can outperform other classic and promising methods, and reach strong performance even under task-specific transfer and zero-shot settings, demonstrating its high practicality. All the source codes and datasets will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3367. Projected Coupled Diffusion for Test-Time Constrained Joint Generation
๐ฌ 21
โญ 5.00
๐ generative models
๐ท๏ธ diffusion model constrained diffusion
Modifications to test-time sampling have emerged as an important extension to diffusion algorithms, with the goal of biasing the generative process to achieve a given objective without having to retrain the entire diffusion model. However, generating jointly correlated samples from multiple pre-trained diffusion models while simultaneously enforcing task-specific constraints without costly retraining has remained challenging. To this end, we propose Projected Coupled Diffusion (PCD), a novel test-time framework for constrained joint generation. PCD introduces a coupled guidance term into the generative dynamics to encourage coordination between diffusion models and incorporates a projection step at each diffusion step to enforce hard constraints. Empirically, we demonstrate the effectiveness of PCD in application scenarios of image-pair generation, object manipulation, and multi-robot motion planning. Our results show improved coupling effects and guaranteed constraint satisfaction without incurring excessive computational costs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3368. NGS-Marker: Robust Native Watermarking for 3D Gaussian Splatting
๐ฌ 18
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Gaussian Splatting digital asset watermarking copyright protection
With the rapid development and adoption of 3D Gaussian Splatting (3DGS), the need for effective copyright protection has become increasingly critical. Existing watermarking techniques for 3DGS mainly focus on protecting rendered images via pre-trained decoders, leaving the underlying 3D Gaussian primitives vulnerable to misuse. In particular, they are ineffective against **Partial Infringement**, where an adversary extracts and reuses only a subset of Gaussians. In this paper, we propose **NGS-Marker**, a novel native watermarking framework for 3DGS. It integrates a jointly trained watermark injector and message decoder, and employs a gradient-based progressive injection strategy to ensure full-scene coverage. This enables robust ownership decoding from any local region. We further extend NGS-Marker with hybrid protection (combining native and indirect watermarks) and support for multimodal watermarking. Extensive experiments demonstrate that NGS-Marker effectively defends against partial infringement while offering practical flexibility for real-world deployment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3369. Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning
๐ฌ 26
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ MLLM multi-modal reasoning
Recent breakthroughs in reasoning language models have significantly advanced text-based reasoning. On the other hand, Multi-modal Large Language Models (MLLMs) still lag behind, hindered by their outdated internal LLMs. Upgrading these is often prohibitively expensive, as it requires complete vision-language alignment retraining which is costly. To address this issue, we introduce Perception-Reasoning Decoupling, which modularizes the MLLMโs reasoning component and makes it easily replaceable. This approach redefines the MLLM's role to convert multi-modal inputs into detailed textual outputs that can be processed by any powerful, external, text-only LLM reasoners. To align the MLLM's perceptual output with the final reasoning task, we propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO). VPO rewards the MLLM based on the correctness of answers generated by the external reasoner to produce faithful and query-relevant captions. Together, this decoupling pipeline and VPO form our Reasoning-Aligned PerceptIon Decoupling (RAPID) approach. Empirical results show that RAPID achieves significant performance gains on multi-modal reasoning benchmarks. Crucially, RAPID enables a novel inference-time scaling paradigm: Once trained with VPO, the MLLM can be paired with any state-of-the-art LLM reasoner for consistent performance improvement without retraining. The implementation of our method is available at: https://anonymous.4open.science/r/RAPID2-80CD/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3370. Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling
๐ฌ 21
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Speculative Decoding Test-Time Scaling
Test-time scaling has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs) by allocating additional computational resources during inference. However, this paradigm is inherently inefficient due to the generation of redundant and repetitive reasoning traces, leading to significant computational overhead. Speculative decoding offers a promising avenue for mitigating this inefficiency, yet its efficacy in the structured and repetition-rich context remains unexplored. To bridge this gap, we introduce the first comprehensive benchmark designed to evaluate speculative decoding methods in LLM test-time scaling. Our benchmark provides consistent experimental protocols across representative test-time scaling paradigms (e.g., Best-of-N sampling and multi-round thinking), enabling a fair comparison of three major categories of speculative decoding: model-based, training-based, and n-gram-based methods. Extensive experiments reveal that simple n-gram-based methods effectively capture repetitive patterns, demonstrating unique potential in accelerating test-time scaling. This phenomenon demonstrates the value of integrating n-gram-based methods with model-based or training-based approaches to benefit both repetitive and diverse reasoning in test-time scaling. We hope this benchmark spurs further research on speculative decoding for test-time scaling, enabling faster and more practical reasoning in LLMs through better handling of repetitive and diverse reasoning paths.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3371. Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance
๐ฌ 15
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Safety Alignment Large Language Models Fine-tuning Attack
Fine-tuning safety-aligned large language models (LLMs) can substantially compromise their safety. Previous approaches require many safety samples or calibration sets, which not only incur significant computational overhead during realignment but also lead to noticeable degradation in model utility. Contrary to this belief, we show that safety alignment can be fully recovered with only a single safety example, without sacrificing utility and at minimal cost. Remarkably, this recovery is effective regardless of the number of harmful examples used in fine-tuning or the size of the underlying model, and convergence is achieved within just a few epochs. Furthermore, we uncover the low-rank structure of the safety gradient, which explains why such efficient correction is possible. We validate our findings across five safety-aligned LLMs and multiple datasets, demonstrating the generality of our approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3372. Robust Test-time Video-Text Retrieval: Benchmarking and Adapting for Query Shifts
๐ฌ 23
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Video-Text Retrieval; Test-time Adaptation
Modern video-text retrieval (VTR) models excel on in-distribution benchmarks but are highly vulnerable to real-world *query shifts*, where the distribution of query data deviates from the training domain, leading to a sharp performance drop. Existing image-focused robustness solutions are inadequate to handle this vulnerability in video, as they fail to address the complex spatio-temporal dynamics inherent in these shifts. To systematically evaluate this vulnerability, we first introduce a comprehensive benchmark featuring 12 distinct types of video perturbations across five severity degrees. Analysis on this benchmark reveals that query shifts amplify the *hubness phenomenon*, where a few gallery items become dominant "hubs" that attract a disproportionate number of queries. To mitigate this, we then propose HAT-VTR (Hubness Alleviation for Test-time Video-Text Retrieval), as our baseline test-time adaptation framework designed to directly counteract hubness in VTR. It leverages two key components: a *Hubness Suppression Memory* to refine similarity scores, and *multi-granular losses* to enforce temporal feature consistency. Extensive experiments demonstrate that HAT-VTR substantially improves robustness, consistently outperforming prior methods across diverse query shift scenarios, and enhancing model reliability for real-world applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3373. From atom to space: A region-based readout function for spatial properties of materials
๐ฌ 19
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ porous material graph neural network
The message passingโreadout framework has become the de facto standard for material property prediction. However, most existing readout functions are built on an atom-decomposable inductive bias, i.e. the material-level property or feature can be reasonably assigned to contributions of individual atoms. This is a strong bias and may not hold for all properties, limiting the application scenarios. In this work, we propose a region-based decomposition perspective, reformulating material properties as integrals over space and pooling contributions from spatial regions rather than atoms. Specifically, we propose a novel readout function named SpatialRead. SpatialRead introduces additional spatial nodes to represent a voxelized space, transforming the atomic isomorphic graph into a heterogeneous atomโspace graph with unidirectional message flow from atoms to spatial nodes. To combine the two types of inductive bias, multimodal methods can be used to fuse the features of atoms the spatial nodes. Such a region-based readout function is especially suited for spatial properties such as gas adsorption capacity, separation ratio. Extensive experiments demonstrate that a simple PaiNNโTransformer-based SpatialRead trained from scratch outperforms state-of-the-art pre-trained foundation models on these special tasks. Our results highlight the importance of designing physically grounded readout functions tailored to the target property. The code can be found in anonymous github https://anonymous.4open.science/r/SpatialRead-8E92 and dataset will be released after the double-blind review.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3374. Extreme Weather Nowcasting via Local Precipitation Pattern Prediction
๐ฌ 16
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ Spatiotemporal Forecasting Extreme Heavy Rain Video Transformer
Accurate forecasting of extreme weather events such as heavy rainfall or storms is critical for risk management and disaster mitigation. Although high-resolution radar observations have spurred extensive research on nowcasting models, precipitation nowcasting remains particularly challenging due to pronounced spatial locality, intricate fine-scale rainfall structures, and variability in forecasting horizons.
While recent diffusion-based generative ensembles show promising results, they are computationally expensive and unsuitable for real-time applications. In contrast, deterministic models are computationally efficient but remain biased toward normal rainfall. Furthermore, the benchmark datasets commonly used in prior studies are themselves skewed--either dominated by ordinary rainfall events or restricted to extreme rainfall episodes--thereby hindering general applicability in real-world settings.
In this paper, we propose exPreCast, an efficient deterministic framework for generating finely detailed radar forecasts, and introduce a newly constructed balanced radar dataset from the Korea Meteorological Administration (KMA), which encompasses both ordinary precipitation and extreme events. Our model integrates local spatiotemporal attention, a texture-preserving cubic dual upsampling decoder, and a temporal extractor to flexibly adjust forecasting horizons. Experiments on established benchmarks (SEVIR and MeteoNet) as well as on the balanced KMA dataset demonstrate that our approach achieves state-of-the-art performance, delivering accurate and reliable nowcasts across both normal and extreme rainfall regimes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3375. FLoRG: Federated Fine-tuning with Low-rank Gram Matrices and Procrustes Alignment
๐ฌ 18
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Federated fine-tuning low-rank Gram matrix Procrustes alignment
Parameter-efficient fine-tuning techniques such as Low-rank Adaptation (LoRA) enable large language models (LLMs) to adapt to downstream tasks efficiently. Federated learning (FL) further facilitates this process by enabling collaborative fine-tuning across distributed clients without sharing private data. However, the use of two separate low-rank matrices in LoRA for federated fine-tuning introduces two types of challenges. The first challenge arises from the error induced by separately aggregating those two low-rank matrices. The second challenge occurs even when the product of two low-rank matrices is aggregated. The server needs to recover factors via matrix decomposition, which is non-unique and can introduce decomposition drift. To tackle the aforementioned challenges, we propose FLoRG, a federated fine-tuning framework which employs a single low-rank matrix for fine-tuning and aggregates its Gram matrix (i.e., the matrix of inner products of its column vectors), eliminating the aggregation error while also reducing the communication overhead. FLoRG minimizes the decomposition drift by introducing a Procrustes alignment approach which aligns the decomposed matrix between consecutive fine-tuning rounds for consistent updates. We theoretically analyze the convergence of FLoRG and prove that adopting the Procrustes alignment results in a tighter convergence bound. Experimental results across multiple LLM fine-tuning benchmarks demonstrate that FLoRG outperforms four state-of-the-art baseline schemes in the downstream task accuracy and can reduce the communication overhead by up to 2041$\times$.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3376. MLP Memory: A Retriever-Pretrained Memory for Large Language Models
๐ฌ 5
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ external memory parametric memory
Modern approaches to enhancing Large Language Models' factual accuracy and knowledge utilization face a fundamental trade-off: non-parametric retrieval-augmented generation (RAG) provides flexible access to external knowledge but suffers from high inference latency and shallow integration, while parametric fine-tuning methods like LoRA risk catastrophic forgetting and degraded general capabilities. In this work, we propose MLP Memory, a lightweight parametric module that learns to internalize retrieval patterns without explicit document access. By pretraining an MLP to imitate a $k$NN retriever's behavior on the entire pretraining dataset, we create a differentiable memory component that captures the benefits of retrieval-based knowledge access in a fully parametric form. Our architecture integrates this pretrained MLP Memory with Transformer decoders through simple probability interpolation, achieving 12.3\% relative improvement on five question-answering benchmarks and 5.2 points absolute gain across nine general NLP tasks, while reducing hallucinations by up to 10 points on HaluEval. Moreover, MLP Memory delivers 2.5$\times$ faster inference than RAG with superior accuracy. Our findings show that learning retrieval patterns parametrically bridges the gap between efficient inference and effective knowledge access, offering a practical alternative to both RAG and fine-tuning approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3377. Pseudo-Non-Linear Data Augmentation: A Constrained Energy Minimization Viewpoint
๐ฌ 23
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ data augmentation information geometry energy-based model
We propose a simple yet novel data augmentation method for general data modalities based on energy-based modeling and principles from information geometry. Unlike most existing learning-based data augmentation methods, which rely on learning latent representations with generative models, our proposed framework enables an intuitive construction of a geometrically aware latent space that represents the structure of the data itself, supporting efficient and explicit encoding and decoding procedures. We then present and discuss how to design latent spaces that will subsequently control the augmentation with the proposed algorithm. Empirical results demonstrate that our data augmentation method achieves competitive performance in downstream tasks compared to other baselines, while offering fine-grained controllability that is lacking in the existing literature.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3378. Sharing State Between Prompts and Programs
๐ฌ 21
โญ 5.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ natural language programming large language model programming languages
The rise of large language models (LLMs) has introduced a new type of programs: Natural language programs. By writing prompts directing LLMs to perform natural language processing, code generation, reasoning, etc., LLM users are writing code in natural language for the LLM to execute.
An emerging trend of active research enables interoperability between natural language and formal languages such as Python.
We present a novel programming abstraction: Shared program state.
Shared program state removes the work of enabling interoperability between natural code---natural language prompts---and the program state from the programmer.
With shared program state, programmers can write natural code that directly write program variables, compute with objects, and implement control flow in the program.
We present a template for designing natural function interfaces to extend system support for natural code.
We specified shared program state as a natural function interface and implemented it as the Nightjar programming system.
Nightjar enables programmers to write executable Python programs containing natural code using the shared program state abstraction.
Our evaluation shows that programs with shared program state achieves the same program accuracy as manually written implementations by programmers (66-75%), while decreasing the lines of code by 23.8% to 82.1%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3379. BigMac3D: A Big Macaque Motion and Animation Dataset Bridging Image and 3D Pose Representations
๐ฌ 15
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ surface tracking action recognition benchmark animals
The recognition of dynamic and social behavior in animals is fundamental for advancing several areas of the life sciences, including ethology, ecology, medicine and neuroscience. Recent progress in deep learning has enabled an automated recognition of such behavior from video data. However, an accurate reconstruction of the three-dimensional (3D) pose and shape has not been integrated into this process. Especially for non-human primates, the animals phylogenetically closest to humans, mesh-based tracking efforts lag behind those for other species, leaving pose descriptions restricted to sparse keypoints that are unable to fully capture the richness of action dynamics. To address this gap, we introduce the $\textit{Big Macaque 3D Motion and Animation Dataset}$ ($\texttt{BigMac3D}$), a large-scale dataset comprising more than 750 scenes of interacting rhesus macaques with detailed 3D pose descriptions of skeletal joint rotations. Recordings were obtained from 16 calibrated cameras and paired with action labels derived from a curated ethogram. Extending previous surface-based animal tracking methods, we construct subject-specific textured avatars by adapting a high-quality macaque template mesh to individual monkeys. This allows us to provide pose descriptions that are more accurate than previous state-of-the-art surface-based animal tracking methods. From the original dataset, we derive BigMac500, an action recognition benchmark that links surface-based pose vectors to single frames across multiple individual monkeys. By pairing features extracted from established image and video encoders with and without our pose descriptors, we demonstrate substantial improvements in mean average precision (mAP) when pose information is included. With these contributions, $\texttt{BigMac3D}$ establishes the first dataset that both integrates dynamic 3D pose-shape representations into the learning task of animal action recognition and provides a rich resource to advance the study of visual appearance, posture, and social interaction in non-human primates.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3380. Transductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning
๐ฌ 17
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ visual programming spatial reasoning tool abstraction
The composition of specialized tools offers a powerful approach for complex visual reasoning, particularly for tasks involving 3D spatial understanding. However, existing visual programming methods are often constrained by fixed toolsets or offline tool induction, which leads to suboptimal solutions and poor tool reuse. We introduce Transductive Visual Programming (TVP), a novel framework that dynamically evolves a library of reusable tools by learning from its problem-solving experience.
TVP abstracts recurring solution patterns into new, higher-level tools, which are then used to construct simpler and more effective programs for new tasks. On the challenging Omni3D-Bench, TVP establishes a new state of the art, outperforming both specialized vision-language models and prior visual programming systems. The evolved tools also exhibit strong generalization to out-of-domain queries on 3DSRBench, SpatialSense, and VGBench. Our work demonstrates that transductive tool evolution is a powerful and generalizable paradigm for building robust visual reasoning systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3381. ContextNav: Towards Agentic Multimodal In-Context Learning
๐ฌ 30
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ agent system in-context learning
Recent advances demonstrate that multimodal large language models (MLLMs) exhibit strong multimodal in-context learning (ICL) capabilities, enabling them to adapt to novel vision-language tasks from a few contextual examples. However, existing ICL approaches face challenges in reconciling generalization with robustness across diverse tasks and noisy contextual examples: manually selecting examples produces clean contexts but is labor-intensive and task-specific, while similarity-based retrieval improves scalability but could introduce irrelevant or structurally inconsistent samples that degrade ICL performance. To address these limitations, we propose ContextNav, the first agentic framework that integrates the scalability of automated retrieval with the quality and adaptiveness of human-like curation, enabling noise-robust and dynamically optimized contextualization for multimodal ICL. ContextNav unifies context management and noise-robust contextualization within a closed-loop workflow driven by graph-based tool orchestration. Specifically, it builds a resource-aware multimodal embedding pipeline, maintains a retrievable vector database, and applies agentic retrieval and structural alignment to construct noise-resilient contexts. An Operational Grammar Graph (OGG) further supports adaptive toolchain planning and optimization, enabling the agent to refine its strategies based on downstream feedback. Experimental results demonstrate that ContextNav achieves state-of-the-art performance across various datasets, underscoring the promise of agentic workflows for advancing scalable and robust contextualization in multimodal ICL. The code will be made publicly available on GitHub.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3382. VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning
๐ฌ 19
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ visual-spatial reasoning sparse subspace clustering
Multimodal Large Language Models (MLLMs) have achieved impressive progress in visionโlanguage alignment, yet they remain limited in visualโspatial reasoning. We first identify that this limitation arises from the attention mechanism: visual tokens are overshadowed by language tokens, preventing the model from consistently recognizing the same visual cues across frames. To address this challenge, we draw a novel connection between the self-expressiveness property in sparse subspace clustering and the attention mechanism in Transformers. Building on this insight, we propose VideoAnchor, a plug-and-play module that leverages subspace affinities to reinforce visual cues across frames without retraining, effectively anchoring attention to shared visual structures. Extensive experiments across benchmarks and backbone models show consistent performance gains โ e.g., 3.2% and 4.6% improvements on VSI-Bench and Video-MME (spatial-related tasks) with InternVL2-8B and Qwen2.5VL-72Bโwhile qualitative analyses demonstrate more coherent subspace partitions and stronger visual grounding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3383. PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
๐ฌ 14
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video LLM Prompt-guided Pooling PPLLaVA
The past year has witnessed the significant advancement of video-based large language models. However, the challenge of developing a unified model for both short and long video understanding remains unresolved. Most existing video LLMs cannot handle hour-long videos, while methods custom for long videos tend to be ineffective for shorter videos and images. In this paper, we identify the key issue as the redundant content in videos. To address this, we propose a novel pooling strategy that simultaneously achieves token compression and instruction-aware visual feature aggregation. Our model is termed Prompt-guided Pooling LLaVA, or PPLLaVA for short. Specifically, PPLLaVA consists of three core components: the CLIP-based visual-prompt alignment that extracts visual information relevant to the user's instructions, the prompt-guided pooling that compresses the visual sequence to arbitrary scales using convolution-style pooling, and the clip context extension designed for lengthy prompt common in visual dialogue. Extensive experiments have validated the performance of our model. With superior throughput, PPLLaVA achieves better results on image benchmarks as a video LLM, while achieving state-of-the-art performance across various video benchmarks, excelling in tasks ranging from caption generation to multiple-choice questions, and handling video lengths from seconds to hours.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3384. The Rank and Gradient Lost in Non-stationarity: Sample Weight Decay for Mitigating Plasticity Loss in Reinforcement Learning
๐ฌ 17
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning non-stationary data distributions plasticity loss
Deep reinforcement learning (RL) suffers from plasticity loss severely due to the nature of non-stationarity, which impairs the ability to adapt to new data and learn continually. Unfortunately, our understanding of how plasticity loss arises, dissipates, and can be dissolved remains limited to empirical findings, leaving the theoretical end underexplored. To address this gap, we study the plasticity loss problem from the theoretical perspective of network optimization. By formally characterizing the two culprit factors in online RL process: the non-stationarity of data distributions and the non-stationarity of targets induced by bootstrapping, our theory attributes the loss of plasticity to two mechanisms: the rank collapse of the Neural Tangent Kernel (NTK) Gram matrix and the ฮ(1/k) decay of gradient magnitude. The first mechanism echoes prior empirical findings from the theoretical perspective and sheds light on the effects of existing methods, e.g., network reset, neuron recycle, and noise injection. Against this backdrop, we focus primarily on the second mechanism and aim to alleviate plasticity loss by addressing the gradient attenuation issue, which is orthogonal to existing methods. We propose Sample Weight Decay (SWD) --- a lightweight method to restore gradient magnitude, as a general remedy to plasticity loss for deep RL methods based on experience replay. In experiments, we evaluate the efficacy of SWD upon TD3, SAC with SimBa architecture in MuJoCo and DeepMind Control Suite tasks. The results demonstrate that SWD effectively alleviates plasticity loss and consistently improves learning performance across various configurations of deep RL algorithms, UTD, network architectures, and environments, achieving SOTA performance on challenging DMC Humanoid tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3385. SFBD-OMNI: Bridge models for lossy measurement restoration with limited clean samples
๐ฌ 17
โญ 5.00
๐ generative models
๐ท๏ธ ambient diffusion diffusion models generative modeling density deconvolution
In many real-world scenarios, obtaining fully observed samples is prohibitively expensive or even infeasible, while partial and noisy observations are comparatively easy to collect. In this work, we study distribution restoration with abundant noisy samples, assuming the corruption process is available as a black-box generator. We show that this task can be formulated as a one-sided entropic optimal transport problem and solved via an alternative minimization algorithm. We further provide a test criterion to determine whether the true underlying distribution is recoverable under per-sample information loss, and show that in otherwise unrecoverable cases, a small number of clean samples can render the distribution largely recoverable. Building on these insights, we introduce SFBD-OMNI, a bridge model-based framework that maps corrupted sample distributions to the ground-truth distribution. Our method generalizes Stochastic Forward-Backward Deconvolution (SFBD; Lu et al., 2025) to handle arbitrary measurement models beyond Gaussian corruption. Empirical studies validate our theory and demonstrate that SFBD-OMNI substantially improves recovery in the challenging regime of non-identifiable corruption processes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3386. Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
๐ฌ 13
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLMs JailBreak
Large language models remain vulnerable to jailbreak attacks, inputs crafted to bypass safety mechanisms and elicit harmful responses, despite advances in alignment and instruction tuning. Existing attacks often rely on prompt rewrites, dense optimization, or ad hoc heuristics, and lack interpretability and robustness. We propose **Head-Masked Nullspace Steering (HMNS)**, a circuit-level intervention that (i) identifies attention heads most causally responsible for a modelโs default behavior, (ii) suppresses their write paths via targeted column masking, and (iii) injects a perturbation constrained to the orthogonal complement of the muted subspace. This geometry-aware intervention preserves fluency while steering the model toward completions that differ from baseline routing. HMNS operates in a closed-loop detectionโintervention cycle, re-identifying causal heads and reapplying interventions across multiple decoding attempts. Across multiple jailbreak benchmarks, strong safety defenses, and widely used language models, HMNS attains state-of-the-art attack success rates with fewer queries than prior methods. Ablations confirm that nullspace-constrained injection, residual norm scaling, and iterative re-identification are key to its effectiveness. To our knowledge, this is the first jailbreak method to leverage geometry-aware, interpretability-informed interventions, highlighting a new paradigm for controlled model steering and adversarial safety circumvention.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3387. Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models
๐ฌ 13
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ AI slop slop constrained generation delve patterns sampleing dpo preference optimization fine-tuning fine tuning creativity AI writing Creative AI
Repetitive lexical patterns in LLM output, termed "slop," degrade writing quality through over-use and make AI-generated text immediately recognizable. We present Antislop, a comprehensive framework providing tools to both detect and eliminate these overused patterns. Our approach combines three innovations: (1) The Antislop Sampler, which uses backtracking to suppress unwanted strings at inference time without destroying vocabulary. (2) An automated pipeline that profiles model-specific slop against human baselines and generates training data. and, (3) Final Token Preference Optimization (FTPO), a novel fine-tuning method that operates in logit-space on individual tokens, surgically adjusting logits wherever a banned pattern has appeared in an inference trace. We demonstrate that some slop patterns appear over 1,000 times more frequently in LLM output than human text. The Antislop Sampler successfully suppresses 8,000+ patterns while maintaining quality, whereas token banning becomes unusable at just 2,000. Most importantly, FTPO achieves 90% slop reduction while maintaining or improving performance in cross-domain evals including GSM8K, MMLU, and creative writing tasks. In contrast, DPO suffers significant degradation in writing quality and lexical diversity despite achieving weaker suppression. We release all code and results datasets under MIT license.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3388. CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild
๐ฌ 23
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Human Motion Synthesis Hand motion synthesis LLM Motion in-the-wild
Hands play a central role in daily life, yet modeling natural hand motions remains
underexplored. Existing methods that tackle text-to-hand-motion generation or
hand animation captioning rely on studio-captured datasets with limited actions
and contexts, making them costly to scale to โin-the-wildโ settings. Further,
contemporary models and their training schemes struggle to capture animation
fidelity with textโmotion alignment. To address this, we (1) introduce โ3D Hands
in the Wildโ (3D-HIW), a dataset of 32K 3D hand-motion sequences and aligned
text, and (2) propose CLUTCH, an LLM-based hand animation system with two
critical innovations: (a) SHIFT, a novel VQ-VAE architecture to tokenize hand
motion, and (b) a geometric refinement stage to finetune the LLM. To build 3D-
HIW, we propose a data annotation pipeline that combines visionโlanguage models
(VLMs) and state-of-the-art 3D hand trackers, and apply it to a large corpus of
egocentric action videos covering a wide range of scenarios. To fully capture
motion in-the-wild, CLUTCH employs SHIFT, a partโmodality decomposed VQ-
VAE, which improves generalization and reconstruction fidelity. Finally, to improve
animation quality, we introduce a geometric refinement stage, where CLUTCH is
co-supervised with a reconstruction loss applied directly to decoded hand motion
parameters. Experiments demonstrate state-of-the-art performance on text-to-
motion and motion-to-text tasks, establishing the first benchmark for scalable
in-the-wild hand motion modelling. Code, data and models will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3389. LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops
๐ฌ 25
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Multimodal Large Language Models Energy-latency Attack
Multimodal Large Language Models (MLLMs) have shown great promise but require substantial computational resources during inference. Attackers can exploit this by inducing excessive output, leading to resource exhaustion and service degradation. Prior energy-latency attacks aim to increase generation time by broadly shifting the output token distribution away from the EOS token, but they neglect the influence of token-level Part-of-Speech (POS) characteristics on EOS and sentence-level structural patterns on output counts, limiting their efficacy. To address this, we propose \textbf{LingoLoop}, an attack designed to induce MLLMs to generate excessively verbose and repetitive sequences. First, we find that the POS tag of a token strongly affects the likelihood of generating an EOS token. Based on this insight, we propose a \textbf{POS-Aware Delay Mechanism} to postpone EOS token generation by adjusting attention weights guided by POS information. Second, we identify that constraining output diversity to induce repetitive loops is effective for sustained generation. We introduce a \textbf{Generative Path Pruning Mechanism} that limits the magnitude of hidden states, encouraging the model to produce persistent loops. Extensive experiments on models like Qwen2.5-VL-3B demonstrate LingoLoop's powerful ability to trap them in generative loops; it consistently drives them to their generation limits and, when those limits are relaxed, can induce outputs with up to \textbf{367$\times$} more tokens than clean inputs, triggering a commensurate surge in energy consumption. These findings expose significant MLLMs' vulnerabilities, posing challenges for their reliable deployment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3390. Hallucination-aware Intermediate Representation Editing in Large Vision-Lanugage Models
๐ฌ 4
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ mitigating hallucination feature editing LVLMs
Large Vision-Language Models have demonstrated exceptional performance in multimodal reasoning and complex scene understanding. However, these models still face significant hallucination issues, where outputs contradict visual facts. Recent research on hallucination mitigation has focused on retraining methods and Contrastive Decoding (CD) methods. While both methods perform well, retraining methods require substantial training resources, and CD methods introduce dual inference overhead. These factors hinder their practical applicability. To address the above issue, we propose a framework for dynamically detecting hallucination representations and performing hallucination-eliminating edits on these representations. With minimal additional computational cost, we achieve state-of-the-art performance on existing benchmarks. Extensive experiments demonstrate the effectiveness of our approach, highlighting its efficient and robust hallucination elimination capability and its powerful controllability over hallucinations. Our code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3391. Process-Verified Reinforcement Learning for Theorem Proving via Lean
๐ฌ 17
โญ 5.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Formal Reasoning Large Language Models Theorem Proving with LLMs Lean4
While reinforcement learning from verifiable rewards (RLVR) typically has relied on a single binary verification signal, symbolic proof assistants in formal reasoning offer rich, fine-grained structured feedback. This gap between structured processes and unstructured rewards highlights the importance of feedback that is both dense and sound.
In this work, we demonstrate that the Lean proof assistant itself can serve as a symbolic process oracle, supplying both outcome-level and fine-grained tactic-level verified feedback during training. Proof attempts are parsed into tactic sequences, and Lean's elaboration marks both locally sound steps and the earliest failing step, yielding dense, verifier-grounded credit signals rooted in type theory. We incorporate these structured rewards into a GRPO-style reinforcement learning objective with first-error propagation and first-token credit methods that balances outcome- and process-level advantages.
Experiments with STP-Lean and DeepSeek-Prover-V1.5 show that tactic-level supervision outperforms outcome-only baselines in most settings, delivering improvements on benchmarks such as MiniF2F and ProofNet. Beyond empirical gains, our study highlights a broader perspective: symbolic proof assistants are not only verifiers at evaluation time, but can also act as process-level reward oracles during training. This opens a path toward reinforcement learning frameworks that combine the scalability of language models with the reliability of symbolic verification for formal reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3392. WebArbiter: A Generative Reasoning Process Reward Model for Web Agents
๐ฌ 4
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Process Reward Modeling Multi-step Reasoning Autonomous Agent WebAgent Generative Reasoning
Web agents hold great potential for automating complex computer tasks, yet their interactions involve long horizons, multi-step decisions, and actions that can be irreversible. In such settings, outcome-based supervision is sparse and delayed, often rewarding incorrect trajectories and failing to support inference-time scaling. This motivates the use of Process Reward Models (WebPRMs) for web navigation, but existing approaches remain limited: scalar WebPRMs collapse progress into coarse, weakly grounded signals, while checklist-based WebPRMs rely on brittle template matching that fails under layout or semantic changes and often mislabels superficially correct actions as successful, providing little insight or interpretability. To address these challenges, we introduce WebArbiter, a reasoning-first, principle-inducing WebPRM that formulates reward modeling as text generation, producing structured justifications that conclude with a preference verdict and identify the action most conducive to task completion under the current context. Training follows a two-stage pipeline: reasoning distillation equips the model with coherent principle-guided reasoning, and reinforcement learning corrects teacher biases by directly aligning verdicts with correctness, enabling stronger generalization. To support systematic evaluation, we release WebPRMBench, a comprehensive benchmark spanning four diverse web environments with rich tasks and high-quality preference annotations. On WebPRMBench, WebArbiter-7B outperforms the strongest baseline, Gemini Flash, by 10.9%. In reward-guided trajectory search on WebArena-Lite, it surpasses the best prior WebPRM by up to 7.2%, underscoring its robustness and practical value in real-world complex web tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3393. MiSS: Revisiting the Trade-off in LoRA with an Efficient Shard-Sharing Structure
๐ฌ 26
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ PEFT LLM LoRA
Low-Rank Adaptation (LoRA) is a widely adopted technique for parameter-efficient fine-tuning, but its slow convergence has spurred the development of numerous variants. Nevertheless, current approaches struggle to achieve simultaneous improvements in performance, memory footprint, and computational efficiency. To address this challenge, we revisit the causes of LoRAโs slow convergence and, based on these insights, propose \textbf{M}atr\textbf{i}x \textbf{S}hard \textbf{S}haring (MiSS) that shards the original weight matrix and updates by sharing a single trainable matrix $\boldsymbol{D}$ initialized to zero. To simultaneously ensure computational efficiency, low memory footprint, and scalable serving, we introduce MiSS$^e$. Through theoretical analyses and empirical results, our method reduces optimization complexity while maintaining strong performance, striking a favorable balance between performance, memory, and efficiency. Furthermore, we provide a comprehensive analysis of different PEFT methods with respect to memory usage, initialization time, and computational efficiency. By mapping the Pareto frontier, we show that MiSS achieves a favorable balance across these dimensions, integrating the strengths of prior approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3394. G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior
๐ฌ 20
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Scene Reconstruction Sparse View Reconstruction Generative Prior
Despite recent advances in leveraging generative prior from pre-trained diffusion models for 3D scene reconstruction, existing methods still face two critical limitations. First, due to the lack of reliable geometric supervision, they struggle to produce high-quality reconstructions even in observed regions, let alone in unobserved areas. Second, they lack effective mechanisms to mitigate multi-view inconsistencies in the generated images, leading to severe shapeโappearance ambiguities and degraded scene geometry.
In this paper, we identify accurate geometry as the fundamental prerequisite for effectively exploiting generative models to enhance 3D scene reconstruction.
We first propose to leverage the prevalence of planar structures to derive accurate metric-scale depth maps, providing reliable supervision in both observed and unobserved regions.
Furthermore, we incorporate this geometry guidance throughout the generative pipeline to improve visibility mask estimation, guide novel view selection, and enhance multi-view consistency when inpainting with video diffusion models, resulting in accurate and consistent scene completion.
Extensive experiments on Replica, ScanNet++, and DeepBlending show that our method consistently outperforms existing baselines in both geometry and appearance reconstruction, particularly for unobserved regions.
Moreover, our method naturally supports single-view inputs and unposed videos, with strong generalizability in both indoor and outdoor scenarios with practical real-world applicability. See more results at https://g4splat.github.io.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3395. Hierarchical Multi-Scale Molecular Conformer Generation with Structural Awareness
๐ฌ 15
โญ 5.00
๐ generative models
๐ท๏ธ Molecular conformer generation Generative models
Molecular conformer generation is a fundamental task for drug discovery and material design. Although deep generative models have progressed in this area, existing methods often overlook the hierarchical structural organization inherent to molecules, leading to poor-quality generated conformers. To address this challenge, we demonstrate that capturing the spatial arrangement of key substructures, such as scaffolds, is essential, as they serve as anchors that define the overall molecular distribution. In this paper, we propose a hierarchical multi-scale molecular conformer generation framework (MSGEN), designed to enhance key substructure awareness by leveraging spatially informed guidance. Our framework initiates the generation process from coarse-grained key substructures, progressively refining the conformer by utilizing these coarser-scale structures as conditional guidance for subsequent finer-scale stages. To bridge scale discrepancies between stages, we introduce a molecular upsampling technique that aligns the structural scales, ensuring smooth propagation of geometric guidance. Extensive experiments on standard benchmarks demonstrate that our framework integrates seamlessly with a wide range of existing molecular generative models and consistently generates more stable and chemically plausible molecular conformers.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3396. Multi-Resolution Score-Based Variational Graphical Diffusion for Causal Inference on Latent Systems
๐ฌ 36
โญ 5.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Variational causal inference Probabilistic graphical models Multi-resolution observations Score-based diffusion models
Complex causal systems with interdependent variables require inference from heterogeneous observations that vary in spatial resolution, temporal frequency, and noise characteristics due to data acquisition constraints. Existing multi-modal fusion approaches assume uniform data quality or complete observability -- assumptions often violated in real-world applications. Current methods face three limitations: they treat causally-related variables independently, failing to exploit causal relationships; they cannot integrate multi-resolution observations effectively; and they lack theoretical frameworks for cascaded approximation errors. We introduce the Score-based Variational Graphical Diffusion Model (SVGDM), which integrates score-based diffusion within causal graphical structures for inference under heterogeneous incomplete observations. SVGDM introduces causal score decomposition enabling information propagation across causally-connected variables while preserving original observation characteristics. Diffusion provides a natural way to model scale-dependent sensing noise, which is common in remote-sensing, climate, and physical measurement systems, while the causal graph encodes well-established mechanistic dependencies between latent processes. We provide theoretical analysis and demonstrate superior performance on both synthetic and real-world datasets compared to relevant baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3397. MoMa: A Simple Modular Learning Framework for Material Property Prediction
๐ฌ 4
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ material property prediction modular deep learning AI4Materials
Deep learning methods for material property prediction have been widely explored to advance materials discovery. However, the prevailing pre-train paradigm often fails to address the inherent diversity and disparity of material tasks. To overcome these challenges, we introduce MoMa, a simple Modular framework for Materials that first trains specialized modules across a wide range of tasks and then adaptively composes synergistic modules tailored to each downstream scenario. Evaluation across 17 datasets demonstrates the superiority of MoMa, with a substantial 14% average improvement over the strongest baseline. Few-shot and module scaling experiments further highlight MoMa's potential for real-world applications. Pioneering a new paradigm of modular material learning, MoMa will be open-sourced to foster broader community collaboration.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3398. Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?
๐ฌ 23
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Off-policy RL; LLM; Reasoning
Reinforcement learning has been central to recent advances in large language model reasoning, but most algorithms rely on on-policy training that demands fresh rollouts at every update, limiting efficiency and scalability. Asynchronous RL systems alleviate this by decoupling rollout generation from training, yet their effectiveness hinges on tolerating large staleness in rollout data, a setting where existing methods either degrade in performance or collapse. We revisit this challenge and uncover a \emph{prosperity-before-collapse} phenomenon: stale data can be as informative as on-policy data if exploited properly. Building on this insight, we introduce M2PO (Second-Moment Trust Proxy Optimization), which constrains the second moment of importance weights to suppress only extreme outliers while preserving informative updates. Notably, M2PO sharply reduces the fraction of clipped tokens under high staleness (from 1.22% to 0.006% over training), precisely masking high-variance tokens while maintaining stable optimization. Extensive evaluation across six model scales (1.7Bโ32B) and eight reasoning benchmarks shows that M2PO delivers stable off-policy training even with data stale by *at least 256 model updates* and matches on-policy performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3399. Loc$^{2}$: Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching
๐ฌ 19
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Cross-view localization ground-to-aerial image matching visual localization computer vision
We propose an accurate and interpretable fine-grained cross-view localization method that estimates the 3 Degrees of Freedom (DoF) pose of a ground-level image by matching its local features with a reference aerial image. Unlike prior approaches that rely on global descriptors or birdโs-eye-view (BEV) transformations, our method directly learns groundโaerial image-plane correspondences using weak supervision from camera poses. The matched ground points are lifted into BEV space with monocular depth predictions, and scale-aware Procrustes alignment is then applied to estimate camera rotation, translation, and optionally the scale between relative depth and the aerial metric space. This formulation is lightweight, end-to-end trainable, and requires no pixel-level annotations. Experiments show state-of-the-art accuracy in challenging scenarios such as cross-area testing and unknown orientation. Furthermore, our method offers strong interpretability: correspondence quality directly reflects localization accuracy and enables outlier rejection via RANSAC, while overlaying the re-scaled ground layout on the aerial image provides an intuitive visual cue of localization accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3400. MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
๐ฌ 41
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Unify Model; Multi-Modal Language Model; Benchmark
Unified Multimodal Large Language Models (U-MLLMs) have garnered considerable interest for their ability to seamlessly integrate generation and comprehension tasks. However, existing research lacks a unified evaluation standard, often relying on isolated benchmarks to assess these capabilities. Moreover, current work highlights the potential of ``mixed-modality generation capabilities'' through case studiesโsuch as generating auxiliary lines in images to solve geometric problems, or reasoning through a problem before generating a corresponding image. Despite this, there is no standardized benchmark to assess models on such unified tasks. To address this gap, we introduce MME-Unify, also termed as MME-U, the first open and reproducible benchmark designed to evaluate multimodal comprehension, generation, and mixed-modality generation capabilities. For comprehension and generation tasks, we curate a diverse set of tasks from 12 datasets, aligning their formats and metrics to develop a standardized evaluation framework. For unified tasks, we design five subtasks to rigorously assess how modelsโ understanding and generation capabilities can mutually enhance each other. Evaluation of 12 U-MLLMs, including Janus-Pro, EMU3, and Gemini2-Flash, reveals significant room for improvement, particularly in areas such as instruction following and image generation quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3401. Rethinking Causal Mask Attention for Vision-Language Inference
๐ฌ 21
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Vision-Language Inference Casual Mask Attention
Causal attention has become a foundational mechanism in autoregressive Vision-Language models (VLMs), unifying textual and visual inputs under a single generative framework. However, existing causal mask-based strategies are inherited from large language models (LLMs) where they are tailored for text-only decoding, and their adaptation to vision tokens is insufficiently addressed in the prefill stage. Strictly masking future positions for vision queries introduces overly rigid constraints, which hinder the modelโs ability to leverage future context that often contains essential semantic cues for accurate inference.
In this work, we empirically investigate how different causal masking strategies affect vision-language inference and then propose a family of future-aware attentions tailored for this setting.
We first empirically analyze the effect of previewing future tokens for vision queries and demonstrate that rigid masking undermines the modelโs capacity to capture useful contextual semantic representations. Based on these findings, we propose a lightweight attention family that aggregates future visual context into past representations via pooling, effectively preserving the autoregressive structure while enhancing cross-token dependencies.
We evaluate a range of causal masks across diverse vision-language inference settings and show that selectively compressing future semantic context into past representations benefits the inference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3402. Sparsity Forcing: Reinforcing Token Sparsity of MLLMs
๐ฌ 18
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ reinforced sparse attention token sparsity
Sparse attention mechanisms aim to reduce computational overhead with minimal accuracy loss by selectively processing salient tokens. Despite their effectiveness, most methods merely exploit a modelโs inherent sparsity and thus plateau at moderate budgets (about 50\% token reduction), with little headroom to push budget lower without hurting accuracy.
Other approaches attempt to enforce sparsity through trainable sparse attention or sharpness-inducing regularizers, but these either fix rigid patterns that ignore input and layer dynamics, or optimize proxy objectives without direct control over token budgets.
In this paper, we explicitly reinforce token sparsity in well-posed multimodal large language models (MLLMs) through a simple RL-based post-training framework named $\textit{Sparsity Forcing}$.
Our method explores the efficiency-accuracy trade-off by running multiple rollouts with different token budgets, where both efficiency (token reduction ratio) and performance (answer correctness) are formulated as joint rewards.
By contrasting rollouts within each group, the more efficient and correct answer is rewarded while less efficient or incorrect ones are penalized, thereby turning token saving into an end-to-end, inference-consistent optimization objective.
Across thirteen image and video benchmarks, Sparsity Forcing raises token reduction ratio on Qwen2-VL/Qwen2.5-VL from 20\% to 75\% with minimal accuracy decline,
significantly reducing long-context inference memory by up to 3$\times$ while speeding up decoding by up to 3.3$\times$.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3403. ShieldedCode: Learning Robust Representations for Virtual Machine Protected Code
๐ฌ 29
โญ 5.00
๐ generative models
๐ท๏ธ Virtual Machine Protection
Large language models (LLMs) have achieved remarkable progress in code generation, yet their potential for software protection remains largely untapped. Reverse engineering continues to threaten software security, while traditional virtual machine protection (VMP) relies on rigid, rule-based transformations that are costly to design and vulnerable to automated analysis. In this work, we present the first protection-aware framework that learns robust representations of VMP-protected code. Our approach builds large-scale paired datasets of source code and normalized VM implementations, and introduces hierarchical dependency modeling at intra-, preceding-, and inter-instruction levels. We jointly optimize language modeling with functionality-aware and protection-aware contrastive objectives to capture both semantic equivalence and protection strength. To further assess resilience, we propose a protection effectiveness optimization task that quantifies and ranks different VM variants derived from the same source. Coupled with a two-stage continual pre-training and fine-tuning pipeline, our method enables models to generate, compare, and reason over protected code. Extensive experiments show that our framework significantly improves robustness across diverse protection levels, opening a new research direction for learning-based software defense. In this work, we present ShieldedCode, the first protection-aware framework that learns robust representations of VMP-protected code. Our method achieves 26.95% Pass@1 on L0 VM code generation compared to 22.58% for GPT-4o., and improves binary similarity detection Recall@1 by 10\% over state of art methods like jTrans.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3404. CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing
๐ฌ 19
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Multi-modal Large Language Model Socioeconomic Prediction Urban Imagery Urban Science Benchmark
Understanding urban socioeconomic conditions through visual data is a challenging yet essential task for sustainable urban development and policy planning. In this work, we introduce CityLens, a comprehensive benchmark designed to evaluate the capabilities of Large Vision-Language Models (LVLMs) in predicting socioeconomic indicators from satellite and street view imagery. We construct a multi-modal dataset covering a total of 17 globally distributed cities, spanning 6 key domains: economy, education, crime, transport, health, and environment, reflecting the multifaceted nature of urban life. Based on this dataset, we define 11 prediction tasks and utilize 3 evaluation paradigms: Direct Metric Prediction, Normalized Metric Estimation, and Feature-Based Regression. We benchmark 17 state-of-the-art LVLMs across these tasks. These make CityLens the most extensive socioeconomic benchmark to date in terms of geographic coverage, indicator diversity, and model scale. Our results reveal that while LVLMs demonstrate promising perceptual and reasoning capabilities, they still exhibit limitations in predicting urban socioeconomic indicators. CityLens provides a unified framework for diagnosing these limitations and guiding future efforts in using LVLMs to understand and predict urban socioeconomic patterns.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3405. Tucker-FNO: Tensor Tucker-Fourier Neural Operator and its Universal Approximation Theory
๐ฌ 42
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Neural Operator Implicit Neural Representation Functional Tensor Decomposition
Fourier neural operator (FNO) has demonstrated substantial potential in learning mappings between function spaces, such as numerical partial differential equations (PDEs). However, FNO may suffer from inefficiencies when applied to large-scale, high-dimensional function spaces due to the computational overhead associated with high-dimensional Fourier and convolution operators. In this work, we introduce the Tucker-FNO, an efficient neural operator that decomposes the high-dimensional FNO into a series of 1-dimensional FNOs through Tucker decomposition, thereby significantly reducing computational complexity while maintaining expressiveness. Especially, by using the theoretical tools of functional decomposition in Sobolev space, we rigorously establish the universal approximation theorem of Tucker-FNO. Experiments on high-dimensional numerical PDEs such as Navier-Stokes, Plasticity, and Burger's equations show that Tucker-FNO achieves substantial improvement in execution time and performance over FNO. Moreover, by virtue of the compact Tucker decomposition, Tucker-FNO generalizes seamlessly to high-dimensional visual signals by learning mappings from the positional encoding space to the signal's implicit neural representations (INRs). Under this operator INR framework, Tucker-FNO gains consistent improvements on continuous signal restoration over traditional INR methods in terms of efficiency and accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3406. Language Model Planning from an Information Theoretic Perspective
๐ฌ 25
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ language model LLM deep learning planning explainability interpretability information theory
The extent to which decoder-only language models (LMs) engage in planning, that is, organizing intermediate computations to support coherent long-range generation, remains an open and important question, with implications for interpretability, reliability, and principled model design. Planning involves structuring computations over long horizons, considering multiple possible continuations, and selectively reusing past information, but how effectively transformer-based LMs realize these capabilities is still unclear. We address these questions by analyzing the hidden states at the core of transformer computations, which capture intermediate results and act as carriers of information. Since these hidden representations are often redundant and burdened with fine-grained details, we develop a pipeline based on vector-quantized variational autoencoders that compresses them into compact summary codes. These codes enable measuring mutual information, allowing systematic analysis of the computational structure underlying model behavior. Using this framework, we study planning in LMs across synthetic grammar, path-finding tasks, and natural language datasets, focusing on three key aspects: (i) the planning horizon of pre-output computations, (ii) the extent to which the model considers alternative valid continuations, and (iii) the reliance of new predictions on earlier computations. By answering these questions, we advance the understanding of how planning is realized in LMs and contribute a general-purpose pipeline for probing the internal dynamics of LMs and deep learning systems. Our results reveal that the effective planning horizon is task-dependent, that models implicitly preserve information about unused correct continuations, and that predictions draw most on recent computations, though earlier blocks remain informative.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3407. CDBridge: A Cross-omics Post-training Bridge Strategy for Context-aware Biological Modeling
๐ฌ 18
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ AI4S Cross-omics Central Dogma modeling Foundation models
Linking genomic DNA to quantitative, context-specific expression remains a central challenge in computational biology. Current foundation models capture either tissue context or sequence features, but not both. Cross-omics systems, in turn, often overlook critical mechanisms such as alternative splicing and isoform reuse. We present CDBridge, a post-training strategy that unifies pretrained DNA and protein models into a context-aware framework without full retraining. CDBridge operates in two stages: (a) Seq-context learning, where a splicing-inspired token merge compresses long genomic regions into isoform-aware representations, and (b) Env-context learning, where a conditional decoder injects tissue embeddings to model expression under diverse biological contexts. To benchmark this setting, we introduce GTEx-Benchmark, derived from GTEx and Ensembl, which requires models to capture long-range exon dependencies, resolve isoform reuse, and predict tissue-specific expression levels. Across qualitative and quantitative tasks, CDBridge consistently outperforms prior methods that ignore central dogma constraints or context dependence, offering a scalable and biologically faithful solution for DNA-to-expression modeling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3408. DeepPrim: a Physics-Driven 3D Short-term Weather Forecaster via Primitive Equation Learning
๐ฌ 17
โญ 5.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Weather forecasting Physics-informed neural networks Primitive equations Earth atmospheric dynamics Deep learning.
Solving primitive equations is essential for accurate weather forecasting. However, traditional numerical weather prediction (NWP) methods often incorporate various simplifications that limit their effectiveness in parameterizing unresolved physical processes. Meanwhile, existing deep learning-based models mostly focus on pure data-driven paradigms, overlooking the fundamental physical principles that govern atmospheric dynamics. To address these challenges, we present DeepPrim, a novel 3D \underline{deep} weather forecaster designed to learn \underline{prim}itive equations of the Earthโs atmosphere. Specifically, DeepPrim aims at accurately modeling 3D atmospheric motion through Navier-Stokes equation in pressure coordinates, and effectively capturing the interactions between the solved advection and key weather variables (e.g., temperature and water vapor) through corresponding equations. By seamlessly integrating fundamental atmospheric physics with advanced data-driven techniques, our model effectively approximates complicated physical processes without relying on empirical simplifications. Experimentally, DeepPrim achieves impressive performance in both short-term global and regional weather forecasting tasks, and exhibits the superior capacity to capture 3D atmospheric dynamics. The code is available at https://anonymous.4open.science/r/DeepPrim.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3409. SLM-MUX: Orchestrating Small Language Models for Reasoning
๐ฌ 28
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language model small language models
With the rapid development of language models, the number of small language models (SLMs) has grown significantly. Although they do not achieve state-of-the-art accuracy, they are more efficient and often excel at specific tasks. This raises a natural question: can multiple SLMs be orchestrated into a system where each contributes effectively, achieving higher accuracy than any individual model? Existing orchestration methods have primarily targeted frontier models (e.g., GPT-4) and perform suboptimally when applied to SLMs. To address this gap, we propose a three-stage approach for orchestrating SLMs. First, we introduce SLM-MUX, a multi-model architecture that effectively coordinates multiple SLMs. Building on this, we develop two optimization strategies: (i) a model selection search that identifies the most complementary SLMs from a given pool, and (ii) test-time scaling tailored to SLM-MUX. Our approach delivers strong results: Compared to existing orchestration methods, our approach achieves up to 13.4% improvement on MATH, 8.8% on GPQA, and 7.0% on GSM8K. With just two SLMs, SLM-MUX outperforms Qwen 2.5 72B on GPQA and GSM8K, and matches its performance on MATH. We further provide theoretical analyses to substantiate the advantages of our method. Additional experiments show that the core principle of SLM-MUX extends to open-ended generation tasks (e.g., HumanEval) and benefits other model classes, including frontier LLMs and domain-specific fine-tuned SLMs. In summary, we demonstrate that SLMs can be effectively orchestrated into more accurate and efficient systems through the proposed approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3410. Soft Equivariance Regularization for Invariant Self-Supervised Learning
๐ฌ 21
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Soft Equivariance Self-Supervised Learning Invariant Representation Vision Transformer
A central principle in self-supervised learning (SSL) is to learn data representations that are invariant to semantic-preserving transformations \eg, image representations should remain unchanged under augmentations like cropping or color jitter. While effective for classification, such invariance can suppress transformation-relevant information that is valuable for other tasks. To address this, recent works explore equivariant representation learning, which encourages representations to retain information about the applied transformations. \textcolor{blue}{However, how to effectively incorporate equivariance as an explicit regularizer on top of strong invarianceโbased SSL backbones at ImageNet scale remains underexplored.} We conjecture that enforcing invariance and equivariance to the same layer is inherently difficult and, if handled naively, may even hinder learning. To overcome this, we propose soft equivariance regularization (SER), a simple yet scalable method that decouples the two objectives: learning invariant representations via standard SSL, while softly regularizing intermediate features with an equivariance loss. Our approach necessitates neither a transformation label nor its predictive objectives, but operates directly with group actions applied to the intermediate feature maps. We show that this soft equivariance regularization significantly improves the generalization performance of ImageNet-1k pre-training of vision transformers (ViT), leading to stronger downstream classification accuracy in ImageNet and in its variants, including both natural distributions and broad types of common corruptions and perturbations ImageNet-C and ImageNet-P. Our code is available at https://anonymous.4open.science/r/erl-B5CE
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3411. A Near-Optimal Best-of-Both-Worlds Algorithm for Federated Bandits
๐ฌ 10
โญ 5.00
๐ learning theory
๐ท๏ธ Federated Bandits Mutli-armed Bandits Best-of-both-worlds
This paper studies federated multi-armed bandit (MAB) problems where multiple agents working together to solve a common MAB problem through a communication network. We focus on the heterogeneous setting in which no single agent can identify the global best arm using only local biased observations. In this setting, different agents may select the same arm at the same time step but receive varying rewards. We propose a novel algorithm called \textsc{FedFTRL} for this problem, which is the first work to achieve near-optimal regret guarantees in both stochastic and adversarial environments. Notably, in the adversarial regime, our algorithm achieves $O(\sqrt{T})$ regret which is a significant improvement over the state-of-the-art regret of $O(T^{\frac{2}{3}})$ \citep{yi2023doubly}. We also provide numerical evaluations comparing our algorithm with baseline methods, demonstrating the effectiveness of our approach on both synthetic and real-world datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3412. Towards Improved Sentence Representations using Token Graphs
๐ฌ 21
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Graph-based token pooling; Sentence embeddings
Obtaining a single-vector representation from a Large Language Model's (LLM) token-level outputs is a critical step for nearly all sentence-level tasks. However, standard pooling methods like mean or max aggregation treat tokens as an independent set, discarding the rich relational structure captured by the model's self-attention layers and making them susceptible to signal dilution. To address this, we introduce GLOT, a lightweight, structure-aware pooling module that reframes pooling as relational learning followed by aggregation. Operating on the outputs of a frozen LLM, GLOT first constructs a latent token-similarity graph, then refines token representations with a graph neural network, and finally aggregates them using a readout layer. Experimentally, our approach is remarkably robust and efficient: on a diagnostic stress test where 90% of tokens are random distractors, GLOT maintains over 97% accuracy while baseline methods collapse. Furthermore, it competitive with state-of-the-art techniques on benchmarks like GLUE and MTEB with 20x fewer trainable parameters and speeds up the training time by over 100x compared with parameter-efficient fine-tuning methods. Supported by a theoretical analysis of its expressive power, our work shows that learning over token graphs is a powerful paradigm for the efficient adaptation of frozen LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3413. Why Prototypes Collapse: Diagnosing and Preventing Partial Collapse in Prototypical Self-Supervised Learning
๐ฌ 13
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Self-Supervised Learning Representation Learning Computer Vision
Prototypical self-supervised learning methods consistently suffer from partial prototype collapse, where multiple prototypes converge to nearly identical representations. This undermines their central purposeโproviding diverse and informative targets to guide encoders toward rich representationsโand has led practitioners to over-parameterize prototype sets or add ad-hoc regularizers, which mitigate symptoms rather than address the root cause. We empirically trace the collapse to the joint optimization of encoders and prototypes, which encourages a type of shortcut learning: early in training prototypes drift toward redundant representations that minimize loss without necessarily enhancing representation diversity. To break the joint optimization, we introduce a fully decoupled training strategy that learns prototypes and encoders under separate objectives. Concretely, we model prototypes as a Gaussian mixture updated with an online EM-style procedure, independent of the encoder's loss. This simple yet principled decoupling eliminates prototype collapse without explicit regularization and yields consistently diverse prototypes, which in several settings translate to improved downstream performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3414. Controllable Sequence Editing for Biological and Clinical Trajectories
๐ฌ 25
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ conditional generation sequence editing time series forecasting counterfactual prediction multivariate sequences concept-based learning longitudinal modeling
Conditional generation models for longitudinal sequences can produce new or modified trajectories given a conditioning input. However, they often lack control over when the condition should take effect (timing) and which variables it should influence (scope). Most methods either operate only on univariate sequences or assume that the condition alters all variables and time steps. In scientific and clinical settings, interventions instead begin at a specific moment, such as the time of drug administration or surgery, and influence only a subset of measurements while the rest of the trajectory remains unchanged. CLEF learns temporal concepts that encode how and when a condition alters future sequence evolution. These concepts allow CLEF to apply targeted edits to the affected time steps and variables while preserving the rest of the sequence. We evaluate CLEF on 8 datasets spanning cellular reprogramming, patient health, and sales, comparing against 9 state-of-the-art baselines. CLEF improves immediate sequence editing accuracy by 16.28% (MAE) on average against their non-CLEF counterparts. Unlike prior models, CLEF enables one-step conditional generation at arbitrary future times, outperforming their non-CLEF counterparts in delayed sequence editing by 26.73% (MAE) on average. We test CLEF under counterfactual inference assumptions and show up to 63.19% (MAE) improvement on zero-shot conditional generation of counterfactual trajectories. In a case study of patients with type 1 diabetes mellitus, CLEF identifies clinical interventions that generate realistic counterfactual trajectories shifted toward healthier outcomes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3415. Hippoformer: Integrating Hippocampus-inspired Spatial Memory with Transformers
๐ฌ 26
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Hippocampus grid cell spatial reasoning relational memory transformer
Transformers form the foundation of modern generative AI, yet their keyโvalue memory lacks inherent spatial priors, constraining their capacity for spatial reasoning. In contrast, neuroscience points to the hippocampalโentorhinal system, where the medial entorhinal cortex provides structural codes and the hippocampus binds them with sensory codes to enable flexible spatial inference. However, existing hippocampus models such as the Tolman-Eichenbaum Machine (TEM) suffer from inefficiencies due to outer-product operations or context-length bottlenecks in self-attention, limiting their scalability and integration into modern deep learning frameworks. To bridge this gap, we propose mm-TEM, an efficient and scalable structural spatial memory model that leverages meta-MLP relational memory to improve training efficiency, form grid-like representations, and reveal a novel link between prediction horizon and grid scales. Extensive evaluation shows its strong generalization on long sequences, large-scale environments, and multi-step prediction, with analyses confirming that its advantages stem from explicit understanding of spatial structures. Building on this, we introduce Hippoformer, which integrates mm-TEM with Transformer to combine structural spatial memory with precise working memory and abstraction, achieving superior generalization in both 2D and 3D prediction tasks and highlighting the potential of hippocampal-inspired architectures for complex domains. Overall, Hippoformer represents a initial step toward seamlessly embedding structured spatial memory into foundation architectures, offering a potential scalable path to endow deep learning models with spatial intelligence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3416. Improving Discrete Diffusion Unmasking Policies Beyond Explicit Reference Policies
๐ฌ 18
โญ 5.00
๐ generative models
๐ท๏ธ discrete diffusion models masked diffusion models reinforcement learning
Masked diffusion models (MDMs) have recently emerged as a novel framework for language modeling. MDMs generate sentences by iteratively denoising masked sequences, filling in [MASK] tokens step by step. Although MDMs support any-order sampling, performance is highly sensitive to the choice of which position to unmask next. Prior work typically relies on rule-based schedules (e.g., max-confidence, max-margin), which provide ad hoc improvements. In contrast, we replace these heuristics with a learned scheduler. Specifically, we cast denoising as a KLโregularized Markov decision process (MDP) with an explicit reference policy and optimize a regularized objective that admits policy-improvement and convergence guarantees under standard assumptions. We prove that the optimized policy under this framework generates samples that more closely match the data distribution than heuristic schedules. Empirically, across four benchmarks, our learned policy consistently outperforms max-confidence: for example, on SUDOKU, where unmasking order is critical, it yields a 22% gain over random and a 12% gain over max-confidence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3417. SmellNet: A Large-scale Dataset for Real-world Smell Recognition
๐ฌ 22
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Smell sensing multimodal AI AI for smell smell recognition chemistry physical sensing
The ability of AI to sense and identify various substances based on their smell alone can have profound impacts on allergen detection (e.g., smelling gluten or peanuts in a cake), monitoring the manufacturing process, and sensing hormones that indicate emotional states, stress levels, and diseases. Despite these broad impacts, there are virtually no large-scale benchmarks, and therefore little progress, for training and evaluating AI systemsโ ability to smell in the real world. In this paper, we use portable gas and chemical sensors to create SMELLNET, the first large-scale database that digitizes a diverse range of smells in the natural world. SMELLNET contains about 828,000 data points across 50 substances, spanning nuts, spices, herbs, fruits, and vegetables, and 43 mixtures among them, with 68 hours of data collected. Using SMELLNET, we developed SCENTFORMER, a Transformer-based architecture combining temporal differencing and sliding-window augmentation for smell data. For the SMELLNET-BASE classification task, SCENTFORMER achieves 58.5% Top-1 accuracy, and for the SMELLNET-MIXTURE distribution prediction task, SCENTFORMER achieves 50.2% Top-1@0.1 on the test-seen split. SCENTFORMERโs ability to generalize across conditions and capture transient chemical dynamics demonstrates the promise of temporal modeling in olfactory AI. SMELLNET and SCENTFORMER lay the groundwork for real-world olfactory applications across healthcare, food and beverage, environmental monitoring, manufacturing, and entertainment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3418. Fused-Planes: Why Train a Thousand Tri-Planes When You Can Share?
๐ฌ 20
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Tri-Planes NeRF 3D Latent
Tri-Planar NeRFs enable the application of powerful 2D vision models for 3D tasks, by representing 3D objects using 2D planar structures.
This has made them the prevailing choice to model large collections of 3D objects.
However, training Tri-Planes to model such large collections is computationally intensive and remains largely inefficient.
This is because the current approaches independently train one Tri-Plane per object, hence overlooking structural similarities in large classes of objects.
In response to this issue, we introduce Fused-Planes, a novel object representation that improves the resource efficiency of Tri-Planes when reconstructing object classes, all while retaining the same planar structure.
Our approach explicitly captures structural similarities across objects through a latent space and a set of globally shared base planes.
Each individual Fused-Planes is then represented as a decomposition over these base planes, augmented with object-specific features.
Fused-Planes showcase state-of-the-art efficiency among planar representations, demonstrating $7.2 \times$ faster training and $3.2 \times$ lower memory footprint than Tri-Planes while maintaining rendering quality.
An ultra-lightweight variant further cuts per-object memory usage by $1875 \times$ with minimal quality loss.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3419. LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
๐ฌ 16
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reinforcement Learning Fine-tuning Agents Decision-Making Exploration Analysis
The success of LLMs has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub-optimal exploration and the knowing-doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub-optimally in decision-making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing-doing gap. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales. Our experiments across multi-armed bandits, contextual bandits, and Tic-tac-toe demonstrate that RL fine-tuning enhances the decision-making abilities of LLMs by increasing exploration and narrowing the knowing-doing gap. Finally, we study both classic exploration mechanisms, such as $\epsilon$-greedy, and LLM-specific approaches, such as self-correction and self-consistency, to enable more effective fine-tuning of LLMs for decision-making.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3420. DUET: DISTILLED LLM UNLEARNING FROM AN EFFICIENTLY CONTEXTUALIZED TEACHER
๐ฌ 20
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Unlearning Knowledge Distillation TeacherโStudent Learning Utility Preservation Data Efficiency Robustness Safety and Alignment
LLM unlearning is a technique to remove the impacts of undesirable knowledge from the model without retraining from scratch, which is indispensable towards trustworthy AI. Existing unlearning methods face significant limitations: conventional tuning-based unlearning is computationally heavy and prone to catastrophic forgetting. In contrast, in-contextualized unlearning is lightweight for precise unlearning but vulnerable to prompt removal or reverse engineering attacks. In response, we propose Distilled Unlearning from an Efficient Teacher (DUET), a novel distillation-based unlearning method that combines the merits of these two lines of work. It learns a student model to imitate the behavior of a prompt-steered teacher that effectively refuses undesirable knowledge generation while preserving general domain knowledge. Extensive evaluations on existing benchmarks with our enriched evaluation protocols demonstrated that DUET achieves significantly higher performance in both forgetting and utility preservation, while being orders of magnitude more data-efficient than state-of-the-art unlearning methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3421. Multi-state Protein Design with DynamicMPNN
๐ฌ 9
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Protein Design AI BioML Protein Dynamics Multi-state proteins GNN
Structural biology has long been dominated by the one sequence, one structure, one function paradigm, yet many critical biological processesโfrom enzyme catalysis to membrane transportโdepend on proteins that adopt multiple conformational states. Existing multi-state design approaches rely on post-hoc aggregation of single-state predictions, achieving poor experimental success rates compared to single-state design. We introduce DynamicMPNN, an inverse folding model explicitly trained to generate sequences compatible with multiple conformations through joint learning across conformational ensembles. Trained on 46,033 conformational pairs covering 75\% of CATH superfamilies and evaluated using Alphafold 3, DynamicMPNN outperforms ProteinMPNN by up to 25% on decoy-normalized RMSD and by 12% on sequence recovery across our challenging multi-state protein benchmark.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3422. Edit-Based Flow Matching for Temporal Point Processes
๐ฌ 15
โญ 5.00
๐ generative models
๐ท๏ธ Generative Modelling Forecasting Events Sequences Sets Continuous Time CTMC
Temporal point processes (TPPs) are a fundamental tool for modeling event sequences in continuous time, but most existing approaches rely on autoregressive parameterizations that are limited by their sequential sampling. Recent non-autoregressive, diffusion-style models mitigate these issues by jointly interpolating between noise and data through event insertions and deletions in a discrete Markov chain. In this work, we generalize this perspective and introduce an Edit Flow process for TPPs that transports noise to data via insert, delete, and substitute edit operations. By learning the instantaneous edit rates within a continuous-time Markov chain framework, we attain a flexible and efficient model that effectively reduces the total number of necessary edit operations during generation. Empirical results demonstrate the generative flexibility of our unconditionally trained model in a wide range of unconditional and conditional generation tasks on benchmark TPPs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3423. SpeechJudge: Towards Human-Level Judgment for Speech Naturalness
๐ฌ 27
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ speech naturalness human dataset RLHF generative reward model AudioLLM
Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce ***SpeechJudge***, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalnessโone of the most fundamental subjective metrics for speech synthesis. First, we present ***SpeechJudge-Data***, a large-scale human feedback corpus of 99k speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish ***SpeechJudge-Eval***, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the best-performing model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop ***SpeechJudge-GRM***, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3424. AQER: A Scalable and Efficient Data Loader for Digital Quantum Computers
๐ฌ 39
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ quantum data loading
Digital quantum computing promises to offer computational capabilities beyond the reach of classical systems, yet its capabilities are often challenged by scarce quantum resources. A critical bottleneck in this context is how to load classical or quantum data into quantum circuits efficiently. Approximate quantum loaders (AQLs) provide a viable solution to this problem by balancing fidelity and circuit complexity. However, most existing AQL methods are either heuristic or provide guarantees only for specific input types, and a general theoretical framework is still lacking. To address this gap, here we reformulate most AQL methods into a unified framework and establish information-theoretic bounds on their approximation error. Our analysis reveals that the achievable infidelity between the prepared state and target state scales linearly with the total entanglement entropy across subsystems when the loading circuit is applied to the target state. In light of this, we develop AQER, a scalable AQL method that constructs the loading circuit by systematically reducing entanglement in target states. We conduct systematic experiments to evaluate the effectiveness of AQER, using synthetic datasets, classical image and language datasets, and a quantum many-body state datasets with up to 50 qubits. The results show that AQER consistently outperforms existing methods in both accuracy and gate efficiency. Our work paves the way for scalable quantum data processing and real-world quantum computing applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3425. Measure Twice, Cut Once: A Semantic-Oriented Approach to Video Temporal Localization with Video LLMs
๐ฌ 21
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video LLMs Video Temporal Localization Contrastive Learning
Temporally localizing user-queried events through natural language is crucial for video understanding models. Recent methods predominantly adapt video LLMs to generate event boundary timestamps for temporal localization tasks, which struggle to leverage LLMs' pre-trained semantic understanding capabilities due to the uninformative nature of timestamp outputs. In this work, we propose a timestamp-free, semantic-oriented framework that fine-tunes video LLMs using two generative learning tasks and one discriminative learning task. We first introduce a structural token generation task that enables the video LLM to recognize the temporal structure of input videos based on the input query. Through this task, the video LLM generates a sequence of special tokens, called structural tokens, which partition the video into consecutive segments and categorize them as either target events or background transitions. To enhance precise recognition of event segments, we further propose a query-focused captioning task that enables the video LLM to extract fine-grained event semantics that can be effectively utilized by the structural tokens. Finally, we introduce a structural token grounding module driven by contrastive learning to associate each structural token with its corresponding video segment, achieving holistic temporal segmentation of the input video and readily yielding the target event segments for localization. Extensive experiments across diverse temporal localization tasks demonstrate that our proposed framework, MeCo, consistently outperforms methods relying on boundary timestamp generation, highlighting the potential of a semantic-driven approach for temporal localization with video LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3426. MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents
๐ฌ 17
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Model Context Protocol Security LLM Agnt Attack Benchmark Evaluation
The Model Context Protocol (MCP) standardizes how large language model (LLM) agents discover, describe, and call external tools. While MCP unlocks broad interoperability, it also enlarges the attack surface by making tools first-class, composable objects with natural-language metadata, and standardized I/O. We present MSB (MCP Security Benchmark), the first end-to-end evaluation suite that systematically measures how well LLM agents resist MCP-specific attacks throughout the full tool-use pipeline: task planning, tool invocation, and response handling. MSB contributes: (1) a taxonomy of 12 attacks including name-collision, preference manipulation, prompt injections embedded in tool descriptions, out-of-scope parameter requests, user-impersonating responses, false-error escalation, tool-transfer, retrieval injection, and mixed attacks; (2) an evaluation harness that executes attacks by running real tools (both benign and malicious) via MCP rather than simulation; and (3) a robustness metric that quantifies the trade-off between security and performance: Net Resilient Performance (NRP). We evaluate nine popular LLM agents across 10 domains and 405 tools, producing 2,000 attack instances. Results reveal the effectiveness of attacks against each stage of MCP. Models with stronger performance are more vulnerable to attacks due to their outstanding tool calling and instruction following capabilities. MSB provides a practical baseline for researchers and practitioners to study, compare, and harden MCP agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3427. Membrane Potential Perturbation Dynamic Is Total Variation
๐ฌ 15
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Membrane potential perturbation dynamic spiking neural network total variation
Membrane potential perturbation dynamic (MPPD) is an emerging approach to capture perturbation intensity and stabilize the performance of spiking neural networks (SNN). It discards the neuronal reset part to intuitively reduce fluctuations of dynamics, but this treatment may be insufficient in perturbation characterization. In this study, we prove that MPPD is total variation (TV), which is a widely-used methodology for robust signal reconstruction. Moreover, we propose a novel TV-$\ell_1$ framework for MPPD, which allows for a wider range of network functions and has better denoising advantage than the existing TV-$\ell_2$ framework, based on the coarea formula. Experiments show that MPPD-TV-$\ell_1$ achieves robust performance in both Gaussian noise training and adversarial training for image classification tasks. This finding may provide a new insight into the essence of perturbation characterization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3428. EffiVMT: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning
๐ฌ 13
โญ 5.00
๐ generative models
๐ท๏ธ Video diffusion transfer; Video motion transfer; Efficiency;
Recently, breakthroughs in the video diffusion transformer have shown remarkable capabilities in diverse motion generations. As for the motion-transfer task, current methods mainly use two-stage Low-Rank Adaptations (LoRAs) finetuning to obtain better performance. However, existing adaptation-based motion transfer still suffers from **motion inconsistency** and **tuning inefficiency** when applied to large video diffusion transformers. Naive two-stage LoRA tuning struggles to maintain motion consistency between generated and input videos due to the inherent spatial-temporal coupling in the 3D attention operator. In addition, they require time-consuming fine-tuning processes in both stages. To tackle these issues, we propose EffiVMT, an efficient **three-stage** video motion transfer framework that finetunes a powerful video diffusion transformer to synthesize complex motion. In **stage 1**, we propose a spatial-temporal head classification technique to decouple the heads of 3D attention to distinct groups for spatial-appearance and temporal motion processing. We then finetune the spatial heads in the **stage 2**. In the **stage 3** of temporal head tuning, we design the sparse motion sampling and adaptive RoPE to accelerate the tuning speed. To address the lack of a benchmark for this field, we introduce MotionBench, a comprehensive benchmark comprising diverse motion, including creative camera motion, single object motion, multiple object motion, and complex human motion. We show extensive evaluations on MotionBench to verify the superiority of EffiVMT.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3429. Post-training Large Language Models for Diverse High-Quality Responses
๐ฌ 12
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language models Diversity Reinforcement learning Post-training
Reinforcement learning has emerged as a popular method for post-training large language models (LLMs). While improving the model's performance on downstream tasks, it often reduces the model's output diversity, leading to narrow, canonical responses. Existing methods to enhance diversity are limited, either by operating at inference time or by focusing on lexical differences. We propose a novel training method named DQO (Diversity Quality Optimization) based on determinantal point processes (DPPs) to jointly optimize LLMs for quality and semantic diversity. Our approach samples and embeds a group of responses for each prompt, then uses the determinant of a kernel-based similarity matrix to measure diversity as the volume spanned by the embeddings of these responses. Experiments across instruction-following, summarization, story generation, and reasoning tasks demonstrate that our method substantially improves semantic diversity without sacrificing model quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3430. From Evaluation to Defense: Advancing Safety in Video Large Language Models
๐ฌ 33
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Video Large Language Model Safety of Multimodal Large Language Model Safety Alignment RLHF
While the safety risks of image-based large language models (Image LLMs) have been extensively studied, their video-based counterparts (Video LLMs) remain critically under-examined. To systematically study this problem, we introduce \textbf{VideoSafetyEval} - the first large-scale, real-world benchmark for Video LLM safety, which compromises 11.4k video-query pairs and spans 19 principal risk categories. Based on this, \textit{we reveal that integrating video modality degrades safety performance by an average of 34.2\%, exposing systemic risks in multimodal attack exploitation.}
To address this vulnerability, we propose \textbf{VideoSafety-R1}, a dual-stage framework achieving unprecedented safety gains through three innovations: (1) VideoSafetyThinking dataset contains 46k video-queryโthinking response triplets. (2) Alarm Token-Guided Safety Fine-Tuning (AT-SFT) injects learnable alarm tokens into visual and textual sequences, enabling explicit harm perception across modalities via multitask objectives. (3) Then, Safety-Guided GRPO enhances defensive reasoning through dynamic policy optimization with rule-based rewards derived from dual-modality verification. These components synergize to shift safety alignment from harm perception to active reasoning. The framework achieves a 71.1\% improvement on VSE-HH, and improves by 59.1\%, 44.3\%, and 15.0\% on the image safety datasets MMBench, VLGuard, and FigStep, respectively.
\textit{Our codes are anonymously available at \url{https://anonymous.4open.science/r/VSBr1-911E/README.md}.}
\textcolor{red}{Note: This paper contains harmful language and image examples, and reader discretion is recommended.}
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3431. AutoMetrics: Approximate Human Judgments with Automatically Generated Evaluators
๐ฌ 13
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ evaluation LLM-as-a-judge metrics human feedback open-ended tasks user-centered evaluation data-efficient evaluation automatic metric generation benchmarking
Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present **AutoMetrics**, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from **MetricBank**, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3432. RESFL: An Uncertainty-Aware Framework for Responsible Federated Learning by Balancing Privacy, Fairness and Utility
๐ฌ 18
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Federated Learning Fairness Privacy Adversarial Representation Learning Uncertainty Quantification
Federated Learning (FL) has gained prominence in machine learning applications across critical domains, offering collaborative model training without centralized data aggregation. However, FL frameworks that protect privacy often sacrifice fairness and reliability; differential privacy reduces data leakage but hides sensitive attributes needed for bias correction, worsening performance gaps across demographic groups. This work explores the trade-off between privacy and fairness in FL-based object detection and introduces RESFL, an integrated solution optimizing both. RESFL incorporates adversarial privacy disentanglement and uncertainty-guided fairness-aware aggregation. The adversarial component uses a gradient reversal layer to remove sensitive attributes, reducing privacy risks while maintaining fairness. The uncertainty-aware aggregation employs an evidential neural network to weight client updates adaptively, prioritizing contributions with lower fairness disparities and higher confidence. This ensures robust and equitable FL model updates. We demonstrate the effectiveness of RESFL in high-stakes autonomous vehicle scenarios, where it achieves high mAP on FACET and CARLA, reduces membership-inference attack success by 37%, reduces equality-of-opportunity gap by 17% relative to the FedAvg baseline, and maintains superior adversarial robustness. However, RESFL is inherently domain-agnostic and thus applicable to a broad range of application domains beyond autonomous driving.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3433. Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning
๐ฌ 12
โญ 5.00
๐ reinforcement learning
๐ท๏ธ goal-conditioned reinforcement learning hierarchical reinforcement learning sparse reward long-horizon tasks graph-based policy learning subgoal planning
Long-horizon goal-conditioned tasks pose fundamental challenges for reinforcement learning (RL), particularly when goals are distant and rewards are sparse. While hierarchical and graph-based methods offer partial solutions, their reliance on conventional hindsight relabeling often fails to correct subgoal infeasibility, leading to inefficient high-level planning. To address this, we propose Strict Subgoal Execution (SSE), a graph-based hierarchical RL framework that integrates Frontier Experience Replay (FER) to separate unreachable from admissible subgoals and streamline high-level decision making. FER delineates the reachability frontier using failure and partial-success transitions, which identifies unreliable subgoals, increases subgoal reliability, and reduces unnecessary high-level decisions. Additionally, SSE employs a decoupled exploration policy to cover underexplored regions of the goal space and a path refinement that adjusts edge costs using observed low-level failures. Experimental results across diverse long-horizon benchmarks show that SSE consistently outperforms existing goal-conditioned and hierarchical RL methods in both efficiency and success rate.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3434. SIGMA-GEN: STRUCTURE AND IDENTITY GUIDED MULTI-SUBJECT ASSEMBLY FOR IMAGE GENERATION
๐ฌ 12
โญ 5.00
๐ generative models
๐ท๏ธ image generation identity preservation controllable generation
We present SIGMA-GEN, a unified framework for multi-identity preserving image generation. Unlike prior approaches, SIGMA-GEN is the first to enable single-pass multi-subject identity-preserved generation guided by both structural and spatial constraints. A key strength of our method is its ability to support user guidance at various levels of precision โ from coarse 2D or 3D boxes to pixel-level segmentations and depth โ with a single model. To enable this, we introduce SIGMA-SET27K, a novel synthetic dataset that provides identity, structure, and spatial information for over 100k unique subjects across 27k images. Through extensive evaluation we demonstrate that SIGMA-GEN achieves state-of-the-art performance in identity preservation, image generation quality, and speed.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3435. Characterizing and Mitigating Reasoning Drift in Large Language Models
๐ฌ 12
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Steering Vector LLMs Reasoning Inference-time Scaling
While chain-of-thought prompting enables powerful multi-step reasoning in Large Language Models (LLMs), the stochastic nature of the generation process undermines its reliability. In this work, we first analyze thousands of reasoning paths to identify Reasoning Drift, a key failure mode where models get locked into flawed reasoning patterns. We reveal that the manifestation of drift is a complex interplay between universal functional tendencies and unique, model-specific signatures. Based on the diagnosis, we propose Reasoning-Aware Activation Steering, a novel inference-time intervention method to gently nudge the model's activations away from pathological patterns. We pre-compute a library of vectors from contrastive functional transitions and apply them dynamically. Experiments show that our method effectively mitigates the drift problem and boosts accuracy. Additionally, it generalizes to out-of-distribution tasks, demonstrating a deeper capture of valid reasoning principles.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3436. Seek-CAD: A Self-refined Generative Modeling for 3D Parametric CAD Using Local Inference via DeepSeek
๐ฌ 15
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ CAD generative modeling; parametric CAD sequence;
The advent of Computer-Aided Design (CAD) generative modeling will significantly transform the design of industrial products. The recent research endeavor has extended into the realm of Large Language Models (LLMs). In contrast to fine-tuning methods, training-free approaches typically utilize the advanced LLMs, thereby offering enhanced flexibility and efficiency in the development of AI agents for generating CAD parametric models. However, the lack of a mechanism to harness Chain-of-Thought (CoT) limits the potential of LLMs in CAD applications. The Seek-CAD is the pioneer exploration of locally deployed inference LLM DeepSeek-R1 for CAD parametric model generation with a training-free methodology. This study is the investigation to incorporate both visual and CoT feedback within the self-refinement mechanism for generating CAD models. Specifically, the initial generated parametric CAD model is rendered into a sequence of step-wise perspective images, which are subsequently processed by a Vision Language Model (VLM) alongside the corresponding CoTs derived from DeepSeek-R1 to assess the CAD model generation. Then, the feedback is utilized by DeepSeek-R1 to refine the initial generated model for the next round of generation. Moreover, we present an innovative 3D CAD model dataset structured around the SSR (Sketch, Sketch-based feature, and Refinements) triple design paradigm. This dataset encompasses a wide range of CAD commands, thereby aligning effectively with industrial application requirements and proving suitable for the generation of LLMs. Extensive experiments validate the effectiveness of Seek-CAD under various metrics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3437. HEIST: A Graph Foundation Model for Spatial Transcriptomics and Proteomics Data
๐ฌ 20
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Pretrained models Spatial transcriptomics AI for Science
Single-cell transcriptomics and proteomics have become a great source for data-driven insights into biology, enabling the use of advanced deep learning methods to understand cellular heterogeneity and gene expression at the single-cell level. With the advent of spatial-omics data, we have the promise of characterizing cells within their tissue context as it provides both spatial coordinates and intra-cellular transcriptional or protein counts. Beyond transcriptomics, proteomics offers a complementary view by directly measuring proteins, which are the primary effectors of cellular function and key therapeutic targets. However, existing models either ignore the spatial information or the complex genetic and proteomic programs within cells. Thus they cannot infer how cell internal regulation adapts to microenvironmental cues. Furthermore, these models often utilize fixed gene vocabularies, hindering their generalizability to datasets with different genes than pretraining. In this paper, we introduce HEIST, a hierarchical graph transformer foundation model for spatial transcriptomics and proteomics. HEIST models tissues as hierarchical graphs. The higher level graph is a spatial cell graph, and each cell in turn, is represented by its lower level gene co-expression network graph. Rather than using a fixed gene vocabulary, HEIST computes gene embeddings from its co-expression network and cellular context. HEIST achieves this by performing both intra-level and cross-level message passing to utilize the hierarchy in its embeddings and can thus generalize to novel datatypes including spatial proteomics without retraining. HEIST is pretrained on 22.3M cells from 124 tissues across 15 organs using spatially-aware contrastive and masked autoencoding objectives. Unsupervised analysis of HEIST embeddings reveals spatially informed subpopulations missed by prior models. Downstream evaluations demonstrate generalizability to proteomics data and state-of-the-art performance in clinical outcome prediction, cell type annotation, and gene imputation across multiple technologies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3438. MobiEdit: Resource-efficient Knowledge Editing for Personalized On-device LLMs
๐ฌ 12
โญ 5.00
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ knowledge editing resource-constrained devices privacy-preserving
Large language models (LLMs) are deployed on mobile devices to power killer applications such as intelligent assistants. LLMs pre-trained on general corpora often hallucinate when handling personalized or unseen queries, leading to incorrect or outdated responses. Knowledge editing addresses this by identifying and adjusting a small crucial portion of model weights, without compromising the general knowledge. However, prior knowledge editing methods are impractical to run on local devices due to the resource-heavy backpropagation (BP) needed for updates. We present MobiEdit, the first mobile knowledge editing framework that enables efficient LLM personalization on commercial off-the-shelf (COTS) mobile devices. MobiEdit replaces full-precision BP with quantized forward-only gradient estimation, thus compatible with the energy-efficient mobile neural processing units (NPUs). To further improve gradient estimation efficiency, we introduce two optimizations: an early stopping mechanism that adaptively terminates editing upon success and prefix activation reusing that reduce redundant computation across steps. Our approach enables real-time editing of 3B-parameter models (Qwen2.5-3B-Instruct and Llama3.2-3B-Instruct) on COTS mobile devices with 7.1$\times$ less memory, 15.8 $\times$ less energy and 3.4$\times$ less latency compared to previous knowledge editing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3439. ScalingCache: Extreme Acceleration of DiTs through Difference Scaling and Dynamic Interval Caching
๐ฌ 12
โญ 5.00
๐ generative models
๐ท๏ธ โโDiffusion Transformer Image generation Video generation Model Acceleration Feature Cache
Diffusion Transformers (DiTs) have emerged as powerful generative models, but their iterative denoising structure and deep transformer blocks incur substantial computational overhead, limiting the accessibility and practical deployment of high-quality video generation. To address this bottleneck, we propose ScalingCache, a training-free acceleration framework specifically designed for DiTs. ScalingCache exploits the inherent redundancy in model representations by performing lightweight offline analysis on a small number of samples and dynamically reusing previously computed activations during inference, thereby avoiding full computation at certain denoising steps. Experimental results demonstrate that ScalingCache achieves significant acceleration in both image and video generation tasks while maintaining near-lossless generation quality. On widely used video generation models including Wan2.1 and HunyuanVideo, it achieves approximately 2.5$\times$ acceleration with only 0.5$\%$ drop in VBench scores; on FLUX, it achieves 3.1$\times$ near-lossless acceleration, with human preference tests showing comparable quality to original outputs. Moreover, under similar acceleration ratios, ScalingCache outperforms prior state-of-the-art caching strategies, achieving a 45$\%$ reduction in LPIPS for text-to-image generation and 20$-$30$\%$ reduction for text-to-video generation, highlighting its superior fidelity preservation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3440. FOCUS: Efficient Keyframe Selection for Long Video Understanding
๐ฌ 20
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Keyframe Selection Multimodal large language models Long Video Understanding Combinatorial Pure-exploration
Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments.
We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region. Extensive experiments across four long-video question-answering benchmarks and four popular MLLMs demonstrate that FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3441. An Ensemble Framework for Unbiased Language Model Watermarking
๐ฌ 21
โญ 5.00
๐ generative models
๐ท๏ธ LLM watermarking
As large language models become increasingly capable and widely deployed, verifying the provenance of machine-generated content is critical to ensuring trust, safety, and accountability. Watermarking techniques have emerged as a promising solution by embedding imperceptible statistical signals into the generation process. Among them, unbiased watermarking is particularly attractive due to its theoretical guarantee of preserving the language model's output distribution, thereby avoiding degradation in fluency or detectability through distributional shifts. However, existing unbiased watermarking schemes often suffer from weak detection power and limited robustness, especially under short text lengths or distributional perturbations. In this work, we propose ENS, a novel ensemble framework that enhances the detectability and robustness of logits-based unbiased watermarks while strictly preserving their unbiasedness. ENS sequentially composes multiple independent watermark instances, each governed by a distinct key, to amplify the watermark signal. We theoretically prove that the ensemble construction remains unbiased in expectation and demonstrate how it improves the signal-to-noise ratio for statistical detectors. Empirical evaluations on multiple LLM families show that ENS substantially reduces the number of tokens needed for reliable detection and increases resistance to smoothing and paraphrasing attacks without compromising generation quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3442. SMixer: Rethinking Efficient-Training and Event-Driven SNNs
๐ฌ 15
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Spiking Neural Network Efficient Computation Prune Method
Spiking Neural Networks (SNNs) offer a promising, energy-efficient paradigm for computation, but their practical application is hindered by challenges in architecture design and training costs. For example, Spiking ResNet exhibits relatively low performance, whereas high-performance Spiking Transformers are not truly event driven and cannot be implemented on asynchronous chips. Moreover, the intrinsic time steps and neuron state dynamics result in a substantial computational overhead for training SNNs on GPUs. In response to these problems, we discuss rational architectural design for SNNs and argue that such designs should exhibit three key characteristics: operations fully supported by asynchronous scenarios, low training overhead and competitive performance. In light of this, we adopt the event-driven friendly Spiking Mixer (SMixer) as the foundational architecture and develop a spike feature Spatial-Temporal Pruning (STP) framework with a high pruning ratio and no trainable parameters to reduce the training overhead. Based on a statistical analysis of sparse spike features, STP eliminates redundant spike features across both spatial and temporal dimensions, thereby reducing the input features and computational load during training. It adaptively selects the most salient spike tokens spatially and dynamically constrains neuron firing rates temporally. By leveraging STP and architectural adaptation, SMixer accelerates training while ensuring a fully event-driven characteristics and maintaining competitive performance, offering valuable insights for the design of efficient, event-driven SNNs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3443. THE END OF MANUAL DECODING: TOWARDS TRULY END-TO-END LANGUAGE MODELS
๐ฌ 13
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ dynamic decoding instruction-based Control truly end-to-end
The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end'' generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass.
Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"โa practical upper bound for any static method. Besides, we demonstrate an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., ''generate with low randomness'') and adjusts its predicted temperature and top-p on a token-by-token basis, which may open a new paradigm for steerable and interactive LLM decoding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3444. Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought
๐ฌ 25
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multilingual Math Reasoning
Recent frontier models employ long-chain-of-thought reasoning to explore solution spaces in context and achieve stronger performance. While many works study distillation to build smaller yet capable models, most focus on English and little is known about language-specific reasoning. To bridge this gap, we first introduce **Language-Mixed CoT**, a reasoning schema that switches between English and a target language, using English as an anchor to excel in reasoning while minimizing translation artifacts. As a Korean case study, we curate **Yi-Sang-HQ**: 5.79M native-Korean prompts from web Q&A, exams, STEM, and code; 3.7M long reasoning traces generated from Qwen3-32B; and a targeted 260k high-yield subset. We train nine models (4Bโ35B) across six families (Qwen2.5, Llama-3.1, Gemma-3, etc). Our best model, **KO-REAson-35B**, achieves state-of-the-art performance, with the highest overall average score ($64.0_{\pm2.5}$), ranking first on 5/9 benchmarks and second on the remainder. Smaller and mid-sized models also benefit substantially, with an average improvement of $+18.6$ points across the evaluated nine benchmarks. Ablations show **Language-Mixed CoT** is more effective than monolingual CoT, indicating that reasoning patterns can be engineered to improve non-English performance. We release our data-curation pipeline, evaluation system, datasets, and models to advance research on language-specific reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3445. Sequential Information Bottleneck Fusion: Towards Robust and Generalizable Multi-Modal Brain Tumor Segmentation
๐ฌ 15
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Brain Tumor Segmentation Missing Modality Generalization Robustness
Brain tumor segmentation in multi-modal MRIs poses significant challenges when one or more modalities are missing. Recent approaches commonly employ parallel fusion strategies; however, these methods often risk losing crucial shared information across modalities, which can degrade segmentation performance. In this paper, we advocate leveraging sequential information bottleneck fusion to effectively preserve shared information across modalities. From an information-theoretic perspective, sequential fusion not only produces more robust fused representations in missing-data scenarios but also achieves a tighter generalization upper bound compared to parallel fusion approaches. Building on this principle, we propose the Sequential Multi-modal Segmentation Network (SMSN), which integrates an Information-Bottleneck Fusion Module (IBFM). The IBFM sequentially extracts modality-common features while reconstructing modality-specific features through a dedicated feature extraction module. Extensive experiments on the BRATS18 and BRATS20 glioma datasets demonstrate that SMSN consistently outperforms traditional parallel fusion-based baselines, achieving exceptional robustness in diverse missing-modality settings. Furthermore, SMSN exhibits superior cross-domain generalization, as evidenced by its ability to transfer a trained model from BRATS20 to a brain metastasis dataset without fine-tuning. To ensure reproducibility, the code of the SMSN is provided in the supplementary file.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3446. Mitigating Non-IID Drift in Zeroth-Order Federated LLM Fine-Tuning with Transferable Sparsity
๐ฌ 12
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Sparse Zeroth-Order Optimization Federate Learning
Federated Learning enables collaborative fine-tuning of Large Language Models (LLMs) across decentralized Non-Independent and Identically Distributed (Non-IID) clients, but such models' massive parameter sizes lead to significant memory and communication challenges. This work introduces Meerkat, a sparse zeroth-order optimization (ZO) method designed for federated LLM fine-tuning. By limiting fine-tuning to a transferable, static, extremely sparse subset of parameters, Meerkat achieves remarkable communication efficiency, enabling cost-effective high-frequency synchronization. With theoretical analysis and experiments, we show that this high-frequency communication effectively mitigates Non-IID data challenges and leads to superior performance compared to full-parameter ZO. Furthermore, experiment results show that Meerkat outperforms existing sparsity baselines with better performance at the same communication frequency. To further handle Non-IID drift, Meerkat leverages traceable local updates and forms a virtual path for each client. This virtual path mechanism reveals the GradIP phenomenon: the inner products between LLM pre-training gradients maintained by server and client gradients estimated via ZO converges for extreme Non-IID clients but oscillates for IID ones. This distinct behavior provides a signal for identifying clients with extreme data heterogeneity. Using this signal, Meerkat-vp is proposed to analyze GradIP trajectories to identify extreme Non-IID clients and applies early stopping to enhance aggregated model quality. Experiments confirm that Meerkat and Meerkat-vp significantly improve the efficiency and effectiveness of ZO federated LLM fine-tuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3447. STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models
๐ฌ 24
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ spoken language model reasoning chain-of-thought
Spoken Language Models (SLMs) are designed to take speech inputs and produce
spoken responses. However, current SLMs lack the ability to perform an internal,
unspoken thinking process before responding. In contrast, humans typically engage
in complex mental reasoning internally, enabling them to communicate ideas clearly
and concisely. Thus, integrating an unspoken thought process into SLMs is highly
desirable. While naively generating a complete chain-of-thought (CoT) reasoning
before starting to talk can enable thinking for SLMs, this induces additional latency
for the speech response, as the CoT reasoning can be arbitrarily long. To solve
this issue, we propose STITCH, a novel generation method that alternates between
the generation of unspoken reasoning chunks and spoken response chunks. Since
the audio duration of a chunk of spoken response is much longer than the time to
generate the tokens in a chunk of spoken response, we use the remaining free time
to generate the unspoken reasoning tokens. When a chunk of audio is played to the
user, the model continues to generate the next unspoken reasoning chunk, achieving
simultaneous thinking and talking. Remarkably, STITCH matches the latency
of baselines that cannot generate unspoken CoT by design while outperforming
those baselines by 15% on math reasoning datasets; STITCH also performs equally
well on non-reasoning datasets as those baseline models
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3448. Towards Improvisational TAMP: Learning Low-Level Shortcuts in Abstract Planning Graphs
๐ฌ 20
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Robot Planning Reinforcement Learning
Long-horizon decision-making with sparse rewards and continuous states and actions remains a fundamental challenge in AI and robotics. Task and motion planning (TAMP) is a model-based framework that addresses this challenge by planning hierarchically with abstract actions (options). These options are manually defined, limiting the agent to behaviors that we as human engineers know how to program (pick, place, move). In this work, we propose Shortcut Learning for Abstract Planning (SLAP), a method that leverages existing TAMP options to automatically discover new ones. Our key idea is to use model-free reinforcement learning (RL) to learn *shortcuts* in the abstract planning graph induced by the existing options in TAMP. Without any additional assumptions or inputs, shortcut learning leads to shorter solutions than pure planning, and higher task success rates than flat and hierarchical RL. Qualitatively, SLAP discovers dynamic physical improvisations (e.g., slap, wiggle, wipe) that differ significantly from the manually-defined ones. In experiments in four simulated robotic environments, we show that SLAP solves and generalizes to a wide range of tasks, reducing overall plan lengths by over 50\% and consistently outperforming planning and RL baselines. Our code and videos are included in the supplement and will be publicly released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3449. A Statistical Learning Perspective on Semi-dual Adversarial Neural Optimal Transport Solvers
๐ฌ 16
โญ 5.00
๐ generative models
๐ท๏ธ optimal transport semi-dual optimal transport statistical learning theory approximation bounds
Neural network-based optimal transport (OT) is a recent and fruitful direction in the generative modeling community. It finds its applications in various fields such as domain translation, image super-resolution, computational biology and others. Among the existing OT approaches, of considerable interest are adversarial minimax solvers based on semi-dual formulations of OT problems. While promising, these methods lack theoretical investigation from a statistical learning perspective. Our work fills this gap by establishing upper bounds on the generalization error of an approximate OT map recovered by the minimax quadratic OT solver. Importantly, the bounds we derive depend solely on some standard statistical and mathematical properties of the considered functional classes (neural nets). While our analysis focuses on the quadratic OT, we believe that similar bounds could be derived for general OT case, paving the promising direction for future research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3450. LD-EnSF: Synergizing Latent Dynamics with Ensemble Score Filters for Fast Data Assimilation with Sparse Observations
๐ฌ 14
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Data Assimilation Latent Dynamics Physical Models
Data assimilation techniques are crucial for accurately tracking complex dynamical systems by integrating observational data with numerical forecasts. Recently, score-based data assimilation methods emerged as powerful tools for high-dimensional and nonlinear data assimilation. However, these methods still incur substantial computational costs due to the need for expensive forward simulations. In this work, we propose LD-EnSF, a novel score-based data assimilation method that fully eliminates the need for full-space simulations by evolving dynamics directly in a compact latent space. Our method incorporates improved Latent Dynamics Networks (LDNets) to learn accurate surrogate dynamics and introduces a history-aware LSTM encoder to effectively process sparse and irregular observations. By operating entirely in the latent space, LD-EnSF achieves speedups orders of magnitude over existing methods while maintaining high accuracy and robustness. We demonstrate the effectiveness of LD-EnSF on several challenging high-dimensional benchmarks with highly sparse (in both space and time) and noisy observations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3451. CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards
๐ฌ 16
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM-based Agent Multi-agent System
Self-evolution is a central research topic in enabling large language model (LLM)-based agents to continually improve their capabilities after pretraining. Recent research has witnessed a transition from reinforcement learning (RL)-free to RL-based methods. Current RL-based methods either rely on dense external reward signals or extract intrinsic reward signals from LLMs themselves. However, these approaches diverge from the self-evolution mechanisms observed in human intelligence, where individuals learn and improve through mutual discussion and collaboration. In this work, we introduce Co-Evolving Multi-Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter-agent interactions without external supervision. CoMAS generates intrinsic rewards from rich discussion dynamics, employs an LLM-as-a-judge mechanism to formulate these rewards, and optimizes each agent's policy through RL, thereby enabling decentralized and scalable co-evolution. Experimental results demonstrate that CoMAS consistently outperforms untrained agents and achieves state-of-the-art performance across most evaluation settings. Ablation studies confirm the necessity of interaction-based reward signals and reveal promising scalability as the number and diversity of agents increase. These findings establish CoMAS as a novel and effective paradigm for self-evolution in LLM-based agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3452. ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis
๐ฌ 17
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Reasoning Large Vision-Language Models (LVLMs) Agentic Data Synthesis Multi-Agent ReAct Reinforcement Learning with Verifiable Reward (RLVR) Chain-of-Thought (CoT)
While Reinforcement Learning with Verifiable Reward (RLVR) significantly advances image reasoning in Large Vision-Language Models (LVLMs), its application to complex video reasoning remains underdeveloped. This gap stems primarily from a critical data bottleneck: existing datasets lack the challenging, multi-hop questions and high-quality, video-grounded Chain-of-Thought (CoT) data necessary to effectively bootstrap RLVR. To address this, we introduce ReWatch, a large-scale dataset built to foster advanced video reasoning. We propose a novel multi-stage synthesis pipeline to synthesize its three components: ReWatch-Caption, ReWatch-QA, and ReWatch-CoT. A core innovation is our Multi-Agent ReAct framework for CoT synthesis, which simulates a human-like "re-watching" process to generate video-grounded reasoning traces by explicitly modeling information retrieval and verification. Building on this dataset, we develop ReWatch-R1 by post-training a strong baseline LVLM with Supervised Fine-Tuning (SFT) and our RLVR framework. This framework incorporates a novel Observation \& Reasoning (O\&R) reward mechanism that evaluates both the final answer's correctness and the reasoning's alignment with video content, directly penalizing hallucination. Our experiments show that ReWatch-R1 achieves state-of-the-art average performance on five challenging video reasoning benchmarks, substantially outperforming models trained on all other open-source datasets. We also provide crucial insights into the training dynamics of SFT and RL for complex video reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3453. SGD with Adaptive Preconditioning: Unified Analysis and Momentum Acceleration
๐ฌ 4
โญ 5.00
๐ optimization
๐ท๏ธ convex optimization stochastic optimization adaptive optimization gradient methods accelerated methods
In this paper, we revisit stochastic gradient descent (SGD) with AdaGrad-type preconditioning. Our contributions are twofold. First, we develop a unified convergence analysis of SGD with adaptive preconditioning under anisotropic or matrix smoothness and noise assumptions. This allows us to recover state-of-the-art convergence results for several popular adaptive gradient methods, including AdaGrad-Norm, AdaGrad, and ASGO/One-sided Shampoo. In addition, we establish the fundamental connection between two recently proposed algorithms, Scion and DASGO, and provide the first theoretical guarantees for the latter. Second, we show that the convergence of methods like AdaGrad and DASGO can be provably accelerated beyond the best-known rates using Nesterov momentum. Consequently, we obtain the first theoretical justification that AdaGrad-type algorithms can simultaneously benefit from both diagonal preconditioning and momentum, which may provide an ultimate explanation for the practical efficiency of Adam.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3454. RegionReasoner: Region-Grounded Multi-Round Visual Reasoning
๐ฌ 17
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ large vision-language models multi-round visual reasoning
Large vision-language models have achieved remarkable progress in visual reasoning, yet most existing systems rely on single-step or text-only reasoning, limiting their ability to iteratively refine understanding across multiple visual contexts.
To address this limitation, we introduce a new multi-round visual reasoning benchmark with training and test sets spanning both detection and segmentation tasks, enabling systematic evaluation under iterative reasoning scenarios.
We further propose RegionReasoner, a reinforcement learning framework that enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes, while maintaining semantic coherence via a globalโlocal consistency reward.
This reward extracts key objects and nouns from both global scene captions and region-level captions, aligning them with the reasoning trace to ensure consistency across reasoning steps.
RegionReasoner is optimized with structured rewards combining grounding fidelity and globalโlocal semantic alignment.
Experiments on detection and segmentation tasks show that RegionReasoner-7B, together with our newly introduced benchmark RegionDial-Bench, considerably improves multi-round reasoning accuracy, spatial grounding precision, and globalโlocal consistency, establishing a strong baseline for this emerging research direction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3455. Towards Anomaly-Aware Pre-Training and Fine-Tuning for Graph Anomaly Detection
๐ฌ 26
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph Anomaly Detection Graph Pre-Training Self-Supervised Learning
Graph anomaly detection (GAD) has garnered increasing attention in recent years, yet remains challenging due to two key factors: (1) label scarcity stemming from the high cost of annotations and (2) homophily disparity at node and class levels. In this paper, we introduce Anomaly-Aware Pre-Training and Fine-Tuning (APF), a targeted and effective framework to mitigate the above challenges in GAD. In the pre-training stage, APF incorporates node-specific subgraphs selected via the Rayleigh Quotient, a label-free anomaly metric, into the learning objective to enhance anomaly awareness. It further introduces two learnable spectral polynomial filters to jointly learn dual representations that capture both general semantics and subtle anomaly cues. During fine-tuning, a gated fusion mechanism adaptively integrates pre-trained representations across nodes and dimensions, while an anomaly-aware regularization loss encourages abnormal nodes to preserve more anomaly-relevant information. Furthermore, we theoretically show that APF tends to achieve linear separability under mild conditions. Comprehensive experiments on 10 benchmark datasets validate the superior performance of APF in comparison to state-of-the-art baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3456. Variance-Dependent Regret Lower Bounds for Contextual Bandits
๐ฌ 14
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Bandit Reinforcement Learning
Variance-dependent regret bounds for linear contextual bandits, which improve upon the classical $\tilde{O}(d\sqrt{K})$ regret bound to $\tilde{O}(d\sqrt{\sum_{k=1}^K\sigma_k^2})$, where $d$ is the context dimension, $K$ is the number of rounds, and $\sigma^2_k$ is the noise variance in round $k$, has been widely studied in recent years. However, most existing works focus on the regret upper bounds instead of lower bounds. To our knowledge, the only lower bound is from Jia et al. (2024), which proved that for any eluder dimension $d_{\textbf{elu}}$ and total variance budget $\Lambda$, there exists an instance with $\sum_{k=1}^K\sigma_k^2\leq \Lambda$ for which any algorithm incurs a variance-dependent lower bound of $\Omega(\sqrt{d_{\textbf{elu}}\Lambda})$. However, this lower bound has a $\sqrt{d}$ gap with existing upper bounds. Moreover, it only considers a fixed total variance budget $\Lambda$ and does not apply to a general variance sequence $\{\sigma_1^2,\ldots,\sigma_K^2\}$.
In this paper, to overcome the limitations of Jia et al. (2024), we consider the general variance sequence under two settings. For a prefixed sequence, where the entire variance sequence is revealed to the learner at the beginning of the learning process, we establish a variance-dependent lower bound of $\Omega(d \sqrt{\sum_{k=1}^K\sigma_k^2 }/\log K)$ for linear contextual bandits. For an adaptive sequence, where an adversary can generate the variance $\sigma_k^2$ in each round $k$ based on historical observations, we show that when the adversary must generate $\sigma_k^2$ before observing the decision set $D_k$, a similar lower bound of $\Omega(d\sqrt{ \sum_{k=1}^K\sigma_k^2} /\log^6(dK))$ holds. In both settings, our results match the upper bounds of the SAVE algorithm (Zhao et al. 2023) up to logarithmic factors. Furthermore, if the adversary can generate the variance $\sigma_k$ after observing the decision set $D_k$, we construct a counter-example showing that it is impossible to construct a variance-dependent lower bound if the adversary properly selects variances in collaboration with the learner.
Our lower bound proofs use a novel peeling technique that groups rounds by variance magnitude. For each group, we construct separate instances and assign the learner distinct decision sets. We believe this proof technique may be of independent interest.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3457. TEST-TIME SCALING IN DIFFUSION LLMS VIA HIDDEN SEMI-AUTOREGRESSIVE EXPERTS
๐ฌ 25
โญ 5.00
๐ generative models
๐ท๏ธ Diffusion Large Language Models reasoning inference time
Diffusion-based large language models (dLLMs) are trained to model extreme flexibility/dependence in the data-distribution; however, how to best utilize this at inference time remains an open problem. In this work, we uncover an interesting property of these models: dLLMs {trained on textual data} implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time schedule, a common practice, collapses performance by failing to leverage this latent ensemble. To address this, we introduce HEX (Hidden semi-autoregressive EXperts for test-time scaling), a training-free inference method that ensembles across heterogeneous block schedules. By doing a majority vote over diverse block-sized generation paths, HEX robustly avoids failure modes associated with any single fixed schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to 3.56ร (from 24.72\% to 88.10\%), outperforming top-K margin inference and specialized fine-tuned methods like GRPO, without additional training. HEX even yields significant gains on MATH benchmark from 16.40\% to 40.00\%, scientific reasoning on ARC-C from 54.18\% to 87.80\%, and TruthfulQA from 28.36\% to 57.46\%. Our results establish test-time scaling as a powerful principle for dLLMs, showing that the sequence in which masking is done can play a significant role in test-time scaling/inferencing of dLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3458. Test-Time Scaling with Reflective Generative Model
๐ฌ 17
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Reasoning
We introduce a new Reflective Generative Model (RGM), which obtains OpenAI o3-mini's performance via a novel Reflective Generative Form. This form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 50M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model (SPRM), which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, RGM is naturally suitable for test-time scaling based on the controllable thinking length. Experiments show that our RGM, equipped with only 50M additional parameters in SPRM, outperforms policy models with 72B extra reward models, thereby enabling QwQ-32B to outperform OpenAI o3-mini on AIME24 (84.2 vs. 79.6) and HMMT25 (53.1 vs. 53.0).
Code will be available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3459. Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition
๐ฌ 14
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Diffusion Policies Policy Composition Training-free
Diffusion-based models for robotic control, including vision-language-action (VLA) and vision-action (VA) policies, have demonstrated significant capabilities. Yet their advancement is constrained by the high cost of acquiring large-scale interaction datasets. This work introduces an alternative paradigm for enhancing policy performance ***without additional model training***. Perhaps surprisingly, we demonstrate that the composed policies can exceed the performance of either parent policy. Our contribution is threefold. First, we establish a theoretical foundation showing that the convex composition of distributional scores from multiple diffusion models can yield a superior one-step functional objective compared to any individual score. A Grรถnwall-type bound is then used to show that this single-step improvement propagates through entire generation trajectories, leading to systemic performance gains. Second, motivated by these results, we propose General Policy Composition (GPC), a training-free method that enhances performance by combining the distributional scores of multiple pre-trained policies via a convex combination and test-time search. GPC is versatile, allowing for the plug-and-play composition of heterogeneous policies, including VA and VLA models, as well as those based on diffusion or flow-matching, irrespective of their input visual modalities. Third, we provide extensive empirical validation. Experiments on Robomimic, PushT, and RoboTwin benchmarks, alongside real-world robotic evaluations, confirm that GPC consistently improves performance and adaptability across a diverse set of tasks. Further analysis of alternative composition operators and weighting strategies offers insights into the mechanisms underlying the success of GPC. These results establish GPC as a simple yet effective method for improving control performance by leveraging existing policies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3460. DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
๐ฌ 4
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Open-Vocabulary Object Detection Knowledge Distillation Multi-modal
Open-Vocabulary Object Detection (OVOD) plays a critical role in autonomous driving and human-computer interaction by enabling perception beyond closed-set categories. However, current approaches predominantly rely on multimodal fusion, facing dual limitations: multimodal fusion methods incur heavy computational overhead from text encoders, while task-coupled designs compromise between detection precision and open-world generalization. To address these challenges, we propose Decoupled Cognition DETR, a vision framework featuring a three-stage cognitive distillation mechanism: Dynamic Hierarchical Concept Pool constructs self-evolving concept prototypes using LLaVA-generated region descriptions filtered by CLIP alignment, aiming to replace costly text encoders and reduce computational overhead; Hierarchical Knowledge Distillation decouples visual-semantic space mapping via prototype-centric projection, avoiding task coupling to enhance open-world generalization; Parametric Decoupling Training coordinates localization and cognition through dual-stream gradient isolation, further optimizing detection precision. Extensive experiments on the common OVOD evaluation protocol demonstrated that DeCo-DETR achieves state-of-the-art performance compared to existing OVOD methods. It provides a new paradigm for extending OVOD to more real-world applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3461. Graph Signal Processing Meets Mamba2: Adaptive Filter Bank via Delta Modulation
๐ฌ 21
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ State Space Model Mamba Graph Signal Processing Adaptive Filter Bank
State-space models (SSMs) offer efficient alternatives to attention with linear-time recurrence. Mamba2, a recent SSM-based language model, uses selective input gating and a multi-head structure, enabling parallel computation and strong benchmark performance. However, its multi-head recurrence operates independently without structured utilization or analysis. In this work, we propose a novel method called **H**ierarchical **AD**aptive filter bank for **E**fficient **S**SMs (*HADES*), a Graph Signal Processing (GSP)-inspired framework that reinterprets Mamba2 as an adaptive filter bank on a line graph. Our hierarchical architecture introduces two filter types: shared filters for global low-pass behavior and expert filters for local high-pass behavior, achieved through structured bias on the parameter $\Delta$. *HADES* achieves comparable performance to baseline models including Mamba2 across various benchmarks in language modeling, commonsense reasoning, and long-context retrieval, while using only **58.9%** of the original parameters. In this regard, *HADES* bridges GSP and neural sequence modeling, enabling efficient, hierarchical, and interpretable filtering within state-space models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3462. Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Model
๐ฌ 21
โญ 5.00
๐ generative models
๐ท๏ธ controllable video generation training-free guidance video diffusion models
Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation based on frame-level signals, such as keyframes, style reference images, sketches, or depth maps. By applying guidance to only a few selected frames, Frame Guidance can steer the generation of the entire video, resulting in a temporally coherent controlled video. To enable training-free guidance on large-scale video models, we propose a simple latent processing method that dramatically reduces memory usage, and apply a novel latent optimization strategy designed for globally coherent video generation. Frame Guidance enables effective control across diverse tasks, including keyframe guidance, stylization, and looping, without any training, and is compatible with any models. Experimental results show that Frame Guidance can produce high-quality controlled videos for a wide range of tasks and input signals.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3463. PAMDP: Interact to Persona Alignment via a Partially Observable Markov Decision Process
๐ฌ 7
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ persona alignment interact to align partially observable markov decision process dual critic reinforcement learning
The interaction process of comprehending user-specific nuances and adapting to their preferences represents a pivotal consideration for Persona Large Language Models, as it more authentically mirrors genuine dialogue dynamics than adherence to general human value alignment. In this paper, we conceptualize this ``Interact to Persona Alignment'' challenge as a Partially Observable Markov Decision Process, abbreviated as PAMDP, wherein the userโs dynamically evolving profile through interaction is treated as an unobservable variable to the assistant. Grounded in this formulation, we propose a dual-critic reinforcement learning framework, with a continuous latent space action representing the assistantโs utterance. We evaluate our approach on both offline datasets and the online simulator, ultimately demonstrating its effectiveness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3464. Kevin: Multi-Turn RL for Generating CUDA Kernels
๐ฌ 14
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ multi-turn RL GPU kernel code generation
Writing GPU kernels is a challenging task and critical for AI systems' efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. Moreover, it presents verifiable rewards like correctness and speedup, making it a natural environment to apply Reinforcement Learning (RL). To explicitly incorporate the iterative nature of this process into training, we develop a flexible multi-turn RL recipe that addresses unique challenges encountered in real-world settings, such as learning from long trajectories and effective reward attribution across turns. We present Kevin the Kernel Writer, the first model trained with multi-turn RL for CUDA kernel generation and optimization. In our evaluation setup, Kevin shows significant gains over its base model (QwQ-32B), improving correctness of generated kernels (in pure CUDA) from 56% to 82% and mean speedup from 0.53x to 1.10x of baseline (PyTorch Eager), and surpassing frontier models like o4-mini (0.78x). Finally, we study its behavior across test-time scaling axes: we found scaling serial refinement more beneficial than parallel sampling. In particular, when given more refinement turns, Kevin shows a higher rate of improvement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3465. A$^2$FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning
๐ฌ 31
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Adaptive LLMs Deep Research Agent Reasoning
Large language models split into two families: reasoning-centric LLMs, which strengthen internal chain-of-thought reasoning but cannot invoke external tools, and agentic LLMs, which learn to interact with environments and leverage tools but often lag in deep reasoning. This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries, where both families tend to overthink or over-call tools. In this work, we present Adaptive Agent Foundation Model (A$^2$FM), a unified framework that follows a route-then-align principle: the model first learns task-aware routing and then aligns mode-specific trajectories under a shared backbone. To address the inefficiency gap, we introduce a third instant mode that handles simple queries directly, preventing unnecessary reasoning or tool calls while complementing the agentic and reasoning modes. To jointly enhance accuracy and efficiency, we propose Adaptive Policy Optimization (APO), which enforces adaptive sampling across modes and applies a cost-regularized reward. On the 32B scale, A$^2$FM achieves 13.4\% on BrowseComp, 70.4\% on AIME25, and 16.7\% on HLE, setting new SOTA among comparable models and performing competitively with frontier LLMs across agentic, reasoning, and general benchmarks. Notably, the adaptive execution achieves a cost of pass of only \$0.00487 per correct answerโcutting cost by 45.2\% relative to reasoning and 33.5\% relative to agentic, thus delivering substantially higher cost efficiency while maintaining comparable accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3466. Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis
๐ฌ 4
โญ 5.00
๐ optimization
๐ท๏ธ Convex Optimization Heavy-Tailed Noise Gradient Clipping
Optimization under heavy-tailed noise has become popular recently, since it better fits many modern machine learning tasks, as captured by empirical observations. Concretely, instead of a finite second moment on gradient noise, a bounded $\mathfrak{p}$-th moment where $\mathfrak{p}\in\left(1,2\right]$ has been recognized to be more realistic (say being upper bounded by $\sigma_{\mathfrak{l}}^{\mathfrak{p}}$ for some $\sigma_{\mathfrak{l}}\geq0$). A simple yet effective operation, gradient clipping, is known to handle this new challenge successfully. Specifically, Clipped Stochastic Gradient Descent (Clipped SGD) guarantees a high-probability rate $\mathcal{O}(\sigma_{\mathfrak{l}}\ln(1/\delta)T^{\frac{1}{\mathfrak{p}}-1})$ (resp. $\mathcal{O}(\sigma_{\mathfrak{l}}^{2}\ln^{2}(1/\delta)T^{\frac{2}{\mathfrak{p}}-2})$) for nonsmooth convex (resp. strongly convex) problems, where $\delta\in\left(0,1\right]$ is the failure probability and $T\in\mathbb{N}$ is the time horizon. In this work, we provide a refined analysis for Clipped SGD and offer two faster rates, $\mathcal{O}(\sigma_{\mathfrak{l}}d_{\mathrm{eff}}^{-\frac{1}{2\mathfrak{p}}}\ln^{1-\frac{1}{\mathfrak{p}}}(1/\delta)T^{\frac{1}{\mathfrak{p}}-1})$ and $\mathcal{O}(\sigma_{\mathfrak{l}}^{2}d_{\mathrm{eff}}^{-\frac{1}{\mathfrak{p}}}\ln^{2-\frac{2}{\mathfrak{p}}}(1/\delta)T^{\frac{2}{\mathfrak{p}}-2})$, than the aforementioned best results, where $d_{\mathrm{eff}}\geq1$ is a quantity we call the generalized effective dimension. Our analysis improves upon the existing approach on two sides: better utilization of Freedman's inequality and finer bounds for clipping error under heavy-tailed noise. In addition, we extend the refined analysis to convergence in expectation and obtain new rates that break the known lower bounds.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3467. Rethinking LoRA for Privacy-Preserving Federated Learning in Large Models
๐ฌ 21
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Federated Learning Differential Privacy LoRA.
Fine-tuning large vision models (LVMs) and large language models (LLMs) under differentially private federated learning (DPFL) is hindered by a fundamental privacy-utility trade-off. Low-Rank Adaptation (LoRA), a promising parameter-efficient fine-tuning (PEFT) method, reduces computational and communication costs by introducing two trainable low-rank matrices while freezing pre-trained weights. However, directly applying LoRA in DPFL settings leads to performance degradation, especially in LVMs. Our analysis reveals three previously underexplored challenges: (1) gradient coupling caused by the simultaneous update of two asymmetric low-rank matrices, (2) compounded noise amplification under differential privacy, and (3) sharpness of the global aggregated model in the parameter space. To address these issues, we propose LA-LoRA (\textbf{L}ocal \textbf{A}lternating \textbf{LoRA}), a novel approach that decouples gradient interactions and aligns update directions across clients to enhance robustness under stringent privacy constraints. Theoretically, LA-LoRA strengthens convergence guarantees in noisy federated environments. Extensive experiments demonstrate that LA-LoRA achieves state-of-the-art (SOTA) performance on Swin Transformer and RoBERTa models, showcasing robustness to DP noise and broad applicability across both LVMs and LLMs. For example, when fine-tuning the Swin-B model on the Tiny-ImageNet dataset under a strict privacy budget ($\epsilon = 1$), LA-LoRA outperforms the best baseline, RoLoRA, by 16.83\% in test accuracy. Code is provided in the Appendix.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3468. Beyond Frequency: Scoring-Driven Debiasing for Object Detection via Blueprint-Prompted Image Synthesis
๐ฌ 15
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ computer vision image generation object detection dataset debiasing
This paper presents a generation-based debiasing framework for object detection. Prior debiasing methods are often limited by the representation diversity of samples, while naive generative augmentation often preserves the biases it aims to solve. Moreover, our analysis reveals that simply generating more data for rare classes is suboptimal due to two core issues: i) instance frequency is an incomplete proxy for the true data needs of a model, and ii) current layout-to-image synthesis lacks the fidelity and control to generate high-quality, complex scenes. To overcome this, we introduce the representation score (RS) to diagnose representational gaps beyond mere frequency, guiding the creation of new, unbiased layouts. To ensure high-quality synthesis, we replace ambiguous text prompts with a precise visual blueprint and employ a generative alignment strategy, which fosters communication between the detector and generator. Our method significantly
narrows the performance gap for underrepresented object groups, e.g., improving large/rare instances by 4.4/3.6 mAP over the baseline, and surpassing prior L2I synthesis models by 15.9 mAP for layout accuracy in generated images.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3469. Agentic Collaboration as an Information Bottleneck Problem
๐ฌ 12
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ information bottleneck rate-distortion theory agentic collaboration large language models scaling laws
Agentic language model (LM) systems have rapidly become central to modern workflows, powering applications like "Deep Research" and "Claude Code."
As contexts grow beyond what even the largest frontier models can process effectively, multi-LM architectures have emerged to overcome context limitations.
Beneath their apparent diversity lies a recurring pattern: smaller "compressor" LMs distill raw context into compact text that is then consumed by larger "predictor" LMs that interact with the user. Despite their popularity, the design of compressor-predictor systems remains largely ad hoc. Little guidance exists on how compressor and predictor choices shape downstream performance.
In practice, attributing gains to compression versus prediction requires exhaustive pairwise sweeps, which is costly and task-specific.
We argue that these agentic system design questions are, at root, information-theoretic. Viewing the compressor LM as a "noisy channel", we introduce a simple estimator of the mutual information between the context and its compression to quantify compression quality in a task-independent way. We show that mutual information strongly predicts downstream performance, independent of any specific task. Through an information-theoretic framework, we perform a comprehensive empirical analysis across five datasets and three model families. Results reveal that larger compressors not only are more accurate, but also more token-efficient, conveying more bits of information per token.
A 7B Qwen-2.5 compressor, for instance, is $1.6\times$ more accurate, $4.6\times$ more concise, and conveys $5.5\times$ more bits of mutual information per token. Across the datasets studied, scaling compressors is substantially more effective than scaling predictors, enabling larger on-device compressors to pair with smaller cloud predictors.
When applied to a Deep Research system, these principles enable local compressors as small as 3B parameters to recover 99% of frontier-LM accuracy at 26% of API costs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3470. Robust Preference Optimization: Aligning Language Models with Noisy Preference Feedback
๐ฌ 24
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ large language model alignment robustness
Standard human preference-based alignment methods, such as Reinforcement Learning from Human Feedback (RLHF), are a cornerstone technology for aligning Large Language Models (LLMs) with human values. However, these methods are all underpinned by a strong assumption that the collected preference data is clean and that all observed labels are equally reliable. In reality, large-scale preference datasets contain substantial label noise due to annotator errors, inconsistent instructions, varying expertise, and even adversarial or low-effort feedback. This creates a discrepancy between the recorded data and the ground-truth preferences, which can misguide the model and degrade its performance. To address this challenge, we introduce Robust Preference Optimization (RPO). RPO employs an Expectation-Maximization algorithm to infer the posterior probability of each labelโs correctness, which is used to adaptively re-weigh each data point in the training loss to mitigate noise. We further generalize this approach by establishing a theoretical link between arbitrary preference losses and their corresponding probabilistic models. This generalization enables the systematic transformation of existing alignment algorithms into their robust counterparts, elevating RPO from a specific algorithm to a general framework for robust preference alignment.
Theoretically, we prove that under the condition of a perfectly calibrated model, RPO is guaranteed to converge to the true noise level of the dataset. Our experiments demonstrate RPOโs effectiveness as a general framework, consistently enhancing four state-of-the-art alignment algorithms (DPO, IPO, SimPO, and CPO). When applied to Mistral and Llama 3 models, the RPO-enhanced methods improve AlpacaEval 2 win rates by up to 7.0 percentage points over their respective baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3471. KRAMABENCH: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes
๐ฌ 27
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ data science agents data science data wrangling data analysis data management reasoning agentic systems
Discovering insights from a real-world data lake potentially containing unclean, semi-structured, and unstructured data requires a variety of data processing tasks, ranging from
extraction and cleaning to integration, analysis, and modeling. This process often also demands domain knowledge and project-specific insight. While AI models have shown remarkable results in reasoning and code generation, their abilities to design and execute complex pipelines that solve these data-lake-to-insight challenges remain unclear. We introduce KramaBench which consists of 104 manually curated and solved challenges spanning 1700 files, 24 data sources, and 6 domains. KramaBench focuses on testing the end-to-end capabilities of AI systems to solve challenges which require automated orchestration of different data tasks. KramaBench also features a comprehensive evaluation framework assessing the pipeline design and individual data task implementation abilities of AI systems. Evaluating 8 LLMs with our single-agent reference framework DS-Guru, alongside open- and closed-source agentic systems, we find that while current single-agent systems may handle isolated data-science tasks and generate plausible draft pipelines, they struggle with producing working end-to-end pipelines. On KramaBench, the best system reaches only 50% end-to-end accuracy in the full data-lake setting. Even with perfect retrieval, the accuracy tops out at 59%. Leading LLMs can identify up to 42% of important data tasks but can only fully implement 20% of individual data tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3472. Learning to Reason Efficiently with Discounted Reinforcement Learning
๐ฌ 18
โญ 5.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning reasoning blackwell optimality post training
Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. We challenge the assumption that longer responses improve accuracy. By penalizing the reasoning tokens using a discounted reinforcement-learning setup (interpretable as a small per-token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning; in practice we discount only the environment (correctness) reward. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3473. EgoBrain: Synergizing Minds and Eyes For Human Action Understanding
๐ฌ 23
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Electroencephalography (EEG) First Person Vision(Egocentric Vision) Human Action Understanding
The integration of brain-computer interfaces (BCIs), in particular electroencephalography (EEG), with artificial intelligence (AI) has shown tremendous promise in decoding human cognition and behavior from neural signals. In particular, the rise of multimodal AI models have brought new possibilities that have never been imagined before. Here, we present EgoBrainโthe worldโs first large-scale, temporally aligned multimodal dataset that synchronizes egocentric vision and EEG of human brain over extended periods of time, establishing a new paradigm for human-centered behavior analysis. This dataset comprises 61 hours of synchronized 32-channel EEG recordings and first-person video from 40 participants engaged in 29 categories of daily activities. We then developed a multimodal learning framework to fuse EEG and vision for action understanding, validated across both cross-subject and cross-environment challenges, achieving an action recognition accuracy of 66.70%. EgoBrain paves the way for a unified framework for brain-computer interface with multiple modalities. All data, tools, and acquisition protocols together with the source code are openly shared to foster open science in cognitive computing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3474. Multiple-Prediction-Powered Inference
๐ฌ 13
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Prediction-powered inference Statistical estimation Model evaluation LLM as judge PPI
A core challenge in modern AI model development is obtaining high-quality evaluation metrics in a cost-effective way. Such evaluation often involves tradeoffs between expensive, high-quality measurements and a variety of lower-quality proxies. We introduce Multiple-Prediction-Powered Inference (MultiPPI), a general framework for constructing statistically efficient estimates by optimally allocating resources across these diverse data sources. We provide theoretical guarantees about the minimax optimality, finite-sample performance, and asymptotic normality of the MultiPPI estimator. Through experiments across three diverse large language model (LLM) evaluation scenarios, we show that MultiPPI consistently achieves lower estimation error than existing baselines. This advantage stems from its budget-adaptive allocation strategy, which strategically combines subsets of models by learning their complex cost and correlation structures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3475. Self-Rewarding Vision-Language Model via Reasoning Decomposition and Multi-Reward Policy Optimization
๐ฌ 27
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ machine learning vision-language models deep learning reinforceme
Vision-Language Models (VLMs) often suffer from visual hallucinations โ generating things that are not consistent with visual inputs โ and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post-training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language-based reasoning over visual perception. To mitigate this, some existing methods add visual supervision using human annotations or distilled labels from external large models. However, human annotations are labor-intensive and costly, and because external signals cannot adapt to the evolving policy, they cause distributional shifts that can lead to reward hacking. In this paper, we introduce Vision-SR1, a self-rewarding method that improves visual reasoning without relying on external visual supervisions via reinforcement learning and Multi-Reward Policy Optimization. Vision-SR1 decomposes VLM reasoning into two stages: visual perception and language reasoning. The model is first prompted to produce self-contained visual perceptions that are sufficient to answer the question without referring back the input image. To validate this self-containment, the same VLM model is then re-prompted to perform language reasoning using only the generated perception as input to compute reward. This self-reward is combined with supervision on final outputs, providing a balanced training signal that strengthens both visual perception and language reasoning. In addition, Multi-Reward Policy Optimization separately computes advantages and log probabilities for both the visual reasoning reward and the answer accuracy reward. The method then calculates KL divergence regularization and Actor Loss using the combined sum of these two reward components. Our experiments demonstrate that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision-language tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3476. Adaptive Conformal Prediction via Mixture-of-Experts Gating Similarity
๐ฌ 21
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Conformal prediction; Mixture of Experts; Distribution-free inference.
Prediction intervals are essential for applying machine learning models in real applications, yet most conformal prediction (CP) methods provide coverage guarantees that overlook the heterogeneity and domain knowledge that characterize modern multimodal datasets. We introduce Mixture-of-Experts Conformal Prediction (MoE-CP), a flexible and scalable framework that uses the gating probability vectors of Mixture-of-Experts (MoE) models as soft domain assignments to guide similarity-weighted conformal calibration. MoE-CP weights calibration residuals according to the similarity between gating vectors of calibration and test points, producing prediction intervals that adapt to latent subpopulations without requiring explicit domain labels. We provide theoretical justification showing that MoE-CP preserves nominal marginal validity under common similarity measures and improves conditional adaptivity when the gating captures domain structure. Empirical results on synthetic and real-world datasets demonstrate that MoE-CP yields more domain-aware, interpretable, and often tighter intervals than existing conformal baselines while maintaining target coverage. MoE-CP offers a practical route to reliable uncertainty quantification in latent heterogeneous, multi-domain environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3477. $\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers
๐ฌ 17
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Learned Optimizer Meta Generalization MuP Maximal Update Parameterization
Learned optimizers (LOs) have the potential to significantly reduce the wall-clock training time of neural networks. However, they can struggle to optimize unseen tasks (*meta-generalize*), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization ($\mu$P) for two state-of-the-art learned optimizer architectures and propose a simple meta-training recipe for $\mu$-parameterized LOs ($\mu$LOs). Our empirical evaluation demonstrates that LOs meta-trained with our recipe substantially improve meta-generalization to wider unseen tasks when compared to LOs trained under standard parametrization (SP) using the same compute budget. We also empirically observe that $\mu$LOs exhibit unexpectedly improved meta-generalization to deeper networks ($5\times$ meta-training) and surprising generalization to much longer training horizons ($25\times$ meta-training) when compared to SP LOs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3478. Learning Correlated Reward Models: Statistical Barriers and Opportunities
๐ฌ 13
โญ 5.00
๐ generative models
๐ท๏ธ reward model RLHF choice model random utility model
Random Utility Models (RUMs) are a classical framework for modeling user preferences and play a key role in reward modeling for Reinforcement Learning from Human Feedback (RLHF). However, a crucial shortcoming of many of these techniques is the Independence of Irrelevant Alternatives (IIA) assumption, which collapses \emph{all} human preferences to a universal underlying utility function, yielding a coarse approximation of the range of human preferences. On the other hand, statistical and computational guarantees for models avoiding this assumption are scarce. In this paper, we investigate the statistical and computational challenges of learning a \emph{correlated} probit model, a fundamental RUM that avoids the IIA assumption. First, we establish that the classical data collection paradigm of pairwise preference data is \emph{fundamentally insufficient} to learn correlational information, explaining the lack of statistical and computational guarantees in this setting. Next, we demonstrate that \emph{best-of-three} preference data provably overcomes these shortcomings, and devise a statistically and computationally efficient estimator with near-optimal performance. These results highlight the benefits of higher-order preference data in learning correlated utilities, allowing for more fine-grained modeling of human preferences. Finally, we validate these theoretical guarantees on several real-world datasets, demonstrating improved personalization of human preferences.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3479. Learning Mixtures of Linear Dynamical Systems (MoLDS) via Hybrid TensorโEM Method
๐ฌ 12
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ Mixture of linear dynamical systems Tensor-based moment method Expectation-Maximization Latent dynamical systems Neural data analysis
Linear dynamical systems (LDSs) have been powerful tools for modeling high-dimensional time-series data across many domains, including neuroscience. However, a single LDS often struggles to capture the heterogeneity of neural data, where trajectories recorded under different conditions can have variations in their dynamics. Mixtures of linear dynamical systems (MoLDS) provide a path to model these variations in temporal dynamics for different observed trajectories.
However, MoLDS remains difficult to apply in complex and noisy settings, limiting its practical use in neural data analysis. Tensor-based moment methods can provide global identifiability guarantees for MoLDS, but their performance degrades under realistic noise and complexity. Commonly used expectation-maximization (EM) methods offer flexibility in fitting latent models but are highly sensitive to initialization and prone to poor local minima. Here, we propose a tensor-based moment method that provides identifiability guarantees for learning MoLDS, which can be followed by EM updates to combine the strengths of both approaches. The novelty in our approach lies in the construction of moment tensors using the inputโoutput data, on which we then apply Simultaneous Matrix Diagonalization (SMD) to recover globally consistent estimates of mixture weights and system parameters. These estimates can then be refined through a full Kalman filter-smoother EM algorithm, with closed-form updates for all LDS parameters. We validate our framework on synthetic benchmarks and real-world datasets. On synthetic data, the proposed Tensor-EM method achieves more reliable recovery and improved robustness compared to either pure tensor or randomly initialized EM methods.
We then analyze neural recordings from the primate somatosensory cortex while a non-human primate performs reaches in different directions. In this setting, our method successfully models and clusters different conditions as separate subsystems, which is consistent with supervised single-LDS fits for each condition. Finally, we apply this approach to another neural dataset where monkeys perform a sequential reaching task. These results demonstrate that MoLDS provides an effective framework for modeling complex neural data in different brain regions, and that Tensor-EM is a principled and reliable approach to MoLDS learning for these applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3480. VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation
๐ฌ 15
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Reasoning Large Language Models Emotion Recognition Vowel
Emotion recognition in speech presents a complex multimodal challenge, requiring comprehension of both linguistic content and vocal expressivity, particularly prosodic features such as fundamental frequency, intensity, and temporal dynamics. Although large language models (LLMs) have shown promise in reasoning over textual transcriptions for emotion recognition, they typically neglect fine-grained prosodic information, limiting their effectiveness and interpretability. In this work, we propose VowelPrompt, a linguistically grounded framework that augments LLM-based emotion recognition with interpretable, fine-grained vowel-level prosodic cues. Drawing on phonetic evidence that vowels serve as primary carriers of affective prosody, VowelPrompt extracts pitch-, energy-, and duration-based descriptors from time-aligned vowel segments, and converts these features into natural language descriptions for better interpretability.
Such a design enables LLMs to jointly reason over lexical semantics and fine-grained prosodic variation. Moreover, we adopt a two-stage adaptation procedure comprising supervised fine-tuning (SFT) followed by Reinforcement Learning with Verifiable Reward (RLVR), implemented via Group Relative Policy Optimization (GRPO), to enhance reasoning capability, enforce structured output adherence, and improve generalization across domains and speaker variations. Extensive evaluations across diverse benchmark datasets demonstrate that VowelPrompt consistently outperforms state-of-the-art emotion recognition methods under zero-shot, fine-tuned, cross-domain, and cross-linguistic conditions, while enabling the generation of interpretable explanations that are jointly grounded in contextual semantics and fine-grained prosodic structure.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3481. Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning
๐ฌ 12
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Offline RL Diffusion Model Out-of-Distribution (OOD) Detection
Offline reinforcement learning (RL) faces a critical challenge of overestimating the value of out-of-distribution (OOD) actions.
Existing methods mitigate this issue by penalizing unseen samples, yet they fail to accurately identify OOD actions and may suppress beneficial exploration beyond the behavioral support.
Although several methods have been proposed to differentiate OOD samples with distinct properties, they typically rely on restrictive assumptions about the data distribution and remain limited in discrimination ability.
To address this problem, we propose $\textbf{DOSER}$ ($\textbf{D}$iffusion-based $\textbf{O}$OD Detection and $\textbf{SE}$lective $\textbf{R}$egularization), a novel framework that goes beyond uniform penalization.
DOSER trains two diffusion models to capture the behavior policy and state distribution, using single-step denoising reconstruction error as a reliable OOD indicator.
During policy optimization, it further distinguishes between beneficial and detrimental OOD actions by evaluating predicted transitions, selectively suppressing risky actions while encouraging exploration of high-potential ones.
Theoretically, we prove that DOSER is a $\gamma$-contraction and therefore admits a unique fixed point with bounded value estimates.
We further provide an asymptotic performance guarantee relative to the optimal policy under model approximation and OOD detection errors.
Across extensive offline RL benchmarks, DOSER consistently attains superior performance to prior methods, especially on suboptimal datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3482. Local Geometry Attention for Time Series Forecasting under Realistic Corruptions
๐ฌ 29
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ Local Geometry Local Gaussian Process Transformer Architecture Time Series Analysis Corruption Benchmark
Transformers have demonstrated strong performance in time series forecasting, yet they often fail to capture the intrinsic structure of temporal data, making them susceptible to real-world noise and anomalies. Unlike in vision or language, the local geometry of temporal patterns is a critical feature in time series forecasting, but it is frequently disrupted by corruptions.
In this work, we address this gap with two key contributions. First, we propose Local Geometry Attention (LGA), a novel attention mechanism theoretically grounded in local Gaussian process theory. LGA adapts to the intrinsic data geometry by learning query-specific distance metrics, enabling it to model complex temporal dependencies and enhance resilience to noise. Second, we introduce TSRBench, the first comprehensive benchmark for evaluating forecasting robustness under realistic, statistically-grounded corruptions.
Experiments on TSRBench show that LGA significantly reduces performance degradation, consistently outperforming both Transformer and linear model. These results establish a foundation for developing robust time series models that can be deployed in real-world applications where data quality is not guaranteed. Our code is available at: https://anonymous.4open.science/r/LGA-5454.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3483. A Theoretical Analysis of Mambaโs Training Dynamics: Filtering Relevant Features for Generalization in State Space Models
๐ฌ 24
โญ 5.00
๐ learning theory
๐ท๏ธ Training Dynamics Feature Learning Generalization Mamba Learning Theory Selective State Space Models
The recent empirical success of Mamba and other selective state space models (SSMs) has renewed interest in non-attention architectures for sequence modeling, yet their theoretical foundations remain underexplored. We present a first-step analysis of generalization and learning dynamics for a simplified but representative Mamba block: a single-layer, single-head selective SSM with input-dependent gating, followed by a two-layer MLP trained via SGD. Our study adopts a structured data model with tokens that include both class-relevant and class-irrelevant patterns under token-level noise and examines two canonical regimes: majority-voting and concentration-type sequences. We prove that the model achieves guaranteed generalization by establishing asymptotic sample complexity and convergence rate bounds, which improve as the effective signal increases and the noise decreases.
Furthermore, we show that the gating vector aligns with class-relevant features while ignoring irrelevant ones, thereby formalizing a feature-selection role similar to attention but realized through selective recurrence. Numerical experiments on both synthetic and real-world data justify our theoretical results. Overall, our results provide principled insight into when and why Mamba-style selective SSMs learn efficiently, offering a theoretical counterpoint to Transformer-centric explanations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3484. Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization
๐ฌ 23
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ VisionโLanguage Models Safety Robustness Adversarial Machine Learning
VisionโLanguage Models (VLMs) have become essential for tasks such as image synthesis, captioning, and retrieval by aligning textual and visual information in a shared embedding space. Yet, this flexibility also makes them vulnerable to malicious prompts designed to produce unsafe content, raising critical safety concerns. Existing defenses either rely on blacklist filters, which are easily circumvented, or on heavy classifier-based systems, both of which are costly and fragile under embedding-level attacks.
We address these challenges with two complementary components: Hyperbolic Prompt Espial (HyPE) and Hyperbolic Prompt Sanitization (HyPS). HyPE is a lightweight anomaly detector that leverages the structured geometry of hyperbolic space to model benign prompts and detect harmful ones as outliers. HyPS builds on this detection by applying explainable attribution methods to identify and selectively modify harmful words, neutralizing unsafe intent while preserving the original semantics of user prompts.
Through extensive experiments across multiple datasets and adversarial scenarios, we prove that our framework consistently outperforms prior defenses in both detection accuracy and robustness. Together, HyPE and HyPS offer an efficient, interpretable, and resilient approach to safeguarding VLMs against malicious prompt misuse.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3485. Uncertainty Matters in Dynamic Gaussian Splatting for Monocular 4D Reconstruction
๐ฌ 22
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ dynamic Gaussian Splatting uncertainty estimation 4D reconstruction graph model
Reconstructing dynamic 3D scenes from monocular input is fundamentally under-constrained, with ambiguities arising from occlusion and extreme novel views. While dynamic Gaussian Splatting offers an efficient representation, vanilla models optimize all Gaussian primitives uniformly, ignoring whether they are well or poorly observed. This limitation leads to motion drifts under occlusion and degraded synthesis when extrapolating to unseen views. We argue that uncertainty matters: Gaussians with recurring observations across views and time act as reliable anchors to guide motion, whereas those with limited visibility are treated as less reliable. To this end, we introduce USplat4D, a novel Uncertainty-aware dynamic Gaussian Splatting framework that propagates reliable motion cues to enhance 4D reconstruction. Our approach estimates time-varying per-Gaussian uncertainty and leverages it to construct a spatio-temporal graph for uncertainty-aware optimization. Experiments on diverse real and synthetic datasets show that explicitly modeling uncertainty consistently improves dynamic Gaussian Splatting models, yielding more stable geometry under occlusion and high-quality synthesis at extreme viewpoints. Our code, dataset, and model will be released publicly.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3486. Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
๐ฌ 15
โญ 5.00
๐ generative models
๐ท๏ธ Novel view synthesis diffusion model
We introduce a diffusion-based framework that generates aligned novel view images and geometries via a warpingโandโinpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to inโdomain views, our method leverages offโtheโshelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between the generated image and geometry, we propose cross-modal attention instillation where the attention maps from an image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating both geometrically robust image synthesis and geometry prediction. We further introduce proximityโbased mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis, delivers competitive reconstruction under interpolation settings, and produces geometrically aligned point clouds as 3D completion.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3487. CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition
๐ฌ 11
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ distributionally robust optimization deep learning robustness speech recognition
Modern deep learning models often achieve high overall performance, but consistently fail on specific subgroups. Group distributionally robust optimization (group DRO) addresses this problem by minimizing the worst-group loss, but it fails when group losses misrepresent performance differences between groups. This is common in domains like speech, where the widely used connectionist temporal classification (CTC) loss not only scales with input length but also varies with linguistic and acoustic properties, leading to spurious differences between group losses. We present CTC-DRO, which addresses the shortcomings of the group DRO objective by smoothing the group weight update to prevent overemphasis on consistently high-loss groups, while using input length-matched batching to mitigate CTC's scaling issues. We evaluate CTC-DRO on the task of multilingual automatic speech recognition (ASR) across five language sets from the diverse ML-SUPERB 2.0 benchmark. CTC-DRO consistently outperforms group DRO and CTC-based baseline models, reducing the worst-language error by up to 47.1% and the average error by up to 32.9%. CTC-DRO can be applied to ASR with minimal computational costs, and, while motivated by multilingual ASR, offers the potential for reducing group disparities in other domains with similar challenges.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3488. ATEX-CF: Attack-Informed Counterfactual Explanations for Graph Neural Networks
๐ฌ 27
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Graph Neural Network Counterfactual Explanation Adversarial Attack
Counterfactual explanations offer an intuitive way to interpret graph neural networks (GNNs) by identifying minimal changes that alter a modelโs prediction, thereby answering โwhat must differ for a different outcome?โ. In this work, we propose a novel framework, ATEX-CF that unifies adversarial attack techniques with counterfactual explanation generationโa connection made feasible by theirshared goal of flipping a nodeโs prediction, yet differing in perturbation strategy:adversarial attacks often rely on edge additions, while counterfactual methods typically use deletions. Unlike traditional approaches that treat explanation and attack separately, our method efficiently integrates both edge additions and deletions, grounded in theory, leveraging adversarial insights to explore impactful counterfactuals. In addition, by jointly optimizing fidelity, sparsity, and plausibility under a constrained perturbation budget, our method produces instance-level explanations that are both informative and realistic. Experiments on synthetic and real-world node classification benchmarks demonstrate that ATEX-CF generates faithful, concise, and plausible explanations, highlighting the effectiveness of integrating adversarial insights into counterfactual reasoning for GNNs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3489. Agentic Reinforcement Learning with Implicit Step Rewards
๐ฌ 22
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ reinforcement learning large language model agents process reward
Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL) that reason and act in interactive environments.
However, sparse and sometimes unverifiable rewards make it extremely challenging to assign credit when training LLM agents that serve as a policy.
Recent work attempts to integrate process supervision into RL but suffers from biased annotation, reward hacking, high-variance from overly fine-grained rewards or failtures when state overlap is rare.
We therefore introduce implicit step rewards for agentic RL (**iStar**), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms without relying on additional rollouts or explicit step labels.
Particularly, we alternatively optimize an implicit process reward model (PRM) with the policy model to generate step rewards for each action via a multi-turn DPO objective. Theoretical analysis shows that this learning objective produces a step-wise reward function learned from trajectory preferences.
Then the implicit step rewards are used to compute step-level advantages, which are combined with trajectory (or episode)-level advantages for policy updates, creating a self-reinforcing training loop.
We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA.
Crucially, our method shows superior performance over frontier LLMs and strong RL baselines across domains, achieving state-of-the-art results with higher sample-efficiency and training stability.
Further analysis also demonstrates efficient exploration by **iStar** with increased rewards in both step- and episode-level while maintaining fewer steps to achieve task success.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3490. One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image
๐ฌ 24
โญ 5.00
๐ generative models
๐ท๏ธ scene generation multi-view diffusion feedforward Gaussian Splatting
Generating explorable 3D scenes from a single image is a highly challenging problem in 3D vision. Existing methods struggle to support free exploration, often producing severe geometric distortions and noisy artifacts when the viewpoint moves far from the original perspective. We introduce One2Scene, an effective framework that decomposes this ill-posed problem into three tractable sub-tasks to enable immersive explorable scene generation.
We first use a panorama generator to produce anchor views from a single input image as initialization. Then, we lift these 2D anchors into an explicit 3D geometric scaffold via a generalizable, feed-forward Gaussian Splatting network.
Instead of treating the panorama as a single image for reconstruction, we project it into multiple sparse anchor views and reformulate the reconstruction task as multi-view stereo matching, which allows us to leverage robust geometric priors learned from large-scale multi-view datasets.
A bidirectional feature fusion module is used to enforce cross-view consistency, yielding an efficient and geometrically reliable scaffold.
Finally, the scaffold serves as a strong prior for a novel view generator to produce photorealistic and geometrically accurate views at arbitrary cameras. By explicitly conditioning on a 3D-consistent scaffold to perform reconstruction, One2Scene works stably under large camera motions, supporting immersive scene exploration. Extensive experiments show that One2Scene substantially outperforms state-of-the-art methods in panorama depth estimation, feed-forward 360ยฐ reconstruction, and explorable 3D scene generation. Code and models will be released. Anonymous project page can be found at: https://one2scene5406.github.io
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3491. Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models
๐ฌ 21
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ language models reinforcement learning
Reinforcement Learning with Verifiable Rewards (RLVR) for large language models (LLMs) has achieved remarkable progress in enhancing LLMsโ reasoning capabilities on tasks with clear correctness criteria, such as mathematical reasoning tasks. Several training metrics, such as entropy or response length, have been observed to correlate with different reasoning behaviors in reinforcement learning. Prior approaches incorporate such priors through reward or advantage shaping, which often relies on hand-crafted penalties and preferences (e.g., higher-is-better or lower-is-better). However, without careful hyper-parameter tuning, these directional priors can be overly biased and may lead to failure. To this end, we introduce ***C****onditional adv****AN****tage estimati****ON*** (***CANON***), amplifying the impact of the target metric without presuming its direction. Specifically, *CANON* regroups the sampled responses into two groups based on the higher or lower value of a target metric, measures which metric trend contributes to better performance through inter-group comparison, and identifies the better response within the same group. In summary, *CANON* based on entropy consistently outperforms prior methods across three LLMs on both math reasoning and high-complexity logic tasks. When applied to response length, *CANON* further improves token efficiency, yielding a more favorable Pareto frontier in the performanceโcost trade-off.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3492. WATS: Wavelet-Aware Temperature Scaling for Reliable Graph Neural Networks
๐ฌ 22
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph Neural Network Calibration
Graph Neural Networks (GNNs) have demonstrated strong predictive performance on relational data; however, their confidence estimates often misalign with actual predictive correctness, posing significant limitations for deployment in safety-critical settings. While existing graph-aware calibration methods seek to mitigate this limitation, they primarily depend on coarse one-hop statistics, such as neighbor-predicted confidence, or latent node embeddings, thereby neglecting the fine-grained structural heterogeneity inherent in graph topology. In this work, we propose Wavelet-Aware Temperature Scaling (WATS), a post-hoc calibration framework for node classification that assigns node-specific temperatures based on tunable heat-kernel graph wavelet features. Specifically, WATS harnesses the scalability and topology sensitivity of graph wavelets to refine confidence estimates, all without necessitating model retraining or access to neighboring logits or predictions. Extensive evaluations across nine benchmark datasets with varying graph structures and three GNN backbones demonstrate that WATS achieves the lowest Expected Calibration Error (ECE) among most of the compared methods, outperforming both classical and graph-specific baselines by up to 41.2\% in ECE and reducing calibration variance by 33.17\% on average compared with graph-specific methods. Moreover, WATS remains computationally efficient, scaling well across graphs of diverse sizes and densities. The implementation is available at \url{https://anonymous.4open.science/status/WATS-057A}
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3493. Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models
๐ฌ 24
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Text-to-Image Generative Evaluation Spatial Intelligence
Text-to-image (T2I) models have achieved remarkable success in generating high-fidelity images, but they often fail in handling complex spatial relationships, e.g., spatial perception, reasoning, or interaction. These critical aspects are largely overlooked by current benchmarks due to their short or information-sparse prompt design. In this paper, we introduce SpatialGenEval, a new benchmark designed to systematically evaluate the spatial intelligence of T2I models, covering two key aspects: (1) SpatialGenEval involves 1,230 long, information-dense prompts across 25 real-world scenes. Each prompt integrates 10 spatial sub-domains and corresponding 10 multi-choice question-answer pairs, ranging from object position and layout to occlusion and causality. Our extensive evaluation of 21 state-of-the-art models reveals that higher-order spatial reasoning remains a primary bottleneck. (2) To demonstrate that the utility of our information-dense design goes beyond simple evaluation, we also construct the SpatialT2I dataset. It contains 15,400 text-image pairs with rewritten prompts to ensure image consistency while preserving information density. Fine-tuned results on current foundation models (\ie, Stable Diffusion-XL, Uniworld-V1, OmniGen2) yield consistent performance gains (+4.2%, +5.7%, +4.4%) and more realistic effects in spatial relations, highlighting a data-centric paradigm to achieve spatial intelligence in T2I models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3494. VisionReasoner: Unified Reasoning-Integrated Visual Perception via Reinforcement Learning
๐ฌ 23
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Reasoning Segmentation; Reinforcement Learning; Multi-model Large Language Models; Visual Perception
Large vision-language models exhibit inherent capabilities to handle diverse visual perception tasks. In this paper, we introduce VisionReasoner, a unified framework capable of reasoning and solving multiple visual perception tasks within a shared model. Specifically, by designing a unified reward mechanism and multi-object cognitive learning strategies, VisionReasoner enhances its reasoning capabilities to analyze visual inputs, and addresses diverse perception tasks within a unified model. VisionReasoner generates a structured reasoning process before delivering the desired outputs responding to user queries. Human evaluation reveals the reasoning process of VisionReasoner is faithful and reliable even without annotated reasoning train data. To rigorously assess unified visual perception capabilities, we evaluate VisionReasoner on ten diverse tasks spanning three critical domains: detection, segmentation, and counting. Experimental results show that VisionReasoner achieves superior performance as a unified model, outperforming the baseline Qwen2.5VL by relative margins of 29.1\% on COCO (detection), 22.1\% on ReasonSeg (segmentation), and 15.3\% on CountBench (counting).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3495. Graph Representational Learning: When Does More Expressivity Hurt Generalization?
๐ฌ 16
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph Neural Networks Generalization Expressivity PAC-Bayes
Graph Neural Networks (GNNs) are powerful tools for learning on structured data, yet the relationship between their expressivity and predictive performance remains unclear. We introduce a family of pseudometrics that capture different degrees of structural similarity between graphs and relate these similarities to generalization, and consequently, the performance of expressive GNNs. By considering a setting where graph labels are correlated with structural features, we derive generalization bounds that depend on the distance between training and test graphs, model complexity, and training set size. These bounds reveal that more expressive GNNs may generalize worse unless their increased complexity is balanced by a sufficiently large training set or reduced distance between training and test graphs. Our findings relate expressivity and generalization, offering theoretical insights supported by empirical results.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3496. QueryStream: Advancing Streaming Video Understanding with Query-Aware Pruning and Proactive Response
๐ฌ 14
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Streaming Video Understanding Query-Aware Pruning Proactive Response
The increasing demand for real-time interaction in online video scenarios necessitates a new class of efficient streaming video understanding models. However, existing approaches often rely on a flawed, query-agnostic ``change-is-important'' principle, which conflates visual dynamics with semantic relevance, leading to computational waste and interaction errors. To address this, we propose QueryStream, a novel framework that instills query-awareness into the core of video processing and response scheduling. QueryStream features two synergistic components: (1) Query-Aware Differential Pruning (QDP), a policy that filters the token stream by jointly assessing semantic relevance to the query and temporal novelty against a dynamically smoothed history, and (2) Relevance-Triggered Active Response (RTAR), a dual-gated mechanism that schedules responses based on both high query relevance and significant information density. As a lightweight, training-free module, QueryStream establishes a new state-of-the-art on benchmarks like StreamingBench and OVO-Bench, matching or exceeding the performance of full-token baselines while pruning over 70\% of visual tokens. Notably, our pruning mechanism generalizes to offline tasks, where it functions as an effective context-denoising module to improve accuracy on long-form videos. This work not only reveals the vast semantic redundancy in video streams relative to user intent but also establishes a promising, intent-driven direction for truly efficient and robust online video understanding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3497. GenSR: Symbolic regression based on equation generative space
๐ฌ 20
โญ 5.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Symbolic Regression; Equation Generative Latent Space
Symbolic Regression (SR) tries to reveal the hidden equations behind observed data.
However, most methods search within a discrete equation space, where the structural modifications of equations rarely align with their numerical behavior, leaving fitting error feedback too noisy to guide exploration.
To address this challenge, we propose GenSR, a generative latent spaceโbased SR framework following the "map construction $\rightarrow$ coarse localization $\rightarrow$ fine search" paradigm. Specifically, GenSR first pretrains a dual-branch Conditional Variational Autoencoder (CVAE) to reparameterize symbolic equations into a generative latent space with symbolic continuity and local numerical smoothness. This space can be regarded as a well-structured "map" of the equation space, providing directional signals for search. At inference, the CVAE coarsely localizes the input data to promising regions in the latent space. Then, a modified CMA-ES refines the candidate region, leveraging smooth latent gradients.
From a Bayesian perspective, GenSR reframes SR task as maximizing the conditional distribution $p({\rm Equ.}|{\rm Num.})$, with CVAE training achieving this objective through the Evidence Lower Bound (ELBO). This new perspective provides a theoretical guarantee for the effectiveness of GenSR. Extensive experiments show that GenSR jointly optimizes predictive accuracy, expression simplicity, and computational efficiency, while remaining robust under noise.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3498. SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable SparseโLinear Attention
๐ฌ 18
โญ 5.00
๐ generative models
๐ท๏ธ sparse attention efficient attention video diffusion model video generation diffusion transformer
In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. Interestingly, we find that attention weights can be decoupled into two matrices: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (**S**parse-**L**inear **A**ttention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible, applying $\mathcal{O}(N^2)$ attention to critical weights, $\mathcal{O}(N)$ attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a $\textbf{20x}$ reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by $\textbf{95}$\% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a $\textbf{13.7x}$ speedup in attention computation and a $\textbf{2.2x}$ end-to-end speedup in video generation on Wan2.1-1.3B.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3499. Amortising Inference and Meta-Learning Priors in Neural Networks
๐ฌ 13
โญ 5.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ neural processes Bayesian neural networks meta-learning priors variational inference
One of the core facets of Bayesianism is in the updating of prior beliefs in light of new evidence$\textemdash$so how can we maintain a Bayesian approach if we have no prior beliefs in the first place? This is one of the central challenges in the field of Bayesian deep learning, where there is no clear way to translate beliefs about a prediction task into prior distributions over model parameters. Bridging the fields of Bayesian deep learning and neural processes, we propose to $\textit{meta-learn}$ our parametric prior from data by introducing a way to perform per-dataset amortised variational inference. The model we develop can be viewed as a neural process whose latent variable is the set of weights of a BNN and whose decoder is the neural network parameterised by a sample of the latent variable itself. This unique model allows us to study the behaviour of Bayesian neural networks under well-specified priors, use Bayesian neural networks as flexible generative models, and perform desirable but previously elusive feats in neural processes such as within-task minibatching or meta-learning under extreme data-starvation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3500. ULD-Net: Enabling Ultra-Low-Degree Fully Polynomial Networks for Homomorphically Encrypted Inference
๐ฌ 20
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Privacy-Preserving Machine Learning efficient private inference machine learning as a service homomorphic encryption Fully Polynomial Networks Ultra-Low-Degree operators
Fully polynomial neural networksโmodels whose computations comprise only additions and multiplicationsโare attractive for privacy-preserving inference under homomorphic encryption (HE). Yet most prior systems obtain such models by *post-hoc* replacement of nonlinearities with high-degree or cascaded polynomials, which inflates HE cost and makes training numerically fragile and hard to scale.
We introduce **ULD-Net**, a pretraining methodology that enables *ultra-low-degree* (multiplicative depth โค 3 for each operator) fully polynomial networks to be trained from scratch at ImageNet and transformer scale while maintaining high accuracy. The key is a polynomial-only normalization, **PolyNorm**, coupled with a principled choice of normalization axis that keeps activations in a well-conditioned range across deep stacks of polynomial layers. Together with a special set of polynomial-aware operator replacements, such as polynomial activation functions and linear attention, ULD-Net delivers stable optimization without resorting to high-degree approximations.
Experimental results demonstrate that ULD-Net outperforms several state-of-the-art open-source fully and partially polynomial approaches across both CNNs and ViTs on diverse datasets, in terms of both accuracy and HE inference latency. Specifically, ULD-Net achieves +0.39% accuracy and a 2.76ร speedup compared to the best fully polynomial baseline; up to +3.33% accuracy and a 3.17ร speedup compared to the best partial polynomial baseline.
Applying ULD-Net to ViT-Small and ViT-Base yields 76.7% and 75.2% top-1 accuracy on ImageNet, demonstrating the first fully polynomial models scaled to the ViT/ImageNet level.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3501. Online Alignment as Perceptual Loss
๐ฌ 13
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ alignment LLM LLM alignment prospect theory perceptual loss behavioral economics
Online alignment (e.g., GRPO) is generally more performant than offline alignment (e.g., DPO)---but why? Drawing on prospect theory from behavioral economics, we propose a human-centric explanation. We prove that online on-policy sampling better approximates the human-perceived distribution of what the model can produce, and PPO/GRPO-style clipping---originally introduced to just stabilize training---recovers a perceptual bias in how humans perceive probability. In this sense, PPO/GRPO act as perceptual losses already. Our theory further suggests that the online/offline dichotomy is itself incidental to maximizing human utility, since we can achieve the same effect by selectively training on any data in a manner that mimics human perception, rather than restricting ourselves to online on-policy data. Doing so would allow us to post-train more quickly, cheaply, and flexibly without sacrificing performance. To this end, we propose a design pattern that explicitly incorporates perceptual distortions of probability into objectives like DPO/KTO/GRPO, creating $\textit{humanline variants}$ of them. Surprisingly, we find that these humanline variants, even when trained with offline off-policy data, can match the performance of their online counterparts on both verifiable and unverifiable tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3502. EAST: Early Action Prediction Sampling Strategy with Token Masking
๐ฌ 16
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ early action prediction token masking video analysis efficient training
Early action prediction seeks to anticipate an action before it fully unfolds, but limited visual evidence makes this task especially challenging. We introduce EAST, a simple and efficient framework that enables a model to reason about incomplete observations. In our empirical study, we identify key components when training early action prediction models. Our key contribution is a randomized training strategy that samples a time step separating observed and unobserved video frames, enabling a single model to generalize seamlessly across all test-time observation ratios. We further show that joint learning on both observed and future (oracle) representations significantly boosts performance, even allowing an encoder-only model to excel. To improve scalability, we propose a token masking procedure that cuts memory usage in half and accelerates training by 2ร with no accuracy loss. Combined with a forecasting decoder, EAST sets a new state of the art on NTU60, SSv2, and UCF101, surpassing previous best work by 10.1, 7.7, and 3.9 percentage points, respectively. We support future research by releasing efficient training implementations and pre-trained models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3503. Seesaw: Accelerating Training by Balancing Batch Size and Learning Rate Scheduling
๐ฌ 15
โญ 5.00
๐ optimization
๐ท๏ธ optimization batch size cbs scheduler llm pretraining
Increasing the batch size during training --- a โbatch ramp'' --- is a promising strategy to accelerate large language model pretraining. While for SGD, doubling the batch size can be equivalent to halving the learning rate, the optimal strategy for adaptive optimizers like Adam is less clear. As a result, any batch-ramp scheduling, if used at all, is typically tuned heuristically. This work develops a principled framework for batch-size scheduling and introduces Seesaw: whenever a standard scheduler would halve the learning rate, Seesaw instead multiplies it by $1/\sqrt{2}$ and doubles the batch size, preserving loss dynamics while reducing serial steps. Theoretically, we provide, to our knowledge, the first finite-sample proof of equivalence between learning-rate decay and batch-size ramp-up for SGD on noisy linear regression, and we extend this equivalence to normalized SGD, a tractable proxy for Adam, under a variance-dominated regime observed in practice. Empirically, on 150M/300M/600M-parameter models trained at Chinchilla scale using a constant (critical) batch size, Seesaw matches cosine decay at equal FLOPs while reducing wall-clock time by $\approx 36\%$, approaching the theoretical limit implied by our analysis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3504. GIT-BO: High-Dimensional Bayesian Optimization with Tabular Foundation Models
๐ฌ 24
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Tabular Foundation Models Bayesian optimization TabPFN
Bayesian optimization (BO) struggles in high dimensions, where Gaussian-process surrogates demand heavy retraining and brittle assumptions, slowing progress on real engineering and design problems. We introduce GIT-BO, a Gradient-Informed BO framework that couples TabPFN v2, a tabular foundation model that performs zero-shot Bayesian inference in context, with an active-subspace mechanism computed from the modelโs own predictive-mean gradients. This aligns exploration to an intrinsic low-dimensional subspace via a Fisher-information estimate and selects queries with a UCB acquisition, requiring no online retraining. Across 60 problem variants spanning 20 benchmarksโnine scalable synthetic families and ten real-world tasks (e.g., power systems, Rover, MOPTA08, Mazda)โup to 500 dimensions, GIT-BO delivers a stronger performanceโtime trade-off than state-of-the-art GP-based methods (SAASBO, TuRBO, Vanilla BO, BAxUS), ranking highest in performance and with runtime advantages that grow with dimensionality. Limitations include memory footprint and dependence on the capacity of the underlying TFM.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3505. Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs
๐ฌ 10
โญ 5.00
๐ generative models
๐ท๏ธ sparse attention LLM Serving
Long-context models are essential for many applications but face inefficiencies in loading large KV caches during decoding. Prior methods enforce fixed token budgets for sparse attention, assuming a set number of tokens can approximate full attention. However, these methods overlook variations in the importance of attention across heads, layers, and contexts.
To address these limitations, we propose Tactic, a sparsity-adaptive and calibration-free sparse attention mechanism that dynamically selects tokens based on their cumulative attention scores rather than a fixed token budget. By setting a target fraction of total attention scores, Tactic ensures that token selection naturally adapts to variations in attention sparsity. To efficiently approximate this selection, Tactic leverages clustering-based sorting and distribution fitting, allowing it to accurately estimate token importance with minimal computational overhead.
We show that Tactic outperforms existing sparse attention algorithms, achieving superior accuracy and up to 5.14x decode attention speedup. This improvement translates to an overall 1.51x end-to-end inference speedup, making Tactic a practical and effective solution for long-context LLM inference in accuracy-sensitive applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3506. The Achillesโ Heel of LLMs: How Altering a Handful of Neurons Can Cripple Language Abilities
๐ฌ 30
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Ultra-sparse neuron sets Perturbation-based identification Catastrophic failure
Large Language Models (LLMs) have become foundational tools in natural language processing, powering a wide range of applications and research. Many studies have shown that LLMs share significant similarities with the human brain. Neuroscience research has found that a small subset of biological neurons in the human brain are crucial for core cognitive functions, which raises a fundamental question: do LLMs also contain a small subset of critical neurons? In this paper, we investigate this question by proposing a Perturbation-based Causal Identification of Critical Neurons method to systematically locate such critical neurons in LLMs. Our findings reveal three key insights:
(1) LLMs contain ultra-sparse critical neuron sets. Disrupting these critical neurons can cause a 72B-parameter model with over 1.1 billion neurons to completely collapse, with perplexity increasing by up to 20 orders of magnitude;
(2) These critical neurons are not uniformly distributed, but tend to concentrate in the outer layers, particularly within the MLP down\_proj components;
(3) Performance degradation exhibits sharp phase transitions, rather than a gradual decline, when these critical neurons are disrupted.
Through comprehensive experiments across diverse model architectures and scales, we provide deeper analysis of these phenomena and their implications for LLM robustness and interpretability. These findings can offer guidance for developing more robust model architectures and improving deployment security in safety-critical applications. Our code is available at https://anonymous.4open.science/r/The-Achilles-Heel-of-LLMs-7C02.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3507. Stage-wise Dynamics of Classifier-Free Guidance in Diffusion Models
๐ฌ 22
โญ 5.00
๐ generative models
๐ท๏ธ Diffusion Models Classifier-Free Guidance
Classifier-Free Guidance (CFG) is widely used to improve conditional fidelity in diffusion models, but its impact on sampling dynamics remains poorly understood. Prior studies, often restricted to unimodal conditional distributions or simplified cases, provide only a partial picture.
We analyze CFG under multimodal conditionals and show that the sampling process unfolds in three successive stages. In the Direction Shift stage, guidance accelerates movement toward the weighted mean, introducing initialization bias and norm growth. In the Mode Separation stage, local dynamics remain largely neutral, but the inherited bias suppresses weaker modes, reducing global diversity. In the Concentration stage, guidance amplifies within-mode contraction, diminishing fine-grained variability.
This unified view explains a widely observed phenomenon: stronger guidance improves semantic alignment but inevitably reduces diversity. Experiments support these predictions, showing that early strong guidance erodes global diversity, while late strong guidance suppresses fine-grained variation. Moreover, our theory naturally suggests a time-varying guidance schedule, and empirical results confirm that it consistently improves both quality and diversity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3508. ProtoKV: Long-context Knowledges Are Already Well-Organized Before Your Query
๐ฌ 20
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model KV Cache
Modern Large Language Models (LLMs) face fundamental challenges in processing long text sequences due to the quadratic complexity of attention mechanisms. Key-Value (KV) cache retention strategies mitigate this issue by selectively preserving salient KV pairs for autoregressive generation. However, existing methods fail to adequately and efficiently preserve the semantic integrity of the compressed representations. In this paper, we discover a prevalent phenomenon in LLM: within the key embedding space, while most tokens exhibit similarity with their contextual neighbors (we term position-determined tokens), a small subset of special tokens serving as semantic anchors consistently show local deviation property and form one or several clusters (we term semantic-anchored tokens). Motivated by this observation, we propose ProtoKV that separately processes these two token categories for KV cache compression. Within this framework, we first construct semantic prototypes based on the inherent characteristics of the two token categories, and subsequently form clusters of semantically similar tokens as basic compression units. This approach preserves semantic integrity with high computational efficiency. Experiments on LongBench demonstrate that ProtoKV achieves 2.11% higher accuracy than state-of-the-art methods under matched memory constraints.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3509. Visual Autoregressive Modeling for Instruction-Guided Image Editing
๐ฌ 17
โญ 5.00
๐ generative models
๐ท๏ธ Instruction-Guided Image Editing Visual Autoregressive Modeling
Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to effectively condition the source image tokens. We observe that finest-scale source features cannot effectively guide the prediction of coarser target features. To bridge this gap, we introduce a Scale-Aligned Reference (SAR) module, which injects scale-matched conditioning information into the first self-attention layer. VAREdit demonstrates significant advancements in both editing adherence and efficiency. On EMU-Edit and PIE-Bench benchmarks, VAREdit outperforms leading diffusion-based methods by a substantial margin in terms of both CLIP and GPT scores. Moreover, VAREdit completes a 512$\times$512 editing in 1.2 seconds, making it 2.2$\times$ faster than the similarly sized UltraEdit.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3510. Tracing and Reversing Edits in LLMs: A Study on Rank-One Model Edits
๐ฌ 14
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Model Editing Knowledge Editing Countermeasures to Malicious Knowledge Editing
Knowledge editing methods (KEs) are a cost-effective way to update the factual content of large language models (LLMs), but they pose a dual-use risk. While KEs are beneficial for updating outdated or incorrect information, they can be exploited maliciously to implant misinformation or bias. In order to defend against these types of malicious manipulation, we need robust techniques that can reliably detect, interpret, and mitigate adversarial edits. To that end, we introduce the tasks of tracing and reversing edits. We propose a novel method to infer the edited object entity, solely based on the modified weights, without access to the editing prompt or any other semantically similar prompts, with up to 99\% accuracy. Further, we propose an effective and training-free method for reversing edits. Our method recovers up to 93\% of edits, and helps regain the original model's output distribution without access to any information about the edit. This method can further be used to distinguish between edited and unedited weights. Our findings highlight the feasibility of tracing and reversing edits based on the edited weights, opening a new research direction for safeguarding LLMs against adversarial manipulations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3511. Uni-NTFM: A Unified Foundation Model for EEG Signal Representation Learning
๐ฌ 34
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Brain Computer interface Foundation Model Electroencephalography
Current foundation models for electroencephalography (EEG) rely on architectures adapted from computer vision or natural language processing, typically treating neural signals as pixel grids or token sequences. This approach overlooks that the neural activity is activated by diverse sparse coding across a complex geometric topological cortex. Inspired by biological neural mechanisms, we propose the Unified Neural Topological Foundation Model (Uni-NTFM), an architecture rooted in three core neuroscience principles. In detail, to align with the brain's decoupled coding mechanism, we design the Heterogeneous Feature Projection Module. This module simultaneously encodes both time-domain non-stationary transients and frequency-domain steady-state rhythms, ensuring high quality in both waveform morphology and spectral rhythms. Moreover, we introduce a Topological Embedding mechanism to inject structured spatial priors and align different sensor configurations onto a unified latent functional topography, effectively reconstructing the geometry of brain regions. Furthermore, we achieve functional modularization and sparse coding efficiency of biological networks by constructing the Mixture-of-Experts Transformer network. This dynamic routing mechanism assigns different signal patterns and tasks to specialized neural subnetworks, and effectively preventing task interference while increasing the model capacity to record-breaking 1.9 billion parameters. Uni-NTFM is pre-trained on a diverse corpus comprising 28,000 hours of EEG data, and outperforms existing models across nine distinct downstream tasks under both linear probing and fine-tuning settings, demonstrating that aligning model architecture with neural mechanisms is significant to learn universal representations and achieve generalizable brain decoding.} Our code is available at \url{https://anonymous.4open.science/r/Uni-NTFM-0924}
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3512. UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers
๐ฌ 14
โญ 5.00
๐ generative models
๐ท๏ธ video diffusion model video generation diffusion transformer video length extrapoaltion
Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation.
Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental viewโattention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings. Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from $~2\times$ to $4\times$. Remarkably, it improves Dynamic Degree and Imaging Quality by 233\% and 40.5\% over the previous best method at $4\times$ extrapolation. Furthermore, our method generalizes seamlessly to downstream tasks such as controllable video synthesis and editing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3513. Curvature-Guided Task Synergy for Skeleton based Temporal Action Segmentation
๐ฌ 4
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Temporal Action Segmentation Skeleton-based Learning Geometric Priors Curvature Guidance Task Synergy
Fine-grained temporal action segmentation plays a vital role in comprehensivehuman behavior understanding, with skeleton-based approaches (STAS) gaining prominence for their privacy and robustness. A core challenge in STAS arises from the conflicting feature requirements of action classification (demanding temporal invariance) and boundary localization (requiring temporal sensitivity). Existing methods typically adopt decoupled pipelines, unfortunately overlooking the inherent semantic complementarity between these sub-tasks, leading to information silos that prevent beneficial cross-task synergies. To address this challenge, we propose CurvSeg, a novel approach that synergizes classification and localization within the STAS domain through a unique geometric curvature guidance mechanism. Our key innovation lies in exploiting curvature properties of well-learned classification representations on skeleton sequences. Specifically, we observe that high curvature within action segments and low curvature at transitions effectively serve as geometric priors for precise boundary detection. CurvSeg establishes a virtuous cycle: localization predictions, guided by these curvature
signals, in turn dynamically refine the classification feature space to organize into a geometry conducive to clearer boundaries. To compute stable curvature signals from potentially noisy skeleton features, we further develop a dual-expert weighting mechanism within a Mixture of Experts framework, providing task-adaptive feature extraction. Comprehensive experiments demonstrate that CurvSeg signif-icantly enhances STAS performance across multiple benchmark datasets, achieving superior results and validating the power of geometric-guided task collaboration for this specific problem.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3514. ROSETTA: Constructing Code-Based Reward from Unconstrained Language Preference
๐ฌ 9
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ reward generation LLMs for robotics human evaluation
Intelligent embodied agents not only need to accomplish preset tasks, but also learn to align with individual human needs and preferences. Extracting reward signals from human language preferences allows an embodied agent to adapt through reinforcement learning. However, human language preferences are unconstrained, diverse, and dynamic, making constructing learnable reward from them a major challenge. We present ROSETTA, a framework that uses foundation models to ground and disambiguate unconstrained natural language preference, construct multi-stage reward functions, and implement them with code generation. Unlike prior works requiring extensive offline training to get general reward models or fine-grained correction on a single task, ROSETTA allows agents to adapt online to preference that evolves and is diverse in language and content. We test ROSETTA on both short-horizon and long-horizon manipulation tasks and conduct extensive human evaluation, finding that ROSETTA outperforms SOTA baselines and achieves 87% average success rate and 86% human satisfaction across 116 preferences.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3515. Meta-UCF: Unified Task-Conditioned LoRA Generation for Continual Learning in Large Language Models
๐ฌ 21
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Fine-Tuning Continual fine-tuning
Large language models are increasingly deployed in settings where newtasks arrive continuously, yet existing parameter-efficient finetuning (PEFT) methods either bloat linearly with the task horizon or sacrifice deep adaptation, leaving catastrophic forgetting unresolved. We aim to achieve memory-constant, on-the-fly adaptation for a frozen LLM facing an unbounded stream of tasks. To this end we propose Meta-Unified Contrastive Finetuning(Meta-UCF), which encodes each task into a lightweight layer-normalised mean embedding and feeds it to a single hypernetwork that instantly generates rank-r LoRA updates for every transformer layer; a meta-contrastive coupled with orthogonality objective further steers task embeddings into near-orthogonal directions, preserving past knowledge without inner-loop gradients. On four benchmark streamsโStd-CL 5, Seq-GLUE 7, Long-CL 15 and TRACE-8โMeta-UCF raises average accuracy by up to 2.2 pp and cuts forgetting by 13% relative to the strongest LoRA baseline, while using the parameters of a single adapter. By decoupling continual learning from parameter growth, Meta-UCF provides a practical path toward scalable, low-resource lifelong language modelling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3516. Direct Doubly Robust Estimation of Conditional Quantile Contrasts
๐ฌ 8
โญ 5.00
๐ causal reasoning
๐ท๏ธ Heterogeneous Treatment Effect Conditional Quantile Treatment Effect Quantile Regression Doubly Robust
Within heterogeneous treatment effect (HTE) analysis, various estimands have been proposed to capture the effect of a treatment conditional on covariates. Recently, the conditional quantile comparator (CQC) has emerged as a promising estimand, offering quantile-level summaries akin to the conditional quantile treatment effect (CQTE) while preserving some interpretability of the conditional average treatment effect (CATE).
It achieves this by summarising the treated response conditional on both the covariates and the untreated response. Despite these desirable properties, the CQC's current estimation is limited by the need to first estimate the difference in conditional cumulative distribution functions and then invert it.
This inversion obscures the CQC estimate, hampering our ability to both model and interpret it. To address this, we propose the first direct estimator of the CQC, allowing for explicit modelling and parameterisation.
This explicit parameterisation enables better interpretation of our estimate while also providing a means to constrain and inform the model. We show, both theoretically and empirically, that our estimation error depends directly on the complexity of the CQC itself, improving upon the existing estimation procedure. Furthermore, it retains the desirable double robustness property with respect to nuisance parameter estimation. We further show our method to outperform existing procedures in estimation accuracy across multiple data scenarios while varying sample size and nuisance error. Finally, we apply it to real-world data from an employment scheme, uncovering a reduced range of potential earnings improvement as participant age increases.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3517. Influence without Confounding: Causal Discovery from Temporal Data with Long-term Carry-over Effects
๐ฌ 11
โญ 5.00
๐ causal reasoning
๐ท๏ธ Causal Discovery Reinforcement Learning QR Decomposition Long-term Carry-over Effects
Learning causal structures from temporal data is fundamental to many practical tasks, such as physical laws discovery and root causes localization.
Real-world systems often exhibit long-term carry-over effects, where the value of a variable at the current time can be influenced by distant past values of other variables. These effects, due to their large temporal span, are challenging to observe or model. Existing methods typically consider finite lag orders, which may lead to confounding from early historical data. Moreover, incorporating historical information often results in computational scalability issues.
In this paper, we establish a theoretical framework for causal discovery in complex temporal scenarios where observational data exhibit long-term carry-over effect, and propose LEVER, a theoretically guaranteed novel causal discovery method for incomplete temporal data. Specifically, based on the \textit{Limited-history Causal Identifiability Theorem}, we refine the variable values at each time step with data at a few preceding steps to mitigate long-term historical influences. Furthermore, we establish a theoretical connection between QR decomposition and causal discovery, and design an efficient reinforcement learning process to determine the optimal variable ordering. Finally, we recover the causal structure from the R matrix.
We evaluate LEVER on both synthetic and real-world datasets. In static cases, LEVER reduces SHD by 17.29\%-40.00\% and improves the F1-score by 5.30\%-8.79\% compared to the best baseline. In temporal cases, it achieves a 64\% reduction in SHD and a 45\% improvement in F1-score. Additionally, LEVER demonstrates significantly higher precision on real-world data compared to baseline methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3518. Generative Bayesian Optimization: Generative Models as Acquisition Functions
๐ฌ 14
โญ 5.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Bayesian optimization; generative models; black-box optimization
We present a general strategy for turning generative models into candidate solution samplers for batch Bayesian optimization (BO). The use of generative models for BO enables: large batch scaling as generative sampling, optimization of non-continuous design spaces, and high-dimensional and combinatorial design by using generative priors over feasible regions. Inspired by the success of direct preference optimization (DPO) and its variants, we show that its approach of directly training generative models using preferential rewards without the need for an intermediate reward model is extensible to the BO case. Furthermore, this framework is generalizable beyond preference-based feedback to general types of reward signals and loss functions. In essence, one can train a generative model with noisy, simple utility values directly computed from observations to then form proposal distributions whose densities are proportional to the expected utility, i.e., BO's acquisition function values. This perspective unifies recent progress in using generative models for black-box optimization and connects it with batch Bayesian optimization under a general framework.
As theoretical results, we show that the generative models within the BO process approximately follow a sequence of distributions which asymptotically concentrate at the global optima under certain conditions.
We also demonstrate this effect through experiments on challenging optimization problems involving large batches in high dimensions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3519. Deft Scheduling of Dynamic Cloud Workflows with Varying Deadlines via Mixture-of-Experts
๐ฌ 21
โญ 5.00
๐ optimization
๐ท๏ธ Cloud Computing Dynamic Workflow Scheduling Deep Reinforcement Learning
Workflow scheduling in cloud computing demands the intelligent allocation of dynamically arriving, graph-structured workflows with varying deadlines onto ever-changing virtual machine resources. However, existing deep reinforcement learning (DRL) schedulers remain limited by rigid, single-path inference architectures that struggle to handle diverse scheduling scenarios. We introduce $\textbf{DEFT}$ ($\textbf{D}$eadline-p$\textbf{E}$rceptive Mixture-o$\textbf{F}$-Exper$\textbf{t}$s), an innovative DRL policy architecture that leverages a specialized mixture of experts, each trained to manage different levels of deadline tightness. To our knowledge, DEFT is the first to introduce and validate a Mixture-of-Experts architecture for dynamic cloud workflow scheduling. By adaptively routing decisions through the most appropriate experts, DEFT is capable of meeting a broad spectrum of deadline requirements that no single expert can achieve. Central to DEFT is a $\textbf{graph-adaptive}$ gating mechanism that encodes workflow DAGs, task states, and VM conditions, using cross-attention to guide expert activation in a fine-grained, deadline-sensitive manner. Experiments on dynamic cloud workflow benchmarks demonstrate that DEFT significantly reduces execution cost and deadline violations, outperforming multiple state-of-the-art DRL baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3520. Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation
๐ฌ 14
โญ 5.00
๐ generative models
๐ท๏ธ Diffusion Model; Classifier-free Guidance; Text-to-Image Generation
Classifier-free guidance (CFG) has helped diffusion models achieve great conditional generation in various fields. Recently, more diffusion guidance methods have emerged with improved generation quality and human preference. However, can these emerging diffusion guidance methods really achieve solid and significant improvements? In this paper, we rethink recent progress on diffusion guidance. Our work mainly consists of four contributions. First, we reveal a critical evaluation pitfall that common human preference models exhibit a strong bias towards large guidance scales. Simply increasing the CFG scale can easily improve quantitative evaluation scores due to strong semantic alignment, even if image quality is severely damaged (e.g., oversaturation and artifacts). Second, we introduce a novel guidance-aware evaluation (GA-Eval) framework that employs effective guidance scale calibration to enable fair comparison between current guidance methods and CFG by identifying the effects orthogonal and parallel to CFG effects. Third, motivated by the evaluation pitfall, we design Transcendent Diffusion Guidance (TDG) method that can significantly improve human preference scores in the conventional evaluation framework but actually does not work in practice. Fourth, in extensive experiments, we empirically evaluate recent eight diffusion guidance methods within the conventional evaluation framework and the proposed GA-Eval framework. Notably, simply increasing the CFG scales can compete with most studied diffusion guidance methods, while all methods suffer severely from winning rate degradation over standard CFG. Our work would strongly motivate the community to rethink the evaluation paradigm and future directions of this field.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3521. Unveiling Perceptual Artifacts: A Fine-Grained Benchmark for Interpretable AI-Generated Image Detection
๐ฌ 11
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ AI-generated Image Detection
Current AI-Generated Image (AIGI) detection approaches predominantly rely on binary classification to distinguish real from synthetic images, often lacking interpretable or convincing evidence to substantiate their decisions. This limitation stems from existing AIGI detection benchmarks, which, despite featuring a broad collection of synthetic images, remain restricted in their coverage of artifact diversity and lack detailed, localized annotations. To bridge this gap, we introduce a fine-grained benchmark towards eXplainable AI-Generated image Detection, named X-AIGD, which provides pixel-level, categorized annotations of perceptual artifacts, spanning low-level distortions, high-level semantics, and cognitive-level counterfactuals. These comprehensive annotations facilitate fine-grained interpretability evaluation and deeper insight into model decision-making processes. Our extensive investigation using X-AIGD provides several key insights: (1) Existing AIGI detectors demonstrate negligible reliance on perceptual artifacts, even at the most basic distortion level. (2) While AIGI detectors can be trained to identify specific artifacts, they still substantially base their judgment on uninterpretable features. (3) Explicitly aligning model attention with artifact regions can increase the interpretability and generalization of detectors.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3522. Emergent Dexterity Via Diverse Resets and Large-Scale Reinforcement Learning
๐ฌ 12
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Robotics; sim-to-real; reinforcement learning
Reinforcement learning in GPU-enabled physics simulation has been the driving force behind many of the breakthroughs in sim-to-real robot learning. However, current approaches for data generation in simulation are unwieldy and task-specific, requiring extensive human effort to engineer training curricula and rewards. Even with this engineering, these approaches still struggle to reliably solve long-horizon, dexterous manipulation tasks. To provide a seamless tool for robotic data generation in simulation, we introduce a simple framework that enables on-policy reinforcement learning to reliably solve an array of such tasks with a single reward function, set of algorithm hyper-parameters, no auto-curricula, and no human demonstrations. Our key insight is careful usage of diverse simulator resets for simplifying long-horizon exploration challenges. Our proposed system, OmniReset, automatically generates these resets with minimal human input and gracefully scales with compute to solve dexterous, contact-rich long-horizon tasks. OmniReset outperforms baselines on easier versions of our tasks, and scales to tasks with complexities beyond the reach of existing techniques. Finally, we use this data-generation methodology to create a large dataset of trajectories in simulation, and show that augmenting it with a small amount of real-world data enables successful real-world transfer for complex manipulation tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3523. Replicable Reinforcement Learning with Linear Function Approximation
๐ฌ 14
โญ 5.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning learning theory replicability stability linear MDP
Replication of experimental results has been a challenge faced by many scientific disciplines, including the field of machine learning. Recent work on the theory of machine learning has formalized replicability as the demand that an algorithm produce identical outcomes when executed twice on different samples from the same distribution. Provably replicable algorithms are especially interesting for reinforcement learning (RL), where algorithms are known to be unstable in practice. While replicable algorithms exist for tabular RL settings, extending these guarantees to more practical function approximation settings has remained an open problem. In this work, we make progress by developing replicable methods for linear function approximation in RL. We first introduce two efficient algorithms for replicable random design regression and uncentered covariance estimation, each of independent interest. We then leverage these tools to provide the first provably efficient replicable RL algorithms for linear Markov decision processes in both the generative model and episodic settings. Finally, we evaluate our algorithms experimentally and show how they can inspire more consistent neural policies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3524. Learning Exposure Mapping Functions for Inferring Heterogeneous Peer Effects
๐ฌ 4
โญ 5.00
๐ causal reasoning
๐ท๏ธ causal inference peer effects network interference exposure mapping function graph neural network
Peer effect refers to the difference in counterfactual outcomes for a unit resulting from different levels of peer exposure, the extent to which the unit is exposed to the treatments, actions, or behaviors of its peers. In practice, peer exposure is typically captured through an explicitly defined exposure mapping function that aggregates peer treatments and outputs peer exposure. Exposure mapping functions range from simple functions like the number or fraction of treated friends to more sophisticated functions that allow for different peers to exert different degrees of influence. However, the true function is rarely known in practice and when the function is misspecified, this leads to biased causal effect estimation. To address this problem, the focus of our work is to move away from the need to explicitly define an exposure mapping function and instead introduce a framework that allows learning this function automatically. We develop EGONETGNN, a graph neural network (GNN), for heterogeneous peer effect estimation that automatically learns the appropriate exposure mapping function and allows for complex peer exposure mechanisms that involve not only peer treatments but also attributes of the local neighborhood, including node, edge, and structural attributes. We theoretically and empirically show that GNN models that use peer exposure based on the number or fraction of treated peers or learn peer exposure naively face difficulty accounting for such influence mechanisms. Our evaluation on synthetic and semi-synthetic network data shows that our method is more robust to different unknown underlying influence mechanisms when compared to state-of-the-art baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3525. HAMLET: Hyperadaptive Agent-based Modeling for Live Embodied Theatrics
๐ฌ 19
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Drama generation Role-playing language models Multi-agent collaboration Agentic workflow Finetuning
Creating an immersive and interactive theatrical experience is a long-term goal in the field of interactive narrative. The emergence of large language model (LLM) is providing a new path to achieve this goal. However, existing LLM-based drama generation methods often result in agents that lack initiative and cannot interact with the physical scene. Furthermore, these methods typically require detailed user input to drive the drama. These limitations reduce the interactivity and immersion of online real-time performance. To address the above challenges, we propose HAMLET, a multi-agent framework focused on drama creation and online performance. Given a simple topic, the framework generates a narrative blueprint, guiding the subsequent improvisational performance. During the online performance, each actor is given an autonomous mind. This means that actors can make independent decisions based on their own background, goals, and emotional state. In addition to conversations with other actors, their decisions can also change the state of scene props through actions such as opening a letter or picking up a weapon. The change is then broadcast to other related actors, updating what they know and care about, which in turn influences their next action. To evaluate the quality of drama performance generated by HAMLET, we designed an evaluation method to assess three primary aspects, including character performance, narrative quality, and interaction experience. The experimental evaluation shows that HAMLET can create expressive and coherent theatrical experiences.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3526. Capturing Visual Environment Structure Correlates with Control Performance
๐ฌ 16
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Robot Learning Computer Vision Diffusion Policy
The choice of visual representation is key to scaling generalist robot policies. However, direct evaluation via policy rollouts is expensive, even in simulation. Existing proxy metrics focus on the representation's capacity to capture narrow aspects of the visual world, like object shape, limiting generalization across environments. In this paper, we take an analytical perspective: we probe pretrained visual encoders by measuring how well they support decoding of environment stateโincluding geometry, object structure, and physical attributesโfrom images. Leveraging simulation environments with access to ground-truth state, we show that this probing accuracy strongly correlates with downstream policy performance across diverse environments and learning settings, significantly outperforming prior metrics. Our study provides insight into the representational properties that support generalizable manipulation, suggesting that learning to encode full environment state is a promising objective for visual representations for control.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3527. Bayesian Evidence-Driven Prototype Evolution for Federated Domain Adaptation
๐ฌ 12
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Federated Learning Domain Adaptation
Federated learning (FL), as a privacy-preserving distributed machine learning paradigm, enables clients to collaboratively train a global model without sharing local data. However, in real-world scenarios, domain shift caused by different source clients leads to structural discrepancies in the feature space, resulting in performance degradation of the global model. Although existing prototype-based FL methods offer improvements in cross-domain feature alignment, they still struggle to adapt to dynamic semantic structures and fail to continuously respond to the changing semantic separability and variance structure during training. To address this, we propose FedPTE, an FL framework with prototype topology evolution. Specifically, FedPTE treats prototype clusters as variable topological units, employing Bayesian Gaussian Mixture Models and marginal likelihood ratios on the server to perform probabilistic inference, which enables adaptive structural adjustments. Meanwhile, FedPTE introduces a stability constraint mechanism to balance the adaptability of topological evolution and training stability. By conducting prototype topology-aware contrastive learning on clients, it enhances the discriminability and cross-domain consistency of features. Experimental results demonstrate that FedPTE achieves superior performance across multiple cross-domain datasets, showcasing its strong expressiveness and generalization capability in heterogeneous domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3528. RoboPARA: Dual-Arm Robot Planning with Parallel Allocation and Recomposition Across Tasks
๐ฌ 4
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Dual-Arm Robots LLM-driven Planning Task Parallelism
Dual-arm robots play a crucial role in improving efficiency and flexibility in complex multitasking scenarios. While existing methods have achieved promising results in task planning, they often fail to fully optimize task parallelism, limiting the potential of dual-arm collaboration.
To address this issue, we propose RoboPARA, a novel large language model (LLM)-driven framework for dual-arm task parallelism planning.
RoboPARA employs a two-stage process: (1) Dependency Graph-based Planning Candidates Generation, which constructs directed acyclic graphs (DAGs) to model task dependencies and eliminate redundancy, and (2) Graph Re-Traversal-based Dual-Arm Parallel Planning, which optimizes DAG traversal to maximize parallelism while maintaining task coherence. In addition, we introduce the Cross-Scenario Dual-Arm Parallel Task dataset (X-DAPT dataset), the first dataset specifically designed to evaluate dual-arm task parallelism across diverse scenarios and difficulty levels. Extensive experiments demonstrate that RoboPARA significantly outperforms existing planning methods, achieving higher efficiency and reliability, particularly in complex task combinations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3529. Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards
๐ฌ 22
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ RLVR GRPO rollout LLM
Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with Stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3530. UFO-4D: Unposed Feedforward 4D reconstruction from Two Images
๐ฌ 17
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Feedforward dense 4D reconstruction 4D interpolation
Dense 4D reconstruction from unposed images remains a critical challenge, with current methods relying on slow test-time optimization or fragmented, task-specific feedforward models. We introduce UFO-4D, a unified feedforward framework to reconstruct a dense, explicit 4D representation from just a pair of unposed images. UFO-4D directly estimates dynamic 3D Gaussian Splats, enabling the joint and consistent estimation of 3D geometry, 3D motion, and camera pose in a feedforward manner. Our core insight is that differentiably rendering different signals from a single, holistic representation enables significant advantages at training time, in the form of a self-supervisory image synthesis loss as well as tightly coupling motion and depth losses. This approach mitigates data scarcity, allowing UFO-4D to achieve joint estimation of geometry, motion, and camera pose while outperforming prior work by up to a factor of three. The 4D representation also enables high-fidelity 4D spatio-temporal interpolation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3531. iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
๐ฌ 13
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Vision-Language Models Model efficiency Token merging
Recent methods have made notable progress in accelerating Large Vision-Language Models (LVLMs) by exploiting the inherent redundancy in visual inputs. Most existing approaches, however, focus narrowly on reducing image tokens before or within the Large Language Model (LLM) stage to lower computational cost. This overlooks other major bottlenecks, particularly the image encoder, which itself requires substantial computation. As a result, these methods fall short of achieving true end-to-end acceleration. Importantly, the image encoder is the primary contributor of input tokens to the LLM. Thus, reducing visual redundancy at the encoder stage not only speeds up the encoder itself but also significantly lightens the workload for the subsequent LLM. Motivated by this, we investigate how to jointly optimize the image encoder and the LLM along with other LVLM components for comprehensive acceleration. To mitigate the risk of performance degradation from token reduction, we propose a novel token merging strategy that recycles useful information from otherwise discarded tokens. Our approach, iLLaVA, delivers consistent improvements across both image and video understanding tasks, achieving up to a 2ร throughput boost and a 4ร reduction in prefilling time. Notably, iLLaVA enables a larger model (e.g., InternVL-2.5 26B) to surpass a smaller counterpart (e.g., InternVL-2.5 8B) in both accuracy and efficiency. Extensive comparisons with state-of-the-art token pruning and merging techniques demonstrate the clear superiority of our method. Finally, we provide detailed visualizations for the merging steps of iLLaVA , offering deeper insights into how different LVLM components contribute to efficient computation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3532. Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling
๐ฌ 19
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ test-time scaling process reward models
Process reward models (PRMs) are a cornerstone of test-time scaling (TTS), designed to verify and select the best responses from large language models (LLMs). However, this promise is challenged by recent benchmarks where simple majority voting, which ignores PRM signals, occasionally outperforms standard PRM-based selection. This raises a critical question: How can we effectively utilize verification signals from PRMs for TTS? To address this, we start by developing a theoretical framework for optimally combining signals from both the LLM and the PRM. Our framework reveals that the optimal strategy is a weighted aggregation of responses, a strategy whose effectiveness hinges on estimating weights that capture the complex interplay between the models.
Based on our theoretical results, we empirically show that these optimal weighting functions differ significantly across LLM-PRM pairs and, notably, often assign substantial negative weights.
Motivated by these insights, we propose efficient pre-computation methods to calibrate these weighting functions.
Extensive experiments across 5 LLMs and 7 PRMs demonstrate that our calibration method significantly boosts the TTS efficiency, surpassing the performance of vanilla weighted majority voting while using only $\sim 21.3\\%$ of the computation.
Ultimately, our work demonstrates that investing in a more intelligent aggregation strategy can be a more convincing path to performance gains than simply scaling test-time computation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3533. CompMarkGS: Robust Watermarking for Compressed 3D Gaussian Splatting
๐ฌ 23
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ 3D Gaussian Splatting Digital Watermarking Privacy
As 3D Gaussian Splatting (3DGS) is increasingly adopted in various academic and commercial applications due to its high-quality and real-time rendering capabilities, the need for copyright protection is growing. At the same time, its large model size requires efficient compression for storage and transmission. However, compression techniques, especially quantization-based methods, degrade the integrity of existing 3DGS watermarking methods, thus creating the need for a novel methodology that is robust against compression. To ensure reliable watermark detection under compression, we propose a compression-tolerant 3DGS watermarking method that preserves watermark integrity and rendering quality. Our approach utilizes an anchor-based 3DGS, embedding the watermark into anchor attributes, particularly the anchor feature, to enhance security and rendering quality. We also propose a quantization distortion layer that injects quantization noise during training, preserving the watermark after quantization-based compression. Moreover, we employ a frequency-aware anchor growing strategy that enhances rendering quality by effectively identifying Gaussians in high-frequency regions, and an HSV loss to mitigate color artifacts for further rendering quality improvement. Extensive experiments demonstrate that our proposed method preserves the watermark even under compression and maintains high rendering quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3534. APPLE: Toward General Active Perception via Reinforcement Learning
๐ฌ 16
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Active Perception Reinforcement Learning Tactile Sensing Transformers
Active perception is a fundamental skill that enables us humans to deal with uncertainty in our inherently partially observable environment.
For senses such as touch, where the information is sparse and local, active perception becomes crucial. In recent years, active perception has emerged as an important research domain in robotics. However, current methods are often bound to specific tasks or make strong assumptions, which limit their generality. To address this gap, this work introduces APPLE (Active Perception Policy Learning) - a novel framework that leverages reinforcement learning (RL) to address a range of different active perception problems. APPLE jointly trains a transformer-based perception module and decision-making policy with a unified optimization objective, learning how to actively gather information. By design, APPLE is not limited to a specific task and can, in principle, be applied to a wide range of active perception problems.
We evaluate two variants of APPLE across different tasks, including tactile exploration problems from the Tactile MNIST benchmark.
Experiments demonstrate the efficacy of APPLE, achieving high accuracies on both regression and classification tasks. These findings underscore the potential of APPLE as a versatile and general framework for advancing active perception in robotics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3535. SongEcho: Cover Song Generation via Instance-Adaptive Element-wise Linear Modulation
๐ฌ 20
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ conditioning method controllable song generation
Cover songs constitute a vital aspect of musical culture, preserving the core melody of an original composition while reinterpreting it to infuse novel emotional depth and thematic emphasis. Although prior research has explored the reinterpretation of instrumental music through melody-conditioned text-to-music models, the task of cover song generation remains largely unaddressed.
In this work, we reformulate our cover song generation as a conditional generation, which simultaneously generates new vocals and accompaniment conditioned on the original vocal melody and text prompts. To this end, we present SongEcho, which leverages Instance-Adaptive Element-wise Linear Modulation (IA-EiLM), a framework that incorporates controllable generation by improving both conditioning injection mechanism and conditional representation. To enhance the conditioning injection mechanism, we extend Feature-wise Linear Modulation (FiLM) to an Element-wise Linear Modulation (EiLM), to facilitate precise temporal alignment in melody control. For conditional representations, we propose Instance-Adaptive Condition Refinement (IACR), which refines conditioning features by interacting with the hidden states of the generative model, yielding instance-adaptive conditioning. Additionally, to address the scarcity of large-scale, open-source full-song datasets, we construct Suno70k, a high-quality AI song dataset enriched with comprehensive annotations.
Experimental results across multiple datasets demonstrate that our approach generates superior cover songs compared to existing methods, while requiring fewer than 30% of the trainable parameters.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3536. HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation
๐ฌ 23
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Retrieval Augmented Generation LLM Agent Agentic RAG Question Answering Reinforcement Learning
Agentic Retrieval-Augmented Generation (RAG) is a powerful technique for incorporating external information that Large Language Models (LLMs) lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a Reinforcement Learning (RL) framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce $\textbf{Hi}$erarchical $\textbf{P}$rocess Rewards for Efficient agentic $\textbf{RAG}$ (HiPRAG), a novel training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4\% (3B) and 67.2\% (7B), outperforming strong agentic RAG baselines. This is accomplished while dramatically improving search efficiency, reducing the over-search rate from over 27\% in baselines to just 2.3\% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents. We will release our code upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3537. M4PQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation
๐ฌ 15
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ question answering supervised fine-tuning trajectory synthesis
The growing volume of academic papers has made it increasingly difficult for researchers to efficiently extract key information. While large language models (LLMs) based agents are capable of automating question answering (QA) workflows for scientific papers, there still lacks a comprehensive and realistic benchmark to evaluate their capabilities. Moreover, training an interactive agent for this task is hindered by the shortage of high-quality interaction trajectories. In this work, we propose M4PQA, a human-annotated comprehensive paper QA dataset in the field of artificial intelligence, with 13,948 papers and 1,246 questions, that encompasses multi-task, multi-modal and instance-level evaluation. Furthermore, we propose ExTrActor, an automated framework for instruction data synthesis. With three LLM-based agents, ExTrActor can perform example generation and trajectory collection without human intervention. Evaluations of multiple open-source and proprietary models show that most models underperform on M4PQA, demonstrating its quality. Extensive experiments confirm that ExTrActor consistently improves the multi-turn tool-use capability of small models, enabling them to achieve performance comparable to larger ones.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3538. FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding
๐ฌ 18
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ large multimodal model visual token compression long video understanding
Recent studies in long video understanding have harnessed the advanced visual-language reasoning capabilities of Large Multimodal Models (LMMs), driving the evolution of video-LMMs specialized for processing extended video sequences. However, the scalability of these models is severely limited by the overwhelming volume of visual tokens generated from extended video sequences. To address this challenge, this paper proposes FLoC, an efficient visual token compression framework based on the facility location function, a principled approach that swiftly selects a compact yet highly representative and diverse subset of visual tokens within a predefined budget on the number of visual tokens. By integrating the lazy greedy algorithm, our method achieves remarkable efficiency gains by swiftly selecting a compact subset of tokens, drastically reducing the number of visual tokens while guaranteeing near-optimal performance. Notably, our approach is training-free, model-agnostic, and query-agnostic, providing a versatile solution that seamlessly integrates with diverse video-LLMs and existing workflows. Extensive evaluations on large-scale benchmarks, such as Video-MME, MLVU, and LongVideoBench, demonstrate that our framework consistently surpasses recent compression techniques, highlighting not only its effectiveness and robustness in addressing the critical challenges of long video understanding, but also its efficiency in processing speed.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3539. Meta-Router: Bridging Gold-standard and Preference-based Evaluations in LLM Routing
๐ฌ 12
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Causal learning Meta-learner Large Language Model query routing
In language tasks requiring extensive human-model interaction, the inference cost of large language models (LLMs) can be substantial. To reduce expenses while preserving the quality of the responses, an LLM router selects among candidate models to balance between the expected response quality and the inference cost. A central challenge in router training is the accuracy and accessibility of reliable supervision. Gold-standard data, obtained from domain experts or benchmark labels, provide accurate quality evaluations of LLM responses but are costly and difficult to scale. In contrast, preference-based data, collected via crowdsourcing or LLM-as-a-judge systems, are cheaper and more scalable, yet often biased in reflecting the true quality of responses. We cast the problem of LLM router training with combined Gold-standard and preference-based data into a causal inference framework by viewing the response evaluation mechanism as the treatment assignment. This perspective further reveals that the bias in preference-based data corresponds to the well-known causal estimand: the conditional average treatment effect (CATE). Based on this new perspective, we develop an integrative causal router training framework that corrects preference-data bias, addresses imbalances between two data sources, and improves routing robustness and efficiency. Numerical experiments demonstrate that our approach delivers more accurate routing and improves the trade-off between cost and quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3540. SABRE-FL: Selective and Accurate Backdoor Rejection for Federated Prompt Learning
๐ฌ 16
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Federated Learning Poisoning Attacks Multimodal Learning
Federated Prompt Learning has emerged as a communication-efficient and privacy-preserving paradigm for adapting large vision-language models like CLIP across decentralized clients. However, the security implications of this setup remain underexplored. In this work, we present the first study of backdoor attacks in Federated Prompt Learning. We show that when malicious clients inject visually imperceptible, learnable noise triggers into input images, the global prompt learner becomes vulnerable to targeted misclassification while still maintaining high accuracy on clean inputs. Motivated by this vulnerability, we propose SABRE-FL, a lightweight, modular defense that filters poisoned prompt updates using an embedding-space anomaly detector trained offline on out-of-distribution data. SABRE-FL requires no access to raw client data or labels and generalizes across diverse datasets. We show, both theoretically and empirically, that malicious clients can be reliably identified and filtered using an embedding-based detector. Across five diverse datasets and four baseline defenses, SABRE-FL outperforms all baselines by significantly reducing backdoor accuracy while preserving clean accuracy, demonstrating strong empirical performance and underscoring the need for robust prompt learning in future federated systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3541. Trust The Typical
๐ฌ 19
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Safety Out-of-Distribution Detection Jailbreaking Representation Learning Selective Generation Anomaly Detection
Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from \emph{deeply understanding what is safe}. We introduce \textbf{T}rust \textbf{T}he \textbf{T}ypical \textbf{(T3)}, a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6\% overhead even under dense evaluation intervals on large-scale workloads.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3542. VIRTUE: Visual-Interactive Text-Image Universal Embedder
๐ฌ 18
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Visual-interactive embedding model VLM-based representation learning interactive image-to-text retrieval benchmark
Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interests from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel **V**isual-**I**nte**R**active **T**ext-Image **U**niversal **E**mbedder (**VIRTUE**) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can process visual prompts that pinpoint specific regions within an image, thereby enabling the embedder to handle complex and ambiguous scenarios more precisely. To evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale **S**egmentation-and-Scene **Ca**ption **R**etrieval (**SCaR**) benchmark comprising 1M samples that aims to retrieve the text caption by jointly considering the entity with a specific object and image scene. VIRTUE consistently achieves a state-of-the-art performance with significant improvements across 36 universal MMEB (**3.1\%โ8.5\%**) and five visual-interactive SCaR (**15.2\%โ20.3\%**) tasks. The code, model, and benchmark will be released in the camera-ready version.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3543. Scalable Offline Model-Based RL with Action Chunks
๐ฌ 14
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Offline RL World Models Model-based RL Action chunking Long-horizon tasks
In this paper, we study whether model-based reinforcement learning (RL), in particular model-based value expansion,
can provide a scalable recipe for tackling complex, long-horizon tasks in offline RL.
Model-based value expansion fits an on-policy value function using length-$n$ imaginary rollouts generated by the current policy and a learned dynamics model.
While larger $n$ reduces bias in value bootstrapping, it amplifies accumulated model errors over long horizons, degrading future predictions.
We address this trade-off with
an *action-chunk* model that predicts a future state from a sequence of actions (an "action chunk")
instead of a single action, which reduces compounding errors.
In addition, instead of directly training a policy to maximize rewards,
we employ rejection sampling from an expressive behavioral action-chunk policy,
which prevents model exploitation from out-of-distribution actions.
We call this recipe **Model-Based RL with Action Chunks (MAC)**.
Through experiments on highly challenging tasks with large-scale datasets of up to $100$M transitions,
we show that MAC achieves the best performance among offline model-based RL algorithms,
especially on challenging long-horizon tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3544. Dual-Scale World Models for LLM Agents towards Hard-Exploration Problems
๐ฌ 17
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ hard-exploration problems world model llm agents text-based games
LLM-based agents have seen promising advances, yet they are still limited in โhard-explorationโ tasks requiring learning new knowledge through exploration. We present GLoW, a novel approach leveraging dual-scale world models, maintaining a trajectory frontier of high-value discoveries at the global scale, while learning from local trial-and-error in exploration through a Multi-path Advantage Reflection mechanism which infers advantage-based progress signals to guide exploration. To evaluate our framework for hard-exploration, we tackle the Jericho benchmark suite of text-based games, where GLoW achieves a new state-of-the-art performance for LLM-based approaches. Compared to state-of-the-art RL-based methods, our approach achieves comparable performance while requiring 100-800ร fewer environment interactions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3545. OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning
๐ฌ 18
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Embodied AI Embodied Reasoning Spatial Reasoning Multimodal Large Language Models 3D Large Language Models
Recent advances in multimodal large language models (MLLMs) have opened new opportunities for embodied intelligence, enabling multimodal understanding, reasoning, and interaction, as well as continuous spatial decision-making. Nevertheless, current MLLM-based embodied systems face two critical limitations. First, **Geometric Adaptability Gap:** models trained solely on 2D inputs or with hard-coded 3D geometry injection suffer from either insufficient spatial information or restricted 2D generalization, leading to poor adaptability across tasks with diverse spatial demands. Second, **Embodiment Constraint Gap**: prior work often neglects the physical constraints of real robots, resulting in task plans that are theoretically valid but practically infeasible.To address these gaps, we introduce **OmniEVA** -- an embodied versatile planner that enables advanced embodied reasoning and task planning through two pivotal innovations: (1) a **Task-Adaptive 3D Grounding** mechanism, which uses a gated router to dynamically inject 3D features based on task context, enabling selective geometric reasoning. (2) an **Embodiment-Aware Reasoning** framework that incorporates task goals and physical constraints into the reasoning loop, ensuring executable plans. Extensive experiments show that OmniEVA achieves state-of-the-art performance on 7 of 8 embodied reasoning benchmarks and excels in downstream tasks such as object navigation and mobile manipulation. Evaluations on proposed primitive and composite benchmarks confirm its robust and versatile planning capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3546. Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation
๐ฌ 23
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Knowledge Distillation Transfer Learning Large Language Model
Knowledge Distillation (KD) can transfer the reasoning abilities of large models to smaller ones, which can reduce the costs to generate Chain-of-Thoughts for reasoning tasks. KD methods typically ask the student to mimic the teacher's distribution over the entire output. However, a student with limited capacity can be overwhelmed by such extensive supervision causing a distribution mismatch, especially in complex reasoning tasks. We propose Token-Selective Dual Knowledge Distillation (TSD-KD), a framework for student-centric distillation. TSD-KD focuses on distilling important tokens for reasoning and encourages the student to explain reasoning in its own words. TSD-KD combines indirect and direct distillation. Indirect distillation uses a weak form of feedback based on preference ranking. The student proposes candidate responses generated on its own; the teacher re-ranks those candidates as indirect feedback without enforcing its entire distribution. Direct distillation uses distribution matching; however, it selectively distills tokens based on the relative confidence between teacher and student. Finally, we add entropy regularization to maintain the student's confidence during distillation. Overall, our method provides the student with targeted and indirect feedback to support its own reasoning process and to facilitate self-improvement. The experiments show the state-of-the-art performance of TSD-KD on 10 challenging reasoning benchmarks, outperforming the baseline and runner-up in accuracy by up to 54.4\% and 40.3\%, respectively. Notably, a student trained by TSD-KD even outperformed its own teacher model in four cases by up to 20.3\%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3547. Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization
๐ฌ 28
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Large Reasoning Models
Recent advances in Large Reasoning Models (LRMs) have demonstrated strong performance on complex tasks through long Chain-of-Thought (CoT) reasoning. However, their lengthy outputs increase computational costs and may lead to overthinking, raising challenges in balancing reasoning effectiveness and efficiency. Current methods solving this problem often compromise reasoning quality or require extensive resources. In this paper, we investigate how to reduce the generation length of LRMs with limited tuning. We analyze generation path distributions and filter generated trajectories through difficulty estimation. Subsequently, we analyze the convergence behaviors of the objectives of various preference optimization methods under a Bradley-Terry loss based framework. Based on the analysis, we propose Length Controlled Preference Optimization (LCPO) that directly balances the implicit reward related to NLL loss. LCPO can effectively learn length preference with limited data and training. Extensive experiments demonstrate that our approach significantly reduces the average output length by over 50\% across multiple benchmarks while maintaining the reasoning performance. Our work highlights the potential for computationally efficient approaches in guiding LRMs toward efficient reasoning. Code available at https://anonymous.4open.science/r/anonymous\_code\_74CD.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3548. On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
๐ฌ 4
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Post-Training Reinforcement Learning Supervise Fine-Tuning
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established response patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data's influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from the expert, which promotes on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on mathematical reasoning problems and practical tool-use tasks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We will release the source code to inspire further research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3549. ULTRA-360: Unconstrained Dataset for Large-scale Temporal 3D Reconstruction across Altitudes and Omnidirectional Views
๐ฌ 17
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ large-scale 3D reconstruction feature matching novel-view synthesis
Significant progress has been made in photo-realistic scene reconstruction over recent years. Various disparate efforts have enabled capabilities such as multi-appearance or large-scale reconstruction from images acquired by consumer-grade cameras. How far away are we from digitally replicating the real world in 4D? So far, there appears to be a lack of well-designed dataset that can evaluate the holistic progress on large-scale scene reconstruction. We introduce a collection of imagery on a campus, acquired at different seasons, times of day, from multiple elevations, views, and at scale. To estimate many camera poses over such a large area and across elevations, we apply a semi-automated calibration pipeline to eliminate visual ambiguities and avoid excessive matching, then visually verify all calibration results to ensure accuracy. Finally, we benchmark various algorithms for automatic calibration and dense reconstruction on our dataset, named ULTRA-360, and demonstrate numerous potential areas to improve upon, e.g., balancing sensitivity and specificity in feature matching, densification and floaters in dense reconstruction, multi-appearance overfitting, etc. We believe ULTRA-360 can serve as the benchmark that reflect realistic challenges in an end-to-end scene-reconstruction pipeline.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3550. Zero-shot Forecasting by Simulation Alone
๐ฌ 28
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ forecasting foundation models simulation zero-shot
Zero-shot time-series forecasting holds great promise, but is still in its infancy, hindered by limited and biased data corpora, leakage-prone evaluation, and privacy and licensing constraints. We propose the first practical univariate time-series simulation pipeline, which is simultaneously fast enough for on-the-fly data generation and enables notable zero-shot forecasting performance on M-Series and GiftEval benchmarks that capture trend/seasonality/intermittency patterns, typical of industrial forecasting applications across a variety of domains. Our simulator is based off of a seasonal autoregressive integrated moving average (SARIMA) as its core data source. Due to instability in the autoregressive component, naive SARIMA parameter sampling often leads to unusable paths. Instead, our simulator follows a three-step procedure: (1) we sample well-behaved trajectories from the system's characteristic polynomial stability region; (2) we introduce a superposition scheme that combines multiple paths into rich multi-seasonality traces; and (3) we add rate-based heavy-tailed noise models to capture burstiness and intermittency alongside seasonalities and trends. Orders of magnitude faster than kernel-based generators, our simulator enables training on circa 1B unique purely simulated series generated on the fly, after which well-established neural backbones exhibit strong zero-shot generalization, surpassing strong statistical forecasters and recent foundation baselines, while operating under strict zero-shot protocol. Notably, on GiftEval we observe a "student-beats-teacher" effect: models trained on our simulations exceed the forecasting accuracy of the generating processes (i.e. AutoARIMA), suggesting emergent generalization beyond the simulator's components.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3551. UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs
๐ฌ 23
โญ 5.00
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ large language model model quantization model compression
Deploying large language model (LLM) models on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by on the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training quantization and low-rank compression framework, with on-device configurable pruning rates for edge LLMs. UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models to cater to diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting that speeds up the computation by 20ร, quantization-aware singular value decomposition (SVD) decompositions to minimize the quantization errors, state-aware weight sorting for SSMs, and a fused rotary embedding (RoPE) kernel for the pruned models. Our framework performs weight-sorting, fine-tuning, and quantization in the cloud in a one-shot fashion, while enabling on-device configurable pruning rates up to 35%. Our experiments show that quantized and pruned models offer a memory reduction of 4รโ5.7ร and a token throughput improvement of 2.7รโ3.4ร, maintaining accuracy within 5% of the original models at 15% pruning rates across Transformers (Llama3 and Qwen2.5), SSMs (Mamba2), and hybrid models (Nemotron-H and Bamba-v2). The code and quantized models will be released at: https://github.com/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3552. SceneCOT: Eliciting Chain-of-Thought Reasoning in 3D Scenes
๐ฌ 4
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D scene reasoning chain-of-thought reasoning multimodal LLM
Existing research of 3D LLMs still struggles to achieve efficient and explainable reasoning, primarily due to the under-exploration of the mechanism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a Chain-of-Thought reasoning framework in 3D scenes (SceneCOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a framework, we build the first large-scale 3D scene Chain-of-Thought reasoning dataset, SceneCOT, including more than 190k high-quality data instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves state-of-the-art with clear interpretability. To our knowledge, this is the first attempt to successfully implement the COT technique for achieving human-like step-by-step reasoning for 3D scene understanding, where we show great potential in extending it to a wider range of 3D scene understanding scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3553. Value Flows
๐ฌ 17
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Distributional Reinforcement Learning Flow Matching
While most reinforcement learning methods today flatten the distribution of future returns to a single scalar value, distributional RL methods exploit the return distribution to provide stronger learning signals and to enable applications in exploration and safe RL. While the predominant method for estimating the return distribution is by modeling it as a categorical distribution over discrete bins or estimating a finite number of quantiles, such approaches leave unanswered questions about the fine-grained structure of the return distribution and about how to distinguish states with high return uncertainty for decision-making. The key idea in this paper is to use modern, flexible flow-based models to estimate the full future return distributions and identify those states with high return variance. We do so by formulating a new flow-matching objective that generates probability density paths satisfying the distributional Bellman equation. Building upon the learned flow models, we estimate the return uncertainty of distinct states using a new flow derivative ODE. We additionally use this uncertainty information to prioritize learning a more accurate return estimation on certain transitions. We compare our method (Value Flows) with prior methods in the offline and online-to-online settings. Experiments on $37$ state-based and $25$ image-based benchmark tasks demonstrate that Value Flows achieves a $1.3\times$ improvement on average in success rates.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3554. Split Happens (But Your Video Model Can Be Edited)
๐ฌ 12
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Understanding Representation Learning Action Recognition
Recognition models are typically trained on fixed taxonomies. Yet these taxonomies can be too coarse, collapsing distinctions under a single label (e.g. the action โopenโ may vary by object, manner, or outcome), and they also evolve as new distinctions become relevant. Collecting annotations and retraining to accommodate such changes is costly. We introduce category splitting, a new task where an existing classifier is edited to refine a coarse class into finer subcategories, while preserving accuracy elsewhere. We propose a zero-shot editing method that leverages the latent compositional structure of video models to expose fine-grained distinctions without extra data. We also show that low-shot fine-tuning, though simple, is highly effective, and benefits further from zero-shot initialization. Experiments on our new video benchmarks for category splitting demonstrate that our method substantially outperforms vision-language baselines, improving accuracy on the newly split categories without sacrificing performance on the rest.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3555. Revisiting the Past: Data Unlearning with Model State History
๐ฌ 26
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ machine unlearning large language models
Large language models are trained on massive corpora of web data, which may include private data, copyrighted material, factually inaccurate data, or data that degrades model performance. Eliminating the influence of such problematic datapoints on a model through complete retraining---by repeatedly pretraining the model on datasets that exclude these specific instances---is computationally prohibitive. To address this, unlearning algorithms have been proposed, that aim to eliminate the influence of particular datapoints at a low computational cost, while leaving the rest of the model intact. However, precisely unlearning the influence of data on a large language model has proven to be a major challenge. In this work, we propose a new algorithm, MSA (**M**odel **S**tate **A**rithmetic), for unlearning datapoints in large language models. MSA utilizes prior model checkpoints--- artifacts that record model states at different stages of pretraining--- to estimate and counteract the effect of targeted datapoints. Our experimental results show that MSA achieves competitive performance and often outperforms existing machine unlearning algorithms across multiple benchmarks, models, and evaluation metrics, suggesting that MSA could be an effective approach towards more flexible large language models that are capable of data erasure.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3556. SPR$^2$Q: Static Priority-based Rectifier Routing Quantization for Image Super-Resolution
๐ฌ 13
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Image Super-Resolution model quantization adapter routing
Low-bit quantization has achieved significant progress in image super-resolution. However, existing quantization methods show evident limitations in handling the heterogeneity of different components. Particularly under extreme low-bit compression, the issue of information loss becomes especially pronounced. In this work, we present a novel low-bit post-training quantization method, namely static priority-based rectifier routing quantization (SPR$^2$Q). The starting point of this work is to attempt to inject rich and comprehensive compensation information into the model before the quantization , thereby enhancing the model's inference performance after quantization. Firstly, we constructed a low-rank rectifier group and embedded it into the model's fine-tuning process. By integrating weight increments learned from each rectifier, the model enhances the backbone network while minimizing information loss during the lightweighting process. Furthermore, we introduce a static rectifier priority routing mechanism that evaluates the offline capability of each rectifier and generates a fixed routing table. During quantisation, it updates weights based on each rectifier's priority, enhancing the model's capacity and representational power without introducing additional overhead during inference. Extensive experiments demonstrate that the proposed SPR$^2$Q significantly outperforms the state-of-the-arts in five benchmark datasets, achieving PSNR improvements of 0.55 and 1.31 dB on the Set5($\times 2$) dataset under 4-bit and 2-bit settings, respectively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3557. SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment
๐ฌ 6
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Controllable Hand Image Generation 3D Hand Reconstruction
Recent studies on 3D hand reconstruction have demonstrated the effectiveness of synthetic training data to improve estimation performance. However, most methods rely on game engines to synthesize hand images, which often lack diversity in textures and environments, and fail to include crucial components like arms or interacting objects. Generative models are promising alternatives to generate diverse hand images, but still suffer from misalignment issues. In this paper, we present SesaHand, which enhances controllable hand image generation from both semantic and structural alignment perspectives for 3D hand reconstruction. Specifically, for semantic alignment, we propose a pipeline with Chain-of-Thought inference to extract human behavior semantics from image captions generated by the Vision-Language Model. This semantics suppresses human-irrelevant environmental details and ensures sufficient human-centric contexts for hand image generation. For structural alignment, we introduce hierarchical structural fusion to integrate structural information with different granularity for feature refinement to better align the hand and the overall human body in generated images. We further propose a hand structure attention enhancement method to efficiently enhance the model's attention on hand regions. Experiments demonstrate that our method not only outperforms prior work in generation performance but also improves 3D hand reconstruction with the generated hand images.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3558. Matching without Group Barrier for Heterogeneous Treatment Effect Estimation
๐ฌ 11
โญ 5.00
๐ causal reasoning
๐ท๏ธ causal inference matching
In heterogeneous treatment effect estimation from observational data, the fundamental challenge is that only the factual outcome under the received treatment is observable, while the potential outcomes under other treatments or no treatment can never be observed. As a simple and effective approach, matching aims to predict counterfactual outcomes of the target treatment by leveraging the nearest neighbors within the target group. However, due to limited observational data and the distribution shifts between groups, one cannot always find sufficiently close neighbors in the target group, resulting in inaccurate counterfactual prediction because of the manifold structure of data. To address this, we remove group barriers and propose a matching method that selects neighbors from all samples, not just the target group. This helps find closer neighbors and improves counterfactual prediction. Specifically, we analyze the effect estimation error in matching, which motivates us to propose a self optimal transport model for matching. Based on this, we employ an outcome propagation mechanism via the transport plan for counterfactual prediction, and exploit factual outcomes to learn a distance as the transport cost. The experiments are conducted on both binary and multiple treatment settings to evaluate our method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3559. TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models
๐ฌ 8
โญ 5.00
๐ generative models
๐ท๏ธ Generative Models Flow Matching Reinforcement Learning GRPO Tree Search
Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce \textbf{TreeGRPO}, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) \emph{High sample efficiency}, achieving better performance under same training samples (2) \emph{Fine-grained credit assignment} via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) \emph{Amortized computation} where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves \textbf{2.4$\times$ faster training} while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3560. Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning
๐ฌ 18
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Adversarial attacks Machine unlearning Image generation model unlearning AI safety Stable Diffusion model AIGC
Recent advances in diffusion-based image generation models (IGMs), such as Stable Diffusion (SD), have substantially improved the quality and diversity of AI-generated content. However, these models also pose ethical, legal, and societal risks, including the generation of harmful, misleading, or copyright-infringing material. Machine unlearning (MU) has emerged as a promising mitigation by selectively removing undesirable concepts from pretrained models, yet the robustness of existing methods, particularly under multi-modal adversarial inputs, remains insufficiently explored. To address this gap, we propose RECALL, a multi-modal adversarial framework for systematically evaluating and compromising the robustness of unlearned IGMs. Unlike prior approaches that primarily optimize adversarial text prompts, RECALL exploits the native multi-modal conditioning of diffusion models by efficiently optimizing adversarial image prompts guided by a single semantically relevant reference image. Extensive experiments across ten state-of-the-art unlearning methods and diverse representative tasks show that RECALL consistently surpasses existing baselines in adversarial effectiveness, computational efficiency, and semantic fidelity to the original prompt. These results reveal critical vulnerabilities in current unlearning pipelines and underscore the need for more robust, verifiable unlearning mechanisms. More than just an attack, RECALL also serves as an auditing tool for model owners and unlearning practitioners, enabling systematic robustness evaluation. Code and data are available at https://anonymous.4open.science/r/RECALL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3561. Multi-Scale Diffusion-Guided Graph Learning with Power-Smoothing Random Walk Contrast for Multi-View Clustering
๐ฌ 12
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Multi-view Clustering
Despite the notable advances in graph-based deep multi-view clustering, existing approaches still suffer from three critical limitations: (1) relying on static graph structures and being unable to model the global semantic relationships across views; (2) contamination from false negative samples in contrastive learning frameworks; and (3) a fundamental trade-off between cross-view consistency and view-specific discrimination. To address these issues, we introduce \textbf{M}ulti-sc\textbf{A}le diffusio\textbf{N}-guided \textbf{G}raph learning with p\textbf{O}wer-smoothing random walk contrast (\textbf{MANGO}) for multi-view clustering, a unified framework that combines adaptive multi-scale diffusion, random walk-driven contrastive learning, and structure-aware view consistency modeling. Specifically, the multi-scale diffusion mechanism leverages local entropy guidance to dynamically fuse similarity matrices across different diffusion steps, thereby achieving joint modeling of fine-grained local structures and overall global semantics. Additionally, we introduce a random walk-based correction strategy that explores high-probability semantic paths to filter out false negative samples, and applies a $\beta$-power transformation to adaptively reweight contrastive targets, thereby reducing noise propagation. To further reconcile the consistency-specificity dilemma, the view consistency module enforces semantic alignment across views by sharing structural embeddings, ensuring consistent local structures while preserving heterogeneous features. Extensive experiments on 12 benchmark datasets demonstrate the superior performance of MANGO.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3562. When Machine Learning Gets Personal: Evaluating Prediction and Explanation
๐ฌ 22
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Fairness explainability personalization
In high-stakes domains like healthcare, users often expect that sharing personal information with machine learning systems will yield tangible benefits, such as more accurate diagnoses and clearer explanations of contributing factors. However, the validity of this assumption remains largely unexplored. We propose a unified framework to fairly quantify if personalizing a model improves both prediction and explanation for every group who provide personal data. We show that its impacts on prediction and explanation can diverge: a model may become more or less explainable even when prediction is unchanged. For practical settings, we study a standard hypothesis test for detecting personalization effects on demographic groups. We derive a finite-sample lower bound on its probability of error as a function of group sizes, number of personal attributes, and desired benefit from personalization. This provides actionable insights, such as which dataset characteristics are necessary to test an effect, or the maximum effect that can be tested given a dataset. We apply our framework to real-world tabular datasets using feature-attribution methods, uncovering scenarios where effects are fundamentally untestable due to the dataset statistics. Our results highlight the need for joint evaluation of prediction and explanation in personalized models and the importance of designing models and datasets with sufficient information for such evaluation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3563. Step-Aware Residual-Guided Diffusion for EEG Spatial Super-Resolution
๐ฌ 5
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ EEG spatial super-resolution; Conditional Diffusion Model; Multi-channel Time Series Generation
For real-world BCI applications, lightweight Electroencephalography (EEG) systems offer the best costโdeployment balance. However, such spatial sparsity of EEG limits spatial fidelity, hurting learning and introducing bias. EEG spatial super-resolution methods aim to recover high-density EEG signals from sparse measurements, yet is often hindered by distribution shift and signal distortion and thus reducing fidelity and usability for EEG analysis and visualization.
To overcome these challenges, we introduce SRGDiff, a step-aware residual-guided diffusion model that formulates EEG spatial super-resolution as dynamic conditional generation.
Our key idea is to learn a dynamic residual condition from the low-density input that predicts the step-wise temporal and spatial details to add and uses the evolving cue to steer the denoising process toward high density reconstructions.
At each denoising step, the proposed residual condition is additively fused with the previous denoiser feature maps, then a step-dependent affine modulation scales and shifts the activation to produce the current features.
This iterative procedure dynamically extracts step-wise temporal rhythms and spatial-topographic cues to steer high-density recovery and maintain a fidelityโconsistency balance.
We adopt a comprehensive evaluation protocol spanning signal-, feature-, and downstream-level metrics across SEED, SEED-IV, and Localize-MI and multiple upsampling scales.
SRGDiff achieves consistent gains of up to 40\% over strong baselines, proving its superiority in the task of EEG spatial super-resolution. Moreover, topographic visualizations comparison and substantial EEG-FID gains jointly indicate that our SR EEG mitigates the spatialโspectral shift between low- and high-density recordings. Our code is available at https://anonymous.4open.science/r/DDPM-VAE-6F6B.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3564. DeAltHDR: Learning HDR Video Reconstruction from Degraded Alternating Exposure Sequences
๐ฌ 14
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ HDR Video Reconstrucion Alternating Exposures Degraded Sequences
High dynamic range (HDR) video can be reconstructed from low dynamic range (LDR) sequences with alternating exposures. However, most existing methods overlook the degradations (e.g., noise and blur) in LDR frames, focusing only on the brightness and position differences between them. To address this gap, we propose DeAltHDR, a novel framework for high-quality HDR video reconstruction from degraded sequences. Our framework addresses two key challenges. First, noisy and blurry contents complicate inter-frame alignment. To tackle this, we propose a flow-guided masked attention that leverages optical flow for a dynamic sparse cross-attention computation, achieving superior performance while maintaining efficiency. Notably, its controllable attention ratio allows for adaptive inference costs. Second, the lack of real-world paired data hinders practical deployment. We overcome this with a two-stage training paradigm: the model is first pre-trained on our newly introduced synthetic paired dataset and subsequently fine-tuned on unlabeled real-world videos via a proposed self-supervised method. Experiments show our method outperforms state-of-the-art ones. The datasets and code will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3565. CoNavBench: Collaborative Long-Horizon Vision-Language Navigation Benchmark
๐ฌ 16
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ collaborative vision-and-language navigation; LLM agents; benchmark
Vision-and-Language Navigation (VLN) primarily focuses on a single-agent-centric approach that executes human instructions step-by-step. In real environments with high demand or parallel workflows, collaboration VLN offers distinct benefits including shorter makespan and greater robustness through parallelism and role specialization. Collaboration VLN also brings new challenges including congestion, handoff errors, and rendezvous timing, which single-agent formulations overlook. Current datasets and protocols remain single-agent centered, which hides opportunities for assistance and ignores inter-robot interference. We fill this gap with Collaborative Long-Horizon VLN benchmark (\textbf{CoNavBench}), consisting of 4048 single and collaborative episodes with graph-level annotations and a collaboration type taxonomy that controls handoff styles and rendezvous patterns. To generate and evaluate at scale, we build \textbf{NavCraft}, an automated graph-grounded data generation platform. A two-stage hierarchical agent first produces a long-horizon base mission for the primary robot and then instantiates helper robots, allocates subgoals, and specifies validated handoffs and rendezvous. The agents operate with a scene graph in the loop derived from Habitat-Sim, which enables reachability checks, travel time, and interference assessment, and iterative schedule repair via an efficiency tool library. As a reference, we provide a collaborative baseline based on a finetuned Qwen2.5-VL-3B. Trained with CoNavBench, collaborative policies reduce makespan and improve reliability over strong single robot counterparts, yielding \textbf{18.11\%} step level success. Anonymous Website: https://navcraft.github.io.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3566. Reasoning Boosts Opinion Alignment in LLMs
๐ฌ 10
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Opinion modeling Alignement LLMs Reasoning
Opinion modeling aims to capture individual or group political preferences, enabling applications such as digital democracies, where models could help shape fairer and more popular policies. Given their versatility, strong generalization capabilities, and demonstrated success across diverse text-to-text applications, large language models (LLMs) are natural candidates for this task. However, due to their statistical nature and limited causal understanding, they tend to produce biased opinions when prompted naively. In this work, we study whether reasoning can improve opinion alignment. Motivated by the recent advancement in mathematical reasoning enabled by reinforcement learning (RL), we train models to produce profile-consistent answers through structured reasoning. We evaluate our approach on three datasets covering U.S., European, and Swiss politics. Results indicate that reasoning enhances opinion modeling and is competitive with strong baselines, but does not fully remove bias, highlighting the need for additional mechanisms to build faithful political digital twins using LLMs. By releasing both our method and datasets, we establish a solid baseline to support future research on LLM opinion alignment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3567. FedDAG: Clustered Federated Learning via Global Data and Gradient Integration for Heterogeneous Environments
๐ฌ 4
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Federated Learning Clustering Distributed Machine Learning
Federated Learning (FL) enables a group of clients to collaboratively train a model without sharing individual data, but its performance drops when client data are heterogeneous. Clustered FL tackles this by grouping similar clients. However, existing clustered FL approaches rely solely on either data similarity or gradient similarity; however, this results in an incomplete assessment of client similarities. Prior clustered FL approaches also restrict knowledge and representation sharing to clients within the same cluster. This prevents cluster models from benefiting from the diverse client population across clusters.
To address these limitations, FEDDAG introduces a clustered FL framework, FEDDAG, that employs a weighted, class-wise similarity metric that integrates both data and gradient information, providing a more holistic measure of similarity during clustering. In addition, FEDDAG adopts a dual-encoder architecture for cluster models, comprising a primary encoder trained on its own clients' data and a secondary encoder refined using gradients from complementary clusters. This enables cross-cluster feature transfer while preserving cluster-specific specialization.
Experiments on diverse benchmarks and data heterogeneity settings show that FEDDAG consistently outperforms state-of-the-art clustered FL baselines in accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3568. HiPO: Self-Hint Policy Optimization for RLVR
๐ฌ 4
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reinforcement Learning Large Language Models Mathematical Reasoning
Reinforcement Learning from Verifiable Rewards (RLVR) is a promising method for enhancing the complex
problem-solving abilities of large language models (LLMs). This is particularly evident in domains requiring
long-horizon reasoning and precise execution, such as solving complex mathematical problems where solutions
hinge on a fragile sequence of tool-based actions. However, current approaches are often crippled by two
interconnected issues: the near-miss problem, where sparse rewards nullify the learning signal for
almost-correct attempts, and the resulting exploration stagnation, which prevents the model from
discovering better solutions. To address these challenges, we introduce HiPO (Hint-guided Policy Optimization),
a novel RLVR framework that enables the agent to learn from its own rare successes.
Our core insight is to capture an occasional successful trajectory within a training batch and
repurpose its initial correct steps as an on-policy โhintโ. This process
transforms a single, stochastically-found success into a dense contrastive learning signal,
effectively allowing the model to teach itself how to overcome the near-miss
problem and break exploration stagnation. On a challenging suite of five mathematical reasoning benchmarks,
HiPO improves the average avg@32 by +5.0 percentage points (pp) over the strong GRPO baseline.
This improvement is driven by substantial absolute point gains on challenging datasets,
including +10.3 pp on CMIMC 2025, +4.9 pp on BRUMO 2025, +4.6 pp on AIME 2024, and +3.1 pp on AIME 2025.
Furthermore, HiPO demonstrates a new exploration paradigm,
repurposing rare successes into reusable guidance to significantly accelerate skill acquisition for complex tasks,
establishing a more efficient and scalable path for models to autonomously master intricate reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3569. Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics
๐ฌ 12
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ self-supervision learning representation learning latent dynamics natural videos object recognition motion understanding computer vision
Object recognition and motion understanding are key components of perception that complement each other.
While self-supervised learning methods have shown promise in their ability to learn from unlabeled data, they have primarily focused on obtaining rich representations for either recognition or motion rather than both in tandem.
On the other hand, latent dynamics modeling has been used in decision making to learn latent representations of observations and their transformations over time for control and planning tasks.
In this work, we present Midway Network, a new self-supervised learning architecture that is the first to learn strong visual representations for both object recognition and motion understanding solely from natural videos, by extending latent dynamics modeling to this domain.
Midway Network leverages a _midway_ top-down path to infer motion latents between video frames, as well as a dense forward prediction objective and hierarchical structure to tackle the complex, multi-object scenes of natural videos.
We demonstrate that after pretraining on two large-scale natural video datasets, Midway Network achieves strong performance on both semantic segmentation and optical flow tasks relative to prior self-supervised learning methods.
We also show that Midway Network's learned dynamics can capture high-level correspondence via a novel analysis method based on forward feature perturbation.
Code is provided at https://anonymous.4open.science/r/midway-network.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3570. MoDr: Mixture-of-Depth-Recurrent Transformers for Test-Time Reasoning
๐ฌ 14
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language models mixture-of-depth-recurrent transformer latent space test-time reasoning
Large Language Models have demonstrated superior reasoning capabilities by generating step-by-step reasoning in natural language before deriving the final answer. Recently, Geiping et al. introduced 3.5B-Huginn as an alternative to this paradigm, a depth-recurrent Transformer that increases computational depth per token by reusing a recurrent block in latent space. Despite its performance gains with increasing recurrences, this approach is inadequate for tasks demanding exploration and adaptivity, a limitation arising from its single, chain-like propagation mechanism. To address this, we propose a novel dynamic multi-branches routing approach for Huginn, termed as Mixture-of-Depth-Recurrent (MoDr) Transformer, which enables effective exploration of the solution space by shifting chain-like latent reasoning into a LoRA-based multi-branch dynamic relay mode with a learnable hard-gate routing. Meanwhile, we introduce an auxiliary-loss-free load balancing strategy to mitigate the potential routing collapse. Our empirical results reveal that MoDr achieves average accuracy improvements of +7.2% and +2.48% over the original Huginn model and its fine-tuned variant, respectively, across various mathematical reasoning benchmarks and improvements of +21.21% and +1.52% on commonsense reasoning benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3571. Towards Privacy-Guaranteed Label Unlearning in Vertical Federated Learning: Few-Shot Forgetting Without Disclosure
๐ฌ 16
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Federated Learning Machine Unlearning Privacy-Preserving
This paper addresses the critical challenge of unlearning in Vertical Federated Learning (VFL), a setting that has received far less attention than its horizontal counterpart. Specifically, we propose the first method tailored to *label unlearning* in VFL, where labels play a dual role as both essential inputs and sensitive information. To this end, we employ a representation-level manifold mixup mechanism to generate synthetic embeddings for both unlearned and retained samples. This is to provide richer signals for the subsequent gradient-based label forgetting and recovery steps. These augmented embeddings are then subjected to gradient-based label forgetting, effectively removing the associated label information from the model. To recover performance on the retained data, we introduce a recovery-phase optimization step that refines the remaining embeddings. This design achieves effective label unlearning while maintaining computational efficiency. We validate our method through extensive experiments on diverse datasets, including MNIST, CIFAR-10, CIFAR-100, ModelNet, Brain Tumor MRI, COVID-19 Radiography, and Yahoo Answers demonstrate strong efficacy and scalability. Overall, this work establishes a new direction for unlearning in VFL, showing that re-imagining mixup as a efficient mechanism can unlock practical, and utility-preserving unlearning. Our code will be released publicly.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3572. Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
๐ฌ 14
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM RL
Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each sliceโs soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguishing traces in the reasoning process. This produces dense, well-calibrated, on-policy step-level rewards that supplement sparse exact-match signals, improving credit assignment, increasing sample efficiency, and enhancing overall reasoning quality of LLMs. Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training. Specifically, on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular discriminator also enables flexible reward shaping for objectives such as teacher distillation, preference alignment, and mathematical proof-based reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3573. Towards a Foundation Model for Crowdsourced Label Aggregation
๐ฌ 16
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Label Aggregation
Inferring ground truth from noisy, crowdsourced labels is a fundamental challenge in machine learning. For decades, the dominant paradigm has relied on dataset-specific parameter estimation, a non-scalable method that fails to transfer knowledge. Recent efforts toward universal aggregation models do not account for the structural and behavioral complexities of human-annotated crowdsourcing, resulting in poor real-world performance. To address this gap, we introduce CrowdFM, a foundation model for crowdsourced label aggregation. At its core, CrowdFM is a bipartite graph neural network that is pre-trained on a vast, domain-randomized synthetic dataset. By leveraging a size-invariant initialization and attention-based message passing, it learns universal principles of collective intelligence and generalizes to new, unseen datasets. Extensive experiments on 22 real-world benchmarks show that our single, fixed model consistently matches or surpasses bespoke, per-dataset methods in both accuracy and efficiency. Furthermore, the representations learned by CrowdFM readily support diverse downstream applications, such as worker assessment and task assignment. Code and pre-trained models will be made publicly available upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3574. MathNet: A Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
๐ฌ 8
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Mathematical retrieval Mathematical comprehension Large language models
Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce *MathNet*, a large-scale, high-quality, multilingual, and multimodal dataset of Olympiad-level problems. MathNet spans 40 countries, 10 languages, and two decades of competitions, comprising 17,512 **expert-authored problems with solutions** across diverse domains.
*MathNet* supports three tasks: (i) *mathematical comprehension*, (ii) *mathematical retrieval*, an underexplored but essential capability and (iii) *Math RAG*, which evaluates how retrieval-augmented generation improves problem solving. For retrieval, we construct 39K pairs of mathematically equivalent problems to enable equivalence-based evaluation, in addition to 70 expert-curated pairs from real competitions. Experimental results show that even state-of-the-art reasoning models (76.8% for GPT-5 and 46.8% for Claude 4.5 Opus) are challenged, while embedding models struggle to retrieve equivalent problems. Finally, we show that LLM performance in RAG-based math problem solving is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark.
*MathNet* provides the largest high-quality Olympiad dataset and the first retrieval benchmark for problem equivalence. We publicly release both the dataset and benchmark at http://mathnet.netlify.app/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3575. Optimal Transport-Induced Samples against Out-of-Distribution Overconfidence
๐ฌ 15
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Out-of-distribution Overconfidence Optimal transport
Deep neural networks (DNNs) often produce overconfident predictions on out-of-distribution (OOD) inputs, undermining their reliability in open-world environments. Singularities in semi-discrete optimal transport (OT) mark regions of semantic ambiguity, where classifiers are particularly prone to unwarranted high-confidence predictions. Motivated by this observation, we propose a principled framework to mitigate OOD overconfidence by leveraging the geometry of OT-induced singular boundaries. Specifically, we formulate an OT problem between a continuous base distribution and the latent embeddings of training data, and identify the resulting singular boundaries. By sampling near these boundaries, we construct a class of OOD inputs, termed optimal transport-induced OOD samples (OTIS), which are geometrically grounded and inherently semantically ambiguous. During training, a confidence suppression loss is applied to OTIS to guide the model toward more calibrated predictions in structurally uncertain regions. Extensive experiments show that our method significantly alleviates OOD overconfidence and outperforms state-of-the-art methods. Codes will be made public upon paper acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3576. Do Large Language Models Know What They Are Capable Of?
๐ฌ 10
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Calibration Decision Making Overconfidence In-context learning LLM Agents LLM self-knowledge AI Safety
We investigate whether large language models (LLMs) can predict whether they will succeed on a given task and whether their predictions improve as they progress through multi-step tasks. We also investigate whether LLMs can learn from in-context experiences to make better decisions about whether to pursue a task in scenarios where failure is costly. All LLMs we tested are overconfident, but most predict their success with better-than-random discriminatory power. We find that newer and larger LLMs generally do not have greater discriminatory power, though Claude models do show such a trend. On multi-step agentic tasks, the overconfidence of several frontier LLMs worsens as they progress through the tasks, and reasoning LLMs perform comparable to or worse than non-reasoning LLMs. With in-context experiences of failure, some LLMs reduce their overconfidence leading to significantly improved decision making, while others do not. Interestingly, all LLMsโ decisions are approximately rational given their estimated probabilities of success, yet their overly-optimistic estimates result in poor decision making. These results suggest that current LLM agents are hindered by their lack of awareness of their own capabilities. We discuss the implications of LLMs' awareness of their capabilities for AI misuse and misalignment risks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3577. Decomposed Attention Fusion in MLLMs for Training-free Video Reasoning Segmentation
๐ฌ 15
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ MLLMs Segmentation Training-free
Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to instructions. To exploit this for training-free localization, we cast video reasoning segmentation as video QA and extract attention maps via rollout. Since raw maps are too noisy to represent objects, we propose Decomposed Attention Fusion (DecAF), combining (1) contrastive object-background fusion and (2) complementary video-frame fusion. This yields cleaner attention maps focused on the target object, which can be directly converted into coarse segmentation masks and outperform existing methods. In addition, we introduce attention-guided SAM2 prompting for fine-grained masks, achieving performance comparable to training-based methods on both referring and reasoning VOS benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3578. Learning to Segment for Vehicle Routing Problems
๐ฌ 26
โญ 5.00
๐ optimization
๐ท๏ธ Learning-Guided Optimization Vehicle Routing Problem
Iterative heuristics are widely recognized as state-of-the-art for Vehicle Routing Problems (VRPs). In this work, we exploit a critical observation: a large portion of the solution remains stable, i.e., unchanged across search iterations, causing redundant computations, especially for large-scale VRPs with long subtours. To address this, we pioneer the formal study of the First-Segment-Then-Aggregate
(FSTA) decomposition technique to accelerate iterative solvers. FSTA preserves stable solution segments during the search, aggregates nodes within each segment into fixed hypernodes, and focuses the search only on unstable portions. Yet, a key challenge lies in identifying which segments should be aggregated. To this end, we introduce Learning-to-Segment (L2Seg), a novel neural framework to intelligently
differentiate potentially stable and unstable portions for FSTA decomposition. We present three L2Seg variants: non-autoregressive (globally comprehensive but locally indiscriminate), autoregressive (locally refined but globally deficient), and their synergy. Empirical results on CVRP and VRPTW show that L2Seg accelerates state-of-the-art solvers by 2x to 7x. We further provide in-depth analysis showing why synergy achieves the best performance. Notably, L2Seg is compatible with traditional, learning-based, and hybrid solvers, while supporting various VRPs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3579. Modeling Others' Minds as Code
๐ฌ 18
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Multi-agent theory of mind program synthesis action understanding goal inference
Accurate prediction of human behavior is essential for robust and safe human-AI collaboration. However, existing approaches for modeling people are often data-hungry and brittle because they either make unrealistic assumptions about rationality or are too computationally demanding to adapt rapidly. Our key insight is that many everyday social interactions may follow predictable patterns; efficient "scripts" that minimize cognitive load for actors and observers, e.g., "wait for the green light, then go." We propose modeling these routines as behavioral programs instantiated in computer code rather than policies conditioned on beliefs and desires. We introduce ROTE, a novel algorithm that leverages both large language models (LLMs) for synthesizing a hypothesis space of behavioral programs, and probabilistic inference for reasoning about uncertainty over that space. We test ROTE in a suite of gridworld tasks and a large-scale embodied household simulator. ROTE predicts human and AI behaviors from sparse observations, outperforming competitive baselines---including behavior cloning and LLM-based methods---by as much as 50% in terms of in-sample accuracy and out-of-sample generalization. By treating action understanding as a program synthesis problem, ROTE opens a path for AI systems to efficiently and effectively predict human behavior in the real-world.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3580. RefTool: Reference-Guided Tool Creation for Knowledge-Intensive Reasoning
๐ฌ 19
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Tool Creation Tool-Augmented Reasoning
Large Language Models (LLMs) can enhance their reasoning capabilities by using external tools. However, many tasks lack predefined tools. Prior works have explored instructing LLMs to generate tools on their own, but such approaches depend heavily on internal knowledge and struggle when tasks fall outside the modelโs knowledge scope. To address this limitation, we propose RefTool, a reference-guided framework for automatic tool creation that leverages external materials, such as textbooks and knowledge snippets. RefTool consists of two modules: (1) tool creation, where LLMs generate executable tools from reference content, validate them using illustrative examples, and organize them hierarchically into a toolbox; and (2) tool utilization, where LLMs navigate the toolbox structure to select and apply the appropriate tools to solve problems. Experiments on causality, physics, and chemistry benchmarks demonstrate that RefTool outperforms existing tool-creation and domain-specific reasoning methods by 12.3% on average accuracy, while being cost-efficient and broadly generalizable to non-scientific tasks, e.g., extremely low-resource language translation. Analyses reveal that grounding tool creation in references produces accurate and faithful tools, and that the hierarchical structure facilitates effective tool selection. RefTool enables LLMs to overcome internal knowledge limitations, advancing generalizable reasoning in knowledge-intensive domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3581. Reducing Class-Wise Performance Disparity via Margin Regularization
๐ฌ 21
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ image classification class-wise performance gap
Deep neural networks often exhibit substantial disparities in class-wise accuracy, even when trained on class-balanced dataโposing concerns for reliable deployment. While prior efforts have explored empirical remedies, a theoretical understanding of such performance disparities in classification remains limited. In this work, we present Margin Regularization for performance disparity Reduction ( $MR^2$ ), a theoretically principled regularization for classification by dynamically adjusting margins in both the logit and representation spaces. Our analysis establishes a novel margin-based, class-sensitive generalization bound that reveals how per-class feature variability contributes to error, motivating the use of larger margins for ''hard'' classes. Guided by this insight,$MR^2$ optimizes per-class logit margins proportional to feature spread and penalizes excessive representation margins to enhance intra-class compactness.
Experiments on seven datasetsโincluding ImageNetโand diverse pre-trained backbones (MAE, MoCov2, CLIP) demonstrate demonstrate that our $MR^2$ not only improves overall accuracy but also significantly boosts ''hard'' class performance without trading off ''easy'' classes, thus reducing the performance disparities. Codes are available in Supplementary Materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3582. Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition
๐ฌ 23
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Human-Object Interaction; Large Multi-modal Language Models;
Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions. While advances in open-vocabulary object detection provide promising solutions for object localization, interaction recognition (IR) remains challenging due to the combinatorial diversity of interactions. Existing methods, including two-stage methods, tightly couple IR with a specific detector and rely on coarse-grained vision-language model (VLM) features, which limit generalization to unseen interactions. In this work, we propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR. We introduce a deterministic generation method that formulates IR as a visual question answering task and enforces deterministic outputs, enabling training-free zero-shot IR. To further enhance performance and efficiency by fine-tuning the model, we design a spatial-aware pooling module that integrates appearance and pairwise spatial cues, and a one-pass deterministic matching method that predicts all candidate interactions in a single forward pass. Extensive experiments on HICO-DET and V-COCO demonstrate that our method achieves superior zero-shot performance, strong cross-dataset generalization, and the flexibility to integrate with any object detectors without retraining. Code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3583. Boosted Trees on a Diet: Compact Models for Resource-Constrained Devices
๐ฌ 25
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ TinyML Boosting Decision Trees Microcontrollers IoT
Deploying machine learning models on compute-constrained devices has become a key building block of modern IoT applications. In this work, we present a compression scheme for boosted decision trees, addressing the growing need for lightweight machine learning models. Specifically, we provide techniques for training compact boosted decision tree ensembles that exhibit a reduced memory footprint by rewarding, among other things, the reuse of features and thresholds during training. Our experimental evaluation shows that models achieved the same performance with a compression ratio of 4โ16x compared to LightGBM models using an adapted training process and an alternative memory layout. Once deployed, the corresponding IoT devices can operate independently of constant communication or external energy supply, and, thus, autonomously, requiring only minimal computing power and energy. This capability opens the door to a wide range of IoT applications, including remote monitoring, edge analytics, and real-time decision making in isolated or power-limited environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3584. WIMFRIS: WIndow Mamba Fusion and Parameter Efficient Tuning for Referring Image Segmentation
๐ฌ 17
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Referring image segmentation parameter efficient tuning computer vision
Existing Parameter-Efficient Tuning (PET) methods for Referring Image Segmentation (RIS) primarily focus on layer-wise feature alignment, often neglecting the crucial role of a neck module for the intermediate fusion of aggregated multi-scale features, which creates a significant performance bottleneck. To address this limitation, we introduce WIMFRIS, a novel framework that establishes a powerful neck architecture alongside a simple yet effective PET strategy. At its core is our proposed HMF block, which first aggregates multi-scale features and then employs a novel WMF module to perform effective intermediate fusion. This WMF module leverages non-overlapping window partitioning to mitigate the information decay problem inherent in SSMs while ensuring rich local-global context interaction. Furthermore, our PET strategy enhances primary alignment with a MTA for robust textual priors, a MSA for precise vision-language fusion, and learnable emphasis parameters for adaptive stage-wise feature weighting. Extensive experiments demonstrate that WIMFRIS achieves new state-of-the-art performance across all public RIS benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3585. No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings
๐ฌ 42
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Membership Inference Data Privacy in Generative Models
Latent diffusion models have achieved remarkable success in high-fidelity text-to-image generation, but their tendency to memorize training data raises critical privacy and intellectual property concerns. Membership inference attacks (MIAs) provide a principled way to audit such memorization by determining whether a given sample was included in training. However, existing approaches assume access to ground-truth captions. This assumption fails in realistic scenarios where only images are available and their textual annotations remain undisclosed, rendering prior methods ineffective when substituted with vision-language model (VLM) captions. In this work, we propose MoFit , a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model's generative manifold. Given a query image, MoFit proceeds in two stages: (i) model-fitted surrogate optimization, where a perturbation applied to the image is optimized to construct a surrogate in regions of the modelโs unconditional prior learned from member samples, and (ii) surrogate-driven embedding extraction, where a model-fitted embedding is derived from the surrogate and then used as a mismatched condition for the query image. This embedding amplifies conditional loss responses for member samples while leaving hold-outs relatively less affected, thereby enhancing separability in the absence of ground-truth captions. Our comprehensive experiments across multiple datasets and diffusion models demonstrate that MoFit consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3586. Topological Anomaly Quantification for Semi-supervised Graph Anomaly Detection
๐ฌ 18
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph neural network graph anomaly detection semi-supervised graph anomaly detection
Semi-supervised graph anomaly detection identifies nodes deviating from normal patterns using a limited set of labeled nodes. This paper specifically addresses the challenging scenario where only normal node labels are available. To address the challenge of anomaly scarcity in real-world graphs, generative-based methods synthesize anomalies by linear/non-linear interpolation or random noise perturbation. However, these methods lack a quantitative assessment of anomalies, hindering the reliability of the generated ones. To overcome this limitation, we propose a generative graph anomaly detection model based on topological anomaly quantification (TAQ-GAD). First, we design a topological anomaly quantification module (TAQ), which quantifies node abnormality through two topological metrics: The node boundary score (NBS) quantifies the boundaryness of a node by evaluating its connectivity to labeled normal neighbors. The node isolation score (NIS) assesses the structural isolation of a node by evaluating its connection strength to other nodes within the same category. This anomaly measurement module dynamically screens nodes with high anomaly scores as pseudo-anomaly nodes. Subsequently, the topological anomaly enhancement (TAE) module generates virtual anomaly center nodes and constructs their topological relationships with other nodes. Finally, the method integrates normal and pseudo-anomaly nodes on the enhanced graph for model training. Extensive experiments on benchmark datasets demonstrate TAQ-GADโs superiority over state-of-the-art methods and effectively improve anomaly detection performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3587. Dissecting Representation Misalignment in Contrastive Learning via Influence Function
๐ฌ 12
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Contrastive Learning Influence Function Interpretability
Contrastive learning, commonly applied in large-scale multimodal models, often relies on data from diverse and often unreliable sources, which can include misaligned or mislabeled text-image pairs. This frequently leads to robustness issues and hallucinations, ultimately causing performance degradation. Data valuation is an efficient way to detect and trace these misalignments. Nevertheless, existing methods are computationally expensive for large-scale models. Although computationally efficient, classical influence functions are inadequate for contrastive learning models, as they were initially designed for pointwise loss. Furthermore, contrastive learning involves minimizing the distance between positive sample modalities while maximizing the distance between negative sample modalities. This necessitates evaluating the influence of samples from both perspectives. To tackle these challenges, we introduce the Extended Influence Function for Contrastive Loss (ECIF), an influence function crafted for contrastive loss. ECIF considers both positive and negative samples and provides a closed-form approximation of contrastive learning models, eliminating the need for retraining. Building upon ECIF, we develop a series of algorithms for data evaluation, misalignment detection, and misprediction trace-back tasks. Experimental results demonstrate that our ECIF advances the transparency and interpretability of CLIP-style embedding models by offering a more accurate assessment of data impact and model alignment compared to traditional baseline methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3588. ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs
๐ฌ 21
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ diffusion LLMs parallel decoding benchmark
While most autoregressive LLMs are constrained to one-by-one decoding, diffusion LLMs (dLLMs) have attracted growing interest for their potential to dramatically accelerate inference through parallel decoding. Despite this promise, the conditional independence assumption in dLLMs causes parallel decoding to ignore token dependencies, inevitably degrading generation quality when these dependencies are strong. However, existing works largely overlook these inherent challenges, and evaluations on standard benchmarks (e.g., math and coding) are not sufficient to capture the quality degradation caused by parallel decoding. To address this gap, we first provide an information-theoretic analysis of parallel decoding. We then conduct case studies on analytically tractable synthetic list operations from both data distribution and decoding strategy perspectives, offering quantitative insights that highlight the fundamental limitations of parallel decoding. Building on these insights, we propose **ParallelBench**, the first benchmark specifically designed for dLLMs, featuring realistic tasks that are trivial for humans and autoregressive LLMs yet exceptionally challenging for dLLMs under parallel decoding. Using ParallelBench, we systematically analyze both dLLMs and autoregressive LLMs, revealing that: (i) dLLMs under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speedup without compromising quality. Our findings underscore the pressing need for innovative decoding methods that can overcome the current speed-quality trade-off. We are releasing our benchmark to help accelerate the development of truly efficient dLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3589. No labels, No Problem: Training Visual Reasoners with Multimodal Verifiers
๐ฌ 23
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ VLM spatial reasoning LLM RL
Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches which use pre-trained models and avoid training, but suffer from flawed logic and erroneous grounding. We propose an annotation-free training framework that improves both reasoning and grounding. Our framework uses AI-powered verifiers: an LLM verifier refines LLM reasoning via reinforcement learning, while a VLM verifier strengthens visual grounding through automated hard-negative mining, eliminating the need for ground truth labels. This design combines the strengths of modern AI systems: advanced language-only reasoning models for decomposing spatial queries into simpler subtasks, and strong vision specialist models improved via performant VLM critics. We evaluate our approach across diverse spatial reasoning tasks, and show that our method improves visual reasoning and surpasses open-source and proprietary models, while with our improved visual grounding model we further outperform recent text-only visual reasoning methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3590. Merge before Forget: A Single LoRA Continual Learning via Continual Merging
๐ฌ 12
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Continual learning Low-rank adaptation
Parameter-efficient continual learning has emerged as a promising approach for large language models (LLMs) to mitigate catastrophic forgetting while enabling adaptation to new tasks. Current Low-Rank Adaptation (LoRA) continual learning techniques often retain and freeze previously learned LoRAs or generate data representations to overcome forgetting, typically utilizing these to support new LoRAs learn new tasks. However, these methods not only ignore growing computational memory with tasks and limited storage space but also suffer from potential task interference due to the lack of effective LoRA merging mechanisms. In this paper, we propose a novel continual learning method that orthogonally initializes and sequentially merges LoRAs updates into a single unified LoRA. Our method leverages orthogonal basis extraction from previously learned LoRA to initialize the learning of new tasks, further exploits the intrinsic asymmetry property of LoRA components by using a time-aware scaling mechanism to balance new and old knowledge during continual merging. Our approach maintains constant memory complexity with respect to the number of tasks, minimizes interference between past and new tasks via orthogonal basis initialization, and improves performance over asymmetric LoRA merging via adaptive scaling. We provide theoretical analysis to justify our design and conduct extensive experiments across diverse continual learning benchmarks using various Llama models, demonstrating the effectiveness and efficiency of our method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3591. CARL: Camera-Agnostic Representation Learning for Spectral Image Analysis
๐ฌ 11
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Representation Learning Self-Supervised Learning Spectral Imaging
Spectral imaging offers promising applications across diverse domains, including medicine and urban scene understanding, and is already established as a critical modality in remote sensing.
However, variability in channel dimensionality and captured wavelengths among spectral cameras impede the development of AI-driven methodologies, leading to camera-specific models with limited generalizability and inadequate cross-camera applicability.
To address this bottleneck, we introduce CARL, a model for Camera-Agnostic Representation Learning across RGB, multispectral, and hyperspectral imaging modalities.
To enable the conversion of a spectral image with any channel dimensionality to a camera-agnostic representation, we introduce a novel spectral encoder, featuring a self-attention-cross-attention mechanism, to distill salient spectral information into learned spectral representations.
Spatio-spectral pre-training is achieved with a novel feature-based self-supervision strategy tailored to CARL.
Large-scale experiments across the domains of medical imaging, autonomous driving, and satellite imaging demonstrate our model's unique robustness to spectral heterogeneity, outperforming on datasets with simulated and real-world cross-camera spectral variations.
The scalability and versatility of the proposed approach position our model as a backbone for future spectral foundation models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3592. CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally
๐ฌ 15
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ vision-language models CLIP compositionality binding
CLIP (Contrastive Language-Image Pretraining) has become a popular choice for various downstream tasks. However, recent studies have questioned its ability to represent compositional concepts effectively. These works suggest that CLIP often acts like a bag-of-words (BoW) model, interpreting images and text as sets of individual concepts without grasping the structural relationships. In particular, CLIP struggles to correctly bind attributes to their corresponding objects when multiple objects are present in an image or text. In this work, we investigate why CLIP exhibits this BoW-like behavior. Our key finding is that CLIP does not lack binding information. Through linear probing, robustness tests with increasing object counts, and conjunctive search experiments, we show that attributeโobject bindings are already encoded within CLIPโs text and image embeddings. The weakness lies in the cross-modal alignment, which fails to preserve this information. We show it can be accessed cross-modally with a simple linear transformation to text embeddings. This improves CLIPโs attribute-object binding performance and confirms that the information was already encoded unimodally. In practice, this means CLIP-based systems can be enhanced with a lightweight linear layer trained on existing embeddings, avoiding costly encoder retraining.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3593. Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
๐ฌ 23
โญ 5.00
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ language model agents human-AI collaboration human-in-the-loop evaluation
We present Collaborative Gym (Co-Gym), a general framework for developing and evaluating collaborative agents that engage in asynchronous, bidirectional communication with humans while interacting with task environments. While fully autonomous agents typically operate without humans in the loop, numerous use cases inherently require agents to collaborate with humans due to humans' latent preferences, domain expertise, or the need for control. This motivates the study of collaborative agents designed to work together with humans. We instantiate Co-Gym with three representative tasks---creating travel plans, writing Related Work sections, and analyzing tabular data---in both simulated and real-world conditions, and propose an evaluation framework assessing both collaboration outcomes and processes. Auditing initiative-taking patterns reveals that simply expanding the agent's action space to include communication is insufficient for effective collaboration or appropriate initiative. By equipping agents with a situational planning module, we observe substantial gains: the best-performing collaborative agents consistently outperform their fully autonomous counterparts in task performance, achieving win rates of 86% in Travel Planning, 74% in Tabular Analysis, and 66% in Related Work when evaluated by real users. Despite these improvements, our evaluation reveals persistent limitations in current language models and agents, with communication and situational awareness failures observed in 65% and 80% of cases in the real condition, respectively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3594. On Universality of Deep Equivariant Networks
๐ฌ 14
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Geometric Deep Learning Theory for Equivariant Neural Networks Expressiveness Approximation Theory
Universality results for equivariant neural networks remain rare. Those that do exist typically hold only in restrictive settings: either they rely on regular or higher-order tensor representations, leading to impractically high-dimensional hidden spaces, or they target specialized architectures, often confined to the invariant setting. This work develops a more general account. For invariant networks, we establish a universality theorem under separation constraints, showing that the addition of a fully connected readout layer secures approximation within the class of separation-constrained continuous functions. For equivariant networks, where results are even scarcer, we demonstrate that standard separability notions are inadequate and introduce the sharper criterion of *entry-wise separability*. We show that with sufficient depth or with the addition of appropriate readout layers, equivariant networks attain universality within the entry-wise separable regime. Together with prior results showing the failure of universality for shallow models, our findings identify depth and readout layers as a decisive mechanism for universality, additionally offering a unified perspective that subsumes and extends earlier specialized results.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3595. Training Large Reasoning Models Efficiently via Progressive Thought Encoding
๐ฌ 20
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ efficient reasoning LLM KV cache test-time learning
Large reasoning models (LRMs) excel on complex problems but face a critical barrier to efficiency: reinforcement learning (RL) training requires long rollouts for outcome-based rewards, where autoregressive decoding dominates time and memory usage. While sliding-window cache strategies can bound memory, they disrupt long-context reasoning and degrade performance. We introduce Progressive Thought Encoding, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches. By progressively encoding intermediate reasoning into compact representations, our approach eliminates the need to backpropagate through full-cache rollouts, thereby reducing training-time memory usage, while maintaining constant memory during inference. Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, across six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3\% improvement over LoRA and +29.9\% over the baseline on average, with up to +23.4 absolute gains on AIME2024/2025 under tight cache budgets. These results demonstrate that Progressive Thought Encoding not only improves reasoning accuracy but also makes RL training of LRMs substantially more efficient and scalable under real-world memory constraints.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3596. Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning
๐ฌ 19
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Autonomous Driving Reinforcement Fine-Tuning Multi-agent Traffic Simulation
Scalable and realistic simulation of multi-agent traffic behavior is critical for advancing autonomous driving technologies. Although existing data-driven simulators have made significant strides in this domain, they predominantly rely on supervised learning to align simulated distributions with real-world driving scenarios. A persistent challenge, however, lies in the distributional shift that arises between training and testing, which often undermines model generalization in unseen environments. To address this limitation, we propose SMART-R1, a novel R1-style reinforcement fine-tuning paradigm tailored for next-token prediction models to better align agent behavior with human preferences and evaluation metrics. Our approach introduces a metric-oriented policy optimization algorithm to improve distribution alignment and an iterative "SFT-RFT-SFT" training strategy that alternates between Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) to maximize performance gains. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) validate the effectiveness of this simple yet powerful R1-style training framework in enhancing foundation models. The results on the Waymo Open Sim Agents Challenge (WOSAC) showcase that SMART-R1 achieves state-of-the-art performance with an overall realism meta score of 0.7858, ranking first on the leaderboard at the time of submission.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3597. Relational Graph Transformer
๐ฌ 16
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ graph graph transformer relational deep learning
Relational Deep Learning (RDL) is a promising approach for building state-of-the-art predictive models on multi-table relational data by representing it as a heterogeneous temporal graph. However, commonly used Graph Neural Network models suffer from fundamental limitations in capturing complex structural patterns and long-range dependencies that are inherent in relational data. While Graph Transformers have emerged as powerful alternatives to GNNs on general graphs, applying them to relational entity graphs presents unique challenges: (i) Traditional positional encodings fail to generalize to massive, heterogeneous graphs; (ii) existing architectures cannot model the temporal dynamics and schema constraints of relational data; (iii) existing tokenization schemes lose critical structural information. Here we introduce the Relational Graph Transformer (RelGT), the first graph transformer architecture designed specifically for relational tables. RelGT employs a novel multi-element tokenization strategy that decomposes each node into five components (features, type, hop distance, time, and local structure), enabling efficient encoding of heterogeneity, temporality, and topology without expensive precomputation. Our architecture combines local attention over sampled subgraphs with global attention to learnable centroids, incorporating both local and database-wide representations. Across 21 tasks from the RelBench benchmark, RelGT consistently matches or outperforms GNN baselines by up to 18%, establishing Graph Transformers as a powerful architecture for Relational Deep Learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3598. Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps
๐ฌ 14
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Shadow Generation Relight Bridge matching
We propose LightโGeometry Interaction (LGI) maps, a novel representation that encodes light-aware occlusion from monocular depth. Unlike ray tracing, which requires full 3D reconstruction, LGI captures essential lightโshadow interactions reliably and accurately, computed from off-the-shelf 2.5D depth map predictions. LGI explicitly ties illumination direction to geometry, providing a physics-inspired prior that constrains generative models. Without such prior, these models often produce floating shadows, inconsistent illumination, and implausible shadow geometry. Building on this representation, we propose a unified pipeline for joint shadow generation and relighting-unlike prior methods that treat them as disjoint tasks-capturing the intrinsic coupling of illumination and shadowing essential for modeling indirect effects. By embedding LGI into a bridge-matching generative backbone, we reduce ambiguity and enforce physically consistent lightโshadow reasoning. To enable effective training, we curated the first large-scale benchmark dataset for joint shadow and relighting, covering reflections, transparency, and complex interreflections. Experiments show significant gains in realism and consistency across synthetic and real images. LGI thus bridges geometry-inspired rendering with generative modeling, enabling efficient, physically consistent shadow generation and relighting.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3599. PMI: Flow-Based Inversion Correction via Proximal Operator
๐ฌ 11
โญ 5.00
๐ generative models
๐ท๏ธ Flow Inversion Rectified Flow Image Editing
Rectified-Flow (RF)-based generative models have recently emerged as strong alternatives to traditional diffusion models, demonstrating state-of-the-art performance across various tasks. By learning a continuous velocity field that transforms simple noise into complex data, RF-based models not only enable high-quality generation, but also support training-free inversion, which facilitates downstream tasks such as reconstruction and editing. However, existing inversion methods, such as vanilla RF-based inversion, suffer from approximation errors that accumulate across timesteps, leading to unstable velocity fields and degraded reconstruction and editing quality. To address this challenge, we propose Proximal-Mean Inversion (PMI), a training-free gradient correction method that stabilizes the velocity field by guiding it toward a running average of past velocities, constrained within a theoretically derived spherical Gaussian. Furthermore, we introduce mimic-CFG, a lightweight velocity correction scheme for editing tasks, which interpolates between the current velocity and its projection onto the historical average, balancing editing effectiveness and structural consistency. Extensive experiments on PIE-Bench demonstrate that our methods significantly improve inversion stability, image reconstruction quality, and editing fidelity, while reducing the required number of neural function evaluations. Our approach achieves state-of-the-art performance on the PIE-Bench with enhanced efficiency and theoretical soundness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3600. ResCP: Reservoir Conformal Prediction for Time Series Forecasting
๐ฌ 17
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ Conformal prediction Time series Uncertainty quantification
Conformal prediction offers a powerful framework for building distribution-free prediction intervals for exchangeable data.
Existing methods that extend conformal prediction to sequential data rely on fitting a relatively complex model to capture temporal dependencies.
However, these methods can fail if the sample size is small and often require expensive retraining when the underlying data distribution changes.
To overcome these limitations, we propose Reservoir Conformal Prediction (ResCP), a novel training-free conformal prediction method for time series.
Our approach leverages the efficiency and representation learning capabilities of reservoir computing to dynamically reweight conformity scores.
In particular, we compute similarity scores among reservoir states and use them to adaptively reweight the observed residuals at each step. With this approach, ResCP enables us to account for local temporal dynamics when modeling the error distribution without compromising computational scalability. We prove that, under reasonable assumptions, ResCP achieves asymptotic conditional coverage, and we empirically demonstrate its effectiveness across diverse forecasting tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3601. GAVEL: Towards Rule-Based Safety through Activation Monitoring
๐ฌ 38
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ AI Safety Activation-Based Monitoring Rule-Based Detection Large Language Models Misuse Detection
Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity. We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as "making a threat" and "payment processing", that can be composed to capture nuanced, domain-specific behaviors with higher precision. Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time. This enables practitioners to configure and update safeguards without retraining models or detectors, while supporting transparency and auditability. Our results show that compositional rule-based activation safety improves precision, supports domain customization, and lays the groundwork for scalable, interpretable, and auditable AI governance. We open source GAVEL and provide an automated rule creation tool.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3602. Sharp Monocular View Synthesis in Less Than a Second
๐ฌ 12
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ photorealism view synthesis neural rendering
We present SHARP, an approach to photorealistic view synthesis from a single image. Given a single photograph, SHARP regresses the parameters of a 3D Gaussian representation of the depicted scene. This is done in less than a second on a standard GPU via a single feedforward pass through a neural network. The 3D Gaussian representation produced by SHARP can then be rendered in real time, yielding high-resolution photorealistic images for nearby views. The representation is metric, with absolute scale, supporting metric camera movements. Experimental results demonstrate that SHARP delivers robust zero-shot generalization across datasets. It sets a new state of the art on multiple datasets, reducing LPIPS by 25โ34% and DISTS by 21โ43% versus the best prior model, while lowering the synthesis time by three orders of magnitude.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3603. ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments?
๐ฌ 14
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ omnidirectional image benchmark virtual reality
Omnidirectional images (ODIs) provide full 360$^{\circ} \times$ 180$^{\circ}$ view which are widely adopted in VR, AR and embodied intelligence applications. While multi-modal large language models (MLLMs) have demonstrated remarkable performance on conventional 2D image and video understanding benchmarks, their ability to comprehend the immersive environments captured by ODIs remains largely unexplored. To address this gap, we first present ODI-Bench, a novel comprehensive benchmark specifically designed for omnidirectional image understanding. ODI-Bench contains 2,000 high-quality omnidirectional images and over 4,000 manually annotated question-answering (QA) pairs across 10 fine-grained tasks, covering both general-level and spatial-level ODI understanding. Extensive experiments are conducted to benchmark 20 representative MLLMs, including proprietary and open-source models, under both close-ended and open-ended settings. Experimental results reveal that current MLLMs still struggle to capture the immersive context provided by ODIs. To this end, we further introduce Omni-CoT, a training-free method which significantly enhances MLLMsโ comprehension ability in the omnidirectional environment through chain-of-thought reasoning across both textual information and visual cues. Both the benchmark and the code will be released upon the publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3604. VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery
๐ฌ 37
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Vision-Language Models Vision Question Answering Ancient Greek Pottery Cultural Heritage Dataset Construction 3D Generation Archaeological AI Multimodal Learning
Vision-Language Models (VLMs) have achieved significant progress in multimodal understanding tasks, demonstrating strong capabilities particularly in general tasks such as image captioning and visual reasoning. However, when dealing with specialized cultural heritage domains like 3D vase artifacts, existing models face severe data scarcity issues and insufficient domain knowledge limitations. Due to the lack of targeted training data, current VLMs struggle to effectively handle such culturally significant specialized tasks. To address these challenges, we propose the VaseVQA-3D dataset, which serves as the first 3D visual question answering dataset for ancient Greek pottery analysis, collecting 664 ancient Greek vase 3D models with corresponding question-answer data and establishing a complete data construction pipeline. We further develop the VaseVLM model, enhancing model performance in vase artifact analysis through domain-adaptive training. Experimental results validate the effectiveness of our approach, where our VaseVLM-7B-RL achieves 12.8\% improvement in R@1 accuracy and 6.6\% improvement in lexical similarity compared to the strongest baselines on the VaseVQA-3D dataset, significantly improving the recognition and understanding of 3D vase artifacts, providing new technical pathways for digital heritage preservation research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3605. Neural Dynamics Self-Attention for Spiking Transformers
๐ฌ 12
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Spiking Neural Networks Spiking self-attention Spiking transformer
Integrating Spiking Neural Networks (SNNs) with Transformer architectures offers a promising pathway to balance energy efficiency and performance, particularly for edge vision applications. However, existing Spiking Transformers face two critical challenges: i) a substantial performance gap relative to their Artificial Neural Network (ANN) counterparts, and ii) considerable memory overhead. Our theoretical analysis and empirical evidence indicate that these limitations arise from the unfocused global attention paradigm of Spiking Self Attention (SSA) and the storage cost of large attention matrices. Inspired by the localized receptive fields and membrane potential dynamics of biological visual neurons, we propose LRF-Dyn, which enables attention computation via spiking neurons endowed with localized receptive fields. Specifically, we integrate a LRF mechanism into SSA, enabling the model to allocate greater attention to neighboring regions and thereby enhance local modeling capacity. Moreover, LRF-Dyn approximates the chargeโfireโreset dynamics of spiking neurons within the LRF-SSA, substantially reducing memory requirements during inference. Extensive experiments on visual tasks confirm that our method lowers memory overhead while delivering significant performance improvements.
These results establish LRF-Dyn as a key component for achieving energy-efficient Spiking Transformers.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3606. Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective
๐ฌ 8
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Performance Prediction Scaling Law Large Language Models Pretraining
The escalating scale and cost of Large Language Models (LLMs) training necessitate accurate pre-training prediction of downstream task performance for efficient resource allocation. This is challenged by: 1) the emergence phenomenon, where unpredictable capabilities appearing suddenly at critical model scales; and 2) uneven task difficulty and inconsistent performance scaling patterns, leading to high metric variability. Current prediction methods lack accuracy and reliability. We propose a Clustering-On-Difficulty (COD) framework for downstream performance prediction. The COD framework clusters tasks by their difficulty scaling features, thereby establishing a more stable and predictable subset that perform well-behaved scaling performance as compute increase. We adopt a performance scaling law to predict cluster-wise performance with theoretical support. Predictable subset performance acts as an intermediate predictor for the full evaluation set. We further derive a mapping function to accurately extrapolate the performance of the subset to the full set. Applied to an LLM with 70B parameters, COD achieved a 1.63\% average prediction error across eight key LLM benchmarks, offering actionable insights for resource allocation and training monitoring of LLMs pretraining.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3607. Scaling Group Inference for Diverse and High-Quality Generation
๐ฌ 16
โญ 5.00
๐ generative models
๐ท๏ธ generative models diffusion models
Generative models typically sample outputs independently, and recent inference-time guidance and scaling algorithms focus on improving the quality of individual samples. However, in real-world applications, users are often presented with a set of multiple images (e.g., 4-8) for each prompt, where independent sampling tends to lead to redundant results, limiting user choices and hindering idea exploration. In this work, we introduce a scalable group inference method that improves both the diversity and quality of a group of samples. We formulate group inference as a quadratic integer assignment problem: candidate outputs are modeled as graph nodes, and a subset is selected to optimize sample quality (unary term) while maximizing group diversity (binary term). To substantially improve runtime efficiency, we use intermediate predictions of the final sample at each step to progressively prune the candidate set, allowing our method to scale up efficiently to large input candidate sets. Extensive experiments show that our method significantly improves group diversity and quality compared to independent sampling baselines and recent inference algorithms. Our framework generalizes across a wide range of tasks, including text-to-image, image-to-image, and image prompting, enabling generative models to treat multiple outputs as cohesive groups rather than independent samples.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3608. CryoLVM: Self-supervised Learning from Cryo-EM Density Maps with Large Vision Models
๐ฌ 12
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ cryo-EM structural biology foundation model JEPA SCUNet
Cryo-electron microscopy (cryo-EM) has revolutionized structural biology by enabling near-atomic-level visualization of biomolecular assemblies. However, the exponential growth in cryo-EM data throughput and complexity, coupled with diverse downstream analytical tasks, necessitates unified computational frameworks that transcend current task-specific deep learning approaches with limited scalability and generalizability. We present CryoLVM, a foundation model that learns rich structural representations from experimental density maps with resolved structures by leveraging the Joint-Embedding Predictive Architecture (JEPA) integrated with SCUNet-based backbone, which can be rapidly adapted to various downstream tasks. We further introduce a novel histogram-based distribution alignment loss that accelerates convergence and enhances fine-tuning performance. We demonstrate CryoLVM's effectiveness across three critical cryo-EM tasks: density map sharpening, density map super-resolution, and missing wedge restoration. Our method consistently outperforms state-of-the-art baselines across multiple density map quality metrics, confirming its potential as a versatile model for a wide spectrum of cryo-EM applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3609. Neural Latent Arbitrary Lagrangian-Eulerian Grids for Fluid-Solid Interaction
๐ฌ 4
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Deep learning Neural Operator
Fluid-solid interaction (FSI) problems are fundamental in many scientific and engineering applications, yet effectively capturing the highly nonlinear two-way interactions remains a significant challenge. Most existing deep learning methods are limited to simplified one-way FSI scenarios, often assuming rigid and static solid to reduce complexity. Even in two-way setups, prevailing approaches struggle to capture dynamic, heterogeneous interactions due to the lack of cross-domain awareness. In this paper, we introduce \textbf{Fisale}, a data-driven framework for handling complex two-way \textbf{FSI} problems. It is inspired by classical numerical methods, namely the Arbitrary LagrangianโEulerian (\textbf{ALE}) method and the partitioned coupling algorithm. Fisale explicitly models the coupling interface as a distinct component and leverages multiscale latent ALE grids to provide unified, geometry-aware embeddings across domains. A partitioned coupling module (PCM) further decomposes the problem into structured substeps, enabling progressive modeling of nonlinear interdependencies. Compared to existing models, Fisale introduces a more flexible framework that iteratively handles complex dynamics of solid, fluid and their coupling interface on a unified representation, and enables scalable learning of complex two-way FSI behaviors. Experimentally, Fisale excels in three reality-related challenging FSI scenarios, covering 2D, 3D and various tasks. The code is included in the supplementary material for reproductivity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3610. ProteinAE: Protein Diffusion Autoencoders for Structure Encoding
๐ฌ 19
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Protein Auto-encoder; Protein Structure modeling
Developing effective representations of protein structures is essential for advancing protein science, particularly for protein generative modeling. Current approaches often grapple with the complexities of the $\operatorname{SE}(3)$ manifold, rely on discrete tokenization, or the need for multiple training objectives, all of which can hinder the model optimization and generalization. We introduce ProteinAE, a novel and streamlined protein diffusion autoencoder designed to overcome these challenges by directly mapping protein backbone coordinates from $\operatorname{E}(3)$ into a continuous, compact latent space. ProteinAE employs a non-equivariant Diffusion Transformer with a bottleneck design for efficient compression and is trained end-to-end with a single flow matching objective, substantially simplifying the optimization pipeline. We demonstrate that ProteinAE achieves state-of-the-art reconstruction quality, outperforming existing autoencoders. The resulting latent space serves as a powerful foundation for a latent diffusion model that bypasses the need for explicit equivariance. This enables efficient, high-quality structure generation that is competitive with leading structure-based approaches and significantly outperforms prior latent-based methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3611. Improving Text-guided CAD Prototyping via Modality-Specific Tokenization
๐ฌ 26
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Language Models Computer-aided Design CAD Generation
Computer-Aided Design (CAD) is a foundational component of industrial prototyping.
where models are defined not by raw coordinates but by construction sequences such as sketches and extrusions.
This sequential structure enables both efficient prototype initialization and subsequent editing.
Text-guided CAD prototyping, which unifies Text-to-CAD generation and CAD editing, has the potential to streamline the entire design pipeline.
However, prior work has not explored this setting, largely because standard large language model (LLM) tokenizers decompose CAD sequences into natural-language word pieces, failing to capture primitive-level CAD semantics and hindering attention modules from modeling geometric structure.
We conjecture that a multimodal tokenization strategy, aligned with CADโs primitive and structural nature, can provide more effective representations.
To this end, we propose CAD-Tokenizer, a framework that represents CAD data with modality-specific tokens using a sequence-based VQ-VAE with primitive-level pooling and constrained decoding.
This design produces compact, primitive-aware representations that align with CADโs structural nature.
Applied to unified text-guided CAD prototyping, CAD-Tokenizer significantly improves instruction following and generation quality, achieving better quantitative and qualitative performance over both general-purpose LLMs and task-specific baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3612. Latent Fourier Transform
๐ฌ 17
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Music Generation Signal Processing Diffusion Models Audio Music Audio Generation Controllable Generation Fourier Transform Diffusion Autoencoders
We introduce the Latent Fourier Transform (LatentFT), a framework that provides novel frequency-domain controls for generative music models. LatentFT combines a diffusion autoencoder with a latent-space Fourier transform to separate musical patterns by timescale. By masking latents in the frequency domain during training, our method yields representations that can be manipulated coherently at inference. This allows us to generate musical variations and blends from reference examples while preserving characteristics at desired timescales, which are specified as frequencies in the latent space. LatentFT parallels the role of the equalizer in music production: while traditional equalizers operates on audible frequencies to shape timbre, LatentFT operates on latent-space frequencies to shape musical structure. Experiments and listening tests show that LatentFT improves condition adherence and quality compared to baselines. We also present a technique for hearing frequencies in the latent space in isolation, and show different musical attributes reside in different regions of the latent spectrum. Our results show how frequency-domain control in latent space provides an intuitive, continuous frequency axis for conditioning and blending, advancing us toward more interpretable and interactive generative music models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3613. ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation
๐ฌ 11
โญ 5.00
๐ generative models
๐ท๏ธ RAG image generation rare-concept generation
While recent generative models synthesize high-quality visual content, they still struggle with generating rare or fine-grained concepts.
To address this challenge, we explore the usage of Retrieval-Augmented Generation (RAG) for image generation, and introduce ImageRAG, a training-free method for rare concept generation.
Using a Vision Language Model (VLM), ImageRAG identifies generation gaps between an input prompt and a generated image dynamically, retrieves relevant images, and uses them as context to guide the generation process.
Prior approaches that use retrieved images require training models specifically for retrieval-based generation. In contrast, ImageRAG leverages existing image conditioning models, and does not require RAG-specific training.
We demonstrate our approach is highly adaptable through evaluation over different backbones, including models trained to receive image inputs and models augmented with a post-training image-prompt adapter.
Through extensive quantitative, qualitative, and subjective evaluation, we show that incorporating retrieved references consistently improves the generation abilities of rare and fine-grained concepts across three datasets and three generative models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3614. FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning
๐ฌ 18
โญ 5.00
๐ generative models
๐ท๏ธ LLM Reasoning Reinforcement Learning Robust Learning
Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for enhancing the reasoning capabilities of
large language models (LLMs).
In this context, models explore reasoning trajectories and exploit rollouts with correct answers as positive signals for policy optimization.
However, these rollouts might involve flawed patterns such as answer-guessing and jump-in-reasoning.
Such flawed-positive rollouts are rewarded identically to fully correct ones, causing policy models to internalize these unreliable reasoning patterns.
In this work, we first conduct a systematic study of flawed-positive rollouts in RL and find that they enable rapid capability gains during the early optimization stage, while constraining reasoning capability later by reinforcing unreliable patterns.
Building on these insights, we propose **F**lawed-**A**ware **P**olicy **O**ptimization (**FAPO**), which presents a parameter-free reward penalty for flawed-positive rollouts, enabling the policy to leverage them as useful shortcuts in the warm-up stage, securing stable early gains, while gradually shifting optimization toward reliable reasoning in the later refinement stage.
To accurately and comprehensively detect flawed-positive rollouts, we introduce a generative reward model (GenRM) with a process-level reward that precisely localizes reasoning errors.
Experiments show that FAPO is effective in broad domains, improving outcome correctness, process reliability, and training stability without increasing the token budget.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3615. Riesz Neural Operator for Solving Partial Differential Equations
๐ฌ 14
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Neural Operator Local derivative Taylor expansion Nonlinear PDEs
Local non-stationarity is pivotal to solving partial differential equations (PDEs). However, in operator learning, the spatially local information inherent in the data is often overlooked. Even when explicitly modeled, it is usually collapsed into local superpositions within the model architecture, preventing full exploitation of local features in physical phenomena. To address this limitation, our paper proposes a novel Riesz Neural Operator (RNO) based on the spectral derivative representation. Since PDEs are fundamentally governed by local derivatives, RNO leverages the Riesz transform, a natural spectral representation of derivatives, to mix global spectral information with local directional variations. This approach allows the RNO to outperform existing operators in complex scenarios that require sensitivity to local detail. Our design bridges the gap between physical interpretability and local dynamics. Experimental results demonstrate that the RNO consistently achieves superior prediction accuracy and generalization performance compared to existing approaches across various benchmark PDE problems and complex real-world datasets, presenting superior non-linear reconstruction capability in model analysis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3616. Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs
๐ฌ 4
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Quantization Pruning LLMs
Recent advances in Large Language Model (LLM) compression, such as quantization and pruning, have achieved notable success. However, as these techniques gradually approach their limits, relying on a single method for further compression has become increasingly challenging. In this work, we explore an alternative solution by combining quantization and sparsity. This joint approach, though promising, introduces new difficulties due to the inherently conflicting requirements on weight distributions: quantization favors compact ranges, while pruning benefits from high variance. To attack this problem, we propose Optimal Brain Restoration (OBR), a general and training-free framework that aligns pruning and quantization by error compensation between both. OBR minimizes performance degradation on downstream tasks by building on a second-order Hessian objective, which is then reformulated into a tractable problem through surrogate approximation and ultimately reaches a closed-form solution via group error compensation. Experiments show that OBR incurs only a 1.4 perplexity degradation on Llama2-7B to enable aggressive W4A4KV4 quantization with 50\% sparsity, delivering up to 4.72x speedup and 6.4x memory reduction compared to the FP16-dense baseline.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3617. P2P: Automated Paper-to-Poster Generation and Fine-Grained Benchmark
๐ฌ 19
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Poster Generate LLM-as-a-Judge Multi Agent
Academic posters are vital for scholarly communication, yet their manual creation is time-consuming. However, automated academic poster generation faces significant challenges in preserving intricate scientific details and achieving effective visual-textual integration.
Existing approaches often struggle with semantic richness, structural nuances, and lack standardized benchmarks for evaluating generated academic posters comprehensively.
To address these limitations, we introduce P2P, the first flexible, LLM-based multi-agent framework that generates high-quality, HTML-rendered academic posters directly from research papers. P2P employs three specialized agentsโfor visual element processing, content generation, and final poster assemblyโeach integrated with dedicated checker modules to enable iterative refinement and ensure output quality.
To foster advancements and rigorous evaluation in this domain, we argue that generated posters must be assessed from two complementary perspectives: objective fidelity and subjective quality.
So we establish P2Peval, a comprehensive benchmark featuring 1738 checklist items and a dual evaluation methodology (Fine-Grained and Universal).
Our Fine-Grained Evaluation uses human-annotated checklists to objectively measure the faithful preservation of verifiable content from the source paper. Concurrently, our Universal Evaluation captures subjective, holistic quality by training a model to align with human aesthetic preferences across key design principles. We evaluate a total of 35 models.
To power these advancements, we also release P2Pinstruct, the first large-scale instruction dataset comprising over 30,000 high-quality examples tailored for the academic paper-to-poster generation task. Furthermore, our contributions aim to streamline research dissemination while offering a principled blueprint for evaluating complex, creative AI-generated artifacts.
The code is on the anonymous GitHub, https://anonymous.4open.science/r/c-163Canonymous/README.md.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3618. MedLesionVQA: A Multimodal Benchmark Emulating Clinical Visual Diagnosis for Body Surface Health
๐ฌ 18
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Medical Multimodal Benchmark Body Lesion Images Medical VQA
Body-surface health conditions, spanning diverse clinical departments, represent some of the most frequent diagnostic scenarios and a primary target for medical multimodal large language models (MLLMs).
Yet existing medical benchmarks are either built from publicly available sources with limited expert curation or focus narrowly on disease classification, failing to reflect the stepwise recognition and reasoning processes physicians follow in real practice.
To address this gap, we introduce MedLesionVQA, the first benchmark explicitly designed to evaluate MLLMs on the visual diagnostic workflow for body-surface conditions in large scale.
All questions are derived from authentic clinical visual diagnosis scenarios and verified by medical experts with over 20 years of experience, while the data are drawn from 10k+ real patient visits, ensuring authenticity, clinical reality and diversity.
MedLesionVQA consists of 12K in-house volunteer images (never publicly leaked) and 19K expert-verified questionโanswer pairs, with fine-grained annotations of 94 lesion types, 110 body regions, and 96 diseases.
We evaluate 20+ state-of-the-art MLLMs against human physicians: the best model reaches 56.2% accuracy, far below primary physicians (61.4%) and senior specialists (73.2%). These results expose the persistent gap between MLLMs and clinical expertise, underscoring the need for the multimodal benchmarks to drive trustworthy medical AI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3619. Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
๐ฌ 27
โญ 5.00
๐ generative models
๐ท๏ธ autoregressive video generation long video generation real-time video generation
Streaming video generation as one fundamental component in interactive world models and neural game engines aims to generate high-quality, low-latency, and temporally coherent long stream videos. However, most existing work suffers from severe error accumulation that often significantly degrades the generated stream videos over long horizons. We design Rolling Forcing, a novel video generation technique that enables streaming long videos with minimal error accumulation. Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames which accelerates error propagation, we design a joint denoising scheme that simultaneously denoises multiple frames with progressively increasing noise levels. This design relaxes the strict causality across adjacent frames, effectively suppressing error growth. Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep keyโvalue states of initial frames as a global context anchor and thereby enhances long-term global consistency. Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows. This algorithm operates on non-overlapping windows and mitigates exposure bias conditioned on self-generated histories. Extensive experiments show that Rolling Forcing enables real-time streaming generation of multi-minute videos on a single GPU, with substantially reduced error accumulation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3620. VINCIE: Unlocking In-context Image Editing from Video
๐ฌ 30
โญ 5.00
๐ generative models
๐ท๏ธ Image Editing Video Generation Diffusion Model
In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences.
To effectively learn from this data, we design three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction.
Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area.
Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3621. DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems
๐ฌ 21
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM-based agent systems failure analysis intervention
Large language model (LLM)โbased multi-agent systems are challenging to debug because failures often arise from long, branching interaction traces. The prevailing practice is to leverage LLMs for log-based failure localization, attributing errors to a specific agent and step. However, this paradigm has two key limitations: (i) log-only debugging lacks validation, producing untested hypotheses, and (ii) single-step or single-agent attribution is often ill-posed, as we find that multiple distinct interventions can independently repair the failed task. To address the first limitation, we introduce DoVer, an intervention-driven debugging framework, which augments hypothesis generation with active verification through targeted interventions (e.g., editing messages, altering plans). For the second limitation, rather than evaluating on attribution accuracy, we focus on measuring whether the system resolves the failure or makes quantifiable progress toward task success, reflecting a more outcome-oriented view of debugging. On the datasets derived from GAIA and AssistantBench, DoVer flips 18โ28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses. Our findings highlight intervention as a practical mechanism for improving reliability in agentic systems and open opportunities for more robust, scalable debugging methods for LLM-based multi-agent systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3622. Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making
๐ฌ 29
โญ 5.00
๐ reinforcement learning
๐ท๏ธ POMDP; Latent Variable Models; RL generalization
Recent work has framed decision-making as a sequence modeling problem using generative models such as diffusion models. Although promising, these approaches often overlook latent factors that exhibit evolving dynamics, elements that are fundamental to environment transitions, reward structures, and high-level agent behavior. Explicitly modeling these hidden processes is essential for both precise dynamics modeling and effective decision-making. In this paper, we propose a unified framework that explicitly incorporates latent dynamic inference into generative decision-making from minimal yet sufficient observations. We theoretically show that under mild conditions, the latent process can be identified from small temporal blocks of observations. Building on this insight, we introduce Ada-Diffuser, a causal diffusion model that learns the temporal structure of observed interactions and the underlying latent dynamics simultaneously, and furthermore, leverages them for planning and control. With a proper modular design, Ada-Diffuser supports both planning and policy learning tasks, enabling adaptation to latent variations in dynamics, rewards, and even recovering hidden action variables from action-free demonstrations. Extensive experiments on locomotion and robotic manipulation benchmarks demonstrate the modelโs effectiveness in accurate latent inference, long-horizon planning, and adaptive policy learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3623. Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning
๐ฌ 17
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Object Referring Reasoning
Object referring aims to detect all objects in an image that match a given natural language description. We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content. Specifically, it should satisfy two key properties: 1) Verifiable, by producing interpretable reasoning that justifies its predictions and clearly links them to visual evidence; and 2) Trustworthy, by learning to abstain when no object in the image satisfies the given expression. However, most methods treat referring as a direct bounding box prediction task, offering limited interpretability and struggling to reject expressions with no matching object. In this work, we propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task. Given a referring expression, we first identify all candidate object instances corresponding to the referred object category. Rex-Thinker then performs step-by-step reasoning over each candidate to assess whether it matches the given expression, before making a final prediction. To support this paradigm, we construct a large-scale CoT-style referring dataset named HumanRef-CoT by prompting GPT-4o on the HumanRef dataset. Each reasoning trace follows a structured planning, action, and summarization format, enabling the model to learn decomposed, interpretable reasoning over object candidates. We then train Rex-Thinker in two stages: a cold-start supervised fine-tuning phase to teach the model how to perform structured reasoning, followed by GRPO-based RL learning to improve accuracy and generalization. Experiments show that our approach outperforms standard baselines in both precision and interpretability on in-domain evaluation, while also demonstrating improved ability to reject hallucinated outputs and strong generalization in out-of-domain settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3624. Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
๐ฌ 23
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Vision-Language-Model Vision-Language-Action-Model Embodied Reasoning
While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing **Vlaser** - a **V**ision-**L**anguage-**A**ction Model with **s**ynergistic **e**mbodied **r**easoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarksโincluding spatial reasoning, embodied grounding, embodied QA, and task planning.
Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark. We will open-source the model weights, data generation pipelines, and the full dataset to support future research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3625. OVSeg3R: Learn Open-vocabulary Instance Segmentation from 2D via 3D Reconstruction
๐ฌ 16
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ 3D Perception Open-vocabulary 3D Instance Segmentation Open-vocabulary 2D Instance Segmentation
In this paper, we propose a training scheme called OVSeg3R to learn open-vocabulary 3D instance segmentation from well-studied 2D perception models with the aid of 3D reconstruction. OVSeg3R directly adopts reconstructed scenes from 2D videos as input, avoiding costly manual adjustment while aligning input with real-world applications. By exploiting the 2D to 3D correspondences provided by 3D reconstruction models, OVSeg3R projects each view's 2D instance mask predictions, obtained from an open-vocabulary 2D model, onto 3D to generate annotations for the view's corresponding sub-scene. To avoid incorrectly introduced false positives as supervision due to partial annotations from 2D to 3D, we propose a View-wise Instance Partition algorithm, which partitions predictions to their respective views for supervision, stabilizing the training process. Furthermore, since 3D reconstruction models tend to over-smooth geometric details, clustering reconstructed points into representative super-points based solely on geometry, as commonly done in mainstream 3D segmentation methods, may overlook geometrically non-salient objects. We therefore introduce 2D Instance Boundary-aware Superpoint, which leverages 2D masks to constrain the superpoint clustering, preventing superpoints from violating instance boundaries. With these designs, OVSeg3R not only extends a state-of-the-art closed-vocabulary 3D instance segmentation model to open-vocabulary, but also substantially narrows the performance gap between tail and head classes, ultimately leading to an overall improvement of +2.3 mAP on the ScanNet200 benchmark. Furthermore, under the standard open-vocabulary setting, OVSeg3R surpasses previous methods by about +7.1 mAP on the novel classes, further validating its effectiveness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3626. Topology of Reasoning: Retrieved Cell Complex-Augmented Generation for Textual Graph Question Answering
๐ฌ 4
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Retrieval Augmented Generation Graph Question Answering Graph Neural Network Large Language Model Textual Graphs
Retrieval-Augmented Generation (RAG) enhances the reasoning ability of Large Language Models (LLMs) by dynamically integrating external knowledge, thereby mitigating hallucinations and strengthening contextual grounding for structured data such as graphs. Nevertheless, most existing RAG variants for textual graphs concentrate on low-dimensional structuresโtreating nodes as entities (0-dimensional) and edges or paths as pairwise or sequential relations (1-dimensional), but overlook cycles, which are crucial for reasoning over relational loops. Such cycles often arise in questions requiring closed-loop inference about similar objects or relative positions. This limitation often results in incomplete contextual grounding and restricted reasoning capability. In this work, we propose Topology-enhanced Retrieval-Augmented Generation (TopoRAG), a novel framework for textual graph question answering that effectively captures higher dimensional topological and relational dependencies. Specifically, TopoRAG first lifts textual graphs into cellular complexes to model multi-dimensional topological structures. Leveraging these lifted representations, a topology-aware subcomplex retrieval mechanism is proposed to extract cellular complexes relevant to the input query, providing compact and informative topological context. Finally, a multi-dimensional topological reasoning mechanism operates over these complexes to propagate relational information and guide LLMs in performing structured, logic-aware inference. Empirical evaluations demonstrate that our method consistently surpasses existing baselines across diverse textual graph tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3627. TIMESLIVER : SYMBOLIC-LINEAR DECOMPOSITION FOR EXPLAINABLE TIME SERIES CLASSIFICATION
๐ฌ 24
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Time-series Interpretability Temporal Attribution
Identifying the extent to which every temporal segment influences a modelโs predictions is essential for explaining model decisions and increasing transparency. While post-hoc explainable methods based on gradients and feature-based attributions have been popular, they suffer from reference state sensitivity and struggle to generalize across time-series datasets, as they treat time points independently and ignore sequential dependencies. Another perspective on explainable time-series classification is through interpretable components of the model, for instance, leveraging self-attention mechanisms to estimate temporal attribution; however, recent findings indicate that these attention weights often fail to provide faithful measures of temporal importance. In this work, we advance this perspective and present a novel explainability-driven deep learning framework, TimeSliver, which jointly utilizes raw time-series data and its symbolic abstraction to construct a representation that maintains the original temporal structure. Each element in this representation linearly encodes the contribution of each temporal segment to the final prediction, allowing us to assign a meaningful importance score to every time point. For time-series classification, TimeSliver outperforms other temporal attribution methods by 11\% on 7 distinct synthetic and real-world multivariate time-series datasets. TimeSliver also achieves predictive performance within 2\% of state-of-the-art baselines across 26 UEA benchmark datasets, positioning it as a strong and explainable framework for general time-series classification.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3628. Joint Adaptation of Uni-modal Foundation Models for Multi-modal Alzheimer's Disease Diagnosis
๐ฌ 20
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Artificial Intelligence for sciences; Alzheimer's disease; multi-modal diagnosis; Foundation Models
Alzheimerโs Disease (AD) is a progressive neurodegenerative disorder and a leading cause of dementia worldwide. Accurate diagnosis requires integrating diverse patient data modalities. With the rapid advancement of foundation models in neurobiology and medicine, integrating foundation models from various modalities has emerged as a promising yet underexplored direction for multi-modal AD diagnosis. A central challenge is enabling effective interaction among these models without disrupting the robust, modality-specific representations learned from large-scale pretraining. To address this, we propose a novel multi-modal framework for AD diagnosis that enables joint interaction among uni-modal foundation models through modality-anchored interaction. In this framework, one modality and its corresponding foundation model are designated as an anchor, while the remaining modalities serve as auxiliary sources of complementary information. To preserve the pre-trained representation space of the anchor model, we propose modality-aware Q-formers that selectively map auxiliary modality features into the anchor modelโs feature space, enabling the anchor model to jointly process its own features together with the seamlessly integrated auxiliary features. We evaluate our method on AD diagnosis and progression prediction across four modalities: sMRI, fMRI, clinical records, and genetic data. Our framework consistently outperforms prior methods in two modality settings, and further demonstrates strong generalization to external datasets and other neurodegenerative diseases such as Parkinsonโs disease.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3629. COMI: Coarse-to-fine Context Compression via Marginal Information Gain
๐ฌ 24
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Long Context
Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse tasks. However, their deployment in long context scenarios remains hindered by computational inefficiency and information redundancy. Context compression methods address these challenges by significantly reducing input length and eliminating redundancy. We propose COMI, a coarse-to-fine adaptive context compression framework that jointly optimizes for semantic relevance and diversity under high compression rates. We introduce Marginal Information Gain (MIG), a metric defined as the relevance of a unit to the input query minus its semantic redundancy with other units, guiding the compression process to prioritize information that is both relevant and low redundant. The framework operates in two stages: (1) Coarse-Grained Group Reallocation, where the context is partitioned into groups and dynamically assigned compression rates based on inter-group MIG, ensuring compression budgets align with information value distribution; and (2) Fine-Grained Token Merging, where tokens within each group are fused via an intra-group MIG-based weighting mechanism, thereby preserving key semantics while avoiding the accumulation of redundancy. Extensive experiments across question-answering (e.g., NaturalQuestions, 2WikiMQA, HotpotQA and NarrativeQA), summarization (e.g., MultiNews) with various backbones (e.g., LLaMA-2-7B, Qwen2-7B) show that COMI outperforms existing baselines by a large margin, e.g., approximately 25-point Exact Match (EM) improvement under 32x compression constraint with Qwen2-7B on NaturalQuestions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3630. RepIt: Steering Language Models with Concept-Specific Refusal Vectors
๐ฌ 14
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ interpretability representations steering safety
Current safety evaluations of language models rely on benchmark-based assessments that may miss targeted vulnerabilities. We present RepIt, a simple and data-efficient framework for isolating concept-specific representations in LM activations. While existing steering methods already achieve high attack success rates through broad interventions, RepIt enables a more concerning capability: selective suppression of refusal on targeted concepts while preserving refusal elsewhere. Across five frontier LMs, RepIt produces evaluation-evading models that answer questions related to weapons of mass destruction while still scoring as safe on standard benchmarks. We find the edit of the steering vector localizes to just 100-200 neurons, and robust concept vectors can be extracted from as few as a dozen examples on a single A6000, highlighting how targeted, hard-to-detect modifications can exploit evaluation blind spots with minimal resources. By demonstrating precise concept disentanglement, this work exposes critical vulnerabilities in current safety evaluation practices and demonstrates an immediate need for more comprehensive, representation-aware assessments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3631. Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models
๐ฌ 19
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Vision-Language Models Hallucination
Large Vision-Language Models (LVLMs) exhibit outstanding performance on vision-language tasks but struggle with hallucination problems. Through in-depth analysis of LVLM activation patterns, we reveal two key findings: 1) truthfulness and visual perception capabilities predominantly engage different subsets of attention heads within the model architecture; and 2) truthfulness steering vectors vary significantly across different semantic contexts. Based on these observations, we propose Dynamic Multimodal Activation Steering, a training-free, plug-and-play approach for hallucination mitigation. Our method constructs a semantic-based truthfulness steering vector database and computes visual perception steering vectors, enabling context-aware interventions during inference by dynamically selecting the most relevant steering vectors based on input semantic similarity and applying them to the most influential attention heads. We conduct comprehensive experiments across multiple models and datasets, demonstrating that our approach significantly enhances model performance, outperforming existing state-of-the-art methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3632. Generative Blocks World: Moving Things Around in Pictures
๐ฌ 4
โญ 5.00
๐ generative models
๐ท๏ธ 3D primitives Diffusion Models
We describe Generative Blocks World to interact with the scene of a generated image by manipulating simple geometric abstractions. Our method represents scenes as assemblies of convex 3D primitives, and the same scene can be represented by different numbers of primitives, allowing an editor to move either whole structures or small details. Once the scene geometry has been edited, the image is generated by a flow-based method, which is conditioned on depth and a texture hint. Our texture hint takes into account the modified 3D primitives, exceeding the texture-consistency provided by existing techniques. These texture hints (a) allow accurate object and camera moves and (b) preserve the identity of objects. Our experiments demonstrate that our approach outperforms prior works in visual fidelity, editability, and compositional generalization. Code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3633. Randomized Antipodal Search Done Right for Data Pareto Improvement of LLM Unlearning
๐ฌ 20
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Unlearning Randomized Algorithms
Large language models (LLMs) sometimes memorize undesirable knowledge, which must be removed after deployment. Prior work on machine unlearning has focused largely on optimization methods that adjust parameters to enforce forgetting while preserving retention. However, these approaches assume that the forget and retain sets are readily available, which rarely holds in practice. Unlearning is typically triggered by an undesired generation at inference time, making the retrieval of relevant data the central challenge.
We introduce the notion of \emph{data Pareto improvement} for LLM unlearning, which formalizes how retrieval can expand the achievable trade-off frontier between forgetting and retention. To realize this principle, we propose Randomized Antipodal Search on Linearized Influence Kernel (RASLIK), a retrieval algorithm that combines permutationโprojection hashing with randomized antipodal search. RASLIK reduces selection variance, achieves sublinear complexity, and yields a double gain in both quality and efficiency. Across multiple models, datasets, and unlearning algorithms, RASLIK consistently outperforms deterministic baselines and even oracle sampling, establishing randomized search as a principled and scalable solution for data-centric unlearning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3634. Stochastic Optimal Control for Continuous-Time fMRI Representation Learning
๐ฌ 23
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ self-supervised learning neural differential equations irregular time-series fMRI
Learning robust representations from functional magnetic resonance imaging (fMRI) is fundamentally challenged by the temporal irregularity and noise inherent in data from heterogeneous sources. Existing self-supervised learning (SSL) methods often discard critical temporal information by discretizing or averaging fMRI signals. To address this, we introduce a novel framework that reframes SSL as a Stochastic Optimal Control (SOC) problem. Our approach models brain activity as continuous-time latent dynamics, learning a robust representation of brain dynamics by optimizing a control policy that is agnostic to the temporal irregularity. This SOC framework naturally unifies masked autoencoding (MAE) and joint-embedding prediction (JEPA) to extract compact, control-derived representations. Furthermore, a simulation-free inference strategy ensures computational efficiency and scalability for large-scale fMRI datasets. Our model demonstrates state-of-the-art performance across diverse downstream applications, highlighting the potential of the SOC-based continuous-time representation learning framework.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3635. AutoQVLA: Not All Channels Are Equal in Vision-Language-Action Model's Quantization
๐ฌ 25
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Vision-Language-Action Models Embodied AI Model Quantization
The advent of Vision-Language-Action (VLA) models represents a significant leap for embodied intelligence, yet their immense computational demands critically hinder deployment on resource-constrained robotic platforms. Intuitively, low-bit quantization is a prevalent and preferred technique for large-scale model compression. However, we find that a systematic analysis of VLA model's quantization is fundamentally lacking. We argue that naively applying uniform-bit quantization from Large Language Models (LLMs) to robotics is flawed, as these methods prioritize passive data fidelity while ignoring how minor action deviations compound into catastrophic task failures. To bridge this gap, we introduce AutoQVLA, the first action-centric quantization framework specifically designed for embodied control. In a sharp departure from the rigid, uniform-bit quantization of LLM-based methods, AutoQVLA introduces a highly granular, channel-wise bit allocation strategy. Its core mechanism is to directly measure the final action-space sensitivity when quantizing each individual channel to various bit-widths. This process yields a precise, per-channel importance metric that guides a global optimization, which elegantly unifies quantization and pruning (0-bit) into a single, cohesive framework. Extensive evaluations on different baselines demonstrate the superiority of our approach. In the LIBERO, the quantization version of OpenVLA-OFT with our method requires only 29.2% of the original model's VRAM while maintaining 98.9% of its original performance and achieving a 1.49$\times$ speedup. This translates to a 22.6% performance improvement over the LLM-derived method SmoothQuant. Our work establishes a new, principled foundation for compressing VLA models in robotics, paving the way for deploying powerful, large-scale models on real-world hardware. Code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3636. GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
๐ฌ 4
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Scene Understanding Vision Language Models
In recent years, 2D Vision-Language Models (VLMs) have made significant strides in image-text understanding tasks. However, their performance in 3D spatial comprehension, which is critical for embodied intelligence, remains limited. Recent advances have leveraged 3D point clouds and multi-view images as inputs, yielding promising results. However, we propose exploring a purely vision-based solution inspired by human perception, which merely relies on visual cues for 3D spatial understanding. This paper empirically investigates the limitations of VLMs in 3D spatial knowledge, revealing that their primary shortcoming lies in the lack of global-local correspondence between the scene and individual frames. To address this, we introduce GPT4Scene, a novel visual prompting paradigm in VLM training and inference that helps build the global-local relationship, significantly improving the 3D spatial understanding of indoor scenes. Specifically, GPT4Scene constructs a Bird's Eye View (BEV) image from the video and marks consistent object IDs across both frames and the BEV image. The model then inputs the concatenated BEV image and video frames with markers. In zero-shot evaluations, GPT4Scene improves performance over closed-source VLMs like GPT-4o. Additionally, we prepare a processed video dataset consisting of 165K text annotation to fine-tune open-source VLMs, achieving state-of-the-art performance on all 3D understanding tasks. Surprisingly, after training with the GPT4Scene paradigm, VLMs consistently improve during inference, even without object marker prompting and BEV image as explicit correspondence. It demonstrates that the proposed paradigm helps VLMs develop an intrinsic ability to understand 3D scenes, which paves the way for a seamless approach to extending VLMs for 3D scene understanding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3637. Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs
๐ฌ 16
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ mechanistic interpretability alignment adversarial defense
The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution.
In this work, we introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby sidestepping the need for data that is distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors. By monitoring the cosine similarity of activations along these directions, we can detect salient behaviors introduced during fine-tuning with high precision.
For backdoored models that bypass safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%. For models that have undergone unlearning, we detect inference on erased topics with accuracy up to 95.42% and can even steer the model to recover "unlearned" information. Besides monitoring, our method also shows potential for pre-deployment model auditing: by analyzing commercial instruction-tuned models (OLMo, Llama, Qwen), we are able to uncover model-specific fine-tuning focus including mathematical problem solving, emoji usage, and Midjourney prompt generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3638. Adjusting Prediction Model Through Wasserstein Geodesic for Causal Inference
๐ฌ 14
โญ 5.00
๐ causal reasoning
๐ท๏ธ causal inference
Causal inference estimates the treatment effect by comparing the potential outcomes of the treated and control groups. Due to the existence of confounders, the distributions of treated and control groups are imbalanced, resulting in limited generalization ability of the outcome prediction model, \ie, the prediction model trained on one group cannot perform well on the other group. To tackle this, existing methods usually adjust confounders to learn balanced representations for aligning the distributions. However, these methods could suffer from the over-balancing issue that predictive information about outcomes is removed during adjustment. In this paper, we propose to adjust the outcome prediction model to improve its generalization ability on both groups simultaneously, so that the over-balancing issue caused by confounder adjustment can be avoided. To address the challenge of large distribution discrepancy between groups during model adjustment, we propose to generate intermediate groups through the Wasserstein geodesic, which smoothly connects the control and treated groups. Based on this, we gradually adjust the outcome prediction model between consecutive groups by a self-training paradigm. To further enhance the performance of the model, we filter the generated samples to select high-quality samples for learning. We provide the theoretical analysis regarding our method, and demonstrate the effectiveness of our method on several benchmark datasets in terms of multiple evaluation metrics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3639. Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery
๐ฌ 21
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Bayesian Optimization; Large Language Models; AI for Science; Scientific Discovery
Scientific discovery is increasingly constrained by costly experiments and limited budgets, making efficient optimization essential for AI for science. Bayesian Optimization (BO), while widely adopted for balancing exploration and exploitation, suffers from slow cold-start performance and poor scalability in high-dimensional settings, limiting its effectiveness in real-world scientific applications. To address these challenges, we propose LLM-Guided Bayesian Optimization (LGBO), the first LLM preference-guided BO framework that continuously integrates the semantic reasoning of large language models (LLMs) into the optimization loop. Unlike prior works that use LLMs only for warm-start initialization or candidate generation, LGBO introduces a region-lifted preference mechanism that embeds LLM-driven preferences into every iteration, shifting the surrogate mean in a stable and controllable way. Theoretically, we prove that LGBO is not perform significantly worse than standard BO in the worst case, while achieving significantly faster convergence when preferences align with the objective. Empirically, LGBO achieves consistent improvements across diverse dry benchmarks in physics, chemistry, biology, and materials science. Most notably, in a new wet-lab optimization of FeโCr battery electrolytes, LGBO reaches \textbf{90\% of the best observed value within 6 iterations}, whereas standard BO and existing LLM-augmented baselines require more than 10 iterations. Together, the results suggest that LGBO offers a promising direction for integrating LLMs into scientific optimization workflows.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3640. CAPSUL: A Comprehensive Human Protein Benchmark for Subcellular Localization
๐ฌ 25
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Subcellular Localization Human Protein 3D Structure
Subcellular localization is a crucial biological task for drug target identification and function annotation. Although it has been biologically realized that subcellular localization is closely associated with protein structure, no existing dataset offers comprehensive 3D structural information with detailed subcellular localization annotations, thus severely hindering the application of promising structure-based models on this task.
To address this gap, we introduce a new benchmark called $\textbf{CAPSUL}$, a $\textbf{C}$omprehensive hum$\textbf{A}$n $\textbf{P}$rotein benchmark for $\textbf{SU}$bcellular $\textbf{L}$ocalization. It features a dataset that integrates diverse 3D structural representations with fine-grained subcellular localization annotations carefully curated by domain experts.
We evaluate this benchmark using a variety of state-of-the-art sequence-based and structure-based models, showcasing the importance of involving structural features in this task. Furthermore, we explore reweighting and single-label classification strategies to facilitate future investigation on structure-based methods for this task.
Lastly, we showcase the powerful interpretability of structure-based methods through a case study on the Golgi apparatus, where we discover a decisive localization pattern $\alpha$-helix from attention mechanisms, demonstrating the potential for bridging the gap with intuitive biological interpretability and paving the way for data-driven discoveries in cell biology.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3641. Exploring Specular Reflection Inconsistency for Generalizable Face Forgery Detection
๐ฌ 25
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ face forgery detection illumination separation specular reflection analysis
Detecting deepfakes has become increasingly challenging as forgery faces synthesized by AI-generated methods, particularly diffusion models, achieve unprecedented quality and resolution. Existing forgery detection approaches relying on spatial and frequency features demonstrate limited efficacy against high-quality, entirely synthesized forgeries. In this paper, we propose a novel detection method grounded in the observation that facial attributes governed by complex physical laws and multiple parameters are inherently difficult to replicate. Specifically, we focus on illumination, particularly the specular reflection component in the Phong illumination model, which poses the greatest replication challenge due to its parametric complexity and nonlinear formulation. We introduce a fast and accurate face texture estimation method based on Retinex theory to enable precise specular reflection separation. Furthermore, drawing from the mathematical formulation of specular reflection, we posit that forgery evidence manifests not only in the specular reflection itself but also in its relationship with corresponding face texture and direct light. To address this issue, we design the Specular-Reflection-Inconsistency-Network (SRI-Net), incorporating a two-stage cross-attention mechanism to capture these correlations and integrate specular reflection related features with image features for robust forgery detection. Experimental results demonstrate that our method achieves superior performance on both traditional deepfake datasets and generative deepfake datasets, particularly those containing diffusion-generated forgery faces.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3642. Credit-Budgeted ICPC-Style Coding: When LLM Agents Must Pay for Every Decision
๐ฌ 10
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Coding Agents Large Language Models Agent Evaluation Interactive Environment
Contemporary coding-agent benchmarks applaud โfirst correct answer,โ silently assuming infinite tokens, container minutes, and developer patience. In production, every LLM call, test re-run, and rollback incurs hard cost; agents that cannot budget these resources are dead on arrival. We close the gap with USACOArena, an ICPC-inspired arena where agents pay deterministic credits for every prompt, compilation, test, or rollback. A task becomes a costโbenefit negotiation under uncertainty: is a second sample worth 15\% of the remaining budget, or should the agent pivot to a cheaper heuristic? Real-time deduction exposes decision profiles hidden from static leaderboards: the tax of over-specialized generators, the ROI of early-exit heuristics, and the compound interest of lightweight scaffolding. Even identically seeded agents diverge in self-play, revealing a rich policy space where the same model oscillates between spendthrift submission sprees and parsimonious exploration. Released as a reproducible benchmark and zero-shot curriculum, USACOArena provides the traces, credit engine, and six state-of-the-art decision logs to catalyze research on coding agents that know when to stop.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3643. A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding
๐ฌ 26
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ fMRI brain decoding video reconstruction cross-subject generalization visual cortex contrastive learning zero-shot decoding
Subject-agnostic brain decoding, which aims to reconstruct continuous visual experiences from fMRI without subject-specific training, holds great potential for clinical applications. However, this direction remains underexplored due to challenges in cross-subject generalization and the complex nature of brain signals.
In this work, we propose Visual Cortex Flow Architecture (VCFlow), a novel hierarchical decoding framework that explicitly models the ventral-dorsal architecture of the human visual system to learn multi-dimensional representations. By disentangling and leveraging features from early visual cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary cognitive information essential for visual reconstruction.
Furthermore, we introduce a feature-level contrastive learning strategy to enhance the extraction of subject-invariant semantic representations, thereby enhancing subject-agnostic applicability to previously unseen subjects.
Unlike conventional pipelines that need more than 12 hours of per-subject data and heavy computation, VCFlow sacrifices only 7\% accuracy on average yet generates each reconstructed video in 10 seconds without any retraining, offering a fast and clinically scalable solution.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3644. Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding
๐ฌ 19
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Privacy Preservation Video Understanding
We introduce a novel formulation of visual privacy preservation for video foundation models that operates entirely in the latent space. While spatio-temporal features learned by foundation models have deepened general understanding of video content, sharing or storing these extracted visual features for downstream tasks inadvertently reveals sensitive personal information like skin color, gender, or clothing. Current privacy preservation methods focus on input-pixel level anonymization, which requires retraining the entire utility video model and results in task-specific anonymization, making them unsuitable for recent video foundational models. To address these challenges, we introduce a lightweight Anonymizing Adapter Module (AAM) that removes private information from video features while retaining general task utility. AAM can be applied in a plug and play fashion to frozen video encoders, minimizing the computational burden of finetuning and re-extracting features. Our framework employs three newly designed training objectives: (1) a clip-level self-supervised privacy objective to reduce mutual information between static clips, (2) a co-training objective to retain utility across seen tasks, and (3) a latent consistency loss for generalization on unseen tasks. Our extensive evaluations demonstrate a significant 35% reduction in privacy leakage while maintaining near-baseline utility performance across various downstream tasks: Action Recognition (Kinetics400, UCF101, HMDB51), Temporal Action Detection (THUMOS14), and Anomaly Detection (UCF-Crime). We also provide an analysis on anonymization for sensitive temporal attribute recognition. Additionally, we propose new protocols for assessing gender bias in action recognition models, showing that our method effectively mitigates such biases and promotes more equitable video understanding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3645. Anchor Frame Bridging for Coherent First-Last Frame Video Generation
๐ฌ 26
โญ 5.00
๐ generative models
๐ท๏ธ First-Last Frame Video Generation
First-last frame video generation has recently gained significant attention. It enables coherent motion generation between specified first and last frames. However, this approach suffers from semantic degradation in intermediate frames, causing scene distortion and subject deformation that undermine temporal consistency.
To address this issue, we introduce **Anchor Frame Bridging (AFB)**, a novel plug-and-play method that explicitly bridges semantic continuity from boundary frames to intermediate frames, offering training-free adaptability and generalizability. By adaptively interpolating anchor frames at temporally critical locations exhibiting maximal semantic discontinuities, our approach effectively mitigates semantic drift in intermediate frames. Specifically, we propose an **adaptive anchor frame selection** module, which generates text-aligned candidate frames via frame order reversal and selects anchors based on semantic continuity. Subsequently, we develop **anchor frame guided generation**, which leverages the selected anchor frames to guide semantic propagation across intermediate frames, ensuring consistent boundary semantics and preserving temporal coherence throughout the video sequence. The final video is synthesized using the first frame, last frame, selected anchor frames, and the text prompt.
The results demonstrate that our method significantly enhances the temporal consistency and overall quality of generated videos. Specifically, when applied to the Wan2.1-I2V model, it yields improvements of 16.58\% in FVD and 10.21\% in PSNR. The codes are provided in the supplementary material.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3646. BoRA: Towards More Expressive Low-Rank Adaptation with Block Diversity
๐ฌ 23
โญ 5.00
๐ optimization
๐ท๏ธ LoRA Block Matrix Multiplication Higher Matrix Rank
Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) method widely used in large language models (LLMs).
It approximates the update of a pretrained weight matrix $W\in\mathbb{R}^{m\times n}$ by the product of two low-rank matrices, $BA$, where $A \in\mathbb{R}^{r\times n}$ and $B\in\mathbb{R}^{m\times r} (r\ll\min\{m,n\})$.
Increasing the dimension $r$ can raise the rank of LoRA weights (i.e., $BA$), which typically improves fine-tuning performance but also significantly increases the number of trainable parameters.
In this paper, we propose **Block Diversified Low-Rank Adaptation (BoRA)**, which improves the rank of LoRA weights with a small number of additional parameters.
Specifically, BoRA treats the product $BA$ as a block matrix multiplication, where $A$ and $B$ are partitioned into $b$ blocks along the columns and rows, respectively (i.e., $A=[A_1,\dots,A_b]$ and $B=[B_1,\dots,B_b]^\top$).
Consequently, the product $BA$ becomes the concatenation of the block products $B_iA_j$ for $i,j\in[b]$.
To enhance the diversity of different block products, BoRA introduces a unique diagonal matrix $\Sigma_{i,j} \in \mathbb{R}^{r\times r}$ for each block multiplication, resulting in $B_i \Sigma_{i,j} A_j$.
By leveraging these block-wise diagonal matrices, BoRA increases the rank of LoRA weights by a factor of $b$ while only requiring $b^2r$ additional parameters.
Extensive experiments across multiple datasets and models demonstrate the superiority of BoRA, and ablation studies further validate its scalability.
The code is available at \url{https://anonymous.4open.science/r/BoRA}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3647. Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention
๐ฌ 21
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Reasoning Model Safety Alignment
Although Large Reasoning Models (LRMs) have progressed in solving complex problems, their chain-of-thought (CoT) reasoning often contains harmful content that can persist even when the final responses appear safe. We show that this issue still remains in existing methods which overlook the unique significance of safe reasoning, undermining their trustworthiness and posing potential risks in applications if unsafe reasoning is accessible for and exploited by malicious users. We therefore shift our focus to aligning the safety of reasoning itself in this paper and explore process supervision as the solution. However, simply rewarding safe reasoning proves inadequate due to low rollout diversity and limited training signals. To tackle this challenge, we first delve into the characteristics of safe reasoning and uncover several critical insights that 1) safe reasoning is often consolidated by a few critical steps of _safety triggers_; 2) _compliance cues_ strongly correlate with unsafe continuations; and 3) corrective interventions reliably steer unsafe trajectories towards safer traces. Motivated by these, we propose **Intervened Preference Optimization (IPO)**, an alignment method that enforces safe reasoning by substituting compliance steps with safety triggers and constructing pairs for preference learning with strong signals. Experiments on jailbreak and adversarial safety benchmarks demonstrate that IPO remarkably improves overall safety regarding both reasoning and responses, outperforming SFT-based and RL-based baselines with a relative reduction of over 30\% in harmfulness, while preserving excellent performance across diverse reasoning tasks. The results highlight the importance of explicit alignment for reasoning and provide a practical path to safer LRMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3648. ALM-MTA: Front-Door Causal Multi-Touch Attribution Method for Creator-Ecosystem Optimization
๐ฌ 26
โญ 5.00
๐ causal reasoning
๐ท๏ธ causal reasoning multiโtouch-attribution recommendation system
ConsumptionโDrivesโProduction (CDP) on social platforms aims to deliver interpretable incentive signals for creatorโecosystem building and resource utilization improvement, which strongly relies on attributions. In large-scale and complex recommendation system, the absence of accurate labels together with unobserved confounding renders backdoor adjustments alone insufficient for reliable attribution. To address these problems, we propose Adversarial Learning Mediator based MultiโTouch-Attribution (ALM-MTA), an extensible causal framework that leverages front-door identification with an adversarially learned mediator: a proxy trained to distillate outcome information to strengthen causal pathway from treatment to outcome and eliminate shortcut leakage. Then, we introduce contrastive learning that conditions front door marginalization on high match consumption upload pairs for ensuring positivity in large treatment spaces. To assess causality from nonโRCT logs, we also incorporate a nonโpersonalized bucketed protocol, estimating grouped uplift and computing AUUC over treatment clusters. Finally, we evaluate ALM-MTA performance using a real-world recommendation system with 400 million DAU (daily active users) and 30 billion samples. ALM-MTA has increased DAU with 0.04% and 0.6% of the daily active creators, with unit exposure efficiency increased by 670%. On causal utility, ALM-MTA achieves higher grouped AUUC than the SOTA in every propensity bucket, with a maximum gain of 0.070. In terms of accuracy, ALM-MTA improves upload AUC by 40% compared to SOTA. These results demonstrate that front -door deconfounding with adversarial mediator learning provides accurate, personalized and operationally efficient attribution for creator ecosystem optimization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3649. AutoDA-Timeseries: Automated Data Augmentation for Time Series
๐ฌ 26
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ time series analysis; automated data augmentation
Data augmentation is a fundamental technique in deep learning, widely applied in both representation learning and automated data augmentation (AutoDA). In representation learning, augmentations are used to construct contrastive views for learning task-agnostic embeddings, while in AutoDA the augmentations are directly optimized to improve downstream task performance. However, existing paradigms face critical limitations: representation learning relies on a two-stage scheme with limited adaptability, and current AutoDA frameworks are largely designed for image data, rendering them ineffective for capturing time seriesโspecific features. To address these issues, we introduce **AutoDA-Timeseries**, the first general-purpose automated data augmentation framework tailored for time series. AutoDA-Timeseries incorporates time series features into augmentation policy design and adaptively optimizes both augmentation probability and intensity in a single-stage, end-to-end manner. We conduct extensive experiments on five mainstream tasks, including classification, long-term forecasting, short-term forecasting, regression, and anomaly detection, showing that AutoDA-Timeseries consistently outperforms strong baselines across diverse models and datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3650. DiffTrans: Differentiable Geometry-Materials Decomposition for Reconstructing Transparent Objects
๐ฌ 20
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Differentiable rendering Transparent object reconstruction
Reconstructing transparent objects from a set of multi-view images is a challenging task due to the complicated nature and indeterminate behavior of light propagation. Typical methods are primarily tailored to specific scenarios, such as objects following a uniform topology, exhibiting ideal transparency and surface specular reflections, or with only surface materials, which substantially constrains their practical applicability in real-world settings. In this work, we propose a differentiable rendering framework for transparent objects, dubbed \emph{DiffTrans}, which allows for efficient decomposition and reconstruction of the geometry and materials of transparent objects, thereby reconstructing transparent objects accurately in intricate scenes with diverse topology and complex texture. Specifically, we first utilize FlexiCubes with dilation and smoothness regularization as the iso-surface representation to reconstruct an initial geometry efficiently from the multi-view object silhouette. Meanwhile, we employ the environment light radiance field to recover the environment of the scene. Then we devise a recursive differentiable ray tracer to further optimize the geometry, index of refraction and absorption rate simultaneously in a unified and end-to-end manner, leading to high-quality reconstruction of transparent objects in intricate scenes. A prominent advantage of the designed ray tracer is that it can be implemented in CUDA, enabling a significantly reduced computational cost. Extensive experiments on multiple benchmarks demonstrate the superior reconstruction performance of our \emph{DiffTrans} compared with other methods, especially in intricate scenes involving transparent objects with diverse topology and complex texture. Code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3651. SELF-HARMONY: LEARNING TO HARMONIZE SELF-SUPERVISION AND SELF-PLAY IN TEST-TIME REINFORCEMENT LEARNING
๐ฌ 21
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language models Test-time reinforcement learning test-time adaptation self-play pseudo labeling infomax
Test-time reinforcement learning (TTRL) offers a label-free paradigm for adapting models using only synthetic signals at inference, but its success hinges on constructing reliable learning signals. Standard approaches such as majority voting often collapse to spurious yet popular answers.
We introduce Self-Harmony, a framework built on a simple intuition: the correct answer should remain stable across both an original question and its paraphrase. Self-Harmony operationalizes this by employing a single model in two complementary roles: a Solver to produce answers and a Reframer to rephrase the input. Based on this, we further propose a pseudo-label method: instead of majority voting, it aggregates answer frequencies across these original and reframed views using the harmonic mean. This is a process that naturally selects for solutions stable under reframing, thereby avoiding the common trap of favoring view-dependent, spurious answers.
Crucially, this requires no human supervision or auxiliary models. Across diverse reasoning benchmarks, Self-Harmony achieves state-of-the-art results at the label-free test-time setting, ranking first in 28 of 30 settings across multiple methods. Beyond accuracy, it demonstrates unprecedented robustness, with zero training failures in all experiments, underscoring its stability and reliability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3652. Mapping Post-Training Forgetting in Language Models at Scale
๐ฌ 14
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ continual learning foundation models reasoning forgetting pretraining knowledge
Scaled postโtraining now drives many of the largest capability gains in language models (LMs), yet its effect on pretrained knowledge remains poorly understood. Not all forgetting is equal: Forgetting one fact (e.g., a U.S. president or an API call) does not โaverage outโ by recalling another. Hence, we propose a sample-wise paradigm to measure what is forgotten and when backward transfer occurs. Our metric counts 1โ0 transitions (correct before postโtraining, incorrect after) to quantify forgetting and 0โ1 transitions to quantify backward transfer. Traditional task averages conflate these effects and obscure large changes. For multipleโchoice benchmarks, we add chanceโadjusted variants that subtract the expected contribution of random guessing from preโ and postโtraining accuracies. We apply this framework across postโtraining stages, model sizes, and data scales. Our largeโscale analysis across nearly 30 model pairs and 100 sub-benchmarks with up to 32,768 generated tokens per sample shows that: (1) Domain-continual pretraining induces moderate forgetting with low backward backward transfer; (2) RL/SFT post-training applied to base models and Instruction tuning yield substantial backward transfer with minimal forgetting; (3) Applying RL/SFT to instructionโtuned models is sensitive on data scale: at small scales, both forgetting and backward transfer are small; at larger scales, effects are mixed and warrant further study with better controls; (4) Model merging does not reliably mitigate forgetting. Overall, our framework offers a practical yardstick for mapping how postโtraining alters pretrained knowledge at scale -- enabling progress towards generally capable AI systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3653. Angle K-Means
๐ฌ 28
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Clustering K-Means Accelerate Angle
We propose an accelerated exact $k$-means algorithm, Angle $k$-means.
As its name suggests, the algorithm mainly leverages angular relationships between data points and cluster centers
to reduce computational overhead. Although grounded in straightforward geometric principles,
it delivers substantial performance improvements in empirical evaluations.
In contrast to existing acceleration techniques, our model introduces no new hyperparameters,
preserving full compatibility with standard $k$-means.
Theoretical analysis shows that Angle $k$-means maintains linear time complexity
with respect to both sample size and dimensionality,
while empirical evaluations on diverse real-world datasets demonstrate
significant speedup over state-of-the-art algorithms such as ball $k$-means and Exp-ns.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3654. PhysLLM: Harnessing Large Language Models for Cross-Modal Remote Physiological Sensing
๐ฌ 18
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Remote photoplethysmograph large language model heart rate
Remote photoplethysmography (rPPG) enables non-contact physiological measurement but remains highly susceptible to illumination changes, motion artifacts, and limited temporal modeling. Large Language Models (LLMs) excel at capturing long-range dependencies, offering a potential solution but struggle with the continuous, noise-sensitive nature of rPPG signals due to their text-centric design. To bridge this gap, we introduce PhysLLM, a collaborative optimization framework that synergizes LLMs with domain-specific rPPG components. Specifically, the Text Prototype Guidance (TPG) strategy is proposed to establish cross-modal alignment by projecting hemodynamic features into LLM-interpretable semantic space, effectively bridging the representational gap between physiological signals and linguistic tokens. Besides, a novel Dual-Domain Stationary (DDS) Algorithm is proposed for resolving signal instability through adaptive time-frequency domain feature re-weighting. Finally, rPPG task-specific cues systematically inject physiological priors through physiological statistics, environmental contextual answering, and task description, leveraging cross-modal learning to integrate both visual and textual information, enabling dynamic adaptation to challenging scenarios like variable illumination and subject movements. Evaluation on four benchmark datasets, PhysLLM achieves state-of-the-art accuracy and robustness, demonstrating superior generalization across lighting variations and motion scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3655. Is Graph Unlearning Ready for Practice? A Benchmark on Efficiency, Utility, and Forgetting
๐ฌ 28
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ graph unlearning GNN graph neural network
Graph Neural Networks (\textsc{Gnn}s) are increasingly being deployed in sensitive, user-centric applications where regulations such as the GDPR mandate the ability to remove data upon request. This has spurred interest in graph unlearning, the task of removing the influence of specific training data from a trained \textsc{Gnn} without retraining from scratch. While several unlearning techniques have recently emerged, the field lacks a principled benchmark to assess whether these methods truly provide a practical alternative to retraining and, if so, how to choose among them for different workloads. In this work, we present the first systematic benchmark for \textsc{Gnn} unlearning, structured around three core desiderata: \emph{efficiency} (is unlearning faster than retraining?), \emph{utility} (does the unlearned model preserve predictive performance and align with the retrained gold standard?), and \emph{forgetting} (does the model genuinely eliminate the influence of removed data?). Through extensive experiments across diverse datasets and deletion scenarios, we deliver a unified assessment of existing approaches, surfacing their trade-offs and limitations. Crucially, our findings show that most unlearning techniques are not yet practical for large-scale graphs. At the same time, our benchmarking yields actionable guidelines on when unlearning can be a viable alternative to retraining and how to select among methods for different workloads, thereby charting a path for future research toward more practical, scalable, and trustworthy graph unlearning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3656. Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining
๐ฌ 19
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ robot learning๏ผforward dynamics๏ผinverse dynamics
Vision-language-action (VLA) models have shown great potential in building generalist robots, but still face a dilemmaโmisalignment of 2D image forecasting and 3D action prediction. Besides, such a vision-action entangled training manner limits model learning from large-scale, action-free web video data. To address these issues, we propose DeFI, a novel framework that Decouples visual Forward and Inverse dynamics pretraining to exploit respective data sources, wherein video generation and action prediction are disentangled. We introduce the Foundation Forward Dynamics Model (FFDM), pretrained on diverse human and robot videos for future prediction, and the Foundation Inverse Dynamics Model (FIDM), trained via self-supervised learning to infer latent actions from unlabeled video transitions. These models are then integrated into a unified architecture for end-to-end finetuning on downstream tasks. In this manner, FFDM and FIDM first shine separately and then cooperate for mutual benefit. Extensive experiments on CALVIN ABC-D and SimplerEnv demonstrate state-of-the-art performance, with DeFI achieving an average task length of 4.51 for CALVIN, 51.2% success rate on SimplerEnv-Fractal benchmark and 81.3% success rate in real-world deployment, significantly outperforming prior methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3657. Do We Need All the Synthetic Data? Targeted Image Augmentation via Diffusion Models
๐ฌ 18
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Synthetic data generation Image augmentation Diffusion models
Synthetically augmenting training datasets with diffusion models has been an
effective strategy for improving generalization of image classifiers. However,
existing techniques struggle to ensure the diversity of generation and increase the
size of the data by up to 10-30x to improve the in-distribution performance. In this
work, we show that synthetically augmenting part of the data that is not learned
early in training with faithful imagesโcontaining same features but different
noiseโoutperforms augmenting the entire dataset. By analyzing a two-layer CNN,
we prove that this strategy improves generalization by promoting homogeneity in
feature learning speed without amplifying noise. Our extensive experiments show
that by augmenting only 30%-40% of the data, our method boosts generalization
by up to 2.8% in a variety of scenarios, including training ResNet, ViT, ConvNeXt,
and Swin Transformer on CIFAR-10/100, and TinyImageNet, with various
optimizers including SGD and SAM. Notably, our method applied with SGD
outperforms the SOTA optimizer, SAM, on CIFAR-100 and TinyImageNet.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3658. Short Window Attention Enables Long-Term Memorization
๐ฌ 10
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ hybrids xLSTM SWA memory attention long-context architecture-design LLM stochastic
Recent works show that hybrid architectures combining sliding window softmax attention layers with linear recurrent neural network (RNN) layers outperform both of these architectures taken separately. However, the impact of the window length and the interplay between softmax attention and linear RNN layers remain under-studied. In this work, we introduce SWAX, a hybrid architecture consisting of sliding-window attention and xLSTM linear RNN layers.
A counter-intuitive finding with SWAX is that larger sliding windows do not improve the long-context performance. In fact, short window attention encourages the model to better train the long-term memory of the xLSTM, by relying less on the softmax attention mechanism for long context-retrieval.
The issue with small sliding windows is that they are detrimental for short-context tasks, which could be solved with information from moderately larger sliding windows otherwise. Therefore, we train SWAX by stochastically changing the sliding window size, forcing the model to leverage both a longer context window and the xLSTM memory. SWAX trained with stochastic window sizes significantly outperforms regular window attention both on short and long-context problems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3659. Differential Fine-Tuning Large Language Models Towards Better Diverse Reasoning Abilities
๐ฌ 4
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Reasoning Abilities Supervised Fine-Tuning
Reasoning abilities of large language models (LLMs) require explicit derivations compared to general question-answering, supervised fine-tuning (SFT) can empower multiple reasoning abilities in LLMs via learning from various datasets. However, neither training the datasets jointly (mix-up) nor continually can maintain the performance of single-dataset SFT, sometimes better while sometimes even worse, illustrating vanilla SFT can not only facilitate reasoning abilities but also introduce conflicts. In this paper, we propose a novel framework to mitigate the conflicts and preserve benefits among different reasoning tasks, and even surpass each task's single dataset SFT performance. We start by exploring the differences between reasoning fine-tuned and base LLMs by analyzing their parameter variations during model inference, and we discover that each reasoning capability has exclusive parameters that benefit itself more evidently than others. In contrast, the overlapped parameters of tasks can bring benefits or conflicts. Inspired by the findings, we propose to update the exclusive and overlapped parameters according to specific reasoning task combinations differentially, thereby avoiding unnecessary conflicts while maintaining benefits. Consistent improvements in mix-up and continual SFT experiments demonstrate that the proposed SFT strategy can achieve better performance on various LLMs (Llama3-8B, Mistral-7B, and Qwen2.5-14B) and diverse reasoning tasks with fewer conflicts, showing the superiority and generality of our analysis findings and the proposed approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3660. SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration
๐ฌ 23
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Vision Language Action Model Model Lightweighting Acceleration Embodied intelligence
Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities. However, their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation. Existing VLA acceleration methods primarily focus on structural optimization, overlooking the fact that these models operate in sequential decision-making environments. As a result, temporal redundancy in sequential action generation and spatial redundancy in visual input remain unaddressed. To this end, we propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens. Specifically, we design an action-aware model scheduling mechanism that reduces temporal redundancy by dynamically switching between VLA model and a lightweight generator. Inspired by the human motion pattern of focusing on key decision points while relying on intuition for other actions, we categorize VLA actions into deliberative and intuitive, assigning the former to the VLA model and the latter to the lightweight generator, enabling frequency-adaptive execution through collaborative model scheduling. To address spatial redundancy, we further develop a spatio-semantic dual-aware token pruning method. Tokens are classified into spatial and semantic types and pruned based on their dual-aware importance to accelerate VLA inference. These two mechanisms work jointly to guide the VLA in focusing on critical actions and salient visual information, achieving effective acceleration while maintaining high accuracy. Extensive experiments show that our method achieves 1.5$\times$ lossless acceleration in LIBERO and 2.4$\times$ in SimplerEnv, with up to 6\% average performance gain. Inference frequency and latency improve by 2.2$\times$ in SimplerEnv and 1.4$\times$ in LIBERO.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3661. ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training
๐ฌ 10
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Attention Mechanisms Training-free Methods Inference-time Optimization Model Interpretability Unsupervised Learning Attention Sink
Token-level attention tuning, a class of training-free methods including Post-hoc Attention Steering (PASTA, AutoPASTA) and Attention Calibration (ACT), has emerged as a promising way to improve frozen LLMs with interpretable interventions. However, these methods depend on auxiliary heuristics to identify "important" task-specific tokens, which can introduce bias and limit applicability when token importance is unclear or when using optimized kernels where attention maps are inaccessible. We propose a simpler and more elegant alternative: acting only on the initial token (e.g., \ in LLaMA). We show theoretically that adding lightweight biases to this tokenโs attention logits systematically shifts and reshapes the downstream attention patterns---an effect amplified by its natural function as an attention sink. Our empirical analysis reveals that this tuning process can positively affect LLMs and better unlock their pretrained knowledge, with stronger effects in early layers and distinct scaling preferences across attention heads. Building on these insights, we introduce ZeroTuning: a training-free method that improves LLM performance by applying head-specific attention adjustments to the initial token, requiring zero parameter updates. We present two variants: a supervised mode that calibrates on validation examples, and a novel unsupervised mode that directly minimizes the model's output entropy. Our method requires no KVโcache or decoding changes, and is kernelโagnostic (works with SDPA and FlashAttention). The method is lightweight and requires only four lines of modification to the standard LlamaAttention code. It achieves broad gains across 15 datasets and outperforms previous, more complex methods; for instance, with Llama-3.1-8B, it yields relative improvements of 19.9% on classification, 4.5% on question answering, and 2.1% on dialogue. ZeroTuning also works out-of-the-box with quantized inference and maintains its performance improvements with increasing context lengths. Our code and runnable demo are available at https://anonymous.4open.science/r/ZeroTuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3662. SUSD: Structured Unsupervised Skill Discovery through State Factorization
๐ฌ 14
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Unsupervised Skill Discovery Hierarchical RL
Unsupervised Skill Discovery (USD) aims to autonomously learn a diverse set of skills without relying on extrinsic rewards. One of the most common USD approaches is to maximize the Mutual Information (MI) between skill latent variables and states. However, MI-based methods tend to favor simple, static skills due to their invariance properties, limiting the discovery of dynamic, task-relevant behaviors. Distance-Maximizing Skill Discovery (DSD) promotes more dynamic skills by leveraging state-space distances, yet still fall short in encouraging comprehensive skill sets that engage all controllable factors or entities in the environment.
In this work, we introduce SUSD, a novel framework that harnesses the compositional structure of environments by factorizing the state space into independent components (e.g., objects or controllable entities). SUSD allocates distinct skill variables to different factors, enabling more fine-grained control on the skill discovery process. A dynamic model also tracks learning across factors, adaptively steering the agentโs focus toward underexplored factors.
This structured approach not only promotes the discovery of richer and more diverse skills, but also yields a factorized skill representation that enables fine-grained and disentangled control over individual entities which facilitates efficient training of compositional downstream tasks via Hierarchical Reinforcement Learning (HRL).
Our experimental results across three environments, with factors ranging from 1 to 10, demonstrate that our method can discover diverse and complex skills without supervision, significantly outperforming existing unsupervised skill discovery methods in factorized and complex environments. Code is available at the anonymous repository: [https://anonymous.4open.science/r/SUSD](https://anonymous.4open.science/r/SUSD).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3663. SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification
๐ฌ 18
โญ 5.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Text-to-SQL Formal Equivalence Checking Satisfiability modulo Theories
Community-driven Text-to-SQL evaluation platforms play a pivotal role in tracking the state of the art of Text-to-SQL performance. The reliability of the evaluation process is critical for driving progress in the field. Current evaluation methods are largely test-based, which involves comparing the execution results of a generated SQL query and a human-labeled ground-truth on a static test database. Such an evaluation is optimistic, as two queries can coincidentally produce the same output on the test database while actually being different. In this work, we propose a new alternative evaluation pipeline, called *SpotIt*, where a formal bounded equivalence verification engine actively searches for a database that differentiates the generated and ground-truth SQL queries. We develop techniques to extend existing verifiers to support a richer SQL subset relevant to Text-to-SQL. A performance evaluation of ten Text-to-SQL methods on the high-profile BIRD dataset suggests that test-based methods can often overlook differences between the generated query and the ground-truth. Further analysis of the verification results reveals a more complex picture of the current Text-to-SQL evaluation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3664. Exploring State-Space Models for Data-Specific Neural Representations
๐ฌ 15
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ state-space model
This paper studies the problem of data-specific neural representations, aiming for compact, flexible, and modality-agnostic storage of individual visual data using neural networks. Our approach considers a visual datum as a set of discrete observations of an underlying continuous signal, thus requiring models capable of capturing the inherent structure of the signal. For this purpose, we investigate state-space models (SSMs), which are well-suited for modeling latent signal dynamics. We first explore the appealing properties of SSMs for data-specific neural representation and then present a novel framework that integrates SSMs into the representation pipeline. The proposed framework achieved compact representations and strong reconstruction performance across a range of visual data formats, suggesting the potential of SSMs for data-specific neural representations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3665. Adaptive Thinking: Large Language Models Know When to Think in Latent Space
๐ฌ 24
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Reasoning Efficiency Self-Consistency
Recent advances in large language models (LLMs) test-time computing have introduced the capability to perform intermediate chain-of-thought (CoT) reasoning (thinking) before generating answers. While increasing the thinking budget yields smooth performance improvements at inference time, the relationship between LLM capability, query complexity, and optimal budget allocation remains poorly understood for achieving compute-optimal inference. To address this challenge, we utilize $\textit{self-consistency}$, the agreement among multiple reasoning paths, as a proxy for thinking necessity. We first identify that lower self-consistency indicates when queries require extended thinking to reach correct answers. Building on this insight, we introduce $\texttt{Sonata}$ (Self-Consistency-Guided Adapter for Thinking Allocation), a lightweight approach that adaptively allocates thinking budgets to optimize the performance-efficiency tradeoff. $\texttt{Sonata}$ includes an adapter trained offline on a calibration dataset to predict self-consistency directly from the last layer hidden representations during the query prefilling stage. This prediction then guides on-the-fly budget allocation before thinking. The adapter is general, transferable across diverse tasks once trained, and introduces $<1$$\textperthousand$ computational overhead during inference. Notably, Sonata is compatible with existing CoT compression methods, enabling further efficiency gains when managing thinking budgets across queries. Extensive experiments on multiple models (Qwen3-8B, Qwen3-32B, GPT-OSS-120B, Qwen3-235B-A22B) and benchmarks~(AIME25, GSM8K, MATH500, GPQA, LiveCodeBench) demonstrate that $\texttt{Sonata}$ achieves $20\\%$ to $60\\%$ reduction in thinking tokens while maintaining the same accuracy, or up to $2\\%$ improvement in accuracy with the same token cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3666. Noise Stability of Transformer Models
๐ฌ 14
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ transformers simplicity bias noise stability regularization methods spectral concentration
Understanding simplicity biases in deep learning offers a promising path toward developing reliable AI. A common metric for this, inspired by Boolean function analysis, is average sensitivity, which captures a model's robustness to single-token perturbations. We argue that average sensitivity has two key limitations: it lacks a natural generalization to real-valued domains and fails to explain the "junta-like" input dependence we empirically observe in modern LLMs. To address these limitations, we propose *noise stability* as a more comprehensive simplicity metric. Noise stability expresses a model's robustness to correlated noise applied to *all* input coordinates simultaneously. We provide a theoretical analysis of noise stability for single-layer attention and ReLU MLP layers and tackle the multi-layer propagation problem with a covariance interval propagation approach. Building on this theory, we develop a practical *noise stability regularization* method. Experiments on algorithmic and next-token-prediction tasks show that our regularizer consistently catalyzes grokking and accelerates training by approximately $35$\% and $75$\% respectively. Our results establish noise stability as a powerful tool for understanding and improving modern Transformers.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3667. VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation
๐ฌ 18
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ physical commonsense semantic adherence video generation benchmark auto evaluator
Large-scale video generative models, capable of creating realistic videos of diverse visual concepts, are strong candidates for general-purpose physical world simulators. However, their adherence to physical commonsense across real-world actions remains unclear (e.g., playing tennis, backflip). Existing benchmarks suffer from limitations such as limited size, lack of human evaluation, sim-to-real gaps, and absence of fine-grained physical rule analysis. To address this, we introduce VideoPhy-2, an action-centric dataset for evaluating physical commonsense in generated videos. We curate 4000 diverse and detailed prompts for video synthesis from modern generative models. We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos. Our findings reveal major shortcomings, with even the best model achieving only $47.7\%$ joint performance (i.e., high semantic and physical commonsense adherence) on the hard subset of VideoPhy-2. We find that the models particularly struggle with conservation laws like mass and momentum. Finally, we also train VideoPhy-2-AutoEval, an automatic evaluator for fast, reliable assessment on our dataset. Overall, VideoPhy-2 serves as a rigorous benchmark, exposing critical gaps in video generative models and guiding future research in physically-grounded video generation. We will release the dataset, videos, auto-rater model, and code in the camera-ready version.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3668. Combinatorial Rising Bandits
๐ฌ 17
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Combinatorial online learning Rising bandit Hierarchical planning
Combinatorial online learning is a fundamental task for selecting the optimal action (or super arm) as a combination of base arms in sequential interactions with systems providing stochastic rewards.
It is applicable to diverse domains such as robotics, social advertising, network routing, and recommendation systems.
In many real-world scenarios, we often encounter rising rewards, where playing a base arm not only provides an instantaneous reward but also contributes to the enhancement of future rewards, e.g., robots improving through practice and social influence strengthening in the history of successful recommendations.
Crucially, these enhancements may propagate to multiple super arms that share the same base arms, introducing dependencies beyond the scope of existing bandit models.
To address this gap, we introduce the Combinatorial Rising Bandit (CRB) framework and propose a provably efficient and empirically effective algorithm, Combinatorial Rising Upper Confidence Bound (CRUCB).
We empirically demonstrate the effectiveness of CRUCB in realistic deep reinforcement learning environments and synthetic settings, while our theoretical analysis establishes tight regret bounds. Together, they underscore the practical impact and theoretical rigor of our approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3669. Expert Merging: Model Merging with Unsupervised Expert Alignment and Importance-Guided Layer Chunking
๐ฌ 20
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Model Merging Model Editing Large Language Model
Model merging, which combines multiple domain-specialized experts into a single model, offers a practical path to endow Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) with broad capabilities without the cost of joint training or serving many models. However, training-free methods rely on hand-tuned coefficients, whereas training-based methods primarily align parameters rather than downstream task behavior and typically treat all layers uniformly, ignoring inter-layer heterogeneity. We introduce Expert Merging, a training-light method that learns a small set of layer-wise coefficients using only unlabeled calibration data. The coefficients are optimized to explicitly align the merged modelโs hidden states and logits with those of the corresponding experts, with a coefficient regularizer for stability and task-weighted losses for controllable trade-offs. To capture inter-layer variation, Expert Merging++ augments this design with importance-guided chunking: a normalized layer-importance metric, derived from learned coefficients, task-vector magnitudes, and parameter counts, allocates more chunk-wise coefficients to high-importance layers while keeping low-importance layers lightweight. The result is a label-free, parameter-efficient, and scalable approach to multi-expert model merging across LLMs and MLLMs. Across MLLM backbones (InternVL and Qwen2-VL) and the LLM backbone (Mistral), our method surpasses strong training-free and training-based merging baselines, with Expert Merging++ delivering further gains and, in some cases, even exceeding supervised Mixture Training. Our code is available at https://anonymous.4open.science/r/ExpertMeging and in the supplementary materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3670. The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs
๐ฌ 21
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Diffusion LLMs Safety Jailbreak Attack
Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling.
However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities.
To this end, we present **DIJA**, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100\% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5\% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3671. Towards Generalizable PDE Dynamics Forecasting via Physics-Guided Invariant Learning
๐ฌ 24
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ PDE Dynamics Forecasting OOD Generalization Invariant Learning
Advanced deep learning-based approaches have been actively applied to forecast the spatiotemporal physical dynamics governed by partial differential equations (PDEs), which acts as a critical procedure in tackling many science and engineering problems. As real-world physical environments like PDE system parameters are always capricious, how to generalize across unseen out-of-distribution (OOD) forecasting scenarios using limited training data is of great importance. To bridge this barrier, existing methods focus on discovering domain-generalizable representations across various PDE dynamics trajectories. However, their zero-shot OOD generalization capability remains deficient, since extra test-time samples for domain-specific adaptation are still required. This is because the fundamental physical invariance in PDE dynamical systems are yet to be investigated or integrated. To this end, we first explicitly define a two-fold PDE invariance principle, which points out that ingredient operators and their composition relationships remain invariant across different domains and PDE system evolution. Next, to capture this two-fold PDE invariance, we propose a physics-guided invariant learning method termed iMOOE, featuring an Invariance-aligned Mixture Of Operator Expert architecture and a frequency-enriched invariant learning objective. Extensive experiments across simulated benchmarks and real-world applications validate iMOOE's superior in-distribution performance and zero-shot generalization capabilities on diverse OOD forecasting scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3672. Do We Really Need Permutations? Impact of Width Expansion on Linear Mode Connectivity
๐ฌ 17
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ deep learning linear mode connectivity permutation symmetries
Recently, Ainsworth et al. empirically demonstrated that, given two independently trained models, applying a parameter permutation that preserves the inputโoutput behavior allows the two models to be connected by a low-loss linear path. When such a path exists, the models are said to achieve linear mode connectivity (LMC). Prior studies, including Ainsworth et al., have reported that achieving LMC requires not only an appropriate permutation search but also sufficiently wide models (e.g., a 32 $\times$ width multiplier for ResNet-20). This is broadly believed to be because increasing the model width ensures a large enough space of candidate permutations, increasing the chance of finding one that yields LMC. In this work, we empirically demonstrate that, even without any permutations, simply widening the models is sufficient for achieving LMC when using a suitable softmax temperature calibration. We further explain why this phenomenon arises by analyzing intermediate layer outputs. Specifically, we introduce layerwise exponentially weighted connectivity (LEWC), which states that the output of each layer of the merged model can be represented as an exponentially weighted sum of the outputs of the corresponding layers of the original models. Consequently the merged model's output matches that of an ensemble of the original models, which facilitates LMC. To the best of our knowledge, this work is the first to show that widening the model not only facilitates nonlinear mode connectivity, as suggested in prior research, but also significantly increases the possibility of achieving linear mode connectivity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3673. Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability
๐ฌ 13
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ post-training language models distributional learning alignment pluralistic alignment uncertainty estimation
Language model post-training has enhanced instruction-following and performance on many downstream tasks, but also comes with an often-overlooked cost on tasks with many possible valid answers. We characterize three desiderata: in-context steerability, valid output space coverage, and distributional alignment, and document across three model families how post-training can reduce these properties. In particular, we disambiguate between two kinds of in-context learning: ICL for eliciting existing underlying knowledge or capabilities, and in-context steerability, where a model must use in-context information to override its priors and steer to a novel data generating distribution. To better evaluate and improve these desiderata, we introduce Spectrum Suite, a large-scale resource compiled from $>40$ data sources and spanning $>90$ tasks requiring models to steer to and match diverse distributions. We find that while instruction-tuning helps elicit underlying capabilities and models, it hurts a modelโs ability to flexibly steer in-context. To mitigate these issues, we propose Spectrum Tuning, a post-training method using Spectrum Suite to improve steerability and distributional coverage. We find that Spectrum
Tuning often improves over pretrained models and their instruction-tuned counterparts, enhancing steerability, spanning more of the out-
put space, and improving distributional alignment on held-out datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3674. Memorization Through the Lens of Sample Gradients
๐ฌ 11
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Memorization Sample Gradients
Deep neural networks are known to often memorize underrepresented, hard examples, with implications for generalization and privacy. Feldman & Zhang (2020) defined a rigorous notion of memorization.
However it is prohibitively expensive to compute at scale because it requires training models both with and without the data point of interest in order to calculate the memorization score.
We observe that samples that are less memorized tend to be learned earlier in training, whereas highly memorized samples are learned later.
Motivated by this observation, we introduce Cumulative Sample Gradient (CSG), a computationally efficient proxy for memorization. CSG is the gradient of the loss with respect to input samples, accumulated over the course of training.
The advantage of using input gradients is that per-sample gradients can be obtained with negligible overhead during training. The accumulation over training also reduces per-epoch variance and enables a formal link to memorization. Theoretically, we show that CSG is bounded by memorization and by learning time.
Tracking these gradients during training reveals a characteristic riseโpeakโdecline trajectory whose timing is mirrored by the modelโs weight norm. This yields an early-stopping criterion that does not require a validation set: stop at the peak of the weight norm. This early stopping also enables our memorization proxy, CSG, to be up to five orders of magnitude more efficient than the memorization score from Feldman & Zhang (2020). It is also approximately 140 $\times$ and 10$\times$ faster than the prior state-of-the-art memorization proxies, input curvature and cumulative sample loss, while still aligning closely with the memorization score, exhibiting high correlation. Further, we develop Sample Gradient Assisted Loss (SGAL), a proxy that further improves alignment with memorization and is highly efficient to compute. Finally, we show that CSG attains state-of-the-art performance on practical dataset diagnostics, such as mislabeled-sample detection and enables bias discovery, providing a theoretically grounded toolbox for studying memorization in deep networks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3675. Unlocking the Value of Text: Event-Driven Reasoning and Multi-Level Alignment for Time Series Forecasting
๐ฌ 29
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ time series forecasting multimodal
Existing time series forecasting methods primarily rely on the numerical data itself. However, real-world time series exhibit complex patterns associated with multimodal information, making them difficult to predict with numerical data alone. While several multimodal time series forecasting methods have emerged, they either utilize text with limited supplementary information or focus merely on representation extraction, extracting minimal textual information for forecasting. To unlock the Value of Text, we propose VoT, a method with Event-driven Reasoning and Multi-level Alignment. Event-driven Reasoning combines the rich information in exogenous text with the powerful reasoning capabilities of LLMs for time series forecasting. To guide the LLMs in effective reasoning, we propose the Historical In-context Learning that retrieves and applies historical examples as in-context guidance. To maximize the utilization of text, we propose Multi-level Alignment. At the representation level, we utilize the Endogenous Text Alignment to integrate the endogenous text information with the time series. At the prediction level, we design the Adaptive Frequency Fusion to fuse the frequency components of event-driven prediction and numerical prediction to achieve complementary advantages. Experiments on real-world datasets across 10 domains demonstrate significant improvements over existing methods, validating the effectiveness of our approach in the utilization of text. The code is made available at https://anonymous.4open.science/r/VoT-465C.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3676. Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional
๐ฌ 9
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Multi-modal learning
Understanding the interplay between intra-modality dependencies (the contribution of an individual modality to a target task) and inter-modality dependencies (the relationships between modalities and the target task) is fundamental to advancing multi-modal learning. However, the nature of and interaction between these dependencies within current benchmark evaluations remains poorly characterized. In this work, we present a large-scale empirical study to quantify these dependencies across 23 visual question-answering benchmarks using multi-modal large language models (MLLMs) covering domains such as general and expert knowledge reasoning, optical character recognition, and document understanding. Our findings show that the reliance on vision, question (text), and their interaction varies significantly, both across and within benchmarks. We discover that numerous benchmarks intended to mitigate text-only biases have inadvertently amplified image-only dependencies. This characterization persists across model sizes, as larger models often use these intra-modality dependencies to achieve high performance that mask an underlying lack of multi-modal reasoning. We provide a quantitative characterization of multi-modal datasets, enabling a principled approach to multi-modal benchmark design and evaluation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3677. Guidance Watermarking for Diffusion Models
๐ฌ 17
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ watermarking image generative AI
This paper introduces a novel watermarking method for diffusion models. It is based on guiding the diffusion process using the gradient computed from any off-the-shelf watermark decoder. The gradient is guided further using different image augmentations, increasing robustness to attacks against which the decoder was not originally robust, without retraining or fine-tuning. The methodology effectively allows to convert any post-hoc watermarking scheme into a scheme embedding the signal during the diffusion process. We show that this approach is complementary to watermarking techniques modifying the variational autoencoder at the end of the diffusion process. We validate the methods on different diffusion models and detectors. The watermarking guidance does not significantly alter the generated image for a given seed and prompt, preserving both the diversity and quality of generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3678. Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter
๐ฌ 9
โญ 5.00
๐ generative models
๐ท๏ธ Multi-concept Customization DiT text-to-image generation
Personalized text-to-image generation aims to synthesize images of user-provided concepts in diverse contexts. Despite recent progress in multi-concept personalization, most are limited to object concepts and struggle to customize abstract concepts (e.g., pose, lighting).
Some methods have begun exploring multi-concept personalization supporting abstract concepts, but they require test-time fine-tuning for each new concept, which is time-consuming and prone to overfitting on limited training images.
In this work, we propose a novel tuning-free method for multi-concept personalization that can effectively customize both object and abstract concepts without test-time fine-tuning.
Our method builds upon the modulation mechanism in pre-trained Diffusion Transformers (DiTs) model, leveraging the localized and semantically meaningful properties of the modulation space. Specifically, we propose a novel module, Mod-Adapter, to predict concept-specific modulation direction for the modulation process of concept-related text tokens.
It introduces vision-language cross-attention for extracting concept visual features, and Mixture-of-Experts (MoE) layers that adaptively map the concept features into the modulation space.
Furthermore, to mitigate the training difficulty caused by the large gap between the concept image space and the modulation space, we introduce a VLM-guided pre-training strategy that leverages the strong image understanding capabilities of vision-language models to provide semantic supervision signals.
For a comprehensive comparison, we extend a standard benchmark by incorporating abstract concepts. Our method achieves state-of-the-art performance in multi-concept personalization, supported by quantitative, qualitative, and human evaluations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3679. On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment
๐ฌ 17
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ alignment safety cryptography
With the increased deployment of large language models (LLMs), one concern is their potential misuse for generating harmful content. Our work studies the alignment challenge, with a focus on filters to prevent the generation of unsafe information. Two natural points of intervention are the filtering of the input prompt before it reaches the model, and filtering the output after generation. Our main results demonstrate computational challenges in filtering both prompts and outputs. First, we show that there exist LLMs for which there are no efficient input-prompt filters: adversarial prompts that elicit harmful behavior can be easily constructed, which are computationally indistinguishable from benign prompts for any efficient filter. Our second main result identifies a natural setting in which output filtering is computationally intractable. All of our separation results are under cryptographic hardness assumptions. In addition to these core findings, we also formalize and study relaxed mitigation approaches, demonstrating further computational barriers. We conclude that safety cannot be achieved by designing filters external to the LLM internals (architecture and weights); in particular, black-box access to the LLM will not suffice. Based on our technical results, we argue that an aligned AI systemโs intelligence cannot be separated from its judgment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3680. TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models
๐ฌ 36
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ Time series reasoning multimodal time series time series models time series
Recent advances in multimodal time series learning underscore a paradigm shift from analytics centered on basic patterns toward advanced time series understanding and reasoning. However, existing multimodal time series datasets mostly remain at the level of surface alignment and question answering, without reaching the depth of genuine reasoning. The absence of well-defined tasks that genuinely require time series reasoning, along with the scarcity of high-quality data, has limited progress in building practical time series reasoning models (TSRMs). To this end, we introduce Time Series Reasoning Suite (TSR-Suite), which formalizes four atomic tasks that span three fundamental capabilities for reasoning with time series: (1) perception, acquired through scenario understanding and causality discovery; (2) extrapolation, realized via event-aware forecasting; and (3) decision-making, developed through deliberation over perception and extrapolation. TSR-Suite is the first comprehensive time series reasoning suite that supports not only thorough evaluation but also the data pipeline and training of TSRMs. It contains more than 23K samples, of which 2.3K are carefully curated through a human-guided hierarchical annotation process. Building on this foundation, we introduce TimeOmni-1, the first unified reasoning model designed to address diverse real-world problems demanding time series reasoning. The model is trained in multiple stages, integrating a mixture of task scenarios, novel reward functions, and tailored optimizations. Experiments show that TimeOmni-1 delivers strong out-of-distribution generalization across all tasks and achieves a high rate of valid responses. It significantly improves causality discovery accuracy (64.0% vs. 35.9% with GPT-4.1) and raises the valid response rate by over 6% compared to GPT-4.1 on the event-aware forecasting task.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3681. Inoculation Prompting: Eliciting traits from LLMs during training can reduce trait expression at test-time
๐ฌ 11
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ AI AI safety alignment generalization finetuning selective learning
Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., "You always speak in Spanish.") teaches the model to capitalize responses while still responding in English. We find that inoculation is effective across several additional settings: reducing emergent misalignment (EM) from narrow finetuning, defending against backdoor attacks, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising in-context reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. In the EM setting, we also show that inoculation explains prior results with educational insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3682. Compositional-ARC: Assessing Systematic Generalization in Abstract Spatial Reasoning
๐ฌ 15
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Systematic Generalization Abstract Spatial Reasoning ARC Meta-Learning for Compositionality
Systematic generalization refers to the capacity to understand and generate novel combinations from known components. Despite recent progress by large language models (LLMs) across various domains, these models often fail to extend their knowledge to novel compositional scenarios, revealing notable limitations in systematic generalization. There has been an ongoing debate about whether neural networks possess the capacity for systematic generalization, with recent studies suggesting that meta-learning approaches designed for compositionality can significantly enhance this ability. However, these insights have largely been confined to linguistic problems, leaving their applicability to other tasks an open question. In this study, we extend meta-learning for compositionality to the domain of abstract spatial reasoning. To this end, we introduce $\textit{Compositional-ARC}\textemdash{}$a dataset designed to evaluate the capacity of models to systematically generalize from known geometric transformations (e.g., translation, rotation) of abstract two-dimensional objects to novel combinations of these transformations (e.g., translation+rotation). Our results show that a small transformer-based encoder-decoder model, trained via meta-learning for compositionality, can systematically generalize to previously unseen transformation compositions. Notably, despite having only 5.7M parameters, this model significantly outperforms state-of-the-art LLMs$\textemdash{}$including o3-mini, GPT-4o, and Gemini 2.0 Flash, which fail to exhibit similar systematic behavior$\textemdash{}$and performs on par with the winning model of the ARC prize 2024, an 8B-parameter LLM trained via test-time training. Our findings highlight the effectiveness of meta-learning in promoting systematicity beyond linguistic tasks, suggesting a promising direction toward more robust and generalizable models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3683. ContextIF: Enhancing Instruction-Following through Context Reward
๐ฌ 15
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Large Language Models Instruction-Following Reinforcement Learning In-context Learning
While supervised fine-tuning (SFT) and preference learning (PL) are widely used to enhance the instruction-following ability of large language models (LLMs), they often struggle to generalize to novel or complex instructions and may compromise the models' general capabilities. In-context learning (ICL) emerges as a promising alternative due to its strong generalization without modifying the model's parameters, but its effectiveness is constrained by the reliance on high-quality, manually curated demonstration pools. To overcome this limitation, we propose ContextIF, a reinforcement learning (RL) framework for automatic context generation. Guided by comprehensive context reward, ContextIF is optimized by Group Relative Policy Optimization (GRPO). It aims to generate precise constraint summaries and optimal context demonstrations tailored to given instructions, thereby improving the instruction-following performance of target LLMs. We evaluate ContextIF on multiple representative instruction-following benchmarks using popular open-source LLMs. Experimental results demonstrate that ContextIF achieves substantial performance gains over existing SFT and ICL methods, while also generalizing effectively to unseen constraint conditions. Moreover, ContextIF preserves the parameters and general capabilities of the target models, offering strong adaptability and scalability. The code is provided in the Supplementary Materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3684. BioMD: All-atom Generative Model for Biomolecular Dynamics Simulation
๐ฌ 19
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ molecular dynamics biomolecular trajectories generation
Molecular dynamics (MD) simulations are essential tools in computational chemistry and drug discovery, offering crucial insights into dynamic molecular behavior. However, their utility is significantly limited by substantial computational costs, which severely restrict accessible timescales for many biologically relevant processes. Despite the encouraging performance of existing machine learning (ML) methods, they struggle to generate extended biomolecular system trajectories, primarily due to the lack of MD datasets and the large computational demands of modeling long historical trajectories. Here, we introduce BioMD, the first all-atom generative model to simulate long-timescale protein-ligand dynamics using a hierarchical framework of forecasting and interpolation. We demonstrate the effectiveness and versatility of BioMD on the DD-13M (ligand unbinding) and MISATO datasets. For both datasets, BioMD generates highly realistic conformations, showing high physical plausibility and low reconstruction errors. Besides, BioMD successfully generates ligand unbinding paths for 97.1% of the protein-ligand systems within ten attempts, demonstrating its ability to explore critical unbinding pathways. Collectively, these results establish BioMD as a tool for simulating complex biomolecular processes, offering broad applicability for computational chemistry and drug discovery.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3685. WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool
๐ฌ 13
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ computer vision 3D reconstruction machine learning
We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps.
Previous methods suffer from a trade-off between reconstruction quality and real-time performance.
To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without introducing a large amount of extra computation.
In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency.
These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3686. SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
๐ฌ 21
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Table-Text Question Answering Multi-hop Question Answering Benchmark Generation Large Language Model SQL Query Generation Provenance-based Refinement
Real-world TableโText question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curatedโand therefore error-proneโand contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale TableโText QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human-sounding question, we propose two novel techniques: provenance-based refinement, which rewrites any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which confines generation to post-order traversals of the query graph. The resulting pipeline produces thousands of high-fidelity questionโanswer pairs covering aggregations, grouping, and deep multi-hop reasoning across text and tables. On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. We will release the benchmark, construction code, and baseline results to spur progress toward robust, realistic TableโText QA models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3687. SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation
๐ฌ 15
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Large Language Models Single-cell Foundation Models Scientific AI Benchmark Knowledge-augmented Evaluation
Large language models (LLMs) are increasingly applied in scientific research, offering new capabilities for knowledge discovery and reasoning. In single-cell biology, however, evaluation practices for both general and specialized LLMs remain inadequate: existing benchmarks are fragmented across tasks, adopt formats such as multiple-choice classification that diverge from real-world usage, and rely on metrics lacking interpretability and biological grounding. We present **SC-ARENA**, a natural language evaluation framework tailored to single-cell foundation models. SC-ARENA formalizes a *virtual cell* abstraction that unifies evaluation targets by representing both intrinsic attributes and gene-level interactions. Within this paradigm, we define five natural language tasks โ cell type annotation, captioning, generation, perturbation prediction, and scientific QA โ that probe core reasoning capabilities in cellular biology. To overcome the limitations of brittle string-matching metrics, we introduce **knowledge-augmented evaluation**, which incorporates external ontologies, marker databases, and scientific literature to support biologically faithful and interpretable judgments. Experiments and analysis across both general-purpose and domain-specialized LLMs demonstrate that: (i) under the *Virtual Cell* unified evaluation paradigm, current models achieve uneven performance on biologically complex tasks, particularly those demanding mechanistic or causal understanding; and (ii) our knowledge-augmented evaluation framework ensures biological correctness, provides interpretable, evidence-grounded rationales, and achieves high discriminative capacity, overcoming the brittleness and opacity of conventional metrics. **SC-ARENA** thus provides a unified and interpretable framework for assessing LLMs in single-cell biology, pointing toward the development of biology-aligned, generalizable foundation models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3688. gen2seg: Generative Models Enable Generalizable Instance Segmentation
๐ฌ 14
โญ 5.00
๐ generative models
๐ท๏ธ generative model instance segmentation generalization stable diffusion mae representation learning zero-shot
By pretraining to synthesize coherent images from perturbed inputs, generative models inherently learn to understand object boundaries and scene compositions. How can we repurpose these generative representations for general-purpose perceptual organization? We finetune Stable Diffusion and MAE (encoder+decoder) for category-agnostic instance segmentation using our instance coloring loss exclusively on a narrow set of object types (indoor furnishings and cars). Surprisingly, our models exhibit strong zero-shot generalization, accurately segmenting objects of types and styles unseen in finetuning. This holds even for MAE, which is pretrained on unlabeled ImageNet-1K only. When evaluated on unseen object types and styles, our best-performing models closely approach the heavily supervised SAM, and outperform it when segmenting fine structures and ambiguous boundaries. In contrast, existing promptable segmentation architectures or discriminatively pretrained models fail to generalize. This suggests that generative models learn an inherent grouping mechanism that transfers across categories and domains, even without internet-scale pretraining. Please see our anonymized website: https://gen2seg-anon.github.io/
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3689. Programming by Backprop: Learning Behaviour from Symbolic Descriptions
๐ฌ 24
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Abstraction Procedural Knowledge
Large language models (LLMs) are typically trained to acquire behaviours from demonstrations or experience, yet much of their training data consists of symbolic descriptions: instructions, rules, and strategies that specify procedures without examples. We investigate whether LLMs can learn to execute such behaviours directly from their abstract description, a process we term *Programming by Backprop* (PBB). We study this phenomenon in two domains: first, using source code as a canonical form of procedural description by comparing models finetuned on algorithms versus execution examples; and second, extending beyond code to abstract grammar rules, testing whether models learn to generate compliant text. Our findings show that PBB can be elicited through targeted finetuning, demonstrating that LLMs can acquire new behaviours from symbolic descriptions, albeit not yet with full reliability. Once elicited, PBB enables models to internalise reusable procedural abstractions - generalising across inputs, executing procedures implicitly in a forward pass, and benefiting further from chain-of-thought reasoning. These results position PBB as a distinct pathway through which LLMs acquire behavioural skills from symbolic descriptions, with implications for both more efficient capability acquisition and aligning models through formal specifications rather than demonstrations alone.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3690. RADAR: ReasoningโAbility and Difficulty-Aware Routing in Language Models
๐ฌ 16
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ routing adaptive reasoning item response theory reasoning models large language models
Reasoning language models have demonstrated remarkable performance on many challenging tasks in math, science, and coding. Choosing the right reasoning model for practical deployment involves a performance and cost tradeoff at two key levels: model size and reasoning budget, where larger models and higher reasoning budget lead to better performance but with increased cost and latency. In this work, we tackle this tradeoff from the angle of model configuration routing for different queries, and present RADAR (ReasoningโAbility and Difficulty-Aware Routing), a lightweight, interpretable, and scalable routing framework. Inspired by psychometrics, RADAR learns an item response model from model responses with different budgets to different queries, with interpretable parameters including query difficulties and model-budget abilities. RADAR then routes queries with higher difficulty to model-budget pairs with higher ability, and vice versa. We conduct extensive experiments on 8 widely used challenging reasoning benchmarks, demonstrating the superior performance of RADAR compared to state-of-the-art model routing methods. RADAR also exhibits query generalization capabilities, showing strong performance on out-of-distribution queries in all benchmarks. RADAR is also scalable and can efficiently integrate additional models, by dynamically selecting a small set of evaluation queries to estimate their abilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3691. From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph
๐ฌ 14
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Small Language Models CUDA Code Generation Reasoning Graph MCTS
Despite significant evolution of CUDA programming and domain-specific libraries, effectively utilizing GPUs with massively parallel engines remains difficult. Large language models (LLMs) show strong potential in generating optimized CUDA code from sequential code. However, using LLMs in practice faces two major challenges: cloud-based APIs pose risks of code leakage, and local deployment is often computationally expensive and inefficient. These drawbacks have spurred interest in small language models (SLMs), which are more lightweight and privacy-friendly. Encouragingly, recent studies show that SLMs can achieve performance comparable to LLMs on specific tasks. While SLMs can match LLMs on domain-specific tasks, their limited reasoning abilities lead to suboptimal performance in complex CUDA generation according to our experiments.
To bridge this gap, we propose ReGraphT, a training-free, retrieval-augmented generation framework that transfers LLM-level reasoning to smaller models. ReGraphT organizes CUDA optimization trajectories into a structured reasoning graph, modeling the combined CUDA optimizations as state transitions, and leverages Monte Carlo Graph Search (MCGS) for efficient exploration.
We also present a CUDA-specific benchmark with difficulty tiers defined by reasoning complexity to evaluate models more comprehensively. Experiments show that ReGraphT outperforms HPC-specific fine-tuned models and other retrieval-augmented approaches, achieving an average 2.33ร speedup on CUDAEval and ParEval. When paired with DeepSeek-Coder-V2-Lite-Instruct and Qwen2.5-Coder-7B-Instruct, ReGraphT enables SLMs to approach LLM-level performance without the associated privacy risks or excessive computing overhead.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3692. Flow Expansion via Verifier-Constrained Noised State Space Exploration
๐ฌ 14
โญ 5.00
๐ generative models
๐ท๏ธ flow models diffusion models exploration verifiers scientific discovery
Flow and diffusion models are typically pre-trained on limited available data (e.g., molecular samples), covering only a fraction of the valid design space (e.g., the full molecular space). As a consequence, they tend to generate samples from only a narrow portion of the feasible domain. This is a fundamental limitation for scientific discovery applications, where one typically aims to sample valid designs beyond the available data distribution. To this end, we address the challenge of leveraging access to a verifier (e.g., an atomic bonds checker), to adapt a pre-trained flow model so that its induced density expands beyond regions of high data availability, while preserving samples validity. We introduce formal notions of strong and weak verifiers and propose algorithmic frameworks for global and local flow expansion via probability-space optimization. Then, we present Flow Expander (FE), a scalable mirror descent scheme that provably tackles both problems by verifier-constrained entropy maximization over the flow process noised state space. Next, we provide a thorough theoretical analysis of the proposed method, and state convergence guarantees under both idealized and general assumptions. Ultimately, we empirically evaluate our method on both illustrative, yet visually interpretable settings, and on a molecular design task showcasing the ability of FE to expand a pre-trained flow model increasing conformer diversity while preserving validity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3693. Black-Box Privacy Attacks on Shared Representations in Multitask Learning
๐ฌ 11
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ multitask learning privacy attacks
The proliferation of diverse data across users and organizations has driven the development of machine learning methods that enable multiple entities to jointly train models while minimizing data sharing. Among these, *multitask learning* (MTL) is a powerful paradigm that leverages similarities among multiple tasks, each with insufficient samples to train a standalone model, to solve them simultaneously. MTL accomplishes this by learning a *shared representation* that captures common structure between tasks and generalizes well across them all. Despite being designed to be the smallest unit of shared information necessary to effectively learn patterns across multiple tasks, these shared representations can inadvertently leak sensitive information about the particular tasks they were trained~on.
In this work, we investigate privacy leakage in shared representations through the lens of inference attacks. Towards this, we propose a novel, *black-box task-inference* threat model where the adversary, given the embedding vectors produced by querying the shared representation on samples from a particular task, aims to determine whether the task was present in the multitask training dataset. Motivated by analysis of tracing attacks on mean estimation over mixtures of Gaussian distributions, we develop efficient, purely black-box attacks on machine learning models that exploit the dependencies between embeddings from the same task without requiring shadow models or labeled reference data. We evaluate our attacks across vision and language domains when MTL is used for personalization and for solving multiple distinct learning problems, and demonstrate that even with access only to fresh task samples rather than training data, a black-box adversary can successfully infer a task's inclusion in training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3694. BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration
๐ฌ 4
โญ 5.00
๐ generative models
๐ท๏ธ Video generation Diffusion models
Diffusion Transformer has shown remarkable abilities in generating high-fidelity videos, delivering visually coherent frames and rich details over extended durations.
However, existing video generation models still fall short in subject-consistent video generation due to an inherent difficulty in parsing prompts that specify complex spatial relationships, temporal logic, and interactions among multiple subjects. To address this issue, we propose BindWeave, a unified framework that handles a broad range of subject-to-video scenarios from single-subject cases to complex multi-subject scenes with heterogeneous entities.
To bind complex prompt semantics to concrete visual subjects, we introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities and disentangle roles, attributes, and interactions, yielding subject-aware hidden states that condition the diffusion transformer for high-fidelity subject-consistent video generation.
Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos, outperforming existing open-source and commercial models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3695. Highly Efficient and Effective LLMs with Multi-Boolean Architectures
๐ฌ 19
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Boolean neural networks
Weight binarization has emerged as a promising strategy to reduce the complexity of large language models (LLMs). Existing approaches fall into post-training binarization, which is simple but causes severe performance loss, and training-aware methods, which depend on full-precision latent weights, adding complexity and limiting efficiency. We propose a novel framework that represents LLMs with multi-kernel Boolean parameters and, for the first time, enables direct finetuning LMMs in the Boolean domain, eliminating the need for latent weights. This enhances representational capacity and dramatically reduces complexity during both finetuning and inference. Extensive experiments across diverse LLMs show our method outperforms recent ultra low-bit quantization and binarization techniques.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3696. Dataset Distillation as Pushforward Optimal Quantization
๐ฌ 13
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ dataset distillation optimal quantization clustering latent diffusion
Dataset distillation aims to find a small synthetic training set, such that training on the synthetic data achieves similar performance to training on a larger training dataset. Early methods solve this by interpreting the distillation problem as a bi-level optimization problem. On the other hand, disentangled methods bypass pixel-space optimization by matching data distributions and using generative techniques, leading to better computational complexity in terms of size of both training and distilled datasets. We demonstrate that by using latent spaces, the empirically successful disentangled methods can be reformulated as an optimal quantization problem, where a finite set of points is found to approximate the underlying probability measure. In particular, we link disentangled dataset distillation methods to the classical problem of optimal quantization, and are the first to demonstrate consistency of distilled datasets for diffusion-based generative priors. We propose Dataset Distillation by Optimal Quantization (DDOQ), based on clustering in the latent space of latent diffusion models. Compared to a similar clustering method D4M, we achieve better performance and inter-model generalization on the ImageNet-1K dataset using the same model and with trivial additional computation, achieving SOTA performance in higher image-per-class settings. Using the distilled noise initializations in a stronger diffusion transformer model, we obtain competitive or SOTA distillation performance on ImageNet-1K and its subsets, outperforming recent diffusion guidance methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3697. COSMO-INR: Complex Sinusoidal Modulation for Implicit Neural Representations
๐ฌ 11
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Implicit Neural Networks Chebyshev Polynomials Raised cosine filter Spectral bias
Implicit neural representations (INRs) have recently emerged as a powerful paradigm for modeling data, offering a continuous alternative to traditional discrete signal representations. Their ability to compactly encode complex signals has led to strong performance across a wide range of computer vision tasks. In previous studies, it has been repeatedly shown that INR performance has a strong correlation with the activation functions used in its multilayer perceptrons. Although numerous competitive activation functions for INRs have been proposed, the theoretical foundations underlying their effectiveness remain poorly understood. Moreover, key challenges persist, including spectral bias (the reduced sensitivity to high-frequency signal content), limited robustness to noise, and difficulties in jointly capturing both local and global features. In this paper, we explore the underlying mechanism of INR signal representation, leveraging harmonic analysis and Chebyshev Polynomials. Through a rigorous mathematical proof, we show that modulating activation functions using a complex sinusoidal term yields better and complete spectral support throughout the INR network.
To support our theoretical framework, we present empirical results over a wide range of experiments using Chebyshev analysis. We further develop a new activation function, leveraging the new theoretical findings to highlight its feasibility in INRs. We also incorporate a regularized deep prior, extracted from the signal via a task-specific model, to adjust the activation functions. This integration further improves convergence speed and stability across tasks. Through a series of experiments which include image reconstruction (with an average PSNR improvement of +5.67 dB over the nearest counterpart across a diverse image dataset), denoising (with a +0.46 dB increase in PSNR), super-resolution (with a +0.64 dB improvement over the nearest State-Of-The-Art (SOTA) method for 6X super-resolution), inpainting, and 3D shape reconstruction we demonstrate the novel proposed activation over existing state of the art activation functions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3698. Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
๐ฌ 15
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ language agent multi-agent system
Large language models, employed as multiple agents that interact and collaborate with each other, have excelled at solving complex tasks. The agents are programmed with prompts that declare their functionality, along with the topologies that orchestrate interactions across agents. Designing prompts and topologies for multi-agent systems (MAS) is inherently complex. To automate the entire design process, we first conduct an in-depth analysis of the design space aiming to understand the factors behind building effective MAS. We reveal that prompts together with topologies play critical roles in enabling more effective MAS design. Based on the insights, we propose Multi-Agent System Search (MASS), a MAS optimization framework that efficiently exploits the complex MAS design space by interleaving its optimization stages, from local to global, from prompts to topologies, over three stages: 1) block-level (local) prompt optimization; 2) workflow topology optimization; 3) workflow-level (global) prompt optimization, where each stage is conditioned on the iteratively optimized prompts/topologies from former stages. We show that MASS-optimized multi-agent systems outperform a spectrum of existing alternatives by a substantial margin. Based on the MASS-found systems, we finally propose design principles behind building effective multi-agent systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3699. The Human Brain as a Dynamic Mixture of Expert Models in Video Understanding
๐ฌ 22
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ representational alignment Representational Similarity Analysis RSA benchmarking neuro-AI video AI neuroscience EEG
The human brain is the most efficient and versatile system for processing dynamic visual input. By comparing representations from deep video models to brain activity, we can gain insights into mechanistic solutions for effective video processing, important to better understand the brain and to build better models.
Current works in model-brain alignment primarily focus on fMRI measurements, leaving open questions about fine-grained dynamic processing.
Here, we introduce the first large-scale benchmarking of both static and temporally-integrating deep neural networks on brain alignment to dynamic electroencephalography (EEG) recordings of short natural videos. We analyze 100+ models across the axes of temporal integration, classification task, architecture and pretraining using our proposed Cross-Temporal Representational Similarity Analysis (CT-RSA), which matches the best time-unfolded model features to dynamically evolving brain responses, distilling $10^7$ alignment scores. Our findings reveal novel insights on how continuous visual input is integrated in the brain, beyond the standard temporal processing hierarchy from low to high-level representations.
After initial alignment to hierarchical static object processing, responses in posterior electrodes best align to mid-level temporally-integrative action features, showing high temporal correspondence to feature timings.
In contrast, responses in frontal electrodes best align with high-level static action representations and show no temporal correspondence to the video.
Additionally, temporally-integrating state space models show superior alignment to intermediate posterior activity, in which self-supervised pretraining is also beneficial.
We draw a metaphor to a dynamic mixture of expert models for the changing neural preference in tasks and temporal integration reflected in the alignment to different model types across time.
We posit that a single best-aligned model would need task-independent training to combine these capacities as well as an architecture that supports dynamic switching.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3700. Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance
๐ฌ 14
โญ 5.00
๐ generative models
๐ท๏ธ Image Generation Mixture-of-Experts Diffusion Transformer
Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. To this end, we present $\textbf{ProMoE}$, an MoE framework featuring a two-step router with explicit routing guidance that promotes expert specialization. Specifically, this guidance encourages the router to $\textit{first}$ partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and $\textit{second}$ refine the assignments of conditional image tokens through prototypical routing with learnable prototypes based on semantic content. Moreover, the similarity-based expert allocation in latent space enabled by prototypical routing offers a natural mechanism for incorporating explicit semantic guidance, and we validate that such guidance is crucial for vision MoE. Building on this, we propose a routing contrastive loss that explicitly enhances the prototypical routing process, promoting intra-expert coherence and inter-expert diversity. Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives. Code and models will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3701. Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes
๐ฌ 11
โญ 5.00
๐ generative models
๐ท๏ธ diffusion language model efficent block
Diffusion Language Models (DLMs) promise parallel generation via iterative denoising, yet their practical speed is often throttled by \emph{schedulers} that accept scattered high-confidence tokens, fragmenting KV caches and forcing repeated local repairs. We present \emph{Prefix Absorption}, a training-free inference principle operationalized by the \emph{Longest Stable Prefix} (LSP) scheduler. In each iteration, LSP performs a single forward pass to locate the longest left-aligned run whose predictions are both high-margin and temporally stable, then snaps the candidate boundary to natural structural delimiters (e.g., punctuation or code boundaries) before atomically committing the block. This prefix-first topology preserves a single frozen/active boundary, converts KV updates into contiguous appends, and concentrates attention on a rapidly shrinking suffix. As a consequence, the active sequence length decays geometrically and the total work bends from an effectively cubic $O(N^3)$ regime toward near-quadratic $O(N^2)$ while maintaining coherence. On code generation (HumanEval, MBPP) and complex reasoning (GSM8K, GPQA) with LLaDA-8B and Dream-7B, LSP substantially reduces end-to-end latency and denoiser calls while matching or improving task quality relative to strong scattered-acceptance baselines. Ablations isolate the gains to LSPโs core componentsโadaptive block sizing, structural boundary snapping, and the prefix-first commitment topologyโdemonstrating that faster DLM inference can be achieved without retraining and is complementary to existing diffusion schedules.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3702. PaAno: Patch-Based Representation Learning for Time-Series Anomaly Detection
๐ฌ 25
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ Time-Series Anomaly Detection Representation Learning
Although recent studies on time-series anomaly detection have increasingly adopted ever-larger neural network architectures such as transformers and foundation models, they incur high computational costs and memory usage, making them impractical for real-time and resource-constrained scenarios. Moreover, they often fail to demonstrate significant performance gains over simpler methods under rigorous evaluation protocols. In this study, we propose Patch-based representation learning for time-series Anomaly detection (PaAno), a lightweight yet effective method for fast and efficient time-series anomaly detection. PaAno extracts short temporal patches from time-series training data and uses a 1D convolutional neural network to embed each patch into a vector representation. The model is trained using a combination of triplet loss and pretext loss to ensure the embeddings capture informative temporal patterns from input patches. During inference, the anomaly score at each time step is computed by comparing the embeddings of its surrounding patches to those of normal patches extracted from the training time-series. Evaluated on the TSB-AD benchmark, PaAno achieved state-of-the-art performance, significantly outperforming existing methods, including those based on heavy architectures, on both univariate and multivariate time-series anomaly detection across various range-wise and point-wise performance measures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3703. SAES-SVD: Self-Adaptive Suppression of Accumulated and Local Errors for SVD-based LLM Compression
๐ฌ 24
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ SVD Compression Large Language Models
The rapid growth in the parameter scale of large language models (LLMs) has created a high demand for efficient compression techniques.
As a hardware-agnostic and highly compatible technique, low-rank compression has been widely adopted. However, existing methods typically compress each layer independently by minimizing per-layer reconstruction error, overlooking a critical limitation: the reconstruction error propagates and accumulates through the network, which leads to amplified global deviations from the full-precision baseline.
To address this, we propose **Self-Adaptive Error Suppression SVD (SAES-SVD)**, a LLMs compression framework that jointly optimizes intra-layer reconstruction and inter-layer error compensation.
SAES-SVD is composed of two novel components:
**Cumulative Error-Aware Layer Compression (CEALC),** which formulates the compression objective as a combination of local reconstruction and weighted cumulative error compensation. Based on it, we derive a closed-form low-rank solution relied on second-order activation statistics, which explicitly aligns each layer's output with its full-precision counterpart to compensate for accumulated errors.
\ding{183} Adaptive Collaborative Error Suppression (ACES), which automatically adjusts the weighting coefficient to enhance the low-rank structure of the compression objective in CELAC. Specifically, the coefficient is optimized to maximize the ratio between the Frobenius norm of the compressed layer's output and that of the compression objective under a fixed rank, thus ensuring that the rank budget is utilized effectively.
Extensive experiments across multiple LLM architectures and tasks show that, without fine-tuning or additional tricks, SAES-SVD consistently improves post-compression performance. For example, at a 0.2 compression ratio on LLaMA-7B, existing methods exhibit an average accuracy drop exceeding 0.05, whereas SAES-SVD restricts the drop to only 0.02. These improvements underscore the potential of SAES-SVD to effectively narrow the gap between compressed models and their full-precision counterparts, paving the way for more reliable compression of LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3704. Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs
๐ฌ 13
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Machine unlearning Model collapse Large language models LLMs
Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their fine-tuning data. We argue this not only risks reinforcing exposure to sensitive data, it also fundamentally contradicts the principle of minimizing its use. As a remedy, we propose a novel unlearning methodโPartial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective. Our approach is inspired by recent observations that training generative models on their own generations leads to distribution collapse, effectively removing information from model outputs. Our central insight is that model collapse can be leveraged for machine unlearning by deliberately triggering it for data we aim to remove. We theoretically analyze that our approach converges to the desired outcome, i.e. the model unlearns the data targeted for removal. We empirically demonstrate that PMC overcomes three key limitations of existing unlearning methods that explicitly optimize on unlearning targets, and more effectively removes private information from model outputs while preserving general model utility. Overall, our contributions represent an important step toward more comprehensive unlearning that aligns with real-world privacy constraints.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3705. Ringleader ASGD: The First Asynchronous SGD with Optimal Time Complexity under Data Heterogeneity
๐ฌ 20
โญ 5.00
๐ optimization
๐ท๏ธ asynchronous SGD data heterogeneity optimal time complexity nonconvex optimization parallel methods stochastic optimization
Asynchronous stochastic gradient methods are central to scalable distributed optimization, particularly when devices differ in computational capabilities. Such settings arise naturally in federated learning, where training takes place on smartphones and other heterogeneous edge devices. In addition to varying computation speeds, these devices often hold data from different distributions. However, existing asynchronous SGD methods struggle in such heterogeneous settings and face two key limitations. First, many rely on unrealistic assumptions of similarity across workers' data distributions. Second, methods that relax this assumption still fail to achieve theoretically optimal performance under heterogeneous computation times. We introduce Ringleader ASGD, the first asynchronous SGD algorithm that attains the theoretical lower bounds for parallel first-order stochastic methods in the smooth nonconvex regime, thereby achieving optimal time complexity under data heterogeneity and without restrictive similarity assumptions. Our analysis further establishes that Ringleader ASGD remains optimal under arbitrary and even time-varying worker computation speeds, closing a fundamental gap in the theory of asynchronous optimization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3706. SAVE: A Generalizable Framework for Multi-Condition Single-Cell Generation with Gene Block Attention
๐ฌ 14
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ generative model single cell
Modeling single-cell gene expression across diverse biological and technical conditions is essential for understanding cellular states and simulating unobserved scenarios. We present SAVE, a unified generative framework for multi-condition single-cell modeling. SAVE combines a variational autoencode with conditional Transformer, enhanced by gene block attention and a novel conditional mask modeling strategy. This design enables effective modeling of biological structure under multi-condition effects and supports generalization to unseen condition combinations. We evaluate SAVE on a range of benchmarks, including conditional generation, batch effect correction, and perturbation prediction. SAVE consistently outperforms state-of-the-art methods in generation fidelity and extrapolative gener-alization, especially in low-resource or combinatorially held-out settings. Overall, SAVE offers a scalable and generalizable solution for modeling complex single-cell data, with broad utility in virtual cell synthesis and biological discovery.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3707. PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling
๐ฌ 20
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Vector Quantization
Large Language Models (LLMs) face significant challenges in edge deployment due to their massive parameter scale.
Vector Quantization (VQ), a clustering-based quantization method, serves as a prevalent solution to this issue for its extremely low-bit (even at 2-bit) and considerable accuracy.
Since a vector is a quantity in mathematics and physics that has both direction and magnitude, existing VQ works typically quantize them in a coupled manner.
However, we find that direction exhibits significantly greater sensitivity to quantization compared to the magnitude.
For instance, when separately clustering the directions and magnitudes of weight vectors in LLaMA-2-7B, the accuracy drop of zero-shot tasks are 46.5\% and 2.3\%, respectively.
This gap even increases with the reduction of clustering centers.
Further, Euclidean distance, a common metric to access vector similarities in current VQ works, places greater emphasis on reducing the magnitude error.
This property is contrary to the above finding, unavoidably leading to larger quantization errors.
To these ends, this paper proposes Polar Coordinate Decoupled Vector Quantization (PCDVQ), an effective and efficient VQ framework consisting of two key modules: 1) Polar Coordinate Decoupling (PCD), which transforms vectors into their polar coordinate representations and perform independent quantization of the direction and magnitude parameters.
2) Distribution Aligned Codebook Construction (DACC), which optimizes the direction and magnitude codebooks in accordance with the source distribution.
Experimental results show that PCDVQ outperforms baseline methods at 2-bit level by at least 1.5\% zero-shot accuracy, establishing a novel paradigm for accurate and highly compressed LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3708. Optimizing Agent Planning for Security and Autonomy
๐ฌ 17
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ AI Agents Security Prompt Injection Attacks Information Flow Control Autonomy
Indirect prompt injection attacks threaten AI agents that execute consequential actions, motivating deterministic system-level defenses. Such defenses can provably block unsafe actions by enforcing confidentiality and integrity policies, but currently appear costly: they reduce task completion rates and increase token usage compared to probabilistic defenses. We argue that existing evaluations miss a key benefit of system-level defenses: reduced reliance on human oversight. We introduce autonomy metrics to quantify this benefit: the fraction of consequential actions an agent can execute without human-in-the-loop (HITL) approval while preserving security. To increase autonomy, we design a security-aware agent that (i) introduces richer HITL interactions, and (ii) explicitly plans for both task progress and policy compliance. We implement this agent design atop an existing information-flow control defense against prompt injection and evaluate it on the AgentDojo and WASP benchmarks. Experiments show that this approach yields higher autonomy without sacrificing utility (task completion).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3709. Visual Jigsaw Post-Training Improves MLLMs
๐ฌ 21
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multimodal Large Language Models Self-supervised Learning Post-training Reinforcement Learning Visual Jigsaw
Reinforcement learning based post-training has recently emerged as a powerful paradigm for enhancing the alignment and reasoning capabilities of multimodal large language models (MLLMs). While *vision-centric* post-training is crucial for enhancing MLLMsโ intrinsic understanding of visual signals, current post-training paradigms are predominantly *text-centric*, where dense visual inputs are only leveraged to extract sparse cues for text-based reasoning. There exist a few approaches in this direction, however, they often still rely on text as an intermediate mediator or introduce additional visual generative designs. In this work, we introduce **Visual Jigsaw**, a generic *self-supervised* post-training framework designed to strengthen visual understanding in MLLMs. Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This naturally aligns with reinforcement learning from verifiable rewards (RLVR), requires no additional visual generative components, and derives its supervisory signal automatically without any annotations. We instantiate Visual Jigsaw across three visual modalities, including images, videos, and 3D data. Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding. Our findings highlight the potential of self-supervised vision-centric tasks in post-training MLLMs and aim to inspire further research on vision-centric pretext designs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3710. GALAX: Graph-Augmented Language Model for Explainable Reinforcement-Guided Subgraph Reasoning in Precision Medicine
๐ฌ 25
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Reinforcement Learning Large Language Model (LLM) Text-Numeric Graph (TNG) Multi-Omics Integration Explainability
In precision medicine, quantitative multi-omic features, topological context, and textual biological knowledge play vital roles in identifying disease-critical signaling pathways and targets, guiding the discovery of novel therapeutics and effective treatment strategies. Existing pipelines capture only one or two of theseโnumerical omics ignore topological context, text-centric LLMs lack quantitative grounded reasoning, and graph-only models underuse rich node semantics and the generalization power of LLMsโthereby limiting mechanistic interpretability. Although Process Reward Models (PRMs) aim to guide reasoning in LLMs, they remain limited by coarse step definitions, unreliable intermediate evaluation, and vulnerability to reward hacking with added computational cost. These gaps motivate jointly integrating quantitative multi-omic signals, topological structure with node annotations, and literature-scale text via LLMs, using subgraph reasoning as the principle bridge linking numeric evidence, topological knowledge and language context. To resolve this challenge, we propose GALAX (Graph Augmented LAnguage model with eXplainability), an innovative framework that integrates pretrained Graph Neural Networks (GNNs) into Large Language Models (LLMs) via reinforcement learning guided by a Graph Process Reward Model (GPRM), which generates disease-relevant subgraphs in a step-wise manner initiated by an LLM and iteratively evaluated by a pretrained GNN and schema-based rule check, enabling process-level supervision without explicit labels. As an application, we also introduced Target-QA, a benchmark combining CRISPR-identified targets, multi-omic profiles, and biomedical graph knowledge across diverse cancer cell lines, which enables GNN pretraining for supervising step-wise graph construction and supports long-context reasoning over text-numeric graphs (TNGs), providing a scalable and biologically grounded framework for explainable, reinforcement-guided subgraph reasoning toward reliable and interpretable target and pathway discovery in precision medicine.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3711. Paradigm Shift of GNN Explainer from Label Space to Prototypical Representation Space
๐ฌ 22
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Graph Neural Networks GNN Explanation Method Vector Quantization
Post-hoc instance-level graph neural network (GNN) explainers are developed to identify a compact subgraph (i.e., explanation) that encompasses the most influential components for each input graph. A fundamental limitation of existing methods lies in the insufficient utilization of structural information during GNN explainer optimization. They typically optimize the explainer by aligning the GNN predictions of input graph and its explanation in the graph label space which inherently lacks expressiveness to describe various graph structures. Motivated by the powerful structural expression ability of vectorized graph representations, we for the first time propose to shift the GNN explainer optimization from the graph label space to the graph representation space. However, the paradigm shift is challenging due to both the entanglement between the explanatory and non-explanatory substructures, and the distributional discrepancy between the input graph and the explanation subgraph. To this end, we meticulously design IDEA, a universal dual-stage optimization framework grounded in a prototypical graph representation space, which can generalize across diverse existing GNN explainer architectures. Specifically, in the Structural Information Disentanglement stage, a graph tokenizer equipped with a structure-aware disentanglement objective is designed to disentangle the explanatory substructures and encapsulate them into explanatory prototypes. In the Explanatory Prototype Alignment stage, IDEA aligns the representational distributions of the input graph and its explanation unified in the prototypical representation space, to optimize the GNN explainer. Comprehensive experiments on real-world and synthetic datasets demonstrate the effectiveness of IDEA, with the average improvements of ROC-AUC by 4.45% and precision by 48.71%. We further integrate IDEA with diverse explainer architectures and achieve an improvement by up to 10.70%, which verifies its generalizability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3712. MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding
๐ฌ 32
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Mixup Augmentation MLLM Image Classification Visual Alignment
Vision-language alignment in multi-modal large language models (MLLMs) relies on supervised fine-tuning (SFT) or reinforcement learning (RL).
To align multi-modal large language models (MLLMs) in the post-training stage, supervised fine-tuning (SFT) is a stable choice but requires human annotations and lacks task generalizations, while Reinforcement Learning (RL) searches for better answers from reward signals but suffers from computational overhead and instability.
To achieve balance among scalability, efficiency, and alignment generalizations, we propose MergeMix, a unified paradigm that bridges SFT and RL with an efficient Token Merge based Mixup augmentation. As for the Mixup policy, we generate contextual aligned mixed images with the corresponding labels according to the merged attention maps with cluster regions. Then, we enhance the preference-driven paradigm for MLLMs by building preference pairs with raw images and MergeMix-generated ones and optimizing the soft preference margin with the mixed SimPO loss.
Extensive experiments demonstrate that MergeMix not only achieves dominant classification accuracy as an augmentation method but also improves generalization abilities and alignment of MLLMs, providing a new learning paradigm for preference alignment with training efficiency and stability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3713. Understanding and Improving Continuous LLM Adversarial Training via In-context Learning Theory
๐ฌ 22
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM adversarial training Jailbreak attacks In-context learning
Adversarial training (AT) is an effective defense for large language models (LLMs) against jailbreak attacks, but performing AT on LLMs is costly. To improve the efficiency of AT for LLMs, recent studies propose continuous AT (CAT) that searches for adversarial inputs within the continuous embedding space of LLMs during AT. While CAT has achieved empirical success, its underlying mechanism, i.e., why adversarial perturbations in the embedding space can help LLMs defend against jailbreak prompts synthesized in the input token space, remains unknown. This paper presents the first theoretical analysis of CAT on LLMs based on in-context learning (ICL) theory. For linear transformers trained with adversarial examples from the embedding space on in-context linear regression tasks, we prove a robust generalization bound that has a negative correlation with the perturbation radius in the embedding space. This clearly explains why CAT can defend against jailbreak prompts from the LLM's token space. Further, the robust bound shows that the robustness of an adversarially trained LLM is closely related to the singular values of its embedding matrix. Based on this, we propose to improve LLM CAT by introducing an additional regularization term, which depends on singular values of the LLM's embedding matrix, into the objective function of CAT. Experiments on real-world LLMs demonstrate that our method can help LLMs achieve better jailbreak robustness-utility tradeoff.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3714. DiscoX: Benchmarking Discourse-Level Translation in Expert Domains
๐ฌ 9
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ translation discourse-level expert-level benchmark LLM automatic evaluation system
The evaluation of discourse-level translation in expert domains remains inadequate, despite its centrality to knowledge dissemination and cross-lingual scholarly communication. While these translations demand discourse-level coherence and strict terminological precision, current evaluation methods predominantly focus on segment-level accuracy and fluency. To address this limitation, we introduce DiscoX, a new benchmark for discourse-level and expert-level Chinese-English translation. It comprises 200 professionally-curated texts from 7 domains, with an average length exceeding 1700 tokens. To evaluate performance on DiscoX, we also develop Metric-S, a reference-free system that provides fine-grained automatic assessments across accuracy, fluency, and appropriateness. Metric-S demonstrates strong consistency with human judgments, significantly outperforming existing metrics. Our experiments reveal a remarkable performance gap: even the most advanced LLMs still trail human experts on these tasks. This finding validates the difficulty of DiscoX and underscores the challenges that remain in achieving professional-grade machine translation. The proposed benchmark and evaluation system provide a robust framework for more rigorous evaluation, facilitating future advancements in LLM-based translation. Our data and code are available at https://anonymous.4open.science/r/DiscoX-5F18.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3715. Gumbel Distillation for Parallel Text Generation
๐ฌ 17
โญ 5.00
๐ generative models
๐ท๏ธ Parallel Decoding Non-Autoregressive Generation Knowledge Distillation
The slow, sequential nature of autoregressive (AR) language models has driven the adoption of parallel decoding methods. However, these non-autoregressive models often sacrifice generation quality because they struggle to model the complex joint distribution of token sequences. To bridge this gap, we introduce Gumbel Distillation, a novel distillation technique that enables parallel decoders to learn this distribution effectively. Our method leverages the Gumbel-Max trick to create a deterministic mapping from a latent Gumbel noise space to the output tokens of a high-performing AR teacher. As a model-agnostic technique, Gumbel Distillation seamlessly integrates with diverse parallel decoding architectures, including MDLM and BD3-LM. Experiments on LM1B and OpenWebText show that Gumbel Distillation substantially improves the generation quality of parallel language models, achieving a 30.0% improvement in MAUVE Score and 10.5% in generative perplexity over MDLM trained on OpenWebText dataset.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3716. Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data
๐ฌ 22
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Pretraining Supervised Finetuning Reasoning LLM
The prevailing paradigm for enhancing the reasoning abilities of Large Language Models (LLMs) revolves around post-training on high-quality, reasoning-intensive data. While emerging literature suggests that reasoning data is increasingly incorporated also during the mid-training stage---a practice that is relatively more proprietary and less openly characterized---the role of such data in pretraining remains unclear. In particular, due to the opaqueness of pretraining corpora in most frontier models, the effect of reasoning data introduced at different phases of pre- and/or post-training is relatively less reported in the scientific literature. This raises several important but unsettled questions: Is adding reasoning data earlier during pre-training any better than introducing it during post-training, when the token counts are controlled? Could earlier inclusion risk overfitting and harm generalization, or instead establish durable foundations that later fine-tuning cannot recover? To address these questions, we conduct the first systematic study of how reasoning dataโvarying in scale, diversity, and qualityโaffects LLM performance when introduced at different stages of training. Our findings reveal that front-loading reasoning data into pretraining is critical (19% average gain), establishing foundational capabilities that cannot be fully replicated by later-stage SFT, even with more data. We uncover an asymmetric principle for optimal data allocation: pretraining benefits most from broad diversity in reasoning patterns (11% average gain), while SFT is more sensitive to data quality (15% average gain with high quality data). Furthermore, we show that high-quality pretraining data has latent effects, activated only after SFT, and that naively scaling SFT data can be detrimental, washing away the benefits of early reasoning injection. Collectively, our results challenge the conventional separation of language modeling and reasoning, providing a principled guide for strategically allocating data across the entire training pipeline to build more capable models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3717. 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation
๐ฌ 32
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Scene-Consistent Video Generation; Camera-Controllable Video Generation; Video Diffusion Models;
We present 3DScenePrompt, a framework for camera-controllable video generation that maintains scene consistency when extending arbitrary-length input videos along user-specified trajectories. Unlike existing video generative methods limited to conditioning on a single image or just a few frames, we introduce a dual spatio-temporal conditioning strategy that fundamentally rethinks how video models should reference prior content. Our approach conditions on both temporally adjacent frames for motion continuity and spatially adjacent content for scene consistency. However, when generating beyond temporal boundaries, directly using spatially adjacent frames would incorrectly preserve dynamic elements from the past. We address this through introducing a 3D scene memory that represents exclusively the static geometry extracted from the entire input video. To construct this memory, we leverage dynamic SLAM with our newly introduced dynamic masking strategy that explicitly separates static scene geometry from moving elements. The static scene representation can then be projected to any target viewpoint, providing geometrically-consistent warped views that serve as strong spatial prompts while allowing dynamic regions to evolve naturally from temporal context. This enables our model to maintain long-range spatial coherence and precise camera control without sacrificing computational efficiency or motion realism. Extensive experiments demonstrate that our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3718. A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering
๐ฌ 20
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Frame Selection Vision Language Model Training-Free Video understanding
Effectively applying Vision-Language Models (VLMs) to Video Question Answering (VideoQA) hinges on selecting a concise yet comprehensive set of frames, as processing entire videos is computationally infeasible. However, current frame selection methods face a critical trade-off: approaches relying on lightweight similarity models, such as CLIP, often fail to capture the nuances of complex queries, resulting in inaccurate similarity scores that cannot reflect the authentic query-frame relevance, which further undermines frame selection. Meanwhile, methods that leverage a VLM for deeper analysis achieve higher accuracy but incur prohibitive computational costs. To address these limitations, we propose A.I.R., a training-free approach for Adaptive, Iterative, and Reasoning-based frame selection. We leverage a powerful VLM to perform deep, semantic analysis on complex queries, and this analysis is deployed within a cost-effective iterative loop that processes only a small batch of the most high-potential frames at a time. Extensive experiments on various VideoQA benchmarks demonstrate that our approach outperforms existing frame selection methods, significantly boosts the performance of the foundation VLM, and achieves substantial gains in computational efficiency over other VLM-based techniques.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3719. I-DRUID: Layout to image generation via instance-disentangled representation and unpaired data
๐ฌ 14
โญ 5.00
๐ generative models
๐ท๏ธ diffusion models;
Layout-to-Image (L2I) generation, aiming at coherently generating multiple instances conditioned on the given layouts and instance captions, has raised substantial attention in the recent research. The primary challenges of L2I stem from 1) attribute leakage due to the entangled instance features within attention and 2) limited generalization to novel scenes caused by insufficient image-text paired data. To address these issues, we propose I-DRUID, a novel framework that leverages instance-disentanglement representations (IDR) and unpaired data (UID) to improve L2I generation. IDR are extracted with our instance disentanglement modules, which utilizes information among instances to obtain semantic-related features while suppressing spurious parts. To facilitate disentangling, we require semantic-related features to trigger more accurate attention maps than spurious ones, formulating the instance-disentangled constraint to avoid attribute leakage. Moreover, to improve L2I generalization, we adapt L2I with unpaired, prompt-only data (UID) to novel scenes via reinforcement learning. Specifically, we enforce L2I model to learn from unpaired, prompt-only data by encouraging / rejecting the rational / implausible generation trajectories based on AI feedback, avoiding the need for paired data collection. Finally, our empirical observations show that IDM and RL cooperate synergistically to further enhance L2I accuracies. Extensive experiments demonstrate the efficacy of our method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3720. LumiTex: Towards High-Fidelity PBR Texture Generation with Illumination Context
๐ฌ 21
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Texture Generation 3D Generation Diffusion Models Physically Based Rendering
Physically-based rendering (PBR) provides a principled standard for realistic materialโlighting interactions in computer graphics. Despite recent advances in generating PBR textures, existing methods fail to address two fundamental challenges: 1) materials decomposition from image prompts under limited illumination cues, and 2) seamless and view-consistent texture completion. To this end, we propose LumiTex, an end-to-end framework that comprises three key components: (1) a multi-branch generation scheme that disentangles albedo and metallicโroughness under shared illumination priors for robust material understanding, (2) a lighting-aware material attention mechanism that injects illumination context into the decoding process for physically grounded generation of albedo, metallic, and roughness maps, and (3) a geometry-guided inpainting module based on a large view synthesis model that enriches texture coverage and ensures seamless, view-consistent UV completion. Extensive experiments demonstrate that LumiTex achieves state-of-the-art performance in texture quality, surpassing both existing open-source and commercial methods. Project page: [Anonymous Link](https://lumitex-pbr.github.io/).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3721. vAttention: Verified Sparse Attention via Sampling
๐ฌ 23
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ sparse attention
State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based estimation. However, these approaches are fundamentally limited in their ability to approximate full attention: they fail to provide consistent approximations across heads and query vectors and, most critically, lack guarantees on approximation quality, limiting their practical deployment. We observe that top-$k$ and random sampling are complementary: top-$k$ performs well when attention scores are dominated by a few tokens, whereas random sampling provides better estimates when attention scores are relatively uniform. Building on this insight and leveraging the statistical guarantees of sampling, we introduce vAttention, the first practical sparse attention mechanism with user-specified $(\epsilon, \delta)$ guarantees on approximation accuracy. These guarantees make vAttention a compelling step toward practical, reliable deployment of sparse attention at scale. By unifying top-k and sampling, vAttention outperforms both individually, delivering a superior qualityโefficiency trade-off. Our experiments show that vAttention significantly improves the quality of sparse attention (e.g., $\sim$4.5 percentage points for Llama-3.1-8B-Inst and Deepseek-R1-Distill-Llama-8B on RULER-HARD ), and effectively bridges the gap between full and sparse attention (e.g., across datasets, it matches full model quality at 10xโ20x sparsity). We also demonstrate that it can be deployed in long-generation scenarios to achieve fast decoding without compromising model quality (e.g., vAttention achieves full model quality on AIME2024 at 10\% sparsity with up to 32K token generations).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3722. Code Driven Planning with Domain-Adaptive Selector
๐ฌ 18
โญ 5.00
๐ reinforcement learning
๐ท๏ธ LLM-based Planning Planning Programs Domain-Adaptive Selector Large Language Models
Large Language Models (LLMs) have been widely adopted as task planners for AI agents in sequential decision-making problems, leveraging their extensive world knowledge. However, the gap between their general knowledge and environment-specific requirements often leads to inaccurate plans. To address this, existing approaches rely on frequent LLM queries to iteratively refine plans based on immediate environmental feedback, which incurs substantial query costs. However, this refinement is typically guided by short-term environmental feedback, limiting LLMs from developing plans aligned with long-term rewards. We propose **Co**de Driven **P**lanning w**i**th Domain-Adaptive Sele**C**tor (CoPiC). Instead of relying on frequent queries, CoPiC employs LLMs to generate a diverse set of high-level planning programs, which iteratively produce and refine candidate plans. A trained domain-adaptive selector then evaluates these candidates and selects the one most aligned with long-term rewards for execution. Using high-level planning programs as planner and domain-adaptive selector as estimator, CoPiC improves planning while significantly reducing query costs. Results in ALFWorld, NetHack, and StarCraft II Unit Building show that CoPiC outperforms advanced LLM-based baselines, achieving an average (1) 19.14\% improvement in success rate and (2) 79.39\% reduction in token costs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3723. FACT: Fine-grained Across-variable Convolution for Multivariate Time Series Forecasting
๐ฌ 24
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ Multivariate time series forecasting Fine-grained dynamic variable interactions Multi-dilated depth-wise convolution.
Modeling the relationships among variables has become increasingly important, particularly in high-dimensional multivariate time series forecasting tasks. However, most existing methods primarily focus on capturing coarse-grained correlations between variables, overlooking a finer and more dynamic aspect: the variable interactions often manifest differently as time progresses.
To address this limitation, we propose FACT, an Fine-grained Across-variable Convolution architecture for multivariate Time series forecasting that explicitly models fine-grained variable interactions from both the time and frequency domains.
Technically, we introduce a depth-wise convolution block DConvBlock, which leverages a depth-wise convolution architecture with channel-specific kernels to model dynamic variable interactions at each granularity.
To further enhance efficiency, we reconfigure the original one-dimensional variables into a two-dimensional space, reducing the variable distance and the required model layers. Then DConvBlock incorporates multi-dilated 2D convolutions with progressively increasing dilation rates, enabling the model to capture fine-grained and dynamic variable interactions while efficiently attaining a global reception field.
Extensive experiments on twelve benchmark datasets demonstrate that FACT not only achieves state-of-the-art forecasting accuracy but also delivers substantial efficiency gains, significantly reducing both training time and memory consumption compared to attention mechanism. The code is available at https://anonymous.4open.science/r/FACT-MTSF.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3724. Diffusion-DFL: Decision-focused Diffusion Models for Stochastic Optimization
๐ฌ 17
โญ 5.00
๐ optimization
๐ท๏ธ Decisoin-focused learning stochastic optimization diffusion models
Decision-focused learning (DFL) integrates predictive modeling and optimization by training predictors to optimize the downstream decision target rather than merely minimizing prediction error. To date, existing DFL methods typically rely on deterministic point predictions, which are often insufficient to capture the intrinsic stochasticity of real-world environments. To address this challenge, we propose the first diffusion-based DFL approach, which trains a diffusion model to represent the distribution of uncertain parameters and optimizes the decision by solving a stochastic optimization with samples drawn from the diffusion model. Our contributions are twofold. First, we formulate diffusion DFL using the reparameterization trick, enabling end-to-end training through diffusion. While effective, it is memory and compute-intensive due to the need to differentiate through the diffusion sampling process. Second, we propose a lightweight score function estimator that uses only several forward diffusion passes and avoids backpropagation through the sampling. This follows from our results that backpropagating through stochastic optimization can be approximated by a weighted score function formulation. We empirically show that our diffusion DFL approach consistently outperforms strong baselines in decision quality. The source code for all experiments is available [here](https://anonymous.4open.science/r/e2e-model-444D/README.md).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3725. Sharpness-Aware Machine Unlearning
๐ฌ 21
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Machine Unlearning Sharpness-Aware Minimization
We characterize the effectiveness of Sharpness-aware minimization (SAM) under machine unlearning scheme, where unlearning forget signals interferes with learning retain signals. While previous work prove that SAM improves generalization with noise memorization prevention, we show that SAM abandons such denoising property when fitting the forget set, leading to altered generalization depending on signal strength. We further characterize the signal surplus of SAM in the order of signal strength, which enables learning from less retain signals to maintain model performance and putting more weight on unlearning the forget set. Empirical studies show that SAM outperforms SGD with relaxed requirement for retain signals and can enhance various unlearning methods either as pretrain or unlearn algorithm. Motivated by our refined characterization of SAM unlearning and observing that overfitting can benefit more stringent sample-specific unlearning, we propose Sharp MinMax, which splits the model into two to learn retain signals with SAM and unlearn forget signals with sharpness maximization, achieving best performance. Extensive experiments show that SAM enhances unlearning across varying difficulties measured by memorization, yielding decreased feature entanglement between retain and forget sets, stronger resistance to membership inference attacks, and a flatter loss landscape. Our observations generalize to more noised data, different optimizers, and different architecures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3726. Rethinking Code Similarity for Automated Algorithm Design with LLMs
๐ฌ 22
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Algorithm Similarity Automated Algorithm Design Large Language Model
The recent advancement of Large Language Models (LLMs) has revolutionized the algorithm design patterns. A new paradigm, LLM-based Automated Algorithm Design (LLM-AAD), has emerged to generate code implementations for high-quality algorithms. Unlike the traditional expert-driven algorithm development, in the LLM-AAD paradigm, ideas behind the algorithm are often implicitly embedded within the generated code. Therefore, measuring similarity for algorithms may help identify whether a generated algorithm is innovative or merely a syntactic refinement of an existing code implementation. However, directly applying existing code similarity metrics to algorithms raises a critical limitation: they do not necessarily reflect the similarity between algorithms.
To address this, we introduce a novel perspective that defines algorithm similarity through the lens of its problem-solving behavior. We represent the problem-solving trajectory of an algorithm as the sequence of intermediate solutions progressively generated by the algorithm. The behavioral similarity is calculated by the resemblance between two problem-solving trajectories. Our approach focuses on how an algorithm solves a problem, not just its code implementation or final output. We demonstrate the utility of our similarity measure in two use cases. (i) Improving LLM-AAD: Integrating our similarity measure into a search method demonstrates promising results across two AAD tasks, proving the effectiveness of maintaining behavioral diversity in the algorithm search. (ii) Algorithm analysis. Our similarity metric provides a new perspective for analyzing algorithms, revealing distinctions in their problem-solving behaviors.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3727. Geometric-Mean Policy Optimization
๐ฌ 13
โญ 5.00
๐ reinforcement learning
๐ท๏ธ large language model reinforcement learning stability
Group Relative Policy Optimization (GRPO) has significantly enhanced the reasoning capability of large language models by optimizing the arithmetic mean of token-level rewards. Unfortunately, GRPO is observed to suffer from unstable policy updates when facing tokens with outlier importance-weighted rewards, which manifest as extreme importance sampling ratios during training. In this study, we propose Geometric-Mean Policy Optimization (GMPO), with the aim to improve the stability of GRPO through suppressing token reward outliers. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. GMPO is plug-and-playโsimply replacing GRPO's arithmetic mean with the geometric mean of token-level rewards, as the latter is inherently less sensitive to outliers. GMPO is theoretically plausibleโanalysis reveals that both GMPO and GRPO are weighted forms of the policy gradient while the former enjoys more stable weights, which consequently benefits policy optimization and performance. Experiments on multiple mathematical reasoning benchmarks show that GMPO-7B improves the average Pass@1 of GRPO by up to 4.1%, outperforming many state-of-the-art approaches. The code is enclosed in the supplementary material.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3728. LLaVAction: evaluating and training multi-modal large language models for action understanding
๐ฌ 22
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ MLLM action understanding video understanding
Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. Emerging multimodal large language models (MLLMs) are promising candidates, but their fine-grained action understanding ability has not been fully examined. In this work, we reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action recognition datasets, into a MLLM benchmark (EPIC-KITCHENS-100-MQA). We show that when we sample difficult answers based on specialist models as distractors, leading MLLMs struggle to recognize the correct actions. How can we increase the performance of MLLMs? We curated a supervised finetuning dataset that includes `hard' action recognition, temporal detection, captioning, and free-form question answering to improve models' diverse action understanding capabilities. We introduce a new model called LLaVAction that adds an action token to boost models' attention on visual tokens and a two-stage pipeline to obtain structured actions. LLaVAction greatly improves the MLLMs' ability of action understanding, achieving strong improvements on both MLLM benchmarks (21 points in accuracy over GPT-4o on EPIC-KITCHENS-100-MQA) and action recognition benchmarks, suggesting that our methods prepare MLLMs to be a promising path forward for complex action tasks. Code, data, benchmark and models will be available upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3729. Muon Outperforms Adam in Tail-End Associative Memory Learning
๐ฌ 33
โญ 5.00
๐ optimization
๐ท๏ธ Transformers Muon Optimization
The Muon optimizer is consistently faster than Adam in training Large Language Models (LLMs), yet the mechanism underlying its success remains unclear. This paper demystifies this mechanism through the lens of associative memory. By ablating the transformer components optimized by Muon, we reveal that the associative memory parameters of LLMs, namely the Value and Output (VO) attention weights and Feed-Forward Networks (FFNs), are the primary contributors to Muonโs superiority. Motivated by this associative memory view, we then explain Muonโs superiority on real-world corpora, which are intrinsically heavy-tailed: a few 'head' classes are extremely frequent, while a vast number of 'tail' classes are individually rare. The superiority is explained through two key properties: (i) its update rule consistently yields a more isotropic singular spectrum than Adam; and as a result, (ii) on heavy-tailed data, it optimizes tail classes more effectively than Adam. Beyond empirical evidence, we theoretically confirm these findings by analyzing a one-layer associative memory model under class-imbalanced data. We prove that Muon consistently achieves balanced learning across classes regardless of feature embeddings, whereas Adam can induce large disparities in learning errors depending on embedding properties. In summary, our empirical observations and theoretical analyses reveal Muonโs core advantage: its update rule aligns with the outer-product structure of linear associative memories, enabling more balanced and effective learning of tail classes in heavy-tailed distributions than Adam.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3730. CoT-Evo: Evolutionary Distillation of Chain-of-Thought for Scientific Reasoning
๐ฌ 21
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Long CoT Distillation Scientific Reasoning Evolutionary Algorithm
While chain-of-thought (CoT) distillation from advanced large language models (LLMs) has proven effective in general reasoning tasks, it struggles in scientific domains where even advanced models often produce incorrect or superficial reasoning due to high complexity and specialized knowledge requirements. Directly distilling from such flawed outputs results in low-quality training data and limits the performance of smaller student models. To overcome this, we propose CoT-Evo, an evolutionary CoT distillation framework. It begins by constructing a diverse pool of reasoning trajectories from multiple LLM thinkers, enriches them with automatically retrieved domain knowledge, and iteratively refines the trajectories using novelty-driven selection, reflective recombination and mutation. The refinement is guided by a fitness function that evaluates answer correctness, coherence, and effective knowledge utilization. This results in a high-quality CoT dataset tailored for scientific reasoning. We employ this evolved dataset to fine-tune a compact model, which achieves state-of-the-art performance on scientific reasoning benchmarks. Our work establishes a scalable approach to synthesizing high-fidelity scientific reasoning data from diverse and fallible LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3731. Rethinking Residual Errors in Compensation-based LLM Quantization
๐ฌ 13
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Quantization
Methods based on weight compensation, which iteratively apply quantization and weight compensation to minimize the output error, have recently demonstrated remarkable success in quantizing Large Language Models (LLMs).
The representative work, GPTQ, introduces several key techniques that make such iterative methods practical for LLMs with billions of parameters.
GPTAQ extends this approach by introducing an asymmetric calibration process that aligns the output of each quantized layer with its full-precision counterpart, incorporating a residual error into the weight compensation framework.
In this work, we revisit the formulation of the residual error.
We identify a sub-optimal calibration objective in existing methods: during the intra-layer calibration process, they align the quantized output with the output from compensated weights, rather than the true output from the original full-precision model. This leads to a sub-optimal calibration objective. Therefore, we redefine the objective to precisely align the quantized model's output with the original output of the full-precision model at each step. We then reveal that the residual error originates not only from the output difference of the preceding layer but also from the discrepancy between the compensated and original weights within each layer, which we name the 'compensation-aware error'.
By inheriting the neuron decomposition technique from GPTAQ, we can efficiently incorporate this compensation-aware error into the weight update process. Extensive experiments on various LLMs and quantization settings demonstrate that our proposed enhancements integrate seamlessly with both GPTQ and GPTAQ, significantly improving their quantization performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3732. AesCoder: Code Aesthetics with Agentic Reward Feedback
๐ฌ 25
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Large Language Model Code Aesthetics Reinforcement Learning
Large Language Models (LLMs) have become valuable assistants for developers in code-related tasks. While LLMs excel at traditional programming tasks such as code generation and bug fixing, they struggle with visually-oriented coding tasks, often producing suboptimal aesthetics. In this paper, we introduce a new pipeline to enhance the aesthetic quality of LLM-generated code. We first construct AesCode-358K, a large-scale instruction-tuning dataset focused on code aesthetics. Next, we propose agentic reward feedback, a multi-agent system that evaluates executability, static aesthetics, and interactive aesthetics. Building on this, we develop GRPO-AR, which integrates these signals into the GRPO algorithm for joint optimization of functionality and code aesthetics. Finally, we develop OpenDesign, a benchmark for assessing code aesthetics. Experimental results show that combining supervised fine-tuning on AesCode-358K with reinforcement learning using agentic reward feedback significantly improves performance on OpenDesign and enhances results on existing benchmarks such as PandasPlotBench. Notably, our AesCoder-4B surpasses GPT-4o and GPT-4.1, and achieves performance comparable to large open-source models with 480Bโ685B parameters, underscoring the effectiveness of our approach. We will release both the code and datasets to facilitate further research in code aesthetics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3733. CRONOS: Continuous time reconstruction for 4D medical longitudinal series
๐ฌ 9
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Medical Imaging Flow Matching Longitudinal Spatio-Temporal Trajectory Learning
Forecasting how 3D medical scans evolve along time is important for disease progression, treatment planning, and developmental assessment. Yet existing models either rely on a single prior scan, fixed grid times, or target global labels, which limits voxel-level forecasting under irregular sampling. We present CRONOS, a unified framework for many-to-one prediction from multiple past scans that supports both discrete (grid-based) and continuous (real-valued) timestamps in one model, to the best of our knowledge the first to achieve continuous sequence-to-image forecasting for 3D medical data. CRONOS learns a spatio-temporal velocity field that transports context volumes toward a target volume at an arbitrary time, while operating directly in 3D voxel space. Across three public datasets spanning Cine-MRI, perfusion CT, and longitudinal MRI, CRONOS outperforms other baselines, while remaining computationally competitive. We will release code and evaluation protocols to enable reproducible, multi-dataset benchmarking of multi-context, continuous-time forecasting.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3734. Forge: Compiling a Unified Abstraction into Scalable Kernels for Linear Attention
๐ฌ 14
โญ 5.00
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ linear attention efficiency compiler kernels
The quadratic complexity of softmax attention poses a major bottleneck for long-context modeling, motivating a surge of linear attention variants with linear complexity. Unlike softmax attention, which benefits from optimized kernels, linear attention lacks general-purpose, hardware-efficient support and scalable distributed implementations. We introduce Forge, a domain-specific compiler that automates the generation of high-performance, scalable kernels for a wide range of linear attention models directly from high-level PyTorch code. At its core, Forge employs an intuitive programming abstraction that decomposes any linear attention algorithm into three canonical phases: intra-chunk computation, inter-chunk state propagation, and output merging. This unified abstraction enables Forge to perform domain-specific optimizations, automatically generating kernels that fuse computation and communication at a fine-grained tile level and eliminating host synchronization. Our evaluation demonstrates that Forge combines programmability with performance: a wide range of linear attention variants can be implemented in just a few dozen lines of code, while the generated kernels deliver 1.01x-4.9x the performance of sate-of-the-art expert-optimized library and scale with near-linear efficiency on scalar gated linear attention to 16 million tokens on 128 GPUs, surpassing the state-of-the-art distributed baseline by up to 7.2x.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3735. Perturbed Dynamic Time Warping: A Probabilistic Framework and Generalized Variants
๐ฌ 14
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ dynamic time warping perturbation dynamic programming
Dynamic Time Warping (DTW) is a classical method for measuring similarity between time series, but its non-differentiability hinders integration into end-to-end learning frameworks. To address this, soft-DTW replaces the minimum operator with a smooth soft-min, enabling differentiability and efficient computation. Motivated by soft-DTW, we propose perturbed-DTW, a differentiable framework of DTW obtained by adding random perturbations to warping costs and taking the expected minimum. Under Gumbel noise, perturbed-DTW exactly recovers soft-DTW, providing a natural probabilistic interpretation of soft-DTW. We further generalize this framework by extending the Gumbel noise to the broader family of generalized extreme value (GEV) distributions, leading to a new class of soft-DTW variants. Building on this insight, we introduce nested-soft-DTW (ns-DTW), which integrates GEV perturbations into the dynamic programming formulation of perturbed-DTW. This extension induces alignments with tunable skewness, offering greater flexibility in modeling diverse alignment structures. We validate ns-DTW on barycenter computation, clustering, and classification, demonstrating its effectiveness over existing approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3736. On the trade-off between expressivity and privacy in graph representation learning
๐ฌ 18
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ graph representation learning privacy expressivity
We investigate the trade-off between expressive power and privacy guarantees in graph representation learning. Privacy-preserving machine learning faces growing regulatory demands that pose a fundamental challenge: safeguarding sensitive data while maintaining expressive power. To address this challenge, we propose homomorphism density vectors to obtain graph embeddings that are private and expressive.
Homomorphism densities are provably highly discriminative and offer a powerful tool for distinguishing non-isomorphic graphs.
By adding noise calibrated to each densityโs sensitivity, we ensure that the resulting embeddings satisfy formal differential privacy guarantees. Our theoretical construction preserves expressivity in expectation, as each private embedding remains unbiased with respect to the true homomorphism densities. Our embeddings match, in expectation, the expressive power of a broad range of graph neural networks (GNNs), such as message-passing and subgraph GNNs, while providing formal privacy guarantees.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3737. Generalised Flow Maps for Few-Step Generative Modelling on Riemannian Manifolds
๐ฌ 12
โญ 5.00
๐ generative models
๐ท๏ธ generative modelling Riemannian geometry few-step generative modelling
Geometric data
and purpose-built generative models on them have become ubiquitous in high-impact deep learning application domains, ranging from protein backbone generation and computational chemistry to geospatial data.
Current geometric generative models remain computationally expensive at inference---requiring many steps of complex numerical simulation---as they are derived from dynamical measure transport frameworks such as diffusion and flow-matching on Riemannian manifolds. In this paper, we propose Generalised Flow Maps (GFM), a new class of few-step generative models that generalises the Flow Map framework in Euclidean spaces
to arbitrary Riemannian manifolds. We instantiate GFMs with three self-distillation-based training methods: Generalised Lagrangian Flow Maps, Generalised Eulerian Flow Maps, and Generalised Progressive Flow Maps. We theoretically show that GFMs, under specific design decisions, unify and elevate existing Euclidean few-step generative models, such as consistency models, shortcut models, and meanflows, to the Riemannian setting. We benchmark GFMs against other geometric generative models on a suite of geometric datasets, including geospatial data, RNA torsion angles, and hyperbolic manifolds, and
achieve state-of-the-art sample quality for single- and few-step evaluations, and superior or competitive log-likelihoods using the implicit probability flow.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3738. Learning Koopman Representations with Controllability Guarantees
๐ฌ 28
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ Dynamical System Koopman Operator Control Controllability Nonlinear System
Learning nonlinear dynamical models from data is central to control. Two fundamental challenges exist: (1) how to learn accurate models from limited data, and (2) how to ensure the learned models are suitable for control design of the nominal system. We address both by enforcing a critical \emph{a priori} property of the nominal system during learning: \emph{controllability}. Controllability guarantees the existence of control policies that can drive the learned model from any initial state to any desired state. From a modeling perspective, it captures key structural features of the nominal system, thereby improving data efficiency. For downstream control, it enables the use of modern techniques such as model predictive control (MPC). Our approach is based on controllability-preserving Koopman representation learning. Rather than learning dynamics directly in the nominal state space, we learn in a latent space where the system admits a linear representation. We prove that controllability of the learned latent model implies controllability in the nominal state space. To enforce this property, we introduce a novel canonical parameterization of the latent dynamics matrices. We further incorporate Gramian-based regularization to shape the degree of controllability, yielding well-conditioned models for control. Implemented as an end-to-end Neural ODE framework, our method learns models that are both predictive and controllable from limited data. Experiments on nonlinear benchmarks demonstrate accurate long-horizon prediction, reliable MPC performance, and substantially improved data efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3739. LCA: Local Classifier Alignment for Continual Learning
๐ฌ 24
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ continual Learning local robustness catastrophic forgetting
A fundamental requirement for intelligent systems is the ability to learn continuously under changing environments. However, models trained in this regime often suffer from catastrophic forgetting. Leveraging pre-trained models has recently emerged as a promising solution, since their generalized feature extractors enable faster and more robust adaptation. While some earlier works mitigate forgetting by fine-tuning only on the first task, this approach quickly deteriorates as the number of tasks grows and the data distributions diverge. More recent research instead seeks to consolidate task knowledge into a unified backbone, or adapting the backbone as new tasks arrive. However, such approaches may create a (potential) ${mismatch}$ between task-specific classifiers and the adapted backbone. To address this issue, we propose a novel $\text{Local Classifier Alignment}$ (LCA) loss to better align the classifier with backbone. Theoretically, we show that this LCA loss can enable the classifier to not only generalize well for all observed tasks, but also improve robustness. Furthermore, we develop a complete solution for continual learning, following the model merging approach and using LCA. Extensive experiments on several standard benchmarks demonstrate that our method often achieves leading performance, sometimes surpasses the state-of-the-art methods with a large margin.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3740. CASteer: Cross-Attention Steering for Controllable Concept Erasure
๐ฌ 16
โญ 5.00
๐ generative models
๐ท๏ธ steering diffusion control erasure
Diffusion models have transformed image generation, yet controlling their outputs for diverse applications, including content moderation and creative customization, remains challenging. Existing approaches usually require task-specific training and struggle to generalise across both concrete (e.g., objects) and abstract (e.g.,4 styles) concepts. We propose CASteer (Cross-Attention Steering), a training-free framework for controllable image generation using steering vectors to influence a diffusion modelโs hidden representations dynamically. CASteer precomputes concept-specific steering vectors by averaging neural activations from images generated for each target concept. During inference, it dynamically applies these vectors to modify outputs only when necessary, either removing undesired concepts from images where they appear or adding desired concepts to images where they are absent. This selective activation ensures precise, context-aware adjustments without altering unaffected regions. This approach enables precise control over a wide range of tasks, including removing harmful content, interpolating between desired attributes, replacing objects, all without model retraining. CASteer outperforms state-of-the-art techniques while preserving unrelated content and minimising unintended effects. Code is provided in the supplementary
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3741. ActivationReasoning: Logical Reasoning in Latent Activation Spaces
๐ฌ 22
โญ 5.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ reasoning latent space mechanistic interpretability logic
Large language models (LLMs) excel at generating fluent text, but their internal reasoning remains opaque and difficult to control. Sparse autoencoders (SAEs) make hidden activations more interpretable by exposing latent features that often align with human concepts. Yet, these features are fragile and passive, offering no mechanism for systematic reasoning or model control. To address this, we introduce ActivationReasoning (AR), a framework that embeds explicit logical reasoning into the latent space of LLMs. It proceeds in three stages: (1) Finding latent representations, first latent concept representations are identified (e.g., via SAEs) and organized into a dictionary; (2) Activating propositions, at inference time AR detects activating concepts and maps them to logical propositions; and (3)Logical reasoning, applying logical rules over these propositions to infer higher-order structures, compose new concepts, and steer model behavior. We evaluate AR on multi-hop reasoning (PrOntoQA), abstraction and robustness to indirect concept cues (Rail2Country), reasoning over natural and diverse language (ProverQA), and context-sensitive safety (BeaverTails). Across all tasks, AR scales robustly with reasoning complexity, generalizes to abstract and context-sensitive tasks, and transfers across model backbones. These results demonstrate that grounding logical structure in latent activations not only improves transparency but also enables structured reasoning, reliable control, and alignment with desired behaviors, providing a path toward more reliable and auditable AI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3742. What Scales in Cross-Entropy Scaling Law?
๐ฌ 15
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Cross-Entropy Loss; Error-Entropy; Neural Scaling Laws; Loss Decomposition; Large Language Models;
The cross-entropy scaling law has long served as a key tool for guiding the development of large language models. It shows that cross-entropy loss decreases in a predictable power-law rate as the model size increases. However, recent evidence indicates that this law breaks down at very large scales: the loss decreases more slowly than expected, which causes significant trouble for developing large language models. In this paper, we hypothesize that the root cause lies in the fact that cross-entropy itself does not truly scale; instead, only one of its hidden components does. To investigate this, we introduce a novel decomposition of cross-entropy into three parts: Error-Entropy, Self-Alignment, and Confidence. We show both theoretically and empirically that this decomposition precisely captures the training dynamics and optimization objectives. Through extensive experiments on multiple datasets and 32 models spanning five orders of magnitude in size, we find that only error-entropy follows a robust power-law scaling, while the other two terms remain largely invariant. Moreover, error-entropy constitutes the dominant share of cross-entropy in small models but diminishes in proportion as models grow larger. This explains why the cross-entropy scaling law appears accurate at small scales but fails at very large ones. Our findings establish the error-entropy scaling law as a more accurate description of model behavior. We believe it will have wide applications in the training, understanding, and future development of large language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3743. Parameterization-Based Dataset Distillation of 3D Point Clouds through Learnable Shape Morphing
๐ฌ 12
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Dataset Distillation Distilled Dataset Parameterization
Recent attempt in dataset distillation has been made to compress large-scale training datasets into compact synthetic versions, significantly reducing memory usage and training costs. While parameterization-based approaches have shown promising results on image datasets, their application to 3D point clouds remains largely unexplored due to the irregular and unordered nature of 3D data. In this paper, we first introduce a parameterization-based dataset distillation framework for 3D point clouds that enables the use of more diverse synthetic samples than conventional methods under the same memory budget. We first construct an initial synthetic dataset containing multiple anchor samples with a coarser resolution than the original sample. We also generate new samples by morphing the shapes of the anchor samples with learnable weights to improve the diversity of synthetic dataset. Moreover, we devise a uniformity-aware matching loss to ensure the structural consistency when comparing the original and synthetic datasets. Extensive experiments conducted on five standard benchmarksโModelNet10, ModelNet40, ShapeNet, ScanObjectNN, and OmniObject3Dโdemonstrate that the proposed method effectively optimizes both the synthetic samples and the weights for shape morphing, outperforming existing dataset distillation methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3744. Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity
๐ฌ 13
โญ 5.00
๐ learning theory
๐ท๏ธ Deep learning theory feature larning sampel complexity scaling laws
Two pressing topics in the theory of deep learning are the interpretation of feature learning mechanisms and the determination of implicit bias of networks in the rich regime. Current theories of rich feature learning effects revolve around networks with one or two trainable layers or deep linear networks. Furthermore, even under such limiting settings, predictions often appear in the form of high-dimensional non-linear equations, which require computationally intensive numerical solutions. Given the many details that go into defining a deep learning problem, this analytical complexity is a significant and often unavoidable challenge. Here, we propose a powerful heuristic route for predicting the data and width scales at which various patterns of feature learning emerge. This form of scale analysis is considerably simpler than such exact theories and reproduces the scaling exponents of various known results. In addition, we make novel predictions on complex toy architectures, such as three-layer non-linear networks, thus extending the scope of first-principle theories of deep learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3745. Operator Learning with Domain Decomposition for Geometry Generalization in PDE Solving
๐ฌ 17
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Neural Operator Domain Decomposition Geometric Generalization
Neural operators have become increasingly popular in solving partial differential equations (PDEs) due to their superior capability to capture intricate mappings between function spaces over complex domains. However, the data-hungry nature of operator learning inevitably poses a bottleneck for their widespread applications. At the core of the challenge lies the absence of transferability of neural operators to new geometries. To tackle this issue, we propose operator learning with domain decomposition, a local-to-global framework to solve PDEs on arbitrary geometries. Under this framework, we devise an iterative scheme Schwarz Neural Inference (SNI). This scheme allows for partitioning of the problem domain into smaller subdomains, on which local problems can be solved with neural operators, and stitching local solutions to construct a global solution. Additionally, we provide a theoretical analysis of the convergence rate and error bound. We conduct extensive experiments on several representative linear and nonlinear PDEs with diverse boundary conditions and achieve remarkable geometry generalization compared to alternative methods.These analysis and experiments demonstrate the proposed framework's potential in addressing challenges related to geometry generalization and data efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3746. HDR-NSFF: High Dynamic Range Neural Scene Flow Fields
๐ฌ 16
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ High Dynamic Range Dynamic Radiance Fields Scene Flow
Radiance of real-world scenes typically spans a much wider dynamic range than what standard cameras can capture, often leading to saturated highlights or underexposed shadows.
While conventional HDR methods merge alternatively exposed frames, most approaches remain constrained to the 2D image plane, failing to model geometry and motion consistently.
To address these limitations, we present HDR-NSFF, a novel framework for reconstructing dynamic HDR radiance fields from alternatively exposed monocular videos.
Our method explicitly models 3D scene flow, HDR radiance, and tone mapping in a unified end-to-end pipeline. We further enhance robustness by (i) extending semantic-based optical flow with DINO features to achieve exposure-invariant motion estimation, and (ii) incorporating a generative prior as a regularizer to compensate for sparse-view and saturation-induced information loss.
To enable systematic evaluation, we construct a real-world GoPro dataset with synchronized multi-exposure captures.
Experiments demonstrate that HDR-NSFF achieves state-of-the-art performance in novel view and time synthesis, recovering fine radiance details and coherent dynamics even under challenging exposure variations and large motions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3747. StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models
๐ฌ 19
โญ 5.00
๐ optimization
๐ท๏ธ Large Language Models Operations Research Self-Evolving Framework Generative Process Supervision
Large Language Models (LLMs) have shown promising capabilities for solving Operations Research (OR) problems.
While reinforcement learning serves as a powerful paradigm for LLM training on OR problems, existing works generally face two key limitations. First, outcome reward suffers from the $\textit{credit assignment problem}$, where correct final answers can reinforce flawed reasoning.
Second, conventional discriminative process supervision is $\textit{myopic}$, failing to evaluate the interdependent steps of OR modeling holistically.
To this end, we introduce $\textbf{\texttt{StepORLM}}$, a novel self-evolving framework with generative process supervision.
At its core, $\texttt{StepORLM}$ features a co-evolutionary loop where a policy model and a generative process reward model (GenPRM) iteratively improve on each other.
This loop is driven by a dual-feedback mechanism: definitive, outcome-based verification from an external solver, and nuanced, holistic process evaluation from the GenPRM.
The combined signal is used to align the policy via Weighted Direct Preference Optimization (W-DPO) and simultaneously refine the GenPRM.
Our resulting 8B-parameter $\texttt{StepORLM}$ establishes a new state-of-the-art across six benchmarks, significantly outperforming vastly larger generalist models, agentic methods, and specialized baselines.
Moreover, the co-evolved GenPRM is able to act as a powerful and universally applicable process verifier, substantially boosting the inference scaling performance of both our own model and other existing LLMs.
We release our models and code to facilitate future research (https://anonymous.4open.science/r/StepORLM-BEA9/).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3748. WILD-Diffusion: A WDRO Inspired Training Method for Diffusion Models under Limited Data
๐ฌ 28
โญ 5.00
๐ generative models
๐ท๏ธ Diffusion model; Wasserstein Distributionally Robust Optimization; Limited Data
Diffusion models have recently emerged as a powerful class of generative models and have achieved state-of-the-art performance in various image synthesis tasks.
However, training diffusion models generally requires large amounts of data and suffer from overfitting when the dataset size is limited. To address these limitations, we propose a novel method called WILD-Diffusion, which is inspired by Wasserstein Distributionally Robust Optimization (WDRO), an important and elegant mathematical formulation from robust optimization area. Specifically, WILD-Diffusion utilizes WDRO to iteratively generate new training samples within a Wasserstein distance based uncertainty set centered at the limited data data distribution. This carefully designed method can progressively augment the training set throughout the training process and effectively overcome the obstacles caused by the limited data issue. Moreover, we establish the convergence guarantee for our algorithm even though the mixture of diffusion process and WDRO brings significant challenges to our analysis in theory. Finally, we conduct a set of experiments to verify the effectiveness of our proposed method. With WILD-Diffusion, we can achieve more than a $10$% reduction in FID using only $20$% of the training data across different datasets. Moreover, our method can attain state-of-the-art FID with as few as $100$ images, both in pretrained and non-pretrained settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3749. CoLA: Co-Calibrated Logit Adjustment for Long-Tailed Semi-Supervised Learning
๐ฌ 22
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Semi-supervised learning long-tailed learning logit adjustment
Long-tailed semi-supervised learning is hampered by a vicious cycle of confirmation bias, where skewed pseudo-labeling progressively marginalizes tail classes. This challenge is compounded in real-world scenarios by a class distribution mismatch between labeled and unlabeled data, rendering the bias unpredictable and difficult to mitigate. While existing methods adapt Logit Adjustment (LA) using dynamic estimates of the unlabeled distribution, we argue their effectiveness is undermined by two critical limitations stemming from LA's core design, i.e., its class-wise and overall adjustment mechanisms. First, their reliance on simple frequency counting overestimates the prevalence of head classes due to sample redundancy, leading to harmful over-suppression. Second, and more critically, they overlook the interplay between the above two types of adjustment, treating the overall adjustment strength as a fixed hyperparameter. This is a significant oversight, as we empirically find that the optimal strength is highly sensitive to the estimated distribution. To address these limitations, we propose Co-Calibrated Logit Adjustment (CoLA), a framework that co-designs the class-wise and overall LA components. Specifically, CoLA refines the class-wise adjustment by estimating each class's effective sample size via the effective rank of its representations. Subsequently, it formulates the overall adjustment strength as a learnable parameter, which is optimized through a meta-learning procedure on a proxy validation set constructed to mirror the refined distribution. Supported by a theoretical generalization bound, our extensive experiments show that CoLA outperforms existing baselines on $4$ public benchmarks across standard long-tail setups.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3750. Personalized Collaborative Learning with Affinity-Based Variance Reduction
๐ฌ 26
โญ 5.00
๐ learning theory
๐ท๏ธ personalized collaborative learning multi-agent systems federated learning heterogeneity personalization stochastic approximation variance reduction
Multi-agent learning faces a fundamental tension: leveraging distributed collaboration without sacrificing the personalization needed for diverse agents. This tension intensifies when aiming for full personalization while adapting to unknown heterogeneity levelsโgaining collaborative speedup when agents are similar, without performance degradation when they are different. Embracing the challenge, we propose personalized collaborative learning (PCL), a novel framework for heterogeneous agents to collaboratively learn personalized solutions with seamless adaptivity. Through carefully designed bias correction and importance correction mechanisms, our method AffPCL robustly handles both environment and objective heterogeneity. We prove that AffPCL reduces sample complexity over independent learning by a factor of $\max\\{n^{-1}, \delta\\}$, where $n$ is the number of agents and $\delta\in[0,1]$ measures their heterogeneity. This *affinity-based* acceleration automatically interpolates between the linear speedup of federated learning in homogeneous settings and the baseline of independent learning, without requiring prior knowledge of the system. Our analysis further reveals that an agent may obtain linear speedup even by collaborating with arbitrarily dissimilar agents, unveiling new insights into personalization and collaboration in the high heterogeneity regime.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3751. Lookup multivariate Kolmogorov-Arnold Networks
๐ฌ 8
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ KAN inference efficiency CUDA kernels
High-dimensional linear mappings, or linear layers, dominate both the parameter count and the computational cost of most modern deep-learning models. We introduce lookup multivariate Kolmogorov-Arnold Networks (lmKANs), which deliver a substantially better trade-off between capacity and inference cost. Our construction expresses a general high-dimensional mapping through trainable low-dimensional multivariate functions. These functions can carry dozens or hundreds of trainable parameters each, and yet it takes only a few multiplications to compute them because they are implemented as spline lookup tables. Empirically, lmKANs reduce inference FLOPs by up to 6.0ร while matching the flexibility of MLPs in general high-dimensional function approximation. In another feedforward fully connected benchmark, on the tabular-like dataset of randomly displaced methane configurations, lmKANs enable more than 10ร higher H100 throughput at equal accuracy. Within the framework of Convolutional Neural Networks, lmKAN-based CNNs cut inference FLOPs at matched accuracy by 1.6โ2.1ร and by 1.7ร on the CIFAR-10 and ImageNet-1k datasets, respectively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3752. Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
๐ฌ 17
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ RL calibration reasoning uncertainty
When language models (LMs) are trained via reinforcement learning (RL) to generate natural language โreasoning chainsโ, their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or โhallucinateโ) in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier scoreโa scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any analogous reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluationsโoutperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3753. ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models
๐ฌ 21
โญ 5.00
๐ generative models
๐ท๏ธ Efficient large vision-language model Reinforcement learning Multimodal reasoning Reasoning for efficiency
Efficient processing of high-resolution images is crucial for real-world visionโlanguage applications. However, existing Large Vision-Language Models (LVLMs) incur substantial computational overhead due to the large number of vision tokens. With the advent of "thinking with images" models, reasoning now extends beyond text to the visual domain. This capability motivates our two-stage "coarse-to-fine" reasoning pipeline: first, a downsampled image is analyzed to identify task-relevant regions; then, only these regions are cropped at full resolution and processed in a subsequent reasoning stage. This approach reduces computational cost while preserving fine-grained visual details where necessary. A major challenge lies in inferring which regions are truly relevant to a given query. Recent related methods often fail in the first stage after input-image downsampling, due to perception-driven reasoning, where clear visual information is required for effective reasoning. To address this issue, we propose ERGO (Efficient Reasoning & Guided Observation) that performs reasoning-driven perceptionโleveraging multimodal context to determine where to focus. Our model can account for perceptual uncertainty, expanding the cropped region to cover visually ambiguous areas for answering questions. To this end, we develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception. Across multiple datasets, our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency. For instance, ERGO surpasses Qwen2.5-VL-7B on the V* benchmark by 4.7 points while using only 23% of the vision tokens, achieving a 3ร inference speedup.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3754. TangoFlux: Text to Audio Generation with CLAP-Ranked Preference Optimization
๐ฌ 28
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ text to audio flow matching preference optimization
We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in 3.7 seconds on a A40 GPU. A key challenge in aligning TTA models lies in creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We show that the audio preference dataset generated using CRPO outperforms the static alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both objective and subjective benchmarks. https://tangoflux56.github.io/TangoFlux/ holds the model-generated audio samples for comparison.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3755. Enhancing Communication Compression via Discrepancy-aware Calibration for Federated Learning
๐ฌ 21
โญ 5.00
๐ optimization
๐ท๏ธ Federated Learning; Communication Compression
Federated Learning (FL) offers a privacy-preserving paradigm for distributed model training by enabling clients to collaboratively learn a shared model without exchanging their raw data. However, the communication overhead associated with exchanging model updates remains a critical challenge, particularly for devices with limited bandwidth and battery resources.
Existing communication compression methods largely rely on simple heuristics based on magnitude or randomness.
For example, Top-k drops the elements with small magnitude, while low-rank methods such as ATOMO and PowerSGD truncate singular values with small magnitude.
However, these rules do not account for the discrepancy between the compressed and the original outputs, which can lead to the loss of important information.
To address this issue, we propose a novel discrepancy-aware communication compression method that enhances performance under severely constrained communication conditions.
Each client uses a small subset of its local data as calibration data to directly measure the output discrepancy induced by dropping candidate compression units and uses it as a compression metric to guide the selection.
By integrating this strategy, we can enhance existing mainstream compression schemes, enabling more efficient communication.
Empirical results across multiple datasets and models show that our method achieves a significant improvement in accuracy under stringent communication constraints, notably an $18.9\\%$ relative accuracy improvement at a compression ratio of $0.1$, validating its efficacy for scalable and communication-efficient FL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3756. Repurposing Synthetic Data for Fine-grained Search Agent Supervision
๐ฌ 18
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Search Agent Web Agent
LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative "near-miss" samplesโthose with substantially correct reasoning but a flawed final answerโfrom complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these ''near-misses''. Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3757. Multi-Condition Conformal Selection
๐ฌ 20
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Uncertainty Quantification Conformal Inference False Discovery Control
Selecting high-quality candidates from large-scale datasets is critically important in resource-constrained applications such as drug discovery, precision medicine, and the alignment of large language models. While conformal selection methods offer a rigorous solution with False Discovery Rate (FDR) control, their applicability is confined to single-threshold scenarios (i.e., y > c) and overlooks practical needs for multi-condition selection, such as conjunctive or disjunctive conditions. In this work, we propose the Multi-Condition Conformal Selection (MCCS) algorithm, which extends conformal selection to scenarios with multiple conditions. In particular, we introduce a novel nonconformity score with regional monotonicity for conjunctive conditions and a global BenjaminiโHochberg (BH) procedure for disjunctive conditions, thereby establishing finite-sample FDR control with theoretical guarantees. The integration of these components enables the proposed method to achieve rigorous FDR-controlled selection in various multi-condition environments. Extensive experiments validate the superiority of MCCS over baselines, its generalizability across diverse condition combinations, different real-world modalities, and multi-task scalability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3758. Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation
๐ฌ 13
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D scene generation; Text-to-3D scene; Agentic framwork; Visual guidance; Physical plausibility
Generating interactive 3D scenes from text requires not only synthesizing assets but arranging them with spatial intelligenceโsupport, affordances, and plausibility. However, training data for interactive scenes is dominated by a few indoor datasets, so learning-based methods overfit to in-distribution layouts and struggle to compose diverse arrangements (e.g., outdoor settings and small-on-large relations). Meanwhile, LLM-based layout planners can propose diverse arrangements, but the lack of visual grounding often yields implausible placements that violate commonsense physics. We propose Scenethesis, a training-free, agentic framework that couples LLM-based scene planning with vision-guided layout refinement. Given a text prompt, Scenethesis first drafts a coarse layout with an LLM; a vision module refines the layout and extracts scene structure to capture inter-object relations. A novel optimization stage enforces pose alignment and physical plausibility, and a final judge verifies spatial coherence and triggers targeted repair when needed. Across indoor and outdoor prompts, Scenethesis produces realistic, relation-rich, and physically plausible 3D interactive scenes, reducing collisions and stability failures compared to SOTA methods, making it practical for virtual content creation, simulation, and embodied AI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3759. MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains
๐ฌ 16
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multimodal RAG Vision-Language Agent Benchmark
With the increasing demand for step-wise, cross-modal, and knowledge-grounded reasoning, multimodal large language models (MLLMs) are evolving beyond the traditional fixed retrieve-then-generate paradigm toward more sophisticated agentic multimodal retrieval-augmented generation (MM-RAG). Existing benchmarks, however, mainly focus on simplified QA with short retrieval chains, leaving adaptive planning and multimodal reasoning underexplored. We present MC-Search, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures. Each example specifies sub-questions, retrieval modalities, supporting facts, and intermediate answers, with fidelity ensured by HAVE (Hop-wise Attribution and Verification of Evidence), resulting in 3,333 high-quality examples averaging 3.7 hops. Beyond answer accuracy, MC-Search introduces new process-level metrics for reasoning quality, stepwise retrieval and planning accuracy. By developing a unified agentic MM-RAG pipeline, we benchmark six leading MLLMs and reveal systematic issues such as over- and under-retrieval and modality-misaligned planning. Finally, we introduce Search-Align, a process-supervised fine-tuning framework leveraging verified reasoning chains, showing that our data not only enables faithful evaluation but also improves planning and retrieval fidelity in open-source MLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3760. FlowSearcher: Synthesizing Memory-Guided Agentic Workflows for Web Information Seeking
๐ฌ 13
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Language Model Reasoning Structured Planning Task Adaptability
Web search is a cornerstone for deep research agents, enabling them to acquire and reason over knowledge beyond static corpora. Yet most existing systems follow rigid ReAct-style tool chains locked into fixed workflow structures, which hinders their ability to flexibly handle diverse query types and tool-use strategies. We introduce $\textbf{FlowSearcher}$, a novel web search framework built on agentic workflow synthesis. FlowSearcher decomposes queries into sub-goals, each orchestrated by a tailored workflow graph that adapts the depth and order of tool use, giving the system structural flexibility to handle diverse sub-goals ranging from simple lookups and focused navigation to multi-hop information synthesis. Complementing this, a hierarchical memory distills past workflows into structured experience, providing reusable context that improves orchestration and guides tool use on new queries. This shift from reactive tool calls to memory-driven workflow design and execution marks a principled step toward deliberative web research. Empirical results on GAIA, BrowseComp, and GPQA show that our memory-driven, training-free workflow synthesis consistently matches or exceeds the performance of RLHF-trained systems, pointing toward a new direction of agent design grounded in memory-enhanced structural planning rather than parameter fine-tuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3761. Speculative Speculative Decoding
๐ฌ 14
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ inference large language models speculative decoding
Autoregressive decoding is bottlenecked by its *sequential* nature. Speculative decoding has become a standard way to accelerate inference by using a fast *draft model* to predict upcoming tokens from a slower *target model*, and then verifying them *in parallel* with a single target model forward pass. However, speculative decoding itself relies on a *sequential* dependence between speculation and verification. We introduce *speculative speculative decoding* (SSD) to parallelize these operations. While a verification is ongoing, the draft model *predicts* likely verification outcomes and prepares speculations pre-emptively for them. If the actual verification outcome is then in the predicted set, a speculation can be returned immediately, thereby eliminating all speculation overhead. We identify three key challenges presented by speculative speculative decoding, and put forth principled methods to solve each after theoretical analysis. The result is **Saguaro**, an optimized SSD algorithm which is up to twice as fast as optimized speculative decoding baselines and up to 5ร faster than autoregressive decoding with open source inference engines. **Saguaro** can be combined with existing methods like EAGLE and token tree speculation for further gains, and permits scaling draft compute to better predict verification outcomes, introducing new tradeoffs between compute and latency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3762. Reasoning-Driven Multimodal LLM for Domain Generalization
๐ฌ 5
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Machine Learning (ML) -> ML: Transfer Domain Adaptation Multi-Task Learning
This paper addresses the domain generalization (DG) problem in deep learning.
While most DG methods focus on enforcing visual feature invariance, we leverage the reasoning capability of multimodal large language models (MLLMs) and explore the potential of constructing reasoning chains that derives image categories to achieve more robust predictions under domain shift.
To this end, we systematically study the role of reasoning in DG using DomainBed-Reasoning, a newly constructed extension of DomainBed dataset, in which each sample is paired with class-relevant reasoning chains.
Our analysis reveals two key challenges: (i) fine-tuning MLLMs with reasoning chains for classification is more challenging than direct label supervision, since the model must optimize complex reasoning sequences before label prediction; and (ii) mismatches in reasoning patterns between supervision signals and fine-tuned MLLMs lead to a trade-off between semantic richness (informative but harder to optimize) and optimization efficiency (easier to optimize but less informative).
To address these issues, we propose RD-MLDG (Reasoning-Driven Multimodal LLM for Domain Generalization), a framework with two components: (i) MTCT (Multi-Task Cross-Training), which introduces an additional direct classification pathway to guide reasoning supervision; and (ii) SARR (Self-Aligned Reasoning Regularization), which preserves the semantic richness of reasoning chains while mitigating reasoning-pattern mismatches via iterative self-labeling.
Experiments on standard DomainBed datasets (PACS, VLCS, OfficeHome, TerraIncognita) demonstrate that RD-MLDG achieves state-of-the-art performances, highlighting reasoning as a promising complementary signal for robust out-of-domain generalization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3763. Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders
๐ฌ 17
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Sparse Autoencoder Model Interpretability Retreival-augmented Generation LLM Hallucination RAG Faithfulness
Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in retrieved evidence, but faithfulness failures, where generations contradict or extend beyond the provided sources, remain a critical challenge. Existing hallucination detection methods for RAG often rely either on large-scale detector training, which requires substantial annotated data, or on querying external LLM judges, which leads to high inference costs. Although some approaches attempt to leverage internal representations of LLMs for hallucination detection, their accuracy remains limited. Motivated by recent advances in mechanistic interpretability, we employ sparse autoencoders (SAEs) to disentangle internal activations, successfully identifying features that are specifically triggered during RAG hallucinations. Building on a systematic pipeline of information-based feature selection and additive feature modeling, we introduce RAGLens, a lightweight hallucination detector that accurately flags unfaithful RAG outputs using LLM internal representations. RAGLens not only achieves superior detection performance compared to existing methods, but also provides interpretable rationales for its decisions, enabling effective post-hoc mitigation of unfaithful RAG. Finally, we justify our design choices and reveal new insights into the distribution of hallucination-related signals within LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3764. HeurekaBench: A Benchmarking Framework for AI Co-scientist
๐ฌ 17
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ agents benchmarks for agents LLMs single-cell biology
LLM-based reasoning models have enabled the development of agentic systems that act as co-scientists, assisting in multi-step scientific analysis. However, evaluating these systems is challenging, as it requires realistic, end-to-end research scenarios that integrate data analysis, interpretation, and the generation of new insights from the experimental data. To address this limitation, we introduce HeurekaBench, a framework to create benchmarks with exploratory, open-ended research questions for experimental datasets. Each such question is grounded in a scientific study and its corresponding code repository, and is created using a semi-automated pipeline that leverages multiple LLMs to extract insights and generate candidate workflows, which are then verified against reported findings. We instantiate the framework in single-cell biology to obtain sc-HeurekaBench benchmark and use it to compare state-of-the-art single-cell agents. We further showcase the benefits of our benchmark for quantitatively analyzing current design choices in agentic systems. We find that the addition of a critic module can improve ill-formed responses for open-source LLM-based agents by up to 22% and close the gap with their closed-source counterparts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3765. Graph Tokenization for Bridging Graphs and Transformers
๐ฌ 15
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph BPE Tokenizer
The success of large pretrained Transformers is closely tied to tokenizers, which convert raw input into discrete symbols. Extending these models to graph-structured data remains a significant challenge.
In this work, we introduce a graph tokenization framework that generates sequential representations of graphs by combining reversible graph serialization, which preserves graph information, with Byte Pair Encoding (BPE), a widely adopted tokenizer in large language models (LLMs).
To better capture structural information, the graph serialization process is guided by global statistics of graph substructures, ensuring that frequently occurring substructures appear more often in the sequence and can be merged by BPE into meaningful tokens.
Empirical results demonstrate that the proposed tokenizer enables Transformers such as BERT to be directly applied to graph benchmarks without architectural modifications.
The proposed approach achieves state-of-the-art results on 12 benchmark datasets and frequently outperforms both graph neural networks and specialized graph transformers. This work bridges the gap between graph-structured data and the ecosystem of sequence models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3766. Optimizing Canaries for Privacy Auditing with Metagradient Descent
๐ฌ 19
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ differential privacy auditing metagradient optimization
In this work we study black-box privacy auditing, where the goal is to lower bound the privacy parameter
of a differentially private learning algorithm using only the algorithmโs outputs (i.e., final trained model).
For DP-SGD (the most successful method for training differentially private deep learning models), the
canonical approach auditing uses membership inferenceโan auditor comes with a small set of special โca-
naryโ examples, inserts a random subset of them into the training set, and then tries to discern which of
their canaries were included in the training set (typically via a membership inference attack). The audi-
torโs success rate then provides a lower bound on the privacy parameters of the learning algorithm. Our
main contribution is a method for optimizing the auditorโs canary set to improve privacy auditing, leverag-
ing recent work on metagradient optimization. Our empirical evaluation demonstrates that by
using such optimized canaries, we can improve empirical lower bounds for differentially private image
classification models by over 2x in certain instances. Furthermore, we demonstrate that our method is
transferable and efficient: canaries optimized for non-private SGD with a small model architecture remain
effective when auditing larger models trained with DP-SGD.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3767. Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations
๐ฌ 13
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Foundation Models based Robot Manipulation Vision-based Robotics Generative Video Models 6D Pose Estimation
This work introduces Robots Imitating Generated Videos (RIGVid), a system that enables robots to perform complex manipulation tasksโsuch as pouring, wiping, and mixingโpurely by imitating AI-generated videos, without requiring any physical demonstrations or robot-specific training. Given a language command and an initial scene image, a video diffusion model generates potential demonstration videos, and a vision-language model (VLM) automatically filters out results that do not follow the command. A 6D pose tracker then extracts object trajectories from the video, and the trajectories are retargeted to the robot in an embodiment-agnostic fashion. Through extensive realworld evaluations, we show that filtered generated videos are as effective as real demonstrations, and that performance improves with generation quality. We also show that relying on generated videos outperforms more compact alternatives such as keypoint prediction using VLMs, and that strong 6D pose tracking outperforms other ways to extract trajectories, such as dense feature point tracking. These findings suggest that videos produced by a state-of-the-art off-the-shelf model can offer an effective source of supervision for robotic manipulation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3768. Diversity-Incentivized Exploration for Versatile Reasoning
๐ฌ 22
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM reasoning Reinforcement learning with verifiable rewards efficient exploration diversity
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a crucial paradigm for incentivizing reasoning capabilities in Large Language Models (LLMs). Due to vast state-action spaces and reward sparsity in reasoning tasks, existing methods often struggle with deficient exploration and poor sample efficiency. In the paper, we propose **DIVER** (**D**iversity-**I**ncentivized Exploration for **V**ersatil**E** **R**easoning), an innovative framework that highlights the pivotal role of global sequence-level diversity to incentivize deep exploration for versatile reasoning. We first conduct a primary empirical study to reveal a strong positive correlation between global diversity and reasoning capacity. Building on this insight, we introduce global diversity incentives as an intrinsic reward to promote deep exploration in a semantically structured space. Incorporating the intrinsic reward, we develop a potential-based reward shaping mechanism to preserve optimal policy invariance and design simple heuristics to mitigate possible reward hacking. Experimental results show that DIVER outperforms competitive RLVR baselines with various exploration strategies on both in-domain and out-of-domain tasks, excelling in both Pass@1 and Pass@k evaluations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3769. Anchored Supervised Fine-Tuning
๐ฌ 15
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ SFT
Post-training of large language models involves a fundamental trade-off between
supervised fine-tuning (SFT), which efficiently mimics demonstrations but tends
to memorize, and reinforcement learning (RL), which achieves better generaliza-
tion at higher computational cost. Dynamic Fine-Tuning (DFT) recently emerged
as a promising middle ground, reweighting SFT objectives with token probabili-
ties and achieving improvements in certain reasoning domains, though it exhibits
instability in other tasks. We provide a analysis of DFT through the reward-
weighted regression (RWR) framework, revealing that it corresponds to a spe-
cific auxiliary distribution choice that yields provably tighter RL bounds than
standard SFT. However, our analysis also uncovers a critical limitation: this con-
struction lacks distributional anchoring, leading to progressive drift that under-
mines training stability. To address this, we propose Anchored Supervised Fine-
Tuning (ASFT), which augments DFTโs reweighting with lightweight KL regu-
larization to preserve tightness while ensuring stability. Empirically, ASFT con-
sistently outperforms both SFT and DFT across mathematical reasoning, medical
knowledge grounding, and code generation, achieving substantial improvements
with minimal computational overhead. Our RWR framework provides a system-
atic lens for understanding post-training methods and demonstrates that principled
theoretical analysis leads to both stronger guarantees and practical gains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3770. PrefixMemory-Tuning: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention
๐ฌ 16
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Fine-Tuning
Parameter-Efficient Fine-Tuning (PEFT) methods have become crucial for rapidly adapting large language models (LLMs) to downstream tasks. Prefix-Tuning, an early and effective PEFT technique, demonstrated the ability to achieve performance comparable to full fine-tuning with significantly reduced computational and memory overhead. However, despite its earlier success, its effectiveness in training modern state-of-the-art LLMs has been very limited. In this work, we demonstrate empirically that Prefix-Tuning underperforms on LLMs because of an inherent tradeoff between the contribution of input prompt and parameterized prefix within the attention head. This motivates us to introduce PrefixMemory-Tuning, an architecture that generalizes the principles of Prefix-Tuning while addressing its shortcomings by shifting the prefix module out of the attention head itself and improving its expressiveness. Our experiments show that, across diverse benchmarks, PrefixMemory-Tuning consistently outperforms existing Prefix-Tuning methods. Notably, it achieves competitive performance with modern PEFTs on several general benchmarks, highlighting a potential extension of Prefix-Tuning approaches to become state-of-the-art. Our findings suggest that by overcoming its inherent limitations, Prefix-Tuning can remain a competitive and relevant research direction in the landscape of parameter-efficient LLM adaptation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3771. Learning Ordinal Probabilistic Reward from Preferences
๐ฌ 4
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reward Modeling Large Language Models RLHF
Reward models are crucial for aligning large language models (LLMs) with human values and intentions.
Existing approaches follow either Generative (GRMs) or Discriminative (DRMs) paradigms, yet both suffer from limitations:
GRMs typically demand costly point-wise supervision, while DRMs produce uncalibrated relative scores that lack probabilistic interpretation.
To address these challenges, we introduce a novel reward modeling paradigm: Probabilistic Reward Model (PRM).
Instead of modeling reward as a deterministic scalar, our approach treats it as a random variable, learning a full probability distribution for the quality of each response.
To make this paradigm practical, we present its closed-form, discrete realization: the **Ordinal Probabilistic Reward Model** (OPRM), which discretizes the quality score into a finite set of ordinal ratings.
Building on OPRM, we propose a data-efficient training strategy called **Region Flooding Tuning** (RgFT).
It enables rewards to better reflect absolute text quality by incorporating quality-level annotations, which guide the model to concentrate the probability mass within corresponding rating sub-regions.
Experiments on various reward model benchmarks show that our method improves accuracy by **2.9% ๏ฝ 7.4%** compared to prior reward models, demonstrating strong performance and data efficiency.
Analysis of the score distribution provides evidence that our method captures not only relative rankings but also absolute quality.
Our models, data, and code will be released and open-sourced.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3772. VARestorer: One-Step VAR Distillation for Real-World Image Super-Resolution
๐ฌ 15
โญ 5.00
๐ generative models
๐ท๏ธ visual autoregressive model image restoration
Recent advancements in visual autoregressive models (VAR) have demonstrated their effectiveness in image generation, highlighting their potential for real-world image super-resolution (Real-ISR). However, adapting VAR for ISR presents critical challenges. The next-scale prediction mechanism, constrained by casual attention, fails to fully exploit global low-quality (LQ) context, resulting in blurry and inconsistent high-quality (HQ) outputs. Additionally, error accumulation in the iterative prediction severely degrades coherence in ISR task. To address these issues, we propose VARestorer, a simple yet effective distillation framework that transforms a pre-trained text-to-image VAR model into a one-step ISR model. By leveraging distribution matching, our method eliminates the need for iterative refinement, significantly reducing error propagation and inference time. Furthermore, we introduce pyramid image conditioning with cross-scale attention, which enables bidirectional scale-wise interactions and fully utilizes the input image information while adapting to the autoregressive mechanism. This prevents later LQ tokens from being overlooked in the transformer. By fine-tuning only 1.2\% of the model parameters through parameter-efficient adapters, our method maintains the expressive power of the original VAR model while significantly enhancing efficiency. Extensive experiments show that VARestorer achieves state-of-the-art performance with 72.32 MUSIQ and 0.7669 CLIPIQA on DIV2K dataset, while accelerating inference by 10 times compared to conventional VAR inference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3773. From ``Sure" to ``Sorry": Detecting Jailbreak in Large Vision Language Model via JailNeurons
๐ฌ 24
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Vision Language Model Jailbreak Detection
Large Vision-Language Models (LVLMs) are vulnerable to jailbreak attacks that can generate harmful content. Existing detection methods are either limited to detecting specific attack types or are too time-consuming, making them impractical for real-world deployment. To address these challenges, we propose \textbf{JDJN} (\textbf{J}ailbreak \textbf{D}etection via \textbf{J}ail\textbf{N}eurons), a novel jailbreak detection method for LVLMs. Specifically, we focus on \textbf{JailNeurons}, which are key neurons related to jailbreak at each model layer. Unlike the ``SafeNeurons", which explain why aligned models can reject ordinary harmful queries, JailNeurons capture how jailbreak prompts circumvent safety mechanisms. They provide an important and previously underexplored complement to existing safety research. We design a neuron localization algorithm to detect these JailNeurons and then aggregate them across layers to train a generalizable detector. Experimental results demonstrate that our method effectively extracts jailbreak-related information from high-dimensional hidden states. As a result, our approach achieves the highest detection success rate with exceptionally low false positive rates. Furthermore, the detector exhibits strong generalizability, maintaining high detection success rates across unseen benign datasets and attack types. Finally, our method is computationally efficient, with low training costs and fast inference speeds, highlighting its potential for real-world deployment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3774. ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall
๐ฌ 43
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Large Language Model Knowledge Editing Multi-hop Factual Recall Mechanism Interpretability
LLMs require efficient knowledge editing (KE) to update factual information, yet existing methods exhibit significant performance decay in multi-hop factual recall. This failure is particularly acute when edits involve intermediate implicit subjects within reasoning chains. Through causal analysis, we reveal that this limitation stems from an oversight of how chained knowledge is dynamically represented and utilized at the neuron level. We discover that during multi-hop reasoning, implicit subjects function as query neurons, which sequentially activate corresponding value neurons across transformer layers to accumulate information toward the final answerโa dynamic prior KE work has overlooked. Guided by this insight, we propose ACE (Attribution-Controlled Knowledge Editing), a framework that leverages neuron-level attribution to identify and edit these critical query-value (Q-V) pathways. Ace provides a mechanistically grounded solution for multi-hop KE, empirically outperforming state-of-the-art methods by 9.44% on GPT-J and 37.46% on Qwen3-8B. Our analysis further reveals more fine-grained activation patterns in Qwen3 and demonstrates that the semantic interpretability of value neurons is orchestrated by query-driven accumulation. These findings establish a new pathway for advancing KE capabilities based on the principled understanding of internal reasoning mechanisms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3775. Astra: General Interactive World Model with Autoregressive Denoising
๐ฌ 12
โญ 5.00
๐ generative models
๐ท๏ธ world model video generation
Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action-aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real-world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long-range prediction, and action alignment over existing state-of-the-art world models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3776. More Than What Was Chosen: LLM-based Explainable Recommendation Beyond Noisy User Preferences
๐ฌ 18
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ LLM-based Recommendation Rationale Revealed Preference Explainable Recommender
Recommender systems traditionally rely on the principle of Revealed Preference (RP), which assumes that observed user behaviors faithfully reflect underlying interests.
While effective at scale, this assumption is fragile in practice, as real-world choices are often noisy and inconsistent.
Thus, even LLM-based recommendation models (LLM-Rec) equipped with advanced reasoning capabilities may fail to capture genuine user preferences and often produce rationales of limited persuasiveness.
To address this issue, we introduce the concept of Coherent Preference (CP), which complements RP by favoring items that are logically and causally coherent with user interaction history.
Building on this perspective, we propose Conflict-Aware Direct Preference Optimization (C-APO), an LLM-Rec framework that jointly optimizes RP and CP while adaptively reconciling their agreement and conflict, delivering robust recommendation performance and logically consistent rationales.
We construct a unified ordering approach that combines the RP signal, based on chosen versus unobserved items, with the CP signal, which ranks items by their logical consistency with past interaction history.
In this unified preference ordering, we dynamically adjust the influence of each signal depending on whether RP and CP agree or conflict, allowing the model to better capture user intent and generate more plausible recommendations.
On the Amazon Review dataset, our approach consistently outperforms approximately 20 state-of-the-art baseline models in both recommendation performance and rationale quality, achieving a 1.65$\times$ relative improvement in click-through rate during deployment, thereby demonstrating its practical utility.
The code and dataset are available at https://anonymous.4open.science/r/C-APO.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3777. UltraGauss: Ultrafast Gaussian Reconstruction of 3D Ultrasound Volumes
๐ฌ 15
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Ultrasound 3D Reconstruction Gaussian
Ultrasound imaging is widely used due to its safety, affordability, and real-time capabilities, but its 2D interpretation is highly operator-dependent, leading to variability and increased cognitive demand.
We present $\textbf{UltraGauss}$: an ultrasound-specific Gaussian Splatting framework that serves as an efficient approximation to acoustic image formation. Unlike projection-based splatting, UltraGauss renders by $\textit{probe-plane intersection}$ with in-plane aggregation, aligning with plane-based echo sampling while remaining fast and memory-efficient. A stable parameterisation and compute-aware GPU rasterisation make this method practical at scale. On clinical datasets, UltraGauss delivers state-of-the-art 2D-to-3D reconstructions in minutes on a single GPU (reaching 0.99 SSIM within $\sim$20 minutes), and a clinical expert survey rates its reconstructions the most realistic among competing methods. To our knowledge, this is the first Gaussian Splatting approach tailored to ultrasound 2D-to-3D reconstruction. Code will be released upon publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3778. Unifying Stable Optimization and Reference Regularization in RLHF
๐ฌ 18
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ RLHF LLM Alignment
Reinforcement Learning from Human Feedback (RLHF) has advanced alignment capabilities significantly but remains hindered by two core challenges: reward hacking and stable optimization. Current solutions independently address these issues through separate regularization strategies, specifically a KL-divergence penalty against a supervised fine-tuned model ($\pi_0$) to mitigate reward hacking, and policy ratio clipping towards the current policy ($\pi_t$) to promote stable alignment. However, the implicit trade-off arising from simultaneously regularizing towards both $\pi_0$ and $\pi_t$ remains under-explored. In this paper, we introduce a unified regularization approach that explicitly balances the objectives of preventing reward hacking and maintaining stable policy updates. Our simple yet principled alignment objective yields a weighted supervised fine-tuning loss with a superior trade-off, which demonstrably improves both alignment results and implementation complexity. Extensive experiments across diverse benchmarks validate that our method consistently outperforms RLHF and online preference learning methods, achieving enhanced alignment performance and stability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3779. Reliable Fine-Grained Evaluation of Natural Language Math Proofs
๐ฌ 28
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ automated proof evaluation; LLM-as-a-judge; LLM-generated math proofs; rubric-guided grading; prompt optimization; expert-annotated proof dataset; evaluator reliability; reward modeling
Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers while generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap.
To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs.
To enable this study, we introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-Pro, o3, and DeepSeek-R1.
Using ProofBench as a testbed, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow.
Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines.
Finally, we demonstrate its practical utility in a best-of-$n$ selection task: at $n=16$, ProofGrader achieves an average score of 4.14/7, closing 78\% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3780. Temporal Slowness in Central Vision Drives Semantic Object Learning
๐ฌ 21
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ bio-inspired learning human vision egocentric learning self-supervised learning
Humans acquire semantic object representations from egocentric visual streams with minimal supervision. Importantly, the visual system processes with high resolution only the center of its field of view and learns similar representations for visual inputs occurring close in time. This emphasizes slowly changing information around gaze locations. This study investigates the role of central vision and slowness learning in the formation of semantic object representations in humans. We simulate five months of human-like visual experience using the Ego4D dataset and generate gaze coordinates with a state-of-the-art gaze prediction model. Using these predictions, we extract crops that mimic central vision and train a time-contrastive Self-Supervised Learning model on them. Our results show that combining temporal slowness and central vision improves the encoding of different semantic facets of object representations. Specifically, focusing on central vision strengthens the extraction of foreground object features, while considering temporal slowness, especially during fixational eye movements, allows the model to encode broader semantic information about objects.
These findings provide new insights into the mechanisms by which humans may develop semantic object representations from natural visual experience. Our code will be made public upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3781. Plan-Answer-Refine-on-Graph: Structured Planning and Self-Refinement for Large Language Model Reasoning on Knowledge Graphs
๐ฌ 13
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Knowledge Graphs Large Language Models Question Answering
Incorporating knowledge graphs (KGs) into large language model (LLM) reasoning has shown promise in alleviating hallucinations and factual errors. Although existing paradigms of KG-augmented LLMs have achieved encouraging results, they still exhibit notable limitations when handling multi-hop reasoning and complex logical queries: (1) search space truncation bias: current methods generate linear entity-relation reasoning paths, which can prune correct candidates prematurely during iterative exploration; and (2) entity error amplification: existing methods typically follow the retrieve-and-answer paradigm which causes LLMs to over-rely on retrieved evidence, exacerbating the impact of incorrect entities during reasoning. To alleviate the existing challenges, we propose Plan-Answer-Refine-on-Graph (PARoG), a novel framework for LLM reasoning on knowledge graphs. First, PARoG leverages SPARQL queries from KG data as references, decomposing them into structured step-by-step plans. We further train LLMs to construct such structured plans, which improves the logical consistency of reasoning, ensures uniform step granularity, and facilitates effective execution on the graph. Second, during reasoning over KGs, PARoG adopts a plan-answer-refine paradigm: the model first attempts to answer each sub-query independently, and then refines its prediction by integrating evidence retrieved from the KG. This process mitigates knowledge conflicts between LLM and KG, substantially reducing hallucinations. Experimental results on multiple KG reasoning benchmarks demonstrate that PARoG significantly outperforms state-of-the-art approaches, achieving especially superior accuracy on multi-hop and logically complex queries.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3782. Graph-based Nearest Neighbors with Dynamic Updates via Random Walk-Based Analysis
๐ฌ 14
โญ 5.00
๐ optimization
๐ท๏ธ nearest neighbor search graph random walk
Approximate nearest neighbor search (ANN) is a common way to retrieve relevant search results, especially now in the context of large language models and retrieval augmented generation. One of the most widely used algorithms for ANN is based on constructing a multi-layer graph over the dataset, called the Hierarchical Navigable Small World (HNSW). While this algorithm supports insertion of new data, it does not support deletion of existing data. Moreover, deletion algorithms described by prior work come at the cost of increased query latency, decreased recall, or prolonged deletion time. In this paper, we propose a new theoretical framework for graph-based ANN based on random walks. We then utilize this framework to analyze a randomized deletion approach that preserves hitting time statistics compared to the graph before deleting the point. We then turn this theoretical framework into a \emph{deterministic} deletion algorithm, and show that it provides better tradeoff between query latency, recall, deletion time, and memory usage through an extensive collection of experiments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3783. TEDM: Time Series Forecasting with Elucidated Diffusion Models
๐ฌ 12
โญ 5.00
๐ generative models
๐ท๏ธ Score-based generative models Diffusion models Stochastic Differential Equations Time-series forecasting
Score-based generative modeling through differential equations has driven breakthroughs in high-fidelity image synthesis, offering modular model design and efficient sampling. However, this success has not been widely translated to timeseries forecasting yet. This gap stems from the sequential nature of time series, in contrast to the unordered structure of images. Here, we extend the theoretical formulation used for images to explicitly address sequential structures. We propose a diffusion-based forecasting framework (TEDM) that adapts score estimation to temporal settings and elucidates its design space. Such a design allows empirical computation of noise and signal scaling directly from data, avoiding external schedules. Notably, this reduces sampling complexity to linear in the forecast horizon. Without elaborate preprocessing, TEDM sets new state-of-the-art results on multiple forecasting benchmarks. These results illustrate the growing potential of diffusion models beyond vision. TEDM generates low-latency forecasts using a lightweight architecture, making it ideal for real-time deployment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3784. Seeing Whatโs Wrong: A Trajectory-Guided Approach to Caption Error Detection
๐ฌ 19
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Image-Caption Alignment Error Detection Caption Trajectory
Error detection is critical for enhancing multimodal dataset reliability and downstream model performance. Existing error filters, while increasingly powerful, typically rely on a single similarity score per imageโcaption pair. This is limiting: captions with subtle errors (e.g., mislabeled objects, incorrect colors, or negations) can still score highly, while correct but imprecisely worded captions may score poorly. To address this, we introduce the notion of a caption trajectory: an ordered sequence of captions produced by iteratively editing a caption to maximize an image-text relevance score. This trajectory carries rich signals for error detection. Correct captions typically stabilize after minor edits, while erroneous captions undergo substantial improvements. Building on these insights, we introduce TRACED, a cost-efficient and model-agnostic framework that leverages trajectory statistics for more accurate caption error detection. Beyond detection, TRACED also serves as an interpretable tool for identifying the origins of errors. We further demonstrate that, in the case of error correction, these interpretable token-level error information can be provided to VLMs to enhance the alignment scores of the generated captions. On MS COCO and Flickr30k, TRACED achieves up to 2.8% improvement in accuracy for error detection across three noise types.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3785. Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models
๐ฌ 32
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Remote Sensing Geospatial AI Vision Language Model
Vision-Language Models (VLMs) in remote sensing often fail at complex analytical tasks, a limitation stemming from their end-to-end training paradigm that bypasses crucial reasoning steps and leads to unverifiable outputs. To address this limitation, we introduce the Perceptually-Grounded Geospatial Chain-of-Thought (Geo-CoT), a framework that models remote sensing analysis as a verifiable, multi-step process. We instill this analytical process through a two-stage alignment strategy, leveraging Geo-CoT380k, the first large-scale dataset of structured Geo-CoT rationales. This strategy first employs supervised fine-tuning (SFT) to instill the foundational cognitive architecture, then leverages Group Reward Policy Optimization (GRPO) to refine the modelโs reasoning policy towards factual correctness. The resulting model, RSThinker, outputs both a final answer and its justifying, verifiable analytical trace. This capability yields dominant performance, significantly outperforming state-of-the-art models across a comprehensive range of tasks. The public release of our Geo-CoT380k dataset and RSThinker model upon publication serves as a concrete pathway from opaque perception towards structured, verifiable reasoning for Earth Observation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3786. Iterative Distillation for Reward-Guided Fine-Tuning of Diffusion Models in Biomolecular Design
๐ฌ 22
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Biomolecular Design Diffusion Models
We address the problem of fine-tuning diffusion models for reward-guided generation in biomolecular design. While diffusion models have proven highly effective in modeling complex, high-dimensional data distributions, real-world applications often demand more than high-fidelity generation, requiring optimization with respect to potentially non-differentiable reward functions such as physics-based simulation or rewards based on scientific knowledge. Although RL methods have been explored to fine-tune diffusion models for such objectives, they often suffer from instability, low sample efficiency, and mode collapse due to their on-policy nature. In this work, we propose an iterative distillation-based fine-tuning framework that enables diffusion models to optimize for arbitrary reward functions. Our method casts the problem as policy distillation: it collects off-policy data during the roll-in phase, simulates reward-based soft-optimal policies during roll-out, and updates the model by minimizing the KL divergence between the simulated soft-optimal policy and the current model policy. Our off-policy formulation, combined with KL divergence minimization, enhances training stability and sample efficiency compared to existing RL-based methods. Empirical results demonstrate the effectiveness and superior reward optimization of our approach across diverse tasks in protein, small molecule, and regulatory DNA design.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3787. Soft-Di[M]O: Improved one-step Image Discrete Model
๐ฌ 17
โญ 5.00
๐ generative models
๐ท๏ธ Diffusion Distillation; Discrete Diffusion; Mask Diffusion; One-step Generation
One-step generators distilled from Masked Diffusion Models (MDMs) compress multiple sampling steps into a single forward pass, enabling efficient text and image synthesis.
However, they suffer two key limitations: they inherit modeling bias from the teacher, and their discrete token outputs block gradient flow, preventing post-distillation refinements such as adversarial training, reward-based fine-tuning, and Test-Time Embedding Optimization (TTEO).
In this work, we introduce soft embeddings, a simple relaxation that replaces discrete tokens with the expected embeddings under the generator's output distribution.
Soft embeddings preserve representation fidelity for one-step discrete generator while
providing a fully differentiable continuous surrogate that is compatible with teacher backbones and tokenizer decoders while cause minimum bias.
Integrating soft embeddings into the Di[M]O \citep{zhu2025di} distillation framework (denoted Soft-Di[M]O) makes one-step generators end-to-end trainable and enables straightforward application of GAN-based refinement, differentiable reward fine-tuning, and TTEO.
Empirically, across multiple MDM teachers (e.g., MaskBit \citep{weber2024maskbit}, MaskGen \citep{kim2025democratizing}), Soft-Di[M]O achieves state-of-the-art one-step results: improved class-to-image performance, a one-step FID of 1.56 on ImageNet-256 with GAN-based refinement, along with higher than teacher GenEval \citep{ghosh2023geneval} and HPS \citep{wu2023human} scores on text-to-image with reward fine-tuning, and further gains from TTEO.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3788. Constraint-guided Hardware-aware NAS through Gradient Modification
๐ฌ 16
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Neural Architecture Search Hardware-aware NAS Constraint-aware Optimization Edge Machine Learning
Neural Architecture Search (NAS), particularly gradient-based techniques, has proven highly effective in automating the design of neural networks. Recent work has extended NAS to hardware-aware settings, aiming to discover architectures that are both accurate and computationally efficient. Many existing methods integrate hardware metrics into the optimization objective as regularization terms, which introduces differentiability requirements and hyperparameter tuning challenges. This can either result in overly penalizing resource-intensive architectures or architectures failing to meet the hardware constraints of the target device. To address these challenges, we propose ConNAS, a novel gradient-based NAS framework that enforces hardware constraints directly through gradient modification. This approach eliminates the need for differentiable hardware metrics and regularization weights. The novelty in ConNAS lies in modifying gradients with respect to architectural choices, steering the search away from infeasible architectures while ensuring constraint satisfaction. Evaluations on the NATS-Bench benchmark demonstrate that ConNAS consistently discovers architectures that meet the imposed hardware constraints while achieving performance within just 0.14% of the optimal feasible architecture. Additionally, in a practical deployment scenario, ConNAS outperforms handcrafted architectures by up to 1.55% in accuracy under tight hardware budgets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3789. Group Verification-based Policy Optimization for Interactive Coding Agents
๐ฌ 12
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language model Tool Learning Reinforcement Learning
Recent advancements in reinforcement learning from verifiable rewards (RLVR), particularly through Group Relative Policy Optimization (GRPO), have significantly improved the capabilities of large language models (LLMs) for interactive coding agents.
However, these methods overlook process-verifiable environment feedback (e.g., code execution failures), leading to inaccurate advantage estimation at each reasoning step and insufficient learning.
To address this issue, we propose Group Verification-based Policy Optimization (GVPO), a novel RL algorithm that introduces an advantage shaping framework integrating both outcome-verifiable and process-verifiable signals.
While outcome-verifiable rewards ensure alignment with long-term task objectives, process-verifiable feedback derived from intermediate execution traces (e.g., syntax errors, runtime exceptions) serves as corrective shaping terms at the step level.
By jointly leveraging these two forms of verifiability, GVPO achieves more accurate credit assignment, balancing short-term process guidance with long-term outcome alignment.
This unified formulation yields more stable optimization, faster convergence, and stronger generalization in complex interactive environments.
A 32B-parameter agent trained with GVPO in the AppWorld environment outperforms OpenAIโs o1 agent by 12.6\%s on the more challenging Test-C split and surpasses the strongest 32B RL-trained state-of-the-art baseline by 3.6\%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3790. Fairness via Independence: A General Regularization Framework for Machine Learning
๐ฌ 4
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Bias Mitigation Statistical Independence Fairness in Machine Learning
Fairness in machine learning has emerged as a central concern, as predictive models frequently inherit or even amplify biases present in training data. Such biases often manifest as unintended correlations between model outcomes and sensitive attributes, leading to systematic disparities across demographic groups. Existing approaches to fair learning largely fall into two directions: incorporating fairness constraints tailored to specific definitions, which limits their generalizability, or reducing the statistical dependence between predictions and sensitive attributes, which is more flexible but highly sensitive to the choice of distance measure. The latter strategy in particular raises the challenge of finding a principled and reliable measure of dependence that can perform consistently across tasks. In this work, we present a general and model-agnostic approach to address this challenge. The method is based on encouraging independence between predictions and sensitive features through an optimization framework that leverages the CauchyโSchwarz (CS) Divergence as a principled measure of dependence. Prior studies suggest that CS Divergence provides a tighter theoretical bound compared to alternative distance measures used in earlier fairness methods, offering a stronger foundation for fairness-oriented optimization. Our framework, therefore, unifies prior efforts under a simple yet effective principle and highlights the value of carefully chosen statistical measures in fair learning. Through extensive empirical evaluation on four tabular datasets and one image dataset, we show that our approach consistently improves multiple fairness metrics while maintaining competitive accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3791. All-day Multi-scenes Lifelong Vision-and-Language Navigation with Tucker Adaptation
๐ฌ 23
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Tensor Decomposition Vision-and-Language Navigation Lifelong Learning
Deploying vision-and-language navigation (VLN) agents requires adaptation across diverse scenes and environments, but fine-tuning on a specific scenario often causes catastrophic forgetting in others, which severely limits flexible long-term deployment. We formalize this challenge as the all-day multi-scenes lifelong VLN (AML-VLN) problem. Existing parameter-efficient adapters (e.g., LoRA and its variants) are limited by their two-dimensional matrix form, which fails to capture the multi-hierarchical navigation knowledge spanning multiple scenes and environments. To address this, we propose Tucker Adaptation (TuKA), which represents the multi-hierarchical navigation knowledge as a high-order tensor and leverages Tucker decomposition to decouple the knowledge into shared subspaces and scenario-specific experts. We further introduce a decoupled knowledge incremental learning strategy to consolidate shared subspaces while constraining specific experts for decoupled lifelong learning. Building on TuKA, we also develop a VLN agent named AlldayWalker, which continually learns across multiple navigation scenarios, achieving all-day multi-scenes navigation. Extensive experiments show that AlldayWalker consistently outperforms state-of-the-art baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3792. Consis-GCPO: Consistency-Preserving Group Causal Preference Optimization for Vision Customization
๐ฌ 28
โญ 5.00
๐ generative models
๐ท๏ธ Multi-Subject Personalized Generation Diffusion Model Reinforcement Learning
Subject-driven generation faces a fundamental challenge: achieving high subject fidelity while maintaining semantic alignment with textual descriptions. While recent GRPO-based approaches have shown promise in aligning generative models with human preferences, they apply uniform optimization across all denoising timesteps, ignoring the temporal dynamics of how textual and visual conditions influence generation. We present Consis-GCPO, a causal reinforcement learning framework that reformulates multi-modal condition generation through discrete-time causal modeling. Our key insight is that different conditioning signals exert varying influence throughout the denoising processโtext guides semantic structure in early steps while visual references anchor details in later stages. By introducing decoupled causal intervention trajectories, we quantify instantaneous causal effects at each timestep, transforming these measurements into temporally-weighted advantages for targeted optimization. This approach enables precise tracking of textual and visual contributions, ensuring accurate credit assignment for each conditioning modality. Extensive experiments demonstrate that Consis-GCPO significantly advances personalized generation, achieving superior subject consistency while preserving strong text-following capabilities, particularly excelling in complex multi-subject scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3793. $\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization
๐ฌ 20
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ grokking gradient dynamics generalization memorization modular addition scaling laws
While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open problem whether there is a mathematical framework that characterizes what kind of features will emerge, how and in which conditions it happens, and is still closely connected with the gradient dynamics of the training, for complex structured inputs. We propose a novel framework, named \ours{}, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) \underline{\textbf{L}}azy learning, (II) \underline{\textbf{i}}ndependent feature learning and (III) \underline{\textbf{i}}nteractive feature learning. At the lazy learning stage, top layer overfits to random hidden representation and the model appears to memorize. During lazy learning, the \emph{backpropagated gradient} $G_F$ from the top layer carries information about the target label, with a specific structure that enables each hidden node to learn their representation \emph{independently}. Interestingly, the independent dynamics follows exactly the \emph{gradient ascent} of an energy function $\mathcal{E}$, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. When hidden nodes start to interact in the later stage of learning, we provably show how $G_F$ changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of feature emergence, memorization and generalization, and reveals the underlying cause why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layer architectures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3794. Transformers as Unsupervised Learning Algorithms: A study on Gaussian Mixtures
๐ฌ 18
โญ 5.00
๐ learning theory
๐ท๏ธ In-context learning Gaussian Mixture Models Theory
The transformer architecture has demonstrated remarkable capabilities in modern artificial intelligence, among which the capability of implicitly learning an internal model during inference time is widely believed to play a key role in the understanding of pre-trained large language models. However, most recent works have been focusing on studying supervised learning topics such as in-context learning, leaving the field of unsupervised learning largely unexplored.
This paper investigates the capabilities of transformers in solving Gaussian Mixture Models (GMMs), a fundamental unsupervised learning problem through the lens of statistical estimation.
We propose a transformer-based learning framework called Transformer for Gaussian Mixture Models (TGMM) that simultaneously learns to solve multiple GMM tasks using a shared transformer backbone. The learned models are empirically demonstrated to effectively mitigate the limitations of classical methods such as Expectation-Maximization (EM) or spectral algorithms, at the same time exhibit reasonable robustness to distribution shifts.
Theoretically, we prove that transformers can efficiently approximate both the Expectation-Maximization (EM) algorithm and a core component of spectral methodsโnamely, cubic tensor power iterations. These results not only improve upon prior work on approximating the EM algorithm,
but also provide, to our knowledge, the first theoretical guarantee that transformers can approximate high-order tensor operations.
Our study bridges the gap between practical success and theoretical understanding, positioning transformers as versatile tools for unsupervised learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3795. CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design
๐ฌ 11
โญ 5.00
๐ generative models
๐ท๏ธ Text-to-Image generation Controllable Image Generation Diffusion Models
Graphic design plays a vital role in visual communication across advertising, marketing, and multimedia entertainment. Prior work has explored automated graphic design generation using diffusion models, aiming to streamline creative workflows and democratize design capabilities. However, complex graphic design scenarios require accurately adhering to design intent specified by multiple heterogeneous user-provided elements (\eg images, layouts, and texts), which pose multi-condition control challenges for existing methods. Specifically, previous single-condition control models demonstrate effectiveness only within their specialized domains but fail to generalize to other conditions, while existing multi-condition methods often lack fine-grained control over each sub-condition and compromise overall compositional harmony. To address these limitations, we introduce CreatiDesign, a systematic solution for automated graphic design covering both model architecture and dataset construction. First, we design a unified multi-condition driven architecture that enables flexible and precise integration of heterogeneous design elements with minimal architectural modifications to the base diffusion model. Furthermore, to ensure that each condition precisely controls its designated image region and to avoid interference between conditions, we propose a multimodal attention mask mechanism. Additionally, we develop a fully automated pipeline for constructing graphic design datasets, and introduce a new dataset with 400K samples featuring multi-condition annotations, along with a comprehensive benchmark. Experimental results show that CreatiDesign outperforms existing models by a clear margin in faithfully adhering to user intent.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3796. What happens when generative AI models train recursively on each others' outputs?
๐ฌ 13
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ model collapse generative AI
The internet serves as a common source of training data for generative AI (genAI) models but is increasingly populated with AI-generated content. This duality raises the possibility that future genAI models may be trained on other models' generated outputs. Prior work has studied consequences of models training on their own generated outputs, but limited work has considered what happens if models ingest content produced by other models. Given society's increasing dependence on genAI tools, understanding such data-mediated model interactions is critical. This work provides empirical evidence for how data-mediated interactions might unfold in practice, develops a theoretical model for this interactive training process, and experimentally validates the theory. We find that data-mediated interactions can benefit models by exposing them to novel concepts perhaps missed in original training data, but also can homogenize their performance on shared tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3797. Conformalized Hierarchical Calibration for Uncertainty-Aware Adaptive Hashing
๐ฌ 23
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Adptive Hashing Retrieval Hashing Retrieval Unsupervised Domain Adaptation
Unsupervised domain adaptive hashing transfers knowledge from labeled source domains to unlabeled target domains, addressing domain shift challenges in real-world retrieval tasks. Existing methods face two critical limitations: target domain noise severely misleads model training, and indiscriminate domain alignment strategies treat all target samples equally, potentially distorting essential feature structures. We propose an uncertainty-aware adaptive hashing approach that addresses these challenges through a hierarchical conformal calibration framework. At the semantic level, we employ conformal inference to generate confidence prediction sets, replacing single pseudo-labels with set-based predictions whose sizes directly quantify sample reliability for weighted pseudo-label learning and domain alignment. This enables the model to focus on reliable samples while suppressing noise. At the representation level, we predict the stability of individual hash bits, where bit-level confidence guides a robust weighted quantization loss and enables dynamic weighted Hamming distance during retrieval, fundamentally enhancing hash code quality and retrieval robustness. Through this hierarchical calibration mechanism, our method achieves more adaptive and robust cross-domain knowledge transfer. Extensive experiments on multiple benchmark datasets demonstrate significant improvements over existing approaches, validating the effectiveness and superiority of our method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3798. Robust Amortized Bayesian Inference with Self-Consistency Losses on Unlabeled Data
๐ฌ 4
โญ 5.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Bayesian models amortized inference robust inference self-consistency semi-supervised learning
Amortized Bayesian inference (ABI) with neural networks can solve probabilistic inverse problems orders of magnitude faster than classical methods. However, ABI is not yet sufficiently robust for widespread and safe application. When performing inference on observations outside the scope of the simulated training data, posterior approximations are likely to become highly biased, which cannot be corrected by additional simulations due to the bad pre-asymptotic behavior of current neural posterior estimators. In this paper, we propose a semi-supervised approach that enables training not only on labeled simulated data generated from the model, but also on unlabeled data originating from any source, including real data. To achieve this, we leverage Bayesian self-consistency properties that can be transformed into strictly proper losses that do not require knowledge of ground-truth parameters. We test our approach on several real-world case studies, including applications to high-dimensional time-series and image data. Our results show that semi-supervised learning with unlabeled data drastically improves the robustness of ABI in the out-of-simulation regime. Notably, inference remains accurate even when evaluated on observations far away from the labeled and unlabeled data seen during training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3799. Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards
๐ฌ 30
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models (LLMs) Reinforcement Learning RLVR Math Reasoning Diversity
RL with Verifiable Rewards (RLVR) has emerged as a promising paradigm for improving the reasoning abilities of large language models (LLMs). Current methods rely primarily on policy optimization frameworks like PPO and GRPO, which follow generalized policy iteration that alternates between evaluating the current policy's value and improving the policy based on evaluation. While effective, they often suffer from training instability and diversity collapse, requiring complex heuristic tricks and careful tuning. We observe that standard RLVR in math reasoning can be formalized as a specialized finite-horizon Markov Decision Process with deterministic state transitions, tree-structured dynamics, and binary terminal rewards. Though large in scale, the underlying structure is simpler than general-purpose control settings for which popular RL algorithms (e.g., PPO) were developed, suggesting that several sophisticated techniques in existing methods may be reduced or even omitted. Based on this insight, we prove a surprising result: the optimal action can be recovered from the Q-function of a fixed uniformly random policy, thereby bypassing the generalized policy iteration loop and its associated heuristics. We introduce \underline{\textbf{R}}andom P\underline{\textbf{o}}licy \underline{\textbf{V}}aluation for Div\underline{\textbf{e}}rse \underline{\textbf{R}}easoning (ROVER) to translate this principle into a practical and scalable algorithm for LLM math reasoning, a minimalist yet highly effective RL method that samples actions from a softmax over these uniform-policy Q-values. ROVER preserves diversity throughout training, allowing sustained exploration of multiple valid pathways. Across multiple base models and standard math reasoning benchmarks, ROVER demonstrates superior performance in both \textbf{quality} (\textbf{+8.2} on pass@1, \textbf{+16.8} on pass@256) and \textbf{diversity} (\textbf{+20.5\%}), despite its radical simplification compared to strong, complicated existing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3800. Adaptive Test-Time Training for Predicting Need for Invasive Mechanical Ventilation in Multi-Center Cohorts
๐ฌ 8
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Test-Time Training Domain Adaptation Electronic Health Records Invasive Mechanical Ventilation Prediction
Accurate prediction of the need for invasive mechanical ventilation (IMV) in intensive care units (ICUs) patients is crucial for timely interventions and resource allocation. However, variability in patient populations, clinical practices, and electronic health record (EHR) systems across institutions introduces domain shifts that degrade the generalization performance of predictive models during deployment. Test-Time Training (TTT) has emerged as a promising approach to mitigate such shifts by adapting models dynamically during inference without requiring labeled target-domain data. In this work, we introduce Adaptive Test-Time Training (AdaTTT), an enhanced TTT framework tailored for EHR-based IMV prediction in ICU settings. We begin by deriving information-theoretic bounds on the test-time prediction error and demonstrate that it is constrained by the uncertainty between the main and auxiliary tasks. To enhance their alignment, we introduce a self-supervised learning framework with pretext tasks: reconstruction and masked feature modeling optimized through a dynamic masking strategy that emphasizes features critical to the main task. Additionally, to improve robustness against domain shifts, we incorporate prototype learning and employ Partial Optimal Transport (POT) for flexible, partial feature alignment while maintaining clinically meaningful patient representations. Experiments across multi-center ICU cohorts demonstrate competitive classification performance on different test-time adaptation benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3801. Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning
๐ฌ 17
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Offline Reinforcement Learning Behavior Cloning Flow Matching
Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution. However, such approaches fail to distinguish between high-value and low-value actions in their regularization components. We introduce Guided Flow Policy (GFP), which couples a multi-step flow-matching policy with a distilled one-step actor. The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset rather than indiscriminately imitating all state-action pairs. In turn, the flow policy constrains the actor to remain aligned with the dataset's best transitions while maximizing the critic. This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal datasets and challenging tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3802. MotionGPT3: Human Motion as a Second Modality
๐ฌ 17
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3d motion text-driven motion generation text-to-motion human motion synthesis motion caption
With the rapid progress of large language models (LLMs), multimodal frameworks that unify understanding and generation have become promising, yet they face increasing complexity as the number of modalities and tasks grows. We observe that motion quantization introduces approximation errors that cap motion quality, and that unifying discrete text and continuous motion within a single-stream backbone amplifies cross-modal interference. Motivated by recent multi-branch Transformer designs that separate signals from different modalities, we propose MotionGPT3, a bimodal motionโlanguage model for both understanding and generation. MotionGPT3 encodes raw motion into a continuous latent space using a variational autoencoder (VAE), thereby avoiding quantization-induced artifacts, while leveraging the semantic prior of pretrained language models. A dual-stream Transformer with shared attention preserves modality-specific routes while enabling controlled, bidirectional information flow, which reduces interference, stabilizing optimization, and empirically accelerates convergence without degrading fidelity. For multimodal joint training, a generate-then-align three-stage schedule further improves stability and limits cross-task interference. Experiments show that MotionGPT3 achieves 2ร faster convergence in training loss and up to 4ร faster convergence in validation, while maintaining state-of-the-art performance on standard motion understanding and motion generation benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3803. GNN-as-Judge: Unleashing the Power of LLMs for Graph Few-shot Semi-supervised Learning with GNN Feedback
๐ฌ 19
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Large Language Models Graph Neural Networks Graph Few-shot Semi-supervised Learning
Large Language Models (LLMs) have shown strong performance on text-attributed graphs (TAGs) due to their superior semantic understanding ability on textual node features. However, their effectiveness as predictors in the few-shot semi-supervised setting, where labeled nodes are rather limited, remains constrained since fine-tuning LLMs usually requires sufficient labeled data, especially when the TAG shows complex structural patterns. In essence, this paper targets two key challenges: (i) the difficulty of generating and selecting reliable pseudo labels on TAGs for LLMs, and (ii) the need to mitigate potential label noise when fine-tuning LLMs with pseudo labels. To counter the challenges, we propose a new framework, GNN-as-Judge, which can unleash the power of LLMs for few-shot semi-supervised learning on TAGs by incorporating the structural inductive bias of Graph Neural Networks (GNNs). Specifically, GNN-as-Judge introduces a collaborative pseudo-labeling strategy that first identifies the most influenced unlabeled nodes from labeled nodes, then exploits both the agreement and disagreement patterns between LLMs and GNNs to generate reliable labels. Furthermore, we develop a weakly-supervised LLM fine-tuning algorithm that can distill the knowledge from informative pseudo labels while mitigating the potential label noise. Experiments on different TAG datasets demonstrate that GNN-as-Judge significantly outperforms existing methods, especially under low-resource regimes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3804. UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking
๐ฌ 17
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Dataset Agent Information Seeking
Recent advancements in LLM-based information-seeking agents have achieved record-breaking performance on established benchmarks. However, these agents remain heavily reliant on search-engine-indexed knowledge, leaving a critical blind spot: Unindexed Information Seeking (UIS). This paper identifies and explores the UIS problem, where vital information is not captured by search engine crawlers, such as overlooked content, dynamic webpages, and embedded files. Despite its significance, UIS remains an underexplored challenge. To address this gap, we introduce UIS-QA, the first dedicated UIS benchmark, comprising 110 expert-annotated QA pairs. Notably, even state-of-the-art agents experience a drastic performance drop on UIS-QA (e.g., from 70.90 on GAIA and 46.70 on BrowseComp-zh to 24.55 on UIS-QA), underscoring the severity of the problem. To mitigate this, we propose UIS-Digger, a novel multi-agent framework that incorporates dual-mode browsing and enables simultaneous webpage searching and file parsing. With a relatively small $\sim$30B-parameter backbone LLM optimized using SFT and RFT training strategies, UIS-Digger sets a strong baseline at 26.36\%, outperforming systems integrating sophisticated LLMs such as O3 and GPT-4.1. This demonstrates the importance of proactive interaction with unindexed sources for effective and comprehensive information-seeking. Our work not only uncovers a fundamental limitation in current agent evaluation paradigms but also provides the first toolkit for advancing UIS research, defining a new and promising direction for robust information-seeking systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3805. Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embedding Space
๐ฌ 19
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Semantic Navigation Natural Language Processing Human Cognition Text Embeddings
Semantic representations can be framed as a structured, dynamic knowledge space through which humans navigate to retrieve and manipulate meaning. To investigate how humans traverse this geometry, we introduce a framework that represents concept production as navigation through embedding space. Using different transformer text embedding models, we construct participant-specific semantic trajectories based on cumulative embeddings and extract geometric and dynamical metrics, including distance to next, distance to centroid, entropy, velocity, and acceleration. These measures capture both scalar and directional aspects of semantic navigation, providing a computationally grounded view of semantic representation search as movement in a geometric space. We evaluate the framework on four datasets across different languages, spanning different property generation tasks: Neurodegenerative, Swear verbal fluency, Property listing task in Italian, and in German. Across these contexts, our approach distinguishes between clinical groups and concept types, offering a mathematical framework that requires minimal human intervention compared to typical labor-intensive linguistic pre-processing methods. Comparison with a non-cumulative approach reveals that cumulative embeddings work best for longer trajectories, whereas shorter ones may provide too little context, favoring the non-cumulative alternative. Critically, different embedding models yielded similar results, highlighting similarities between different learned representations despite different training pipelines. By framing semantic navigation as a structured trajectory through embedding space, bridging cognitive modeling with learned representation, thereby establishing a pipeline for quantifying semantic representation dynamics with applications in clinical research, cross-linguistic analysis, and the assessment of artificial cognition.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3806. Random Label Prediction Heads for Studying and Controlling Memorization in Deep Neural Networks
๐ฌ 11
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Memorization Random Labels Overfitting Generalization Regularization
We introduce a straightforward yet effective method to empirically measure and regularize memorization in deep neural networks for classification tasks.
Our approach augments each training sample with auxiliary random labels, which are then predicted by a random label prediction head (RLP-head).
RLP-heads can be attached at arbitrary depths of a network, predicting random labels from the corresponding intermediate representation and thereby enabling analysis of how memorization capacity evolves across layers.
By interpreting the RLP-head performance as an empirical estimate of Rademacher complexity, we obtain a direct measure of both sample-level memorization and model capacity.
We leverage this random label accuracy metric to analyze generalization and overfitting in different models and datasets.
Building on this approach, we further propose a novel regularization technique based on the output of the RLP-head, which demonstrably reduces memorization.
Interestingly, our experiments reveal that reducing memorization can either improve or impair generalization, depending on the dataset and training setup.
These findings challenge the traditional assumption that overfitting is equivalent to memorization and suggest new hypotheses to reconcile these seemingly contradictory results.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3807. PERSONA: Dynamic and Compositional Inference-Time Personality Control via Activation Vector Algebra
๐ฌ 28
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Language Models Personality Control Representation Engineering Model Steering Inference-Time Adaptation Compositionality
Current methods for personality control in Large Language Models rely on static prompting or expensive fine-tuning, failing to capture the dynamic and compositional nature of human traits. We introduce PERSONA, a training-free framework that achieves fine-tuning level performance through direct manipulation of personality vectors in activation space. Our key insight is that personality traits appear as extractable, approximately orthogonal directions in the model's representation space that support algebraic operations. The framework operates through three stages: Persona-Base extracts orthogonal trait vectors via contrastive activation analysis; Persona-Algebra enables precise control through vector arithmetic (scalar multiplication for intensity, addition for composition, subtraction for suppression); and Persona-Flow achieves context-aware adaptation by dynamically composing these vectors during inference. On PersonalityBench, our approach achieves a mean score of 9.60, nearly matching the supervised fine-tuning upper bound of 9.61 without any gradient updates. On our proposed Persona-Evolve benchmark for dynamic personality adaptation, we achieve up to 91% win rates across diverse model families. These results provide evidence that aspects of LLM personality are mathematically tractable, opening new directions for interpretable and efficient behavioral control.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3808. RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning
๐ฌ 17
โญ 5.00
๐ generative models
๐ท๏ธ Reasoning; T2I; RL
Despite recent progress in text-to-image (T2I) generation, existing models often struggle to faithfully capture user intentions from short and under-specified prompts. While prior work has attempted to enhance prompts using large language models (LLMs), these methods frequently generate stylistic or unrealistic content due to insufficient grounding in visual semantics and real-world composition. Inspired by recent advances in reasoning for language model, we propose RePrompt, a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning. Instead of relying on handcrafted rules or stylistic rewrites, our method trains a language model to generate structured, self-reflective prompts by optimizing for image-level outcomes. The tailored reward models assesse the generated images in terms of human preference, semantic alignment, and visual composition, providing indirect supervision to refine prompt generation. Our approach enables end-to-end training without human-annotated data. Experiments on GenEval and T2I-Compbench show that RePrompt significantly boosts spatial layout fidelity and compositional generalization across diverse T2I backbones, establishing new state-of-the-art results. Code: \url{https://anonymous.4open.science/r/RePrompt-CD21}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3809. FrugalRAG: Less is More in RL Finetuning for Multi-hop Question Answering
๐ฌ 16
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multi-Hop RAG Efficiency Reasoning SLMs
Reinforcement learning (RL) based on the final answer's reward has driven recent progress in small language models (SLMs) on reasoning-heavy tasks such as math and code. However, applying the same techniques to retrieval-augmented generation (RAG) benchmarks like multi-hop QA has yielded limited gainsโoften trailing supervised or prompting-only baselines. Instead, we argue that a viable path for RL in multi-hop QA is to use test-time scaling judiciously, for optimizing both the final answer accuracy and the efficiency in reaching that answer.
We propose FrugalRAG, a two-stage finetuning framework that adaptively _reduces_ the number of retrieval steps based on a question's difficulty. First, we train an SLM with supervised finetuning on a full-exploration policy that generates broad sub-queries. Then, we apply RL to adaptively prune search depth based on question difficulty, directly rewarding policies that balance correctness with frugality. Unlike prior approaches requiring 10ร more data, our method achieves competitive performance with only ~1,000 examples. On HotPotQA and other multi-hop QA benchmarks, FrugalRAG attains state-of-the-art efficiencyโaccuracy tradeoffs, cutting retrieval cost nearly in half. Moreover, on the challenging BrowseCompPlus benchmark, it surpasses SLM-based and other baselines after training on only 200 examples. These results demonstrate the use of RLโnot to increase reasoning steps but to reduce themโas an effective solution for scalable, efficient RAG.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3810. Matting Anything 2: Towards Video Matting for Anything
๐ฌ 15
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Matting
Video matting is a crucial task for many applications, but existing methods face significant limitations. They are often domain-specific, focusing primarily on human portraits, and rely on the mask of first frame that is challenging to acquire for transparent or intricate objects like fire or smoke. To address these challenges, we introduce Matting Anything 2 (MAM2), a versatile and robust video matting model that handles diverse objects using flexible user prompts such as points, boxes, or masks. We first propose Promptable Dual-mode Decoder (PDD), an effective structure that simultaneously predicts a segmentation mask and a corresponding high-quality trimap, leveraging trimap-based guidance to improve generalization. To tackle prediction instability for transparent objects across video frames, we further propose a Memory-Separable Siamese (MSS) mechanism. MSS employs a recurrent approach that isolates trimap prediction from potentially interfering mask memory, significantly enhancing temporal consistency. To validate our method's performance on diverse objects, we introduce the Natural Video Matting dataset, a new benchmark with substantially greater diversity. Extensive experiments show that MAM2 possesses exceptional matting accuracy and generalization capabilities. We believe MAM2 demonstrates a significant leap forward in creating a video matting method for anything.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3811. Advancing End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-Training
๐ฌ 16
โญ 5.00
๐ generative models
๐ท๏ธ Image Generation Diffusion models Pixel-space generation
Pixel-space generative models are often more difficult to train and generally underperform compared to their latent-space counterparts, leaving a persistent performance and efficiency gap.
In this paper, we introduce a novel two-stage training framework that closes this gap for pixel-space diffusion and consistency models.
In the first stage, we pre-train encoders to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory, which evolves points from the prior to the data distribution.
In the second stage, we integrate the encoder with a randomly initialized decoder and fine-tune the complete model end-to-end for both diffusion and consistency models.
Our framework achieves state-of-the-art empirical results on ImageNet. Specifically, our diffusion model reaches an FID of 1.58 on ImageNet-256 and 2.35 on ImageNet-512 with 75 number of function evaluations (NFE) surpassing prior pixel-space methods and VAE-based counterparts by a large margin in both generation quality and training efficiency. In a direct efficiency comparison, our model significantly outperforms DiT while using only around 30\% of its training compute.
Furthermore, our one-step consistency model achieves an impressive FID of 8.82 on ImageNet-256, significantly outperforming its latent-space counterparts. To the best of our knowledge, this marks the first successful training of a consistency model directly on high-resolution images without relying on pre-trained VAEs or diffusion models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3812. Transitive RL: Value Learning via Divide and Conquer
๐ฌ 13
โญ 5.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning goal-conditioned rl offline rl
In this work, we present Transitive Reinforcement Learning (TRL), a new value learning algorithm based on a divide-and-conquer paradigm. TRL is designed for offline goal-conditioned reinforcement learning (GCRL) problems, where the aim is to find a policy that can reach any state from any other state in the smallest number of steps. TRL converts a triangle inequality structure present in GCRL into a practical divide-and-conquer value update rule. This has several advantages compared to alternative value learning paradigms. Compared to temporal difference (TD) methods, TRL suffers less from bias accumulation, as in principle it only requires $O(\log T)$ recursions (as opposed to $O(T)$ in TD learning) to handle a length-T trajectory. Unlike Monte Carlo methods, TRL suffers less from high variance as it performs dynamic programming. Experimentally, we show that TRL achieves the best performance in highly challenging, long-horizon benchmark tasks compared to previous offline GCRL algorithms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3813. Spurious Correlation-Aware Embedding Regularization for Worst-Group Robustness
๐ฌ 25
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Spurious Correlation Representation Learning Embedding Regularization Domain Generalization
Deep learning models achieve strong performance across various domains but often rely on spurious correlations, making them vulnerable to distribution shifts. This issue is particularly severe in subpopulation shift scenarios, where models struggle in underrepresented groups. While existing methods have made progress in mitigating this issue, their performance gains are still constrained. They lack a theoretical \red{motivation} connecting the embedding space representations with worst-group error.
To address this limitation, we propose Spurious Correlation-Aware Embedding Regularization for Worst-Group Robustness (SCER), a novel approach that directly regularizes feature representations to suppress spurious cues. We theoretically show that worst-group error is influenced by how strongly the classifier relies on spurious versus core directions, as identified from differences in group-wise mean embeddings across domains and classes.
By imposing theoretical constraints at the embedding level, SCER encourages models to focus on core features while reducing sensitivity to spurious patterns. \blue{Through systematic evaluation on multiple vision and language tasks, we show that SCER outperforms prior state-of-the-art methods in worst-group accuracy.} Our code is available at \href{https://anonymous.4open.science/r/SCER_repo/}{https://anonymous.4open.science/r/SCER}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3814. Any-step Generation via N-th Order Recursive Consistent Velocity Field Estimation
๐ฌ 4
โญ 5.00
๐ generative models
๐ท๏ธ Generative Models
Recent advances in few-step generative models (typically $1$-$8$ steps), such as consistency models, have yielded impressive performance. However, their broader adoption is hindered by significant challenges, including substantial computational overhead, the reliance on complex multi-component loss functions, and intricate multi-stage training strategies that lack end-to-end simplicity. These limitations impede their scalability and stability, especially when applied to large-scale models.
To address these issues, we introduce **$N$-th order Recursive Consistent velocity field estimation for Generative Modeling (RCGM)**, a novel framework that unifies many existing approaches. Within this framework, we reveal that conventional one-step methods, such as consistency and MeanFlow models, are special cases of 1st-order RCGM. This insight enables a natural extension to higher-order scenarios ($N \geq 2$), which exhibit markedly improved training stability and achieve state-of-the-art (SOTA) performance.
For instance, on ImageNet $256\times256$, RCGM enables a $675\text{M}$ parameter diffusion transformer to achieve a $1.48$ FID score in just $2$ sampling steps. Crucially, RCGM facilitates the stable full-parameter training of a large-scale ($20\textrm{B}$) unified multi-modal model, attaining a $0.86$ GenEval score in $2$ steps. In contrast, conventional 1st-order approaches, such as consistency and MeanFlow models, typically suffer from training instability, model collapse, or memory constraints under comparable settings.
*Code will be publicly available.*
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3815. Beyond Student: An Asymmetric Network for Neural Network Inheritance
๐ฌ 21
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Knowledge Distillation
Knowledge Distillation (KD) has emerged as a powerful technique for model compression, enabling lightweight student networks to benefit from the performance of redundant teacher networks. However, the inherent capacity gap often limits the performance of student networks. Inspired by the expressiveness of pretrained teacher networks, a compelling research question arises: is there a type of network that can not only inherit the teacherโs structure but also maximize the inheritance of its knowledge? Furthermore, how does the performance of such an inheriting network compare to that of student networks, all benefiting from the same teacher network? To further explore this question, we propose InherNet, a neural network inheritance method that performs asymmetric low-rank decomposition on the teacherโs weights and reconstructs a lightweight yet expressive network without significant architectural disruption. By leveraging Singular Value Decomposition (SVD) for initialization to ensure the inheritance of principal knowledge, InherNet effectively balances depth, width, and compression efficiency. Experimental results across unimodal and multimodal tasks demonstrate that InherNet achieves higher performance compared to student networks of similar parameter sizes. Our findings reveal a promising direction for future research in efficient model compression beyond traditional distillation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3816. From Language to Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance
๐ฌ 22
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Humanoid locomotion; Reinforcement Learning; Motion Generation
Natural language offers a natural interface for humanoid robots, but existing text-to-motion pipelines remain cumbersome and unreliable. They typically decode human motion, retarget it to robot morphology, and then track it with a physics-based controller. However, this multi-stage process is prone to cumulative errors, introduces high latency, and yields weak coupling between semantics and control. These limitations call for a more direct pathway from language to action, one that eliminates fragile intermediate stages. Therefore, we present RoboGhost, a retargeting-free framework that directly conditions humanoid policies on language-grounded motion latents. By bypassing explicit motion decoding and retargeting, RoboGhost enables a diffusion-based policy to denoise executable actions directly from noise, preserving semantic intent and supporting fast, reactive control. A hybrid causal transformerโdiffusion design further ensures long-horizon consistency while maintaining stability and diversity, yielding rich latent representations for precise humanoid behavior. Extensive experiments demonstrate that RoboGhost substantially reduces deployment latency, improves success rates and tracking accuracy, and produces smooth, semantically aligned locomotion on real humanoids. Beyond text, the framework naturally extends to other modalities such as images, audio, and music, providing a general foundation for visionโlanguageโaction humanoid systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3817. Robust LLM Unlearning via Post Judgment and Multi-round Thinking
๐ฌ 30
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Unlearning; Adversarial Robustness; AI Safety
The unlearning capability of LLMs is vital for ensuring compliance and safety, especially when removing sensitive knowledge from deployed models. Pre-filtering methods, enabling rapid deployment without parameter changes, are a prominent unlearning approach. However, they exhibit significant robustness deficiencies against adversarial attacks: in the worst case, simple prefix attacks can induce up to a 1,150-fold surge in information leakage for fictitious entity knowledge, while composite question attacks can cause accuracy on hazardous knowledge to rebound from the 25% random-guess baseline to as high as 67.0%. To address this, we propose a new unlearning framework via post judgment and multi-round thinking (PoRT), which consists of three key modules. First, a data cleaning module compiles a dynamic few-shot prompt that instructs the LLM to simultaneously generate both a cleaned version of the userโs query and a corresponding initial response, supported by an extensible demonstration library for adaptive defense. Second, unlike existing pre-filtering methods that typically judge based solely on prompts, our post-judgment module jointly evaluates cleaned prompts and their corresponding responses to better detect non-compliant outputs. Finally, a selective multi-round thinking process is employed to trigger LLMโs self-correction for low-confidence outputs, enhancing reliability and result quality. Extensive experiments on benchmarks demonstrate PoRTโs superior robustness against adversarial attacks and strong unlearning effectiveness without compromising general model utility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3818. RCPU: Rotation-Constrained Error Compensation for Structured Pruning of a Large Language Model
๐ฌ 35
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Compression Pruning
In this paper, we propose a rotation-constrained compensation method to address the errors introduced by structured pruning of large language models (LLMs).
LLMs are trained on massive datasets and accumulate rich semantic knowledge in their representation space.
In contrast, pruning is typically carried out with only a small amount of calibration data, which makes output mismatches unavoidable.
Although direct least-squares fitting can reduce such errors, it tends to overfit to the limited calibration set, destructively modifying pretrained weights.
To overcome this difficulty, we update the pruned parameters under a rotation constraint.
This constrained update preserves the geometry of output representations (i.e., norms and inner products) and simultaneously re-aligns the pruned subspace with the original outputs.
Furthermore, in rotation-constrained compensation, removing components that strongly contribute to the principal directions of the output makes error recovery difficult.
Since input dimensions with large variance strongly affect these principal directions, we design a variance-aware importance score that ensures such dimensions are preferentially kept in the pruned model.
By combining this scoring rule with rotation-constrained updates, the proposed method effectively compensates errors while retaining the components likely to be more important in a geometry-preserving manner.
In the experiments, we apply the proposed method to Llama-7B and Llama-2-13B, and evaluate it on WikiText-2 and multiple language understanding benchmarks.
The results demonstrate consistently better perplexity and task accuracy compared with existing baselines.
Codes are available at \url{https://anonymous.4open.science/r/anonymous-llm-pruning-D884/}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3819. Dual Distillation for Few-Shot Anomaly Detection
๐ฌ 26
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ anomaly detection few-shot learning knowledge distillation
Anomaly detection is a critical task in computer vision with profound implications for medical imaging, where identifying pathologies early can directly impact patient outcomes. While recent unsupervised anomaly detection approaches show promise, they require substantial normal training data and struggle to generalize across anatomical contexts. We introduce D$^2$4FAD, a novel dual distillation framework for few-shot anomaly detection that identifies anomalies in previously unseen tasks using only a small number of normal reference images. Our approach leverages a pre-trained encoder as a teacher network to extract multi-scale features from both support and query images, while a student decoder learns to distill knowledge from the teacher on query images and self-distill on support images. We further propose a learn-to-weight mechanism that dynamically assesses the reference value of each support image conditioned on the query, optimizing anomaly detection performance. To evaluate our method, we curate a comprehensive benchmark dataset comprising 13,084 images across four organs, four imaging modalities, and five disease categories. Extensive experiments demonstrate that D$^2$4FAD significantly outperforms existing approaches, establishing a new state-of-the-art in few-shot medical anomaly detection.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3820. Riemannian Federated Learning via Averaging Gradient Streams
๐ฌ 16
โญ 5.00
๐ optimization
๐ท๏ธ Riemannian federated learning Averaging gradient streams Partial participation Heterogeneity data Riemannian distributed optimization
Federated learning (FL) as a distributed learning paradigm has a significant advantage in addressing large-scale machine learning tasks.
In the Euclidean setting, FL algorithms have been extensively studied with both theoretical and empirical success. However, there exist few works that investigate federated learning algorithms in the Riemannian setting. In particular, critical challenges such as partial participation and data heterogeneity among agents are not explored in the Riemannian federated setting. This paper presents and analyzes a Riemannian FL algorithm, called RFedAGS, based on a new efficient server aggregation---averaging gradient streams, which can simultaneously handle partial participation and data heterogeneity. We theoretically show that the proposed RFedAGS has global convergence and sublinear convergence rate under decaying step sizes cases; and converges sublinearly/linearly to a neighborhood of a stationary point/solution under fixed step sizes cases. These analyses are based on a vital and non-trivial assumption induced by partial participation, which is shown to hold with high probability. Extensive experiments conducted on synthetic and real-world data demonstrate the good performance of RFedAGS.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3821. Improving Semantic Proximity in English-Centric Information Retrieval through Cross-Lingual Alignment
๐ฌ 12
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Cross-Lingual Alignment Information Retrieval Multilingual Embedding Cross-Lingual Information Retrieval
With the increasing accessibility and utilization of multilingual documents, Cross-Lingual Information Retrieval (CLIR) has emerged as an important research area. Conventionally, CLIR tasks have been conducted under settings where the language of documents differs from that of queries, and typically, the documents are composed in a single coherent language. In this paper, we highlight that in such a setting, the cross-lingual alignment capability may not be evaluated adequately. Specifically, we observe that, in a document pool where English documents coexist with another language, most multilingual retrievers tend to prioritize unrelated English documents over the related document written in the same language as the query. To rigorously analyze and quantify this phenomenon, we introduce various scenarios and metrics designed to evaluate the cross-lingual alignment performance of multilingual retrieval models. Furthermore, to improve cross-lingual performance under these challenging conditions, we propose a novel training strategy aimed at enhancing cross-lingual alignment. Using only a small dataset consisting of 2.8k samples, our method significantly improves the cross-lingual retrieval performance while simultaneously mitigating the English inclination problem. Extensive analyses demonstrate that the proposed method substantially enhances the cross-lingual alignment capabilities of most multilingual embedding models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3822. SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks
๐ฌ 16
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ jailbreak attack multi-turn reinforcement learning large language model
Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average 80.1% ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3823. Measurement Score-Based Diffusion Model
๐ฌ 21
โญ 5.00
๐ generative models
๐ท๏ธ diffusion models generative models inverse problems learning from measurements learning without ground-truth
Diffusion models have achieved remarkable success in tasks ranging from image generation to inverse problems. However, training diffusion models typically requires clean ground-truth images, which are unavailable in many applications. We introduce the Measurement Score-based diffusion Model (MSM), a novel framework that learns partial measurement scores directly from noisy and subsampled measurements.
By aggregating these scores in expectation, MSM synthesizes fully sampled measurements without requiring access to clean images.
To make this practical, we develop a stochastic sampling variant of MSM that approximates the expectation efficiently and analyze its asymptotic equivalence to the exact formulation. We further extend MSM to posterior sampling for linear inverse problems, enabling accurate image reconstruction directly from partial scores.
Experiments on natural images and multi-coil MRI demonstrate that MSM achieves state-of-the-art performance in unconditional generation and inverse problem solving---all while being trained exclusively on degraded measurements.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3824. Causality โ ย Invariance: Function vs Concept Vectors in LLMs
๐ฌ 19
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ mechanistic interpretability LLMs attention heads in-context learning concept invariance
Do large language models (LLMs) represent concepts abstractly, i.e., independent of input format? We revisit Function Vectors (FVs), compact representations of in-context learning (ICL) tasks that causally drive task performance. Across multiple LLMs, we show that FVs are not fully invariant: FVs are nearly orthogonal when extracted from different input formats (e.g., open-ended vs. multiple-choice), even if both target the same concept. We introduce Concept Vectors (CVs) which produce more stable concept representations. Like FVs, CVs are composed of attention head outputs; however, unlike FVs, the head selection is optimized via Representational Similarity Analysis (RSA) to encode concepts consistently across input formats. While these heads emerge in similar layers to FV-related heads, the two sets are largely distinct, suggesting different underlying mechanisms. Steering experiments reveal that FVs excel in-distribution, when extraction and application formats match (e.g., both open-ended in English), while CVs generalize better out-of-distribution across both question types (open-ended vs. multiple-choice) and languages. Our results show that LLMs do contain abstract concept representations, but these differ from those that drive ICL performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3825. Faster Vision Transformers with Adaptive Patches
๐ฌ 18
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ efficient vision vision transformers
Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones.
APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40\% on ViT-L and 50\% on ViT-H while maintaining downstream performance.
It can be applied to a previously fine-tuned ViT and converges in as little as 1 epoch.
It also significantly reduces training and inference time without loss of performance in high-resolution dense visual tasks, achieving up to 30\% faster training and inference in visual QA, object detection, and semantic segmentation.
We will release all code and trained models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3826. AP-OOD: Attention Pooling for Out-of- Distribution Detection
๐ฌ 16
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ out-of-distribution detection attention pooling nlp language models
Out-of-distribution (OOD) detection, which maps high-dimensional data into a scalar OOD score, is critical for the reliable deployment of machine learning models. A key challenge in recent research is how to effectively leverage and aggregate token embeddings from language models to obtain the OOD score. In this work, we propose AP-OOD, a novel OOD detection method for natural language that goes beyond simple average-based aggregation by exploiting token-level information. AP-OOD is a semi-supervised approach that flexibly interpolates between unsupervised and supervised settings, enabling the use of limited auxiliary outlier data. Empirically, AP-OOD sets a new state of the art in OOD detection for text: in the unsupervised setting, it reduces the FPR95 (false positive rate at 95% true positives) from 27.77% to 5.91% on XSUM summarization, and from 75.19% to 68.13% on WMT15 EnโFr translation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3827. Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
๐ฌ 19
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models AI Safety Jailbreaks Guardrails Frozen Model adaptation
As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensure that their responses comply with safety standards. Prior research has revealed that LLMs often fail to grasp the notion of safe behaviors, resulting in either unjustified refusals to harmless prompts or the generation of harmful content. While substantial efforts have been made to improve their robustness, existing defenses often rely on costly fine-tuning of model parameters or employ suboptimal heuristic techniques. In this work, we take a novel approach to safeguard LLMs by learning to adapt the system prompts in instruction-tuned LLMs. While LLMs are typically pre-trained to follow a fixed system prompt, we investigate the impact of tailoring the system prompt to each specific user input on the safety of the responses. To this end, we propose Sysformer, a transformer model that updates an initial system prompt to a more robust system prompt in the LLM input embedding space while attending to the user prompt. While keeping the LLM parameters frozen, the Sysformer is trained to refuse to respond to a set of harmful prompts while responding ideally to a set of safe ones. Through extensive experiments on 5 LLMs from different families and 2 recent benchmarks, we demonstrate that Sysformer can significantly enhance the robustness of LLMs, leading to upto 80% gain in the refusal rate on harmful prompts while enhancing the compliance with the safe prompts by upto 90%. Results also generalize well to sophisticated jailbreaking attacks, making LLMs upto 100% more robust against different attack strategies. We hope our findings lead to cheaper safeguarding of LLMs and motivate future investigations into designing variable system prompts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3828. CORE: Concept-Oriented Reinforcement for Bridging the DefinitionโApplication Gap in Mathematical Reasoning
๐ฌ 17
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language models mathematical reasoning conceptual understanding fine-tuning knowledge distillation robustness
Large language models (LLMs) often solve drill-style math exercises yet fail to apply the concept right when the problem requires genuine understanding. Popular outcome-based RL pipelines reinforce final answers but provide little fine-grained conceptual signal, so models improve at pattern reuse rather than concept selection and use. We introduce $\textit{CORE}$ (Concept-Oriented REinforcement), an algorithm-agnostic training framework that turns explicit concepts into a controllable supervision signal. Starting from a high-quality, low-contamination textbook resource that links verifiable exercises to concise concept descriptions, we run a sanity probe showing LLMs can restate definitions while missing concept-linked quizzes, quantifying the conceptual reasoning gap. $\textit{CORE}$ then (i) synthesizes additional concept-aligned quizzes, (ii) injects concept snippets into rollouts, and (iii) reinforces trajectories that correctly apply the injected concept while constraining drift with a lightweight divergence penalty; the procedure is compatible with standard policy-gradient methods (e.g., GRPO). On a 7B-class model, $\textit{CORE}$ yields consistent gains over a vanilla baseline and reinforcement-only training across in-domain conceptโexercise suites and diverse out-of-domain math benchmarks (GSM8K, SVAMP, MAWPS, SAT-Math, OlympiadBench, Gaokao, Minerva-Math, CounterMath, TheoremQA). Improvements are largest on concept-heavy categories while maintaining or modestly improving drill performance. $\textit{CORE}$ demonstrates that concept-injected, outcome-regularized rollouts supply the missing fine-grained supervision needed to bridge drill competence and true conceptual reasoningโwithout committing to a particular RL algorithm or certain process-based verifiers.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3829. Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks
๐ฌ 25
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Information Retrieval Relevant Assessment Benchmark
Information retrieval (IR) evaluation remains challenging due to incomplete IR benchmark datasets that contain unlabeled relevant chunks. While LLMs and LLM-human hybrid strategies reduce costly human effort, they remain prone to LLM overconfidence and ineffective AI-to-human escalation. To address this, we propose DREAM, a multi-round debate-based relevance assessment framework with LLM agents, built on opposing initial stances and iterative reciprocal critique. Through our agreement-based debate, it yields more accurate labeling for certain cases and more reliable AI-to-human escalation for uncertain ones, achieving 95.2% labeling accuracy with only 3.5% human involvement. Using DREAM, we build BRIDGE, a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison by uncovering 29,824 missing relevant chunks. We then re-benchmark IR systems and extend evaluation to RAG, showing that unaddressed holes not only distort retriever rankings but also drive retrievalโgeneration misalignment. Code and data will be released upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3830. Fracture-GS: Dynamic Fracture Simulation with Physics-Integrated Gaussian Splatting
๐ฌ 23
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ 3D vision Physics-based Simulation
This paper presents a unified framework for simulating and visualizing dynamic fracture phenomena in extreme mechanical collisions using multi-view image inputs. While existing methods primarily address elastic deformations at contact surfaces, they fail to capture the complex physics of extreme collisions, often producing non-physical artifacts and material adhesion at fracture interfaces. Our approach integrates two key innovations: (1) an enhanced Collision Material Point Method (Collision-MPM) with momentum-conserving interface forces derived from normalized mass distributions, which effectively eliminates unphysical adhesion in fractured solids; and (2) a fracture-aware 3D Gaussian continuum representation that enables physically plausible rendering without post-processing. The framework operates through three main stages: First, performing implicit reconstruction of collision objects from multi-view images while sampling both surface and internal particles and simultaneously learning surface particle Gaussian properties via splatting; Second, high-fidelity collision resolution using our improved Collision-MPM formulation; Third, dynamic fracture tracking with Gaussian attribute optimization for fracture surfaces rendering. Through comprehensive testing, our framework demonstrates significant improvements over existing methods in handling diverse scenarios, including homogeneous materials, heterogeneous composites, and complex multi-body collisions. The results confirm superior physical accuracy, while maintaining computational efficiency for rendering.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3831. SFT Doesnโt Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs
๐ฌ 17
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models; Supervised Finetuning; Domain-specific SFT
Supervised Fine-Tuning (SFT) on domain-specific datasets is a common approach to adapt Large Language Models (LLMs) to specialized tasks but is often believed to degrade their general capabilities. In this work, we revisit this trade-off and present both empirical and theoretical insights. First, we show that SFT does not always hurt: using a smaller learning rate can substantially mitigate general performance degradation while preserving comparable target-domain performance. We then provide a theoretical analysis that explains these phenomena and further motivates a new method, Token-Adaptive Loss Reweighting (TALR). Building on this, and recognizing that smaller learning rates alone do not fully eliminate general-performance degradation in all cases, we evaluate a range of strategies for reducing general capability loss, including L2 regularization, LoRA, model averaging, FLOW, and our proposed TALR. Experimental results demonstrate that while no method completely eliminates the trade-off, TALR consistently outperforms these baselines in balancing domain-specific gains and general capabilities. Finally, we distill our findings into practical guidelines for adapting LLMs to new domains: (i) using a small learning rate to achieve a favorable trade-off, and (ii) when a stronger balance is further desired, adopt TALR as an effective strategy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3832. Learning Self-Critiquing Mechanisms for Region-Guided Chest X-Ray Report Generation
๐ฌ 17
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ radiology report generation x-ray report generation self-critiquing mechanism
Automatic radiology reporting assists radiologists in diagnosing abnormalities in radiology images, where grounding the automatic diagnosis with abnormality locations is important for the report interpretability. However, existing supervised-learning methods could lead to learning the superficial statistical correlations between images and reports, lacking multi-faceted reasoning to critique the relevant regions on which radiologists would focus. Recently, self-critical reasoning has been investigated in test-time scaling approaches to alleviate hallucinations of LLMs with increased time complexity. In this work, we focus on chest X-ray report generation with particular focus on clinical accuracy, where self-critical reasoning is alternatively introduced into the model architecture and their training objective, preferred by the real-time automatic reporting system. In particular, three types of self-critical reasoning are proposed to critique the hypotheses of grounded abnormalities compared to i) alternative abnormalities, ii) alternative patient's X-ray image, and iii) potential false negative abnormalities. To realize this, we propose a novel Radiology Self-Critiquing Reporting (RadSCR) framework, which constructs the abnormality proposals for each localized abnormality region and verify them by the proposed self-critiquing mechanisms accordingly. The critiqued results of the abnormality proposals are then integrated to generate the completed report with interpretable diagnostic process. Our experiments show the state-of-the-art performance achieved by RadSCR in the grounded report generation and diagnosis critiquing, demonstrating its effectiveness in generating the clinically accurate report.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3833. PLoP: Precise LoRA Placement for Efficient Finetuning of Large Models
๐ฌ 16
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Efficient Finetuning of Large Language Models;LoRA;
Low-Rank Adaptation is a widely used finetuning method for large models. Its small memory footprint allows practitioners to adapt large models to specific tasks at a fraction of the cost of full finetuning. Different modifications have been proposed to enhance its efficiency by, for example, setting the learning rate, the rank, and the initialization. Another improvement axis is adapter placement strategy: when using LoRA, practitioners usually pick \emph{module types} to adapt with LoRA, such as Query and Key modules. Few works have studied the problem of adapter placement, with nonconclusive results: original LoRA paper suggested placing adapters in attention modules, while other works suggested placing them in the MLP modules. Through an intuitive theoretical analysis, we introduce PLoP (Precise LoRA Placement), a lightweight method that allows automatic identification of module types where LoRA adapters should be placed, given a pretrained model and a finetuning task. We demonstrate that PLoP consistently outperforms, and in the worst case competes, with commonly used placement strategies through comprehensive experiments on supervised finetuning and reinforcement learning for reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3834. High Probability Bounds for Non-Convex Stochastic Optimization with Momentum
๐ฌ 12
โญ 5.00
๐ learning theory
๐ท๏ธ Momentum nonconvex learning generalization
Stochastic gradient descent with momentum (SGDM) is widely used in machine learning, yet high-probability learning bounds for SGDM in non-convex settings remain scarce. In this paper, we provide high-probability convergence bounds and generalization bounds for SGDM. First, we establish such bounds for the gradient norm in the general non-convex case. The resulting convergence bounds are tighter than existing theoretical results, and to the best of our knowledge, the obtained generalization bounds are the first ones for SGDM. Next, under the Polyak-{\L}ojasiewicz condition, we derive bounds for the function-value error instead of the gradient norm, and the corresponding learning rates are faster than in the general non-convex case. Finally, by additionally assuming a mild Bernstein condition on the gradient, we obtain even sharper generalization bounds whose learning rates can reach $\widetilde{\mathcal{O}}(1/n^2)$ in the low-noise regime, where $n$ is the sample size. Overall, we provide a systematic study of high-probability learning bounds for non-convex SGDM.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3835. Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment
๐ฌ 20
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Image Quality Assessment Low Level Vision Multimodal Large Language Model
Reasoning-based image quality assessment (IQA) models trained through reinforcement learning (RL) exhibit exceptional generalization, yet the underlying mechanisms and critical factors driving this capability remain underexplored in current research. Moreover, despite their superior performance, these models incur inference energy usage and latency orders of magnitude higher than their earlier counterparts, restricting their deployment in specific scenarios. Through extensive experiments, this paper verifies and elaborates that through RL training, MLLMs leverage their reasoning capability to convert redundant visual representations into compact, cross-domain aligned text representations. This conversion is precisely the source of the generalization exhibited by these reasoning-based IQA models. Building on this fundamental insight, we propose a novel algorithm, RALI, which employs contrastive learning to directly align images with these generalizable text representations learned by RL. This approach eliminates the reliance on reasoning processes and even obviates the need to load an LLM. For the quality scoring task, this framework achieves generalization performance comparable to reasoning-based models while requiring less than 5% of their model parameters and inference time.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3836. GOLDILOCS: GENERAL OBJECT-LEVEL DETECTION AND LABELING OF CHANGES IN SCENES
๐ฌ 17
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Change detection scene change detection semantic change detection
We propose GOLDILOCS: a novel zero-shot, pose-agnostic method for object-level semantic change detection in the wild. While supervised Scene Change Detection (SCD) methods achieve impressive results on curated datasets, these models do not generalize and performance drops on out-of-domain data. Recent Zero-Shot SCD methods introduced a more robust approach with foundational models as backbone, yet they neglect the 3D aspect of the task and remain constrained to the image-pair setting. Conversely, 3D-centric SCD methods based on 3D Gaussian Splatting (3DGS) or NeRFs require multi-view inputs, but cannot operate on an image pair. Our key insight is that SCD can be reformulated as a 3D reconstruction problem over time, where geometric inconsistencies naturally indicate change. Although previous work considered viewpoint difference a challenge, we recognize the additional geometric information as an advantage. GOLDILOCS uses dense stereo reconstruction to estimate camera parameters and generate a pointmap of the commonalities between input images by filtering geometric inconsistencies. Rendering the canonical scene representation from multiple viewpoints yields reference images that exclude changed or occluded content. Rigid object changes are then detected through mask tracking, while nonrigid transformations are identified using SSIM heatmaps. We evaluate our method on a variety of datasets, covering both pairwise and multi-view cases in binary and multi-class settings, and demonstrate superior performance over prior work, including supervised methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3837. Robust Preference Alignment via Directional Neighborhood Consensus
๐ฌ 30
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models Preference Alignment Inference-Time Method
Aligning large language models with human preferences is critical for creating
reliable and controllable AI systems. A human preference can be visualized as a
high-dimensional vector where different directions represent trade-offs between
desired attributes (e.g., helpfulness vs. verbosity). Yet, because the training data
often reflects dominant, average preferences, LLMs tend to perform well on com-
mon requests but falls short in specific, individual needs. This mismatch creates
a preference coverage gap. Existing methods often address this through costly
retraining, which may not be generalized to the full spectrum of diverse preferences.
This brittleness means that when a userโs request reflects a nuanced preference
deviating from the training dataโs central tendency, model performance can degrade
unpredictably. To address this challenge, we introduce Robust Preference Selection
(RPS), a post-hoc, training-free method by leveraging directional neighborhood
consensus. Instead of forcing a model to generate a response from a single, highly
specific preference, RPS samples multiple responses from a local neighborhood
of related preferences to create a superior candidate pool. It then selects the re-
sponse that best aligns with the userโs original intent. We provide a theoretical
framework showing that, under mild conditions where (i) nearby preference direc-
tions correspond to better-trained regions of the model and (ii) the reward-model
scores change smoothly with small angular changes in the preference vector, our
neighborhood generation strategy yields a higher expected best score than a strong
baseline that also samples multiple candidates. Comprehensive experiments across
three distinct alignment paradigms (DPA, DPO, and SFT) demonstrate that RPS
consistently improves robustness against this baseline, achieving win rates of up
to 69% on challenging preferences from under-represented regions of the space
without any model retraining. Our work presents a practical, theoretically-grounded
solution for enhancing the reliability of preference-aligned models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3838. Flash-Mono: Feed-Forward Accelerated Gaussian Splatting Monocular SLAM
๐ฌ 24
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ SLAM 3DGS 3D Reconstruction 3D Foundation Model
Monocular 3D Gaussian Splatting SLAM suffers from critical limitations in time efficiency, geometric accuracy, and multi-view consistency. These issues stem from the time-consuming $\textit{Train-from-Scratch}$ optimization and the lack of inter-frame scale consistency from single-frame geometry priors. We contend that a feed-forward paradigm, leveraging multi-frame context to predict Gaussian attributes directly, is crucial for addressing these challenges. We present Flash-Mono, a system composed of three core modules: a feed-forward prediction frontend, a 2D Gaussian Splatting mapping backend, and an efficient hidden-state-based loop closure module. We trained a recurrent feed-forward frontend model that progressively aggregates multi-frame visual features into a hidden state via cross attention and jointly predicts camera poses and per-pixel Gaussian properties. By directly predicting Gaussian attributes, our method bypasses the burdensome per-frame optimization required in optimization-based GS-SLAM, achieving a $\textbf{10x}$ speedup while ensuring high-quality rendering. The power of our recurrent architecture extends beyond efficient prediction. The hidden states act as compact submap descriptors, facilitating efficient loop closure and global $\mathrm{Sim}(3)$ optimization to mitigate the long-standing challenge of drift. For enhanced geometric fidelity, we replace conventional 3D Gaussian ellipsoids with 2D Gaussian surfels. Extensive experiments demonstrate that Flash-Mono achieves state-of-the-art performance in both tracking and mapping quality, highlighting its potential for embodied perception and real-time reconstruction applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3839. Characterizing and Optimizing the Spatial Kernel of Multi Resolution Hash Encodings
๐ฌ 14
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ multi-resolution hash encoding implicit neural representations neural fields point spread function spatial kernel analysis anisotropy resolution limit FWHM hash collisions signal-to-noise ratio NeRF
Multi-Resolution Hash Encoding (MHE), the foundational technique behind Instant Neural Graphics Primitives, provides a powerful parameterization for neural fields. However, its spatial behavior lacks rigorous understanding from a physical systems perspective, leading to reliance on heuristics for hyperparameter selection. This work introduces a novel analytical approach that characterizes MHE by examining its Point Spread Function (PSF), which is analogous to the Green's function of the system. This methodology enables a quantification of the encoding's spatial resolution and fidelity. We derive a closed-form approximation for the collision-free PSF, uncovering inherent grid-induced anisotropy and a logarithmic spatial profile. We establish that the idealized spatial bandwidth, specifically the Full Width at Half Maximum (FWHM), is determined by the average resolution, $N_{\text{avg}}$. This leads to a counterintuitive finding: the effective resolution of the model is governed by the broadened empirical FWHM (and therefore $N_{\text{avg}}$), rather than the finest resolution $N_{\max}$, a broadening effect we demonstrate arises from optimization dynamics. Furthermore, we analyze the impact of finite hash capacity, demonstrating how collisions introduce speckle noise and degrade the Signal-to-Noise Ratio (SNR). Leveraging these theoretical insights, we propose Rotated MHE (R-MHE), an architecture that applies distinct rotations to the input coordinates at each resolution level. R-MHE mitigates anisotropy while maintaining the efficiency and parameter count of the original MHE. This study establishes a methodology based on physical principles that moves beyond heuristics to characterize and optimize MHE.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3840. Learning Facts at Scale with Active Reading
๐ฌ 10
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ factuality tail knowledge synthetic data synthetic continued pretraining
LLMs are known to store vast amounts of knowledge in their parametric memory.
However, learning and recalling facts from this memory is known to be unreliable, depending largely on the prevalence of particular facts in the training data and other factors which are poorly understood.
Practitioners are lacking tools which will allow them to ensure that the models learn a given body of knowledge reliably and consistently.
To this end, we propose Active Reading: a framework where we train models to study a given set of material with self-generated learning strategies.
First, we demonstrate models trained with Active Reading on expert domains absorb significantly more knowledge than vanilla finetuning and other data augmentations.
We train expert 8B models that achieve 66% on a Wikipedia-grounded subset of SimpleQA (+313% relative over vanilla finetuning) and 26% on FinanceBench (+160% relative over vanilla finetuning) by applying Active Reading to the source documents for each benchmark.
Finally, we show that Active Reading can be utilized at pre-training scale to build more factual models.
As a demonstration of this, we release WikiExpert-8B, a Wikipedia-expert model trained on 1 trillion generated tokens, which outcompetes models with hundreds of billions of parameters on factual QA.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3841. SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling
๐ฌ 8
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Vision 3D Generative Models Geometric Primitives
Generative methods for 3D assets have recently achieved remarkable progress, yet providing intuitive and precise control over the object geometry remains a key challenge. Existing approaches predominantly rely on text or image prompts, which often fall short in geometric specificity: language can be ambiguous, and images are cumbersome to edit. In this work, we introduce SpaceControl, a training-free test-time method for explicit spatial control of 3D generation. Our approach accepts diverse geometric inputs, from coarse primitives to detailed meshes, and conditions a powerful pre-trained generative model without additional training. A controllable parameter lets users trade off between geometric fidelity and output realism. Extensive quantitative evaluation and user studies demonstrate that SpaceControl outperforms both training-based and optimization-based baselines in geometric faithfulness while preserving high visual quality.
Finally, we present an interactive user interface that enables online editing of superquadrics for direct conversion into textured 3D assets, facilitating practical deployment in creative workflows.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3842. Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting
๐ฌ 21
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Gaussian Splatting Feed-Forward 3D Reconstruction
Existing feed-forward 3D Gaussian Splatting methods typically rely on pixel-aligned primitives, which makes scaling to higher resolutions (e.g., 4K) prohibitive as the number of Gaussians grows quadratically with image resolution. We introduce LGTM (Less Gaussians, Texture More), a feed-forward and pose-free framework that predicts both compact geometric primitives and associated per-primitive texture maps in a single forward pass without per-scene optimization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3843. R-WoM: Retrieval-augmented World Model For Computer-use Agents
๐ฌ 23
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Language Model Computer-use Agent World Model
Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLMโs tendency to hallucination and their reliance on static training knowledge, which could lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models โ future state prediction and reward estimation โ through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMsโ limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves substantial improvements of up to 25.3% (OSWorld) and 18.1% (WebArena) com-
pared to baselines, with particular advantage in longer-horizon simulations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3844. InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions
๐ฌ 9
โญ 5.00
๐ generative models
๐ท๏ธ talking person video generation multi-concept video customization
End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, regionโspecific binding of conditions from modalities to each identity's spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of human dialogue videos between two to three people or video customization from multiple reference images. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3845. Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models
๐ฌ 18
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Audio Language Models Audio-Contribution Post-Training
Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-training such as reinforcement learning (RL) has demonstrated promising results, multi-stage approaches such as supervised fine-tuning (SFT) followed by RL remain suboptimal. The allocation of data across multiple training stages to maximize LALM capabilities has not been fully explored, and large-scale, high-quality datasets for such research are also lacking. To address these problems, we firstly present AudioMCQ, a comprehensive audio multiple-choice question dataset comprising 571k samples with two kinds of chain-of-thought annotations. Secondly, we investigate the prevalent zero audio-contribution phenomenon in LALMs, where models derive correct answers solely from textual information without processing audio content. We propose Audio-Contribution Filtering to partition data into weak and strong audio-contribution subsets. Based on these insights, we develop two effective post-training paradigms: Weak-to-Strong (SFT on weak audio-contribution data followed by RL on strong audio-contribution data) and Mixed-to-Strong (SFT on mixed audio-contribution data followed by RL on strong audio-contribution data). We achieve first place in the DCASE 2025 Audio-Question-Answering challenge by using AudioMCQ. Additionally, leveraging our dataset with different training strategies, we achieve 78.2\% on MMAU-test-mini, 75.6\% on MMAU, 67.0\% on MMAR, and 71.7\% on MMSU, establishing new state-of-the-art performance across these benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3846. MambaVoiceCloning: Efficient and Expressive Text-to-Speech via State-Space Modeling and Diffusion Control
๐ฌ 13
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ text-to-speech (TTS) speech synthesis voice cloning Mamba state space models (SSM) diffusion TTS prosody modeling streaming/low-latency
MambaVoiceCloning (MVC) asks whether the conditioning path of diffusion-based TTS can be made fully SSM-only at inferenceโremoving all attention and recurrence across text, rhythm, and prosodyโwhile preserving or improving quality under controlled conditions. MVC combines a gated bidirectional Mamba text encoder, a Temporal Bi-Mamba supervised by a lightweight alignment teacher discarded after training, and an expressive Mamba with AdaLN modulation, yielding linear-time $\mathcal{O}(T)$ conditioning with bounded activation memory and practical finite look-ahead streaming. Unlike prior MambaโTTS systems that remain hybrid at inference, MVC removes attention-based duration and style modules under a fixed StyleTTS2 melโdiffusionโvocoder backbone. Trained on LJSpeech/LibriTTS and evaluated on VCTK, CSS10 (ES/DE/FR), and long-form Gutenberg passages, MVC achieves modest but statistically reliable gains over StyleTTS2, VITS, and Mambaโattention hybrids in MOS/CMOS, F$_0$ RMSE, MCD, and WER, while reducing encoder parameters to 21M and improving throughput by $1.6\times$. Diffusion remains the dominant latency source, but SSM-only conditioning improves memory footprint, stability, and deployability.
Code: \url{https://github.com/aiai-9/MVC}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3847. Fast and Stable Riemannian Metrics on SPD Manifolds via Cholesky Product Geometry
๐ฌ 20
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Cholesky Decomposition Symmetric Positive Definite (SPD) SPD Manifold Riemannian Metrics SPD Neural Networks
Recent advances in Symmetric Positive Definite (SPD) matrix learning show that Riemannian metrics are fundamental to effective SPD neural networks. Motivated by this, we revisit the geometry of the Cholesky factors and uncover a simple product structure that enables convenient metric design. Building on this insight, we propose two fast and stable SPD metrics, Power--Cholesky Metric (PCM) and Bures--Wasserstein--Cholesky Metric (BWCM), derived via Cholesky decomposition. Compared with existing SPD metrics, the proposed metrics provide closed-form operators, computational efficiency, and improved numerical stability. We further apply our metrics to construct Riemannian Multinomial Logistic Regression (MLR) classifiers and residual blocks for SPD neural networks. Experiments on SPD deep learning, numerical stability analyses, and tensor interpolation demonstrate the effectiveness, efficiency, and robustness of our metrics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3848. On Measuring Influence in Avoiding Undesired Future
๐ฌ 22
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ avoiding undesired future rehearsal learning causal learning
When a predictive model anticipates an undesired future event, a question arises: what can we do to avoid it? Resolving this forward-looking challenge requires determining the variables that positively influence the future, moving beyond statistical correlations typically exploited for prediction. In this paper, we introduce a novel framework for evaluating the influence of actionable variables in successfully avoiding the undesired future. We quantify influence as the degree to which the probability of success can be increased by altering variables based on the principle of maximum expected utility. While closely related to causal effects, our analysis reveals a counterintuitive insight: influential variables may not necessarily be those with intrinsically strong causal effects on the target. In fact, due to the dynamics of the decision process, it can be highly beneficial to alter a weak causal factor, or even a variable that is not an intrinsic factor at all. We provide a practical implementation for computing the proposed quantity using observational data and demonstrate its utility through empirical studies on synthetic and real-world applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3849. Concepts' Information Bottleneck Models
๐ฌ 13
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Concept bottleneck models Information bottleneck Variational Inference
Concept Bottleneck Models (CBMs) aim to deliver interpretable predictions by routing decisions through a human-understandable concept layer, yet they often suffer reduced accuracy and concept leakage that undermines faithfulness. We introduce an explicit Information Bottleneck regularizer on the concept layer that penalizes $I(X;C)$ while preserving task-relevant information in $I(C;Y)$, encouraging minimal-sufficient concept representations. We derive two practical variants (a variational objective and an entropy-based surrogate) and integrate them into standard CBM training without architectural changes or additional supervision. Evaluated across six CBM families and three benchmarks, the IB-regularized models consistently outperform their vanilla counterparts. Information-plane analyses further corroborate the intended behavior. These results indicate that enforcing a minimal-sufficient concept bottleneck improves both predictive performance and the reliability of concept-level interventions. The proposed regularizer offers a theoretic-grounded, architecture-agnostic path to more faithful and intervenable CBMs, resolving prior evaluation inconsistencies by aligning training protocols and demonstrating robust gains across model families and datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3850. Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models
๐ฌ 17
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Large Language Models; Knowledge Attribution; Interpretability and explainable AI; Citations
Trustworthy language models should provide both correct and verifiable answers. However, citations generated directly by standalone LLMs are often unreliable due to hallucinations. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during (continual) pretraining, without testโtime retrieval, by revising the training process. To study this, we construct **CitePretrainBench**, a benchmark that mixes realโworld corpora (Wikipedia, Common Crawl, arXiv) with novel, unseen documents and probes both shortโform (single fact) and longโform (multiโfact) citation tasks. Our approach follows a two-stage process: (1) Continual-pretraining to index factual knowledge by binding it to persistent document identifiers; (2) Instruction tuning to elicit citation behavior. We introduce **Active Indexing** for the first stage, which creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source$\to$fact and fact$\to$source). This equips the model to both generate content from a cited source and attribute its own answers, improving robustness to paraphrase and composition. Experiments with Qwenโ2.5โ7B and 3B show that Active Indexing consistently outperforms a Passive Indexing baseline, which simply appends an identifier to each document, achieving citation precision gains of up to 30.2\% across all tasks and models. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16ร the original token count.
Finally, we show that internal citations complement external ones by making the model more robust to retrieval noise.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3851. APC-RL: Exceeding data-driven behavior priors with adaptive policy composition
๐ฌ 22
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Normalizing Flows Demonstrations Data Behavior Prior Learned action space
Incorporating demonstration data into reinforcement learning (RL) can greatly accelerate learning, but existing approaches often assume demonstrations are optimal and fully aligned with the target task.
In practice, demonstrations are frequently sparse, suboptimal, or misaligned, which can degrade performance when these demonstrations are integrated into RL.
We propose Adaptive Policy Composition (APC), a hierarchical model that adaptively composes multiple data-driven Normalizing Flow (NF) priors.
Instead of enforcing strict adherence to the priors, APC estimates each prior's applicability to the target task while leveraging them for exploration.
Moreover, APC either refines useful priors, or sidesteps misaligned ones when necessary to optimize downstream reward.
Across diverse benchmarks, APC accelerates learning when demonstrations are aligned, remains robust under severe misalignment, and leverages suboptimal demonstrations to bootstrap exploration while avoiding performance ceilings caused by overly strict adherence to suboptimal demonstrations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3852. Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots
๐ฌ 14
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Safety and Robustness Safety Alignment Interpretability Refusal out-of-distribution Representation Engineering Jailbreaking Multimodal Alignment Concept Vectors
We **color-coded** the added changes to the **paper** and **Appendix** for the comfort of our reviewers.
Multimodal Language Models (MMLMs) typically undergo post-training alignment to prevent harmful content generation. However, these alignment stages focus primarily on the *assistant* role, leaving the *user* role unaligned, and sticking to a fixed input prompt structure of special tokens, making the model vulnerable when inputs deviate from these expectations. We introduce Role-Modality Attacks (RMA), a novel class of adversarial attacks that exploit role confusion between the *user* and *assistant* and alter the position of the *image* token to elicit harmful outputs. Unlike existing attacks that modify query content, RMAs manipulate the input structure without altering the query itself. We systematically evaluate these attacks across multiple Vision Language Models (VLMs) on eight distinct settings, showing that they can be composed to create stronger adversarial prompts, as also evidenced by their increased projection in the negative refusal direction in the residual stream, a property observed in prior successful attacks. Finally, for mitigation, we propose an adversarial training approach that makes the model robust against input prompt perturbations. By training the model on a range of harmful and benign prompts all perturbed with different RMA settings, the model loses its sensitivity to Role Confusion and Modality Manipulation attacks and is trained to only pay attention to the query content in the input prompt structure, effectively reducing Attack Success Rate (ASR) while preserving the modelโs general utility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3853. Zero-Sacrifice Lifelong Adversarial Defense for Pre-Trained Encoders
๐ฌ 28
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ adversarial robustness self-supervised learning pretrained models security vulnerability
The widespread use of publicly available pre-trained encoders from self-supervised learning (SSL) has exposed a critical vulnerability: their susceptibility to downstream-agnostic adversarial examples (DAEs), which are crafted without knowledge of the downstream tasks but capable of misleading downstream models. While several defense methods have been explored recently, they rely primarily on task-specific adversarial fine-tuning, which inevitably limits generalizability and causes catastrophic forgetting and deteriorates benign performance.
Different with previous works, we propose a more rigorous defense goal that requires only a single tuning for diverse downstream tasks to defend against DAEs and preserve benign performance.
To achieve this defense goal, we introduce **Ze**ro-Sacrifice
**P**ersistent-Robustness **A**dversarial **D**efense (**ZePAD**), which is inspired by the inherent sensitivity of neural networks to data characteristics. Specifically, ZePAD is a dual-branch structure, which consists of a Multi-Pattern Adversarial Enhancement Branch (MPAE-Branch) that uses two adversarially fine-tuned encoders to strengthen adversarial resistance. The Benign Memory Preservation Branch (BMP-Branch) is trained on local data to ensure adversarial robustness does not compromise benign performance. Surprisingly, we find that ZePAD can directly detect DAEs by evaluating branch confidence, without introducing any adversarial exsample identification task during training. Notably, by enriching feature diversity, our method enables a single adversarial fine-tuning to defend against DAEs across downstream tasks, thereby achieving persistent robustness. Extensive experiments on 11 SSL methods and 6 datasets validate its effectiveness. In certain cases, it achieves a 29.20\% improvement in benign performance and a 73.86\% gain in adversarial robustness, highlighting its zero-sacrifice property.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3854. FACET: A Fragment-Aware Conformer Ensemble Transformer
๐ฌ 26
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ molecular properties prediction 3D conformers graph transformer 2D-3D fusion fragment aware module Fused Gromov-Wasserstein distance
Accurately predicting molecular properties requires effective integration of structural information from both 2D molecular graphs and their corresponding equilibrium conformer ensembles. In this work, we propose FACET, a scalable Structure-Aware Graph Transformer that efficiently aggregates features from multiple 3D conformers while incorporating fragment-level information from 2D graphs. Unlike prior methods that rely on static geometric solvers or rigid fusion strategies, our approach utilizes a differentiable graph transformer to theoretically approximate the computationally expensive Fused GromovโWasserstein (FGW), enabling dynamic and scalable fusion of 2D and 3D structural information. We further enhance this mechanism by injecting fragment-specific structural priors into the attention layers, enabling the model to capture fine-grained molecular details. This unified design scales to large datasets, handling up to 75,000 molecules and hundreds of thousands of conformers, and provides over a 6ร speedup compared to geometry-aware FGW-based baselines. Our method also achieves state-of-the-art results in molecular property prediction, Boltzmann-weighted ensemble modeling, and reaction-level tasks, and is particularly effective on chemically diverse compounds, including organocatalysts and transition-metal complexes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3855. Resurfacing the Instance-only Dependent Label Noise Model through Loss Correction
๐ฌ 29
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ label noise loss correction instance-dependence risk equivalence
We investigate the label noise problem in supervised binary classification settings and resurface the underutilized instance-_only_ dependent noise model through loss correction. On the one hand, based on risk equivalence, the instance-aware loss correction scheme completes the bridge from _empirical noisy risk minimization_ to _true clean risk minimization_ provided the base loss is classification calibrated (e.g., cross-entropy). On the other hand, the instance-only dependent modeling of the label noise at the core of the correction enables us to estimate a single value per instance instead of a matrix. Furthermore, the estimation of the transition rates becomes a very flexible process, for which we offer several computationally efficient ways. Empirical findings over different dataset domains (image, audio, tabular) with different learners (neural networks, gradient-boosted machines) validate the promised generalization ability of the method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3856. From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking
๐ฌ 14
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ agents research agents coding agents benchmark
Recent progress in autonomous code generation has fueled excitement around AI agents capable of accelerating scientific discovery by running experiments. However, there is currently no benchmark that evaluates whether such agents can implement scientific ideas when given varied amounts of code as a starting point, interpolating between reproduction (running code) and from-scratch replication (fully re-implementing and running code). We introduce AutoExperiment, a benchmark that evaluates AI agentsโ ability to implement and run machine learning experiments based on natural language descriptions in research papers. In each task, agents are given a research paper, a codebase with key functions masked out, and a command to run the experiment. The goal is to generate the missing code, execute the experiment in a sandboxed environment, and reproduce the results. AutoExperiment scales in difficulty by varying the number of missing functions $n$, ranging from partial reproduction to full replication. We evaluate state-of-the-art agents and find that performance degrades rapidly as $n$ increases. Agents that can dynamically interact with the environment (e.g. to debug their code) can outperform agents in fixed ``agentless'' harnesses, and there exists a significant gap between single-shot and multi-trial success rates (Pass@1 vs. Pass@5), motivating verifier approaches to our benchmark. Our findings highlight critical challenges in long-horizon code generation, context retrieval, and autonomous experiment execution, establishing AutoExperiment as a new benchmark for evaluating progress in AI-driven scientific experimentation. Our data and code are open-sourced at https://anonymous.4open.science/r/AutoExperiment-5C54.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3857. Tight Bounds for Schrodinger Potential Estimation in Unpaired Data Translation
๐ฌ 14
โญ 5.00
๐ learning theory
๐ท๏ธ Learning theory stochastic optimal control Schrodinger potential non-asymptotic bounds
Modern methods of generative modelling and unpaired data translation based on Schrodinger bridges and stochastic optimal control theory aim to transform an initial density to a target one in an optimal way. In the present paper, we assume that we only have access to i.i.d. samples from initial and final distributions. This makes our setup suitable for both generative modelling and unpaired data translation. Relying on the stochastic optimal control approach, we choose an Ornstein-Uhlenbeck process as the reference one and estimate the corresponding Schrodinger potential. Introducing a risk function as the Kullback-Leibler divergence between couplings, we derive tight bounds on generalization ability of an empirical risk minimizer in a class of Schrodinger potentials including Gaussian mixtures. Thanks to the mixing properties of the Ornstein-Uhlenbeck process, we almost achieve fast rates of convergence up to some logarithmic factors in favourable scenarios. We also illustrate performance of the suggested approach with numerical experiments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3858. Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals
๐ฌ 22
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Virtual Try-Off Fashion Generation Diffusion Transformer
While virtual try-on (VTON) systems aim to render a garment onto a target person, this paper tackles the novel task of virtual try-off (VTOFF), which addresses the inverse problem: generating standardized product images from real-world photos of clothed individuals. Unlike VTON, which must resolve diverse pose and style variations, VTOFF benefits from a consistent output format, typically a flat, lay-down style, making it a promising tool for data generation and dataset enhancement. However, existing VTOFF approaches face two major limitations: (i) they are fundamentally constrained by their exclusive reliance on ambiguous visual information from the source image, and (ii) they frequently produce images with severely degraded details, preventing their use in practical applications. To overcome these challenges, we present Text-Enhanced MUlti-category Virtual Try-Off (TEMU-VTOFF), a novel architecture featuring a dual DiT-based backbone. To resolve visual ambiguity, our model leverages a modified multimodal attention mechanism that incorporates information from images, text, and masks, enabling robust feature extraction in a multi-category setting. To explicitly mitigate detail degradation, we propose an additional alignment module that refines the generated visual details to achieve high fidelity. Experiments on the VITON-HD and Dress Code datasets show that TEMU-VTOFF sets a new state-of-the-art on the VTOFF task, significantly improving both visual quality and fidelity to the target garments. Our code and models will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3859. Chunking the Critic: A Transformer-based Soft Actor-Critic with N-Step Returns
๐ฌ 22
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Soft Actor-Critic (SAC) Transformer-based Critic Sequence Chunking N-step Returns Critic Alignment Double Q-Learning Deep Reinforcement Learning
We introduce a sequence-conditioned critic for Soft Actor--Critic (SAC) that models trajectory context with a lightweight Transformer and trains on aggregated $N$-step targets. Unlike prior approaches that (i) score state--action pairs in isolation or (ii) rely on actor-side action chunking to handle long horizons, our method strengthens the critic itself by conditioning on short trajectory segments and integrating multi-step returns---without importance sampling (IS). The resulting sequence-aware value estimates capture temporal structure critical for extended-horizon and sparse-reward problems. On local-motion benchmarks, we further show that freezing critic parameters for several steps makes our update compatible with CrossQ's core idea, enabling stable training without a target network. Despite its simplicity---a 2-layer Transformer with 128--256 hidden units and a maximum update-to-data ratio (UTD) of $1$---the approach consistently outperforms standard SAC and strong off-policy baselines, with particularly large gains on long-trajectory control. These results highlight the value of sequence modeling and $N$-step bootstrapping on the critic side for long-horizon reinforcement learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3860. Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences
๐ฌ 12
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ ASR multimodal LLM speech processing TTS datasets
Conversion of spoken mathematical expressions is a challenging task that involves transcribing speech into a strictly structured symbolic representation while addressing the ambiguity inherent in the pronunciation of equations. Although significant progress has been achieved in automatic speech recognition (ASR) and language models (LM), the problem of converting spoken mathematics into LaTeX remains underexplored. This task directly applies to educational and research domains, such as lecture transcription or note creation. Based on ASR post-correction, prior work requires 2 transcriptions, focuses only on isolated equations, has a limited test set, and provides neither training data nor multilingual coverage. To address these issues, we present the first fully open-source large-scale dataset, comprising over 66,000 human-annotated audio samples of mathematical equations and sentences in English and Russian, drawn from diverse scientific domains. In addition to the ASR post-correction models and few-shot prompting, we apply audio language models, demonstrating comparable character error rate (CER) results on the MathSpeech benchmark (28\% vs. 30\%) for the equations conversion. In contrast, on the proposed S2L-equations benchmark, our models outperform the MathSpeech model by a substantial margin of more than 36 percentage points, even after accounting for LaTeX formatting artifacts (27\% vs. 64\%). We establish the first benchmark for mathematical sentence recognition (S2L-sentences) and achieve an equation CER of 40\%. This work lays the groundwork for future advances in multimodal AI, with a particular focus on mathematical content recognition.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3861. Random Controlled Differential Equations
๐ฌ 12
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ random features time-series path signatures CDEs RDEs reservoir computing kernels
We introduce a training-efficient framework for time-series learning that combines random features with controlled differential equations (CDEs). In this approach, large randomly parameterized CDEs act as continuous-time reservoirs, mapping input paths to rich representations. Only a linear readout layer is trained, resulting in fast, scalable models with strong inductive bias. Building on this foundation, we propose two variants: (i) Random Fourier CDEs (RF-CDEs): these lift the input signal using random Fourier features prior to the dynamics, providing a kernel-free approximation of RBF-enhanced sequence models; (ii) Random Rough DEs (R-RDEs): these operate directly on rough-path inputs via a log-ODE discretisation, using log-signatures to capture higher-order temporal interactions while remaining stable and efficient. We prove that in the infinite-width limit, these model induces the RBF-lifted signature kernel and the rough signature kernel, respectively, offering a unified perspective on random-feature reservoirs, continuous-time deep architectures, and path-signature theory.
We evaluate both models across a range of time-series benchmarks, demonstrating competitive or state-of-the-art performance. These methods provide a practical alternative to explicit signature computations, retaining their inductive bias while benefiting from the efficiency of random features. Code is publicly available at: \url{https://anonymous.4open.science/r/RandomSigJax-C768/}
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3862. InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search
๐ฌ 23
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ thinking with images o3 visual search multi-agent framework reinforcement learning
The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect that are crucial for real-world tasks like analyzing documents with dense charts/diagrams or navigating maps. To address this gap, we first introduce o3-bench, a new benchmark designed to evaluate multimodal reasoning while attending to visual details. O3-bench features challenging questions that require agents to gather subtle visual information from multiple distinct areas of an image while performing complex, interleaved reasoning using the gathered information. These tasks are highly challenging even for frontier systems like OpenAI o3, which only obtains 42.8% accuracy on o3-bench. To tackle these tasks, we propose InSight-o3, a multi-agent framework that divides labor between a visual reasoning agent (vReasoner) and a visual search agent (vSearcher). As a concrete first step towards o3-like systems, we focus on the latter (i.e., vSearcher) in this paper, for which we introduce the task of generalized visual search---locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent that can be directly called by other agents, our vSearcher significantly improves the performance of existing frontier multimodal models by empowering them with generalized visual search on a wide range of benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3863. Learning to Reason via Mixture-of-Thought for Logical Reasoning
๐ฌ 13
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Logical Reasoning Self-evolving Training Large Language Models Parallel Scaling Test time scaling
Human beings naturally utilize multiple reasoning modalities to learn and solve logical problems, i.e., different representational formats such as natural language, code, and symbolic logic. In contrast, most existing LLM-based approaches operate with a single reasoning modality during training, typically natural language. Although some methods explored modality selection or augmentation at inference time, the training process remains modality-blind, limiting synergy among modalities. To fill in this gap, we propose Mixture-of-Thought (MoT), a framework that enables LLMs to reason across three complementary modalities: natural language, code, and a newly introduced symbolic modality, truth-table, which systematically enumerates logical cases and partially mitigates key failure modes in natural language reasoning. MoT adopts a two-phase design: (1) **self-evolving MoT training**, which jointly learns from filtered, self-generated rationales across modalities; and (2) **MoT inference**, which fully leverages the synergy of three modalities to produce better predictions. Experiments on logical reasoning benchmarks including FOLIO and ProofWriter demonstrate that our MoT framework consistently and significantly outperforms strong LLM baselines with single-modality chain-of-thought approaches,
achieving up to **+11.7pp** average accuracy gain.
Further analyses show that our MoT framework benefits both training and inference stages; that it is particularly effective on harder logical reasoning problems; and that different modalities contribute complementary strengths, with truth-table reasoning helping to overcome key bottlenecks in natural language inference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3864. SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks
๐ฌ 34
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Jailbreak Adversarial Attack Safe AI
As large language models (LLMs) are widely deployed, identifying their vulnerability through jailbreak attacks becomes increasingly critical. Optimization-based attacks like Greedy Coordinate Gradient (GCG) have focused on inserting adversarial tokens to the end of prompts. However, GCG restricts adversarial tokens to a fixed insertion point (typically the prompt suffix), leaving the effect of inserting tokens at other positions unexplored. In this paper, we empirically investigate slots, i.e., candidate positions within a prompt where tokens can be inserted. We find that vulnerability to jailbreaking is highly related to the selection of the slots. Based on these findings, we introduce the Vulnerable Slot Score (VSS) to quantify the positional vulnerability to jailbreaking. We then propose SlotGCG, which evaluates all slots with VSS, selects the most vulnerable slots for insertion, and runs a targeted optimization attack at those slots. Our approach provides a position-search mechanism that is attack-agnostic and can be plugged into any optimization-based attack, adding only 200ms of preprocessing time. Experiments across multiple models demonstrate that SlotGCG significantly outperforms existing methods. Specifically, it achieves 14% higher Attack Success Rates (ASR) over GCG-based attacks, converges faster, and shows superior robustness against defense methods with 42% higher ASR than baseline approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3865. ToolWeaver: Weaving Collaborative Semantics for Scalable Tool Use in Large Language Models
๐ฌ 6
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Tool-Augmented LLMs Scalable Tool Use Tool Learning Collaborative Semantics
Prevalent retrieval-based tool-use pipelines struggle with a dual semantic challenge: their retrievers often employ encoders that fail to capture complex semantics, while the Large Language Model (LLM) itself lacks intrinsic tool knowledge from its natural language pretraining. Generative methods offer a powerful alternative by unifying selection and execution, tasking the LLM to directly learn and generate tool identifiers. However, the common practice of mapping each tool to a unique new token introduces substantial limitations: it creates a scalability and generalization crisis, as the vocabulary size explodes and each tool is assigned a semantically isolated token. This approach also creates a semantic bottleneck that hinders the learning of collaborative tool relationships, as the model must infer them from sparse co-occurrences of monolithic tool IDs within a vast library. To address these limitations, we propose ToolWeaver, a novel generative tool learning framework that encodes tools into hierarchical sequences. This approach makes vocabulary expansion logarithmic to the number of tools. Crucially, it enables the model to learn collaborative patterns from the dense co-occurrence of shared codes, rather than the sparse co-occurrence of monolithic tool IDs. We generate these structured codes through a novel tokenization process designed to weave together a tool's intrinsic semantics with its extrinsic co-usage patterns. These structured codes are then integrated into the LLM through a generative alignment stage, where the model is fine-tuned to produce the hierarchical code sequences. Evaluation results with nearly 47,000 tools show that ToolWeaver significantly outperforms state-of-the-art methods, establishing a more scalable, generalizable, and semantically-aware foundation for advanced tool-augmented agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3866. Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
๐ฌ 26
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ RLVR LLM Reasoning
Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settingsโwhere stale data from past policies are used for trainingโimproves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective method that dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization. Across diverse off-policy scenariosโincluding sample replay and partial rolloutโBAPO achieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-art results among models of the same scale but also outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3867. PetaGAIL++: Utility Optimized Private Trajectory Generation with Imitation Learning
๐ฌ 9
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Differential Privacy Imitation Learning
Human mobility trajectory data supports a wide range of applications, including urban planning, intelligent transportation systems, and public safety monitoring. However, large-scale, high-quality mobility datasets are difficult to obtain due to privacy concerns. Raw trajectory data may reveal sensitive user information, such as home addresses, routines, or social relationships, making it crucial to develop privacy-preserving alternatives. Recent advances in deep generative modeling have enabled synthetic trajectory generation, but existing methods either lack formal privacy guarantees or suffer from reduced utility and scalability. Differential Privacy (DP) has emerged as a rigorous framework for data protection, and recent efforts such as PATE-GAN and PATEGAIL integrate DP with generative adversarial learning. While promising, these methods struggle to generalize across diverse trajectory patterns and often incur significant utility degradation. In this work, we propose a new framework that builds on PATEGAIL by introducing a sensitivity-aware noise injection module that dynamically adjusts privacy noise based on sample-level sensitivity. This design significantly improves trajectory fidelity, downstream task performance, and scalability under strong privacy guarantees. We evaluate our method on real-world mobility datasets and demonstrate its superiority over state-of-the-art baselines in terms of privacy-utility trade-off.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3868. The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
๐ฌ 16
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ safety alignment multi-agent reinforcement learning
Harnessing the power of LLMs requires a delicate dance between being helpful and harmless, leading to two critical challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirelyโit may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses.
At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3869. Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models
๐ฌ 22
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Unified Multimodal Language Models
Unified Multimodal Models (UMMs) built on shared autoregressive (AR) transformers are attractive for their architectural simplicity. However, we identify a critical limitation: when trained on multimodal inputs, modality-shared transformers suffer from severe gradient conflicts between vision and text, particularly in shallow and deep layers. We trace this issue to the fundamentally different low-level statistical properties of images and text, while noting that conflicts diminish in middle layers where representations become more abstract and semantically aligned.
To overcome this challenge, we propose Uni-X, a two-end-separated, middle-shared architecture. Uni-X dedicates its initial and final layers to modality-specific processing, while maintaining shared parameters in the middle layers for high-level semantic fusion. This X-shaped design not only eliminates gradient conflicts at both ends but also further alleviates residual conflicts in the shared layers.
Extensive experiments validate the effectiveness of Uni-X. Under identical training conditions, Uni-X achieves superior training efficiency compared to strong baselines. When scaled to 3B parameters with larger training data, Uni-X matches or surpasses 7B AR-based UMMs, achieving a GenEval score of 82 for image generation alongside strong performance in text and vision understanding tasks.
These results establish Uni-X as a parameter-efficient and scalable foundation for future unified multimodal modeling.
Our code is available at https://anonymous.4open.science/r/Uni-X-Code-E5CD.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3870. ELMUR: External Layer Memory with Update/Rewrite for Long-Horizon RL
๐ฌ 18
โญ 5.00
๐ reinforcement learning
๐ท๏ธ RL POMDP Memory Transformer Robotics
Real-world robotic agents must act under partial observability and long horizons, where key cues may appear long before they affect decision making. However, most modern approaches rely solely on instantaneous information, without incorporating insights from the past. Standard recurrent or transformer models struggle with retaining and leveraging long-term dependencies: context windows truncate history, while naive memory extensions fail under scale and sparsity.
We propose $\textbf{ELMUR}$ ($\textbf{E}$xternal $\textbf{L}$ayer $\textbf{M}$emory with $\textbf{U}$pdate/$\textbf{R}$ewrite), a transformer architecture with structured external memory. Each layer maintains memory embeddings, interacts with them via bidirectional cross-attention, and updates them through an $\textbf{L}$east $\textbf{R}$ecently $\textbf{U}$sed $\textbf{(LRU)}$ memory module using replacement or convex blending.
ELMUR extends effective horizons up to 100,000 times beyond the attention window and achieves a 100% success rate on a synthetic T-Maze task with corridors up to one million steps. In POPGym, it outperforms baselines on more than half of the tasks. On MIKASA-Robo sparse-reward manipulation tasks with visual observations, it nearly doubles the performance of strong baselines. These results demonstrate that structured, layer-local external memory offers a simple and scalable approach to decision making under partial observability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3871. Fed-Duet: Dual Expert-Orchestrated Framework for Continual Federated Vision-Language Learning
๐ฌ 29
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Federated Learning Federated Continual Learning Prompt learning Vision-Language model
Pretrained vision-language models (VLMs), such as CLIP, have shown promise in federated learning (FL) by bringing strong multimodal representations to edge devices. However, continual adaptation remains a core challenge in practical federated settings, where task distributions evolve over time and data remain non-IID across clients. In this emerging area, recent works adopt parameter-efficient fine-tuning (PEFT) as a lightweight way to reduce communication overhead, yet they fail to preserve satisfactory performance under continual learning conditions. Meanwhile, traditional federated continual learning (FCL) methods lack the capacity to maintain cross-modal alignment crucial to VLM performance. We introduce Fed-Duet, a novel Dual Expert-orchestrated framework for efficient federated continual learning in vision-language models. Fed-Duet features a dual-expert adaptation mechanism, combining server-coordinated semantic prompts with client-personalized modular adapters. These pathways are dynamically fused via a cross-attention mechanism, enabling effective knowledge transfer while preserving multimodal alignment and mitigating forgetting. We evaluate Fed-Duet across multiple challenging continual learning tasks in federated vision-language settings and demonstrate that it achieves superior performance and stability compared to existing approaches. Our work highlights the importance of coordinated expert composition in enabling scalable and robust multimodal continual learning. The code is available at https://anonymous.4open.science/r/FedDuet-0426/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3872. HBO: Hierarchical Balancing Optimization for Fine-Tuning Large Language Models
๐ฌ 17
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Language Models Fine-Tuning Multilingual Multitask
Fine-tuning large language models (LLMs) on a mixture of diverse datasets poses challenges due to data imbalance and heterogeneity. Existing methods often address these issues across datasets (globally) but overlook the imbalance and heterogeneity within individual datasets (locally), which limits their effectiveness. We introduce Hierarchical Balancing Optimization (HBO), a novel method that enables LLMs to autonomously adjust data allocation during fine-tuning both across datasets (globally) and within each individual dataset (locally). HBO employs a bilevel optimization strategy with two types of actors: a Global Actor, which balances data sampling across different subsets of the training mixture, and several Local Actors, which optimizes data usage within each subset based on difficulty levels. These actors are guided by reward functions derived from the LLM's training state, which measure learning progress and relative performance improvement. We evaluate HBO on three LLM backbones across nine diverse tasks in multilingual and multitask setups. Results show that HBO consistently outperforms existing baselines, achieving significant accuracy gains. Our in-depth analysis further demonstrates that both the global actor and local actors of HBO effectively adjust data usage during fine-tuning. HBO provides a comprehensive solution to the challenges of data imbalance and heterogeneity in LLM fine-tuning, enabling more effective training across diverse datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3873. To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration
๐ฌ 28
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Compression LLM
The scaling of Generative AI (GenAI) models into the hundreds of billions of parameters makes low-precision computation indispensable for efficient deployment. We argue that the fundamental solution lies in developing low-precision \emph{floating-point} formats, which inherently provide numerical stability, memory savings, and hardware efficiency without dequantization overhead. In this paper, we present a theoretical and empirical study of an \emph{exponent concentration} phenomenon in GenAI weights: exponents consistently exhibit low entropy across architectures and modalities. We show that this arises naturally from $\alpha$-stable distributions induced by stochastic gradient descent, and we prove tight bounds on the entropy of exponents. Our analysis establishes a theoretical compression limit near FP4.67, which motivates the design of a practical FP8 format. Building on these insights, we propose \textbf{Exponent-Concentrated FP8 (ECF8)}, a lossless compression framework with entropy-aware encoding and GPU-optimized decoding. Experiments on LLMs and DiTs up to 671B parameters demonstrate up to 26.9\% memory savings and 177.1\% throughput acceleration, with perfectly lossless computations, i.e., no deviation in model outputs. Our results establish exponent concentration as a statistical law of trained models and open a principled path for lossless low-precision floating-point design in the FP8 era.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3874. Learn to Guide Your Diffusion Model
๐ฌ 20
โญ 5.00
๐ generative models
๐ท๏ธ Diffusion models Classifier-free Guidance Conditional sampling generative mode
Classifier-free guidance (CFG) is a widely used technique for improving the perceptual quality of samples from conditional diffusion models. It operates by linearly combining conditional and unconditional score estimates using a *guidance weight* $\omega$. While a large, static weight can markedly improve visual results, this often comes at the cost of poorer distributional alignment.
In order to better approximate the target conditional distribution,
we instead learn *guidance weights* $\omega_{c,(s,t)}$, which are continuous functions of the conditioning $c$, the time $t$ from which we denoise, and the time $s$ towards which we denoise.
We achieve this by minimizing the distributional mismatch between noised samples from the true conditional distribution and samples from the guided diffusion process. We extend our framework to reward guided sampling, enabling the model to target distributions tilted by a reward function $R(x_0,c)$, defined on clean data and a conditioning $c$. We demonstrate the effectiveness of our methodology on low-dimensional toy examples and high-dimensional image settings, where we observe improvements in Frรฉchet inception distance (FID) for image generation. In text-to-image applications, we observe that employing a reward function given by the CLIP score leads to guidance weights that improve image-prompt alignment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3875. Learning Human Habits with Rule-Guided Active Inference
๐ฌ 17
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Human Behavior Modeling Active Inference Logic Rule Wake-Sleep Inference
Humans navigate daily life by combining two modes of behavior: deliberate planning in novel situations and fast, automatic responses in familiar ones. Modeling human decision-making therefore requires capturing how people switch between these modes. We present a framework for learning human habits with rule-guided active inference, extending the view of the brain as a prediction machine that minimizes mismatches between expectations and observations, and computationally modeling of human(-like) behavior and habits. In our approach, habits emerge as symbolic rules that serve as compact, interpretable shortcuts for action. To learn these rules alongside the human models, we design a biologically inspired wake--sleep algorithm. In the wake phase, the agent engages in active inference on real trajectories: reconstructing states, updating beliefs, and harvesting candidate rules that reliably reduce free energy. In the sleep phase, the agent performs generative replay with its world model, refining parameters and consolidating or pruning rules by minimizing joint free energy. This alternating ruleโmodel consolidation lets the agent build a reusable habit library while preserving the flexibility to plan. Experiments on basketball player movements, car-following behavior, medical diagnosis, and visual game strategy demonstrate that our framework improves predictive accuracy and efficiency compared to logic-based, deep learning, LLM-based, model-based RL, and prior active inference baselines, while producing interpretable rules that mirror human-like habits.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3876. Latent Adaptation of Foundation Policies for Sim-to-Real Transfer
๐ฌ 18
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Sim-to-Real Domain Adaptation Foundation Policy
The sim-to-real problem remains a critical challenge in the real-world application of reinforcement learning (RL). The conventional sim-to-real methods heavily rely on resource-intensive re-training of the policy network to adapt to new domains, which limits the flexibility of the deployment of RL policies in ever-changing environments. Inspired by human locomotion, where individuals adjust their gait to new surface conditions without relearning the skill of walking, we introduce Latent Adaptation of Foundation Policies (Found-adapt), a framework that decouples this problem into skill acquisition and environment adaptation. Our method first pretrains a foundation policy on unlabeled offline trajectories from the source simulator, capturing diverse long-horizon behaviors as reusable skills. At deployment, instead of retraining the policy, we perform efficient latent space adaptation: a small amount of target-domain data is collected to refine a latent representation through an adapter network that incorporates parameter-efficient alignment, which produces a task-ready controller under various system dynamics. This adaptation occurs entirely in the latent space, avoiding costly policy optimization while enabling robust transfer. Empirical results across multiple locomotion tasks and dynamic variations demonstrate that our method significantly reduces the sim-to-real gap. Further sensitivity analysis provides interesting insights into the requirements for data quality and applicable situations. These findings highlight how foundation policies with latent adaptation could serve as a general and efficient paradigm for real-world RL deployment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3877. Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective
๐ฌ 12
โญ 5.00
๐ optimization
๐ท๏ธ Stochastic Differential Equations Differential Privacy
Differential Privacy (DP) is becoming central to large-scale training as privacy regulations tighten. We revisit how DP noise interacts with *adaptivity* in optimization through the lens of *stochastic differential equations*, providing the first SDE-based analysis of private optimizers. Focusing on DP-SGD and DP-SignSGD under per-example clipping, we show a sharp contrast under fixed hyperparameters: DP-SGD converges at a privacy-utility trade-off $O(1/\varepsilon^2)$ with speed independent of $\varepsilon$, while DP-SignSGD converges at a speed *linear* in $\varepsilon$ with a $O(1/\varepsilon)$ trade-off, dominating in high-privacy or high-noise regimes. Under optimal learning rates, both methods reach comparable theoretical asymptotic performance; however, the optimal learning rate of DP-SGD scales linearly with $\varepsilon$, while that of DP-SignSGD is essentially $\varepsilon$-independent. This makes adaptive methods far more practical, as their hyperparameters transfer across privacy levels with little or no re-tuning. Empirical results confirm our theory across training and test metrics, and extend from DP-SignSGD to DP-Adam.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3878. ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
๐ฌ 11
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Agent
While reasoning models trained with reinforcement learning (RL) excel in reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solvingโareas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic code-augmented long-form reasoning data for cold-start training. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in performance and efficiency. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals generalization to broader tool-use scenarios and emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3879. Activation Steering for LLM Alignment via a Unified ODE-Based Framework
๐ฌ 16
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM alignment Representation Engineering Activation Steering ODE-based Framework Barrier Functions
Activation steering, or representation engineering, offers a lightweight approach to align large language models (LLMs) by manipulating their internal activations at inference time. However, current methods suffer from two key limitations: \textit{(i)} the lack of a unified theoretical framework for guiding the design of steering directions, and \textit{(ii)} an over-reliance on \textit{one-step steering} that fail to capture complex patterns of activation distributions. In this work, we propose a unified ordinary differential equations (ODEs)-based \textit{theoretical} framework for activation steering in LLM alignment. We show that conventional activation addition can be interpreted as a first-order approximation to the solution of an ODE. Based on this ODE perspective, identifying a steering direction becomes equivalent to designing a \textit{barrier function} from control theory. Derived from this framework, we introduce \textsc{Bodes} (\textbf{B}arrier function-guided \textbf{ODE} \textbf{S}teering), which shows \textit{empirical} advancement in LLM alignment. \textsc{Bodes} identifies steering directions by defining the barrier function as the log-density ratio between positive and negative activations, and employs it to construct an ODE for \textit{multi-step and adaptive} steering. Compared to state-of-the-art activation steering methods, \textsc{Bodes} achieves consistent empirical improvements on diverse LLM alignment benchmarks, a notable 7\% improvement over TruthfulQA, and 2\% over RealToxicityPrompts, and 2% over UltraFeedback. Our work establishes a principled new view of activation steering in LLM alignment by unifying its theoretical foundations via ODEs, and validating it empirically through the proposed \textsc{Bodes} method. We will release our source code after the paper is published.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3880. MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs
๐ฌ 22
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Understanding Multimodal LLMs Fine-grained Motion
Despite advancements in Multimodal Large Language Models (MLLMs), their proficiency in fine-grained video motion understanding remains critically limited. They often lack inter-frame differencing and tend to average or ignore subtle visual cues. Furthermore, while visual prompting has shown potential in static images, its application to videos' temporal complexities, particularly for fine-grained motion understanding, remains largely unexplored. We investigate whether inherent capability can be unlocked to boost MLLMs' motion perception and enable distinct visual signatures tailored to decouple object and camera motion cues. In this study, we introduce $\mathtt{MotionSight}$, a novel zero-shot method pioneering object-centric visual spotlight and motion blur as visual prompts to effectively improve fine-grained motion understanding without training. To convert this into valuable data assets, we curated $\mathtt{MotionVid-QA}$, the first large-scale dataset for fine-grained video motion understanding, with hierarchical annotations including SFT and preference data, $\Theta{(40K)}$ video clips and $\Theta{(87K)}$ QAs. Experiments show $\mathtt{MotionSight}$ achieves state-of-the-art open-source performance and competitiveness with commercial models. Using $\mathtt{MotionVid-QA}$, we fine-tuned $\mathtt{MotionChat}$ on Qwen2.5VL-7B, which attains 48.3\% overall accuracy on FAVOR-Bench that is comparable to Qwen2.5VL-72B's 48.1\%. In summary, we present a novel zero-shot method and a large-scale, high-quality dataset specifically for fine-grained motion understanding. All the code and annotations will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3881. S3OD: Towards Generalizable Salient Object Detection with Synthetic Data
๐ฌ 18
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ salient object detection sod segmentation diffusion models image generation synthetic data
Salient object detection exemplifies data-bounded tasks where expensive pixel-precise annotations force separate model training for related subtasks like DIS and HR-SOD. We present a method that dramatically improves generalization through large-scale synthetic data generation and ambiguity-aware architecture. We introduce S3OD, a dataset of over 139,000 high-resolution images created through our multi-modal diffusion pipeline that extracts labels from diffusion and DINO-v3 features. The iterative generation framework prioritizes challenging categories based on model performance. We propose a streamlined multi-mask decoder that handles the inherent ambiguity in salient object detection by predicting multiple valid interpretations. Models trained only on synthetic data achieve 20-50% error reduction in cross-dataset generalization, while fine-tuned versions reach state-of-the-art performance across DIS and HR-SOD benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3882. Deep SPI: Safe Policy Improvement via World Models
๐ฌ 5
โญ 5.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning guarantees representation learning model-based
Safe policy improvement (SPI) offers theoretical control over policy updates, yet existing guarantees largely concern offline, tabular reinforcement learning (RL). We study SPI in general online settings, when combined with world model and representation learning. We develop a theoretical framework showing that restricting policy updates to a well-defined neighborhood of the current policy ensures monotonic improvement and convergence. This analysis links transition and reward prediction losses to representation quality, yielding online, ''deep'' analogues of classical SPI theorems from the offline RL literature. Building on these results, we introduce DeepSPI, a principled on-policy algorithm that couples local transition and reward losses with regularised policy updates. On the ALE-57 benchmark, DeepSPI matches or exceeds strong baselines, including PPO and DeepMDPs, while retaining theoretical guarantees.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3883. Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models
๐ฌ 13
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Unified Multimodal Large Models Text-to-image generation Reasoning Models
Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and vice versa. We analyzed this trade-off and identify the primary cause might be the potential conflict between generation and understanding, which creates a competitive dynamic within the model. To address this, we propose the Reason-Reflect-Refine (R3) framework. This innovative algorithm re-frames the single-step generation task into a multi-step process of "generate-understand-regenerate". By explicitly leveraging the model's understanding capability during generation, we successfully mitigate the optimization dilemma, achieved stronger generation results and improved understanding ability which are related to the generation process. This offers valuable insights for designing next-generation unified multimodal models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3884. Long-tailed Test-Time Adaptation for Vision-Language Models
๐ฌ 30
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Test-Time Adaptation; Vision-Language models; CLIP; Long-tailed Learning
Test-Time Adaptation (TTA) aims to further adapt models to unlabeled test sets arriving in a sequential datastream, thereby progressively strengthening the model's generalization ability. While existing TTA methods for Vision-Language Models (VLMs) are primarily designed and evaluated on (nearly) balanced dataset configurations, real-world test sets may exhibit a long-tailed distribution where major classes dominate the decision boundaries of minor classes, presenting unique challenges. As the first attempt to solve this problem, this paper proposes Long-tailed Test-Time Adaptation (dubbed as L-TTA), which consists of three co-designed mechanisms: Synergistic Prototypes (SyPs), Rebalancing Shortcuts (RSs), and Balanced Entropy Minimization (BEM). SyPs introduce two fine-grained prototypes to enrich tail classes with extra inter-class knowledge; RSs employ learnable shortcuts to achieve learnable adaptation, regularized by class re-allocation loss to enforce distinct feature clustering; BEM restrains excessive entropy minimization of confident classes with extra penalty term, with theoretical propositions to justify its rebalancing capabilities. Extensive experiments over 15 datasets under various long-tailed settings highlight the superior performance of L-TTA in both accuracy and class balancing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3885. Signal in the Noise: Polysemantic Interference Transfers and Predicts Cross-Model Influence
๐ฌ 6
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Polysemanticity Sparse Autoencoders Interpretability Behavior Intervention
Polysemanticity is pervasive in language models and remains a major challenge for interpretation and model behavioral control. Leveraging sparse autoencoders (SAEs), we map the polysemantic topology of two small models (Pythia-70M and GPT-2-Small) to identify SAE feature pairs that are semantically unrelated yet exhibit interference within models. We intervene at four loci (prompt, token, feature, neuron) and measure induced shifts in the next-token prediction distribution, uncovering polysemantic structures that expose a systematic vulnerability in these models. Critically, interventions distilled from counterintuitive interference patterns shared by two small models transfer reliably to larger instruction-tuned models (Llama-3.1-8B/70B-Instruct and Gemma-2-9B-Instruct), yielding predictable behavioral shifts without access to model internals. These findings challenge the view that polysemanticity is purely stochastic, demonstrating instead that interference structures generalize across scale and family. Such generalization suggests a convergent, higher-order organization of internal representations, which is only weakly aligned with intuition and structured by latent regularities, offering new possibilities for both black-box control and theoretical insight into human and artificial cognition.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3886. Critique-RL: Training Critiquing Language Models Through Two-Stage RL for Improved Discrimination and Constructive Feedback
๐ฌ 17
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ large language model Critique models LLM reasoning
Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actorโs outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains. To overcome this, Critique-RL adopts a two-stage optimization strategy. In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic's helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements. For example, it achieves a $9.02\%$ gain on in-domain tasks and a $5.70\%$ gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3887. UNITE: Universal kNowledge Integration from Task-specific Experts
๐ฌ 18
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ LLMs Mixture-of-Experts Universal Knowledge Extraction
Large language models (LLMs) with Mixture-of-Experts (MoE) architectures achieve strong performance under sparse activation. However, their expertise is often fragmented across experts and redundant across layers. Prior studies primarily diagnosed redundancy or parameter importance, revealing overlaps but lacking mechanisms to transform them into reusable knowledge. In contrast, human learning succeeds not by memorizing isolated facts but by reusing shared strategies across domains, which motivates the question: do MoE models similarly encode universal knowledge that can be systematically extracted and reused? We propose Universal kNowledge Integration from Task-specific Experts (UNITE), a framework that consolidates experts through Fisher-weighted fusion and then applies Tucker decomposition to disentangle shared low-rank input/output subspaces as universal knowledge from layer-specific variations. This universal component provides a compact basis for reconstructing target models with flexible depth, enabling lightweight yet competitive adaptation across tasks. To assess effectiveness, we evaluate data efficiency, convergence speed, and generalization across multiple MoE-based LLMs and diverse datasets. The results show that UNITE not only extracts universal knowledge, but also flexibly recomposes it into once-for-all target models that generalize across domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3888. Disentangling the Factors of Convergence between Brains and Computer Vision Models
๐ฌ 14
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ NeuroAI; BrainโAI alignment; Representational alignment; Hierarchy alignment; Emergence; Vision transformers; Self-supervised learning; fMRI; MEG; Temporal dynamics; Spatial dynamics Cortical hierarchy; Development
Many AI models trained on natural images develop representations that resemble those of the human brain. However, the factors that drive this brain-model similarity remain poorly understood. To disentangle how the model, training and data independently lead a neural network to develop brain-like representations, we trained a family of self supervised vision transformers (DINOv3) that systematically varied these different factors. We compare their representations of images to those of the human brain recorded with both fMRI and MEG, providing high resolution in both spatial and temporal analyses. We assess the brain-model similarity with three complementary metrics focusing on overall representational similarity, topographical organization, and temporal dynamics. We show that all three factors - model size, training amount, and image type - independently and interactively impact each of these brain similarity metrics. In particular, the largest DINOv3 models trained with the most human-centric images reach the highest brain-similarity. This emergence of brain-like representations in AI models follows a specific chronology during training: models first align with the early representations of the sensory cortices, and only align with the late and prefrontal representations of the brain with considerably more training. Finally, this developmental trajectory is indexed by both structural and functional properties of the human cortex: the representations that are acquired last by the models specifically align with the cortical areas with the largest developmental expansion, thickness, least myelination, and slowest timescales. Overall, these findings disentangle the interplay between architecture and experience in shaping how artificial neural networks come to see the world as humans do, thus offering a promising framework to understand how the human brain comes to represent its visual world.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3889. CodeBrain: Towards Decoupled Interpretability and Multi-Scale Architecture for EEG Foundation Model
๐ฌ 34
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ EEG foundation model Vector Quantization State Space Model
Electroencephalography (EEG) provides real-time insights into brain activity and supports diverse applications in neuroscience. While EEG foundation models (EFMs) have emerged to address the scalability issues of task-specific models, current approaches still yield clinically uninterpretable and weakly discriminative representations, inefficiently capture global dependencies, and neglect important local neural events. We present CodeBrain, a two-stage EFM designed to fill this gap. In the first stage, we introduce the TFDual-Tokenizer, which decouples heterogeneous temporal and frequency EEG signals into discrete tokens, quadratically expanding the representation space to enhance discriminative power and offering domain-specific representation-level interpretability by suggesting potential links to neural events and spectral rhythms. In the second stage, we propose the multi-scale EEGSSM architecture, which combines structured global convolution with sliding window attention to efficiently capture both sparse long-range and local dependencies, reflecting the brainโs small-world topology. Pretrained on the largest public EEG corpus, CodeBrain achieves strong generalization across 8 downstream tasks and 10 datasets under distribution shifts, supported by comprehensive ablations, scaling-law analyses, and interpretability evaluations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3890. Helix: Evolutionary Reinforcement Learning for Open-Ended Scientific Problem Solving
๐ฌ 32
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Reinforcement Learning Evolution Strategies Scientific Discovery
Large language models (LLMs) with reasoning abilities have demonstrated growing promise for tackling complex scientific problems. Yet such tasks are inherently domain-specific, unbounded and open-ended, demanding exploration across vast and flexible solution spaces. Existing approaches, whether purely learning-based or reliant on carefully designed workflows, often suffer from limited exploration efficiency and poor generalization. To overcome these challenges, we present **HELIX**---a **H**ierarchical **E**volutionary reinforcement **L**earning framework with **I**n-context e**X**periences. HELIX introduces two key novelties: (i) a diverse yet high-quality pool of candidate solutions that broadens exploration through in-context learning, and (ii) reinforcement learning for iterative policy refinement that progressively elevates solution quality. This synergy enables the discovery of more advanced solutions. On the circle packing task, HELIX achieves a new state-of-the-art with a sum of radii of 2.635983 using only a 14B model. Across standard machine learning benchmarks, HELIX further surpasses GPT-4o with a carefully engineered pipeline, delivering an average F1 improvement of 5.95 points on the Adult and Bank Marketing datasets and a 40.5\% reduction in RMSE on Boston Housing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3891. What Matters for Bioacoustic Encoding
๐ฌ 24
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ bioacoustics evaluation benchmarks audio sound classification detection clustering
Bioacoustics, the study of sounds produced by living organisms, plays a vital role in conservation, biodiversity monitoring, and behavioral studies. Many tasks in this field, such as species, individual, and behavior classification and detection, are well-suited to machine learning. However, they often suffer from limited annotated data, highlighting the need for a general-purpose bioacoustic encoder capable of extracting useful representations for diverse downstream tasks. Such encoders have been proposed before, but are often limited in scope due to a focus on a narrow range of species (typically birds), and a reliance on a single model architecture or training paradigm. Moreover, they are usually evaluated on a small set of tasks and datasets. In this work, we present a large-scale empirical study that covers aspects of bioacoustics that are relevant to research but have previously been scarcely considered: training data diversity and scale, model architectures and training recipes, and the breadth of evaluation tasks and datasets. We obtain encoders that are state-of-the-art on the existing and newly proposed benchmarks.
We also identify *what matters* for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find that self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages. To support ongoing research and application, we will release the model checkpoints.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3892. InfGen: Scenario Generation as Next Token Group Prediction
๐ฌ 17
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Autonomous driving Closed-loop simulation Scenario generation
Realistic and interactive traffic simulation is essential for training and evaluating autonomous driving systems. However, most existing data-driven simulation methods rely on static initialization or log-replay data, limiting their ability to model dynamic, long-horizon scenarios with evolving agent populations. We propose InfGen, a scenario generation framework that outputs agent states and trajectories in an autoregressive manner. InfGen represents the entire scene as a sequence of tokensโincluding traffic light signals, agent states, and motion vectorsโand uses a transformer model to simulate traffic over time. This design enables InfGen to continuously insert new agents into traffic, supporting infinite scene generation. Experiments demonstrate that InfGen produces realistic, diverse, and adaptive traffic behaviors. Furthermore, reinforcement learning policies trained in InfGen-generated scenarios achieve superior robustness and generalization, validating its utility as a high-fidelity simulation environment for autonomous driving. Code and models will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3893. Zero-shot Human Pose Estimation using Diffusion-based Inverse solvers
๐ฌ 15
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Human Motion tracking Inverse problems Wearable sensors
Pose estimation refers to tracking a human's full body posture, including their head, torso, arms, and legs.
The problem is challenging in practical settings where the number of body sensors is limited.
Past work has shown promising results using conditional diffusion models, where the pose prediction is conditioned on both $\langle$location, rotation$\rangle$ measurements from the sensors.
Unfortunately, nearly all these approaches generalize poorly across users, primarily because location measurements are highly influenced by the body size of the user.
In this paper, we formulate pose estimation as an inverse problem and design an algorithm capable of zero-shot generalization.
Our idea utilizes a pre-trained diffusion model and conditions it on rotational measurements alone; the priors from this model are then guided by a likelihood term, derived from the measured locations.
Thus, given any user, our proposed InPose method generatively estimates the highly likely sequence of poses that best explains the sparse on-body measurements.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3894. Decision-Theoretic Approaches for Improved Learning-Augmented Algorithms
๐ฌ 14
โญ 5.00
๐ optimization
๐ท๏ธ Learning-augmented algorithms online algorithms competitive analysis performance evaluation metrics decision theory
We initiate the systematic study of decision-theoretic metrics in the design and analysis of algorithms with machine-learned predictions. We introduce approaches based on both deterministic measures such as distance-based evaluation, that help us quantify how close the algorithm is to an ideal solution, and stochastic measures that balance the trade-off between the algorithm's performance and the risk associated with the imperfect oracle. These approaches allow us to quantify the algorithm's performance across the full spectrum of the prediction error, and thus choose the best algorithm within an entire class of otherwise incomparable ones. We apply our framework to three well-known problems from online decision making, namely ski-rental, one-max search, and contract scheduling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3895. FoNE: Precise Single-Token Number Embeddings via Fourier Features
๐ฌ 15
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLMs Arithmetic Embedding Numbers
Language models treat numbers in the same way as ordinary word tokens, which introduces two major issues: (1) embeddings of numerical tokens primarily reflect their frequency in text corpora rather than their inherent numerical properties, leading to frequency bias, and (2) numbers are often split into multiple tokens, forcing the model to aggregate these pieces to recover their values. Inspired by the observation that pre-trained Large Language Models (LLMs) internally learn Fourier-like features for number tokens, we propose **Fo**urier **N**umber **E**mbedding **(FoNE)**, a novel method that directly maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation.
Compared to traditional subword and digit-wise embeddings, FoNE achieves higher accuracy on arithmetic tasks, requires significantly less training data, and offers more efficient training and inference.
A $38$M-parameter Transformer trained from scratch with FoNE outperforms a fine-tuned Llama-3.2-1B model on addition, subtraction, and multiplication. FoNE is also the only method that achieves $100\\%$ accuracy on over 100,000 test examples across these tasks. On 6-digit decimal addition, FoNE needs 64$\times$ less data than subword and digit-wise embeddings to reach $\ge 99\\%$ accuracy, while using 3$\times$ and 6$\times$ fewer tokens per number, respectively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3896. PatchDNA: A Flexible and Biologically-Informed Alternative to Tokenization for DNA
๐ฌ 17
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ DNA DNA language model gLM tokenization genomic sequence representation
DNA language models are emerging as powerful tools for representing genomic sequences, with recent progress driven by self-supervised learning. However, performance on downstream tasks is sensitive to tokenization strategies reflecting the complex encodings in DNA, where both regulatory elements and single-nucleotide changes can be functionally significant. Yet existing models are fixed to their initial tokenization strategy; single-nucleotide encodings result in long sequences that challenge transformer architectures, while fixed multi-nucleotide schemes like byte pair encoding struggle with character level modeling. Drawing inspiration from the Byte Latent Transformer's combining of bytes into patches, we propose that 'patching' provides a competitive and more efficient alternative to tokenization for DNA sequences. Furthermore, patching eliminates the need for a fixed vocabulary, which offers unique advantages to DNA. Leveraging this, we propose a biologically informed strategy, using evolutionary conservation scores as a guide for 'patch' boundaries. By prioritizing conserved regions, our approach directs computational resources to the most functionally relevant parts of the DNA sequence. We show that models up to an order of magnitude smaller surpass current state-of-the-art performance in existing DNA benchmarks. Importantly, our approach provides the flexibility to change patching without retraining, overcoming a fundamental limitation of current tokenization methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3897. SliderQuant: Accurate Post-Training Quantization for LLMs
๐ฌ 6
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language models post-training quantization low-bit neural networks model compression
In this paper, we address post-training quantization (PTQ) for large language models (LLMs) from an overlooked perspective: given a pre-trained high-precision LLM, the predominant sequential quantization framework treats different layers equally, but this may be not optimal in challenging bit-width settings. We empirically study the quantization impact of different layers on model accuracy, and observe that: (1) shallow/deep layers are usually more sensitive to quantization than intermediate layers; (2) among shallow/deep layers, the most sensitive one is the first/last layer, which exhibits significantly larger quantization error than others. These empirical observations imply that the quantization design for different layers of LLMs is required on multiple levels instead of a single level shared to all layers. Motivated by this, we propose a new PTQ framework termed **Sliding**-lay**er** **Quant**ization (SliderQuant) that relies on a simple adaptive sliding quantization concept facilitated by few learnable parameters. The base component of SliderQuant is called inter-layer sliding quantization, which incorporates three types of novel sliding window designs tailored for addressing the varying quantization sensitivity of shallow, intermediate and deep layers. The other component is called intra-layer sliding quantization that leverages an incremental strategy to quantize each window. As a result, SliderQuant has a strong ability to reduce quantization errors across layers. Extensive experiments on basic language generation, zero-shot commonsense reasoning and challenging math and code tasks with various LLMs, including Llama/Llama2/Llama3/Qwen2.5 model families, DeepSeek-R1 distilled models and large MoE models, show that our method outperforms existing PTQ methods (including the latest PTQ methods using rotation transformations) for both weight-only quantization and weight-activation quantization under diverse bit width settings. Code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3898. CoMem: Compositional Concept-Graph Memory for Continual VisionโLanguage Learning
๐ฌ 14
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ VLM Vision Language Learning Continual Learning
Continual visionโlanguage learning is crucial for multimodal tasks such as imageโtext retrieval, visual question answering, and grounded reasoning in dynamic environments, yet deployed systems must learn from non-stationary streams under strict privacy and memory budgets, where naรฏve finetuning forgets and harms transfer. We aim to sustain stable yet plastic capability in this setting without storing raw data, enabling reuse and recombination across domains and tasks. We present CoMem, a framework that treats compositional structure as the unit of memory and rehearsal: it incrementally organizes knowledge into a compact graph of concepts and relations and rehearses directly in feature space by conditioning practice signals on sampled subgraphs. A lightweight compositional consistency objective keeps partโwhole predictions coherent, while teacher-informed, uncertainty-aware filtering limits off-manifold drift. Across cross-domain retrieval, structured concept learning, and continual multimodal VQA, CoMem achieves state-of-the-art retention and transfer alongside consistent gains on SVLC and VQACL/CLOVE under matched memory and parameter budgets. By casting structure as memory and rehearsing where learning happens (feature space), CoMem provides a privacy-friendly and testable paradigm for reliable continual adaptation without raw exemplars.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3899. SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization
๐ฌ 20
โญ 5.00
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ LLM Quantization PTQ LoRA Error Reconstruction
Post-training quantization (PTQ) has emerged as a prevailing technique for deploying large language models (LLMs) efficiently in terms of both memory and computation, across edge devices and server platforms. Existing PTQ methods primarily aim to reduce precision in weights and activations by mitigating quantization errors caused by channel-wise outlier activations (e.g., pre-quantization scaling, online transformations, or low-rank error reconstruction). Among these approaches, error reconstruction with low-rank adaptation (LoRA) has proven particularly effective, as it introduces a lightweight auxiliary computation path without requiring heavy optimization or additional online layers. However, prior studies reveal severe accuracy degradation under W4A4 settings, and conventional low-rank adaptations rely on two sequential factors, necessitating intermediate quantization during inference and thereby limiting low-precision efficiency. In this work, we propose SERQ, a saliency-aware error reconstruction method for low-bit LLM inference that employs a single low-rank compensation matrix. SERQ preserves efficient 4-bit matrix multiplication in linear layers by jointly mitigating quantization errors arising from both activation and weight saliency through three stages: (1) static activation flattening, (2) saliency-aware error reconstruction, and (3) offline weight permutation. The method incurs additional computation only for low-rank error reconstruction via a single decomposition, while all other operations are performed offline, thereby keeping latency overhead minimal. Empirically, SERQ outperforms prior error reconstruction methods under both W4A8 and W4A4 settings, and achieves higher accuracy than state-of-the-art rotation-based W4A4 approaches, while substantially reducing calibration complexity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3900. Beyond the Heatmap: A Rigorous Evaluation of Component Impact in MCTS-Based TSP Solvers
๐ฌ 14
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Travelling Salesman Problem Heatmap Monte Carlo Tree Search Combinatorial optimization
The ``Heatmap + Monte Carlo Tree Search (MCTS)'' paradigm has recently emerged as a prominent framework for solving the Travelling Salesman Problem (TSP). While considerable effort has been devoted to enhancing heatmap sophistication through advanced learning models, this paper rigorously examines whether this emphasis is justified, critically assessing the relative impact of heatmap complexity versus MCTS configuration. Our extensive empirical analysis across diverse TSP scales, distributions, and benchmarks reveals two pivotal insights: \textbf{1}) The configuration of MCTS strategies significantly influences solution quality, underscoring the importance of meticulous tuning to achieve optimal results and enabling valid comparisons among different heatmap methodologies. \textbf{2}) A rudimentary, parameter-free heatmap based on the intrinsic $k$-nearest neighbor structure of TSP instances, when coupled with an optimally tuned MCTS, can match or surpass the performance of more sophisticated, learned heatmaps, demonstrating robust generalizability on problem scale and distribution shift. To facilitate rigorous and fair evaluations in future research, we introduce a streamlined pipeline for standardized MCTS hyperparameter tuning. Collectively, these findings challenge the prevalent assumption that heatmap complexity is the primary determinant of performance, advocating instead for a balanced integration and comprehensive evaluation of both learning and search components within this paradigm.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3901. RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning
๐ฌ 19
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ MLLMs Fine-grained Visual Reasoning Visual Understanding Reinforcement Learning
Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3902. ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
๐ฌ 14
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ web agents safety trustworthiness benchmark policy compliance enterprise workflows Completion Under Policy CuP Risk Ratio human-in-the-loop policy hierarchy robustness error handling evaluation agentic systems LLM-based agents autonomous browsing
Autonomous web agents solve complex browsing tasks, yet existing benchmarks measure only whether an agent finishes a task, ignoring whether it does so safely or in a way enterprises can trust. To integrate these agents into critical workflows, safety and trustworthiness (ST) are prerequisite conditions for adoption. We introduce \textbf{\textsc{ST-WebAgentBench}}, a configurable and extensible framework designed as a first step toward enterprise-grade evaluation.
Each of its 222 tasks is paired with ST policies, concise rules that encode constraints, and is scored along six orthogonal dimensions (e.g., user consent, robustness). Beyond raw task success, we propose the \textit{Completion Under Policy} (\textit{CuP}) metric, which credits only completions that respect all applicable policies, and the \textit{Risk Ratio}, which quantifies ST breaches across dimensions.
Evaluating three open state-of-the-art agents shows their average CuP is less than two-thirds of their nominal completion rate, revealing substantial safety gaps. To support growth and adaptation to new domains, ST-WebAgentBench provides modular code and extensible templates that enable new workflows to be incorporated with minimal effort, offering a practical foundation for advancing trustworthy web agents at scale.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3903. PMDformer: Patch-Mean Decoupling Transformer for Long-term Forecasting
๐ฌ 18
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ Long-term Time Series Forecasting; Non-stationary; Patch-Mean Decoupled
Long-term time series forecasting (LTSF) plays a crucial role in fields such as energy management, finance, and traffic prediction. Transformer-based models have adopted patch-based strategies to capture long-range dependencies, but accurately modeling shape similarities across patches and variables remains challenging due to scale differences.
To address this, we introduce patch-mean decoupling (PMD), which separates the trend and residual shape information by subtracting the mean of each patch, preserving the original structure and ensuring that the attention mechanism captures true shape similarities.
Futhermore, to more effectively model long-range dependencies and capture cross-variable relationships, we propose Trend Restoration Attention (TRA) and Proximal Variable Attention (PVA). The former module reintegrates the decoupled trend from PMD while calculating attention output. And the latter focuses cross-variable attention on the most relevant, recent time segments to avoid overfitting on outdated correlations. Combining these components, we propose PMDformer, a model designed to effectively capture shape similarity in long-term forecasting scenarios. Extensive experiments indicate that PMDformer outperforms existing state-of-the-art methods in stability and accuracy across multiple LTSF benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3904. HiDivDrop: Vision Token Reduction in MLLMs via Late Injection and Differentiable Top-K
๐ฌ 21
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ MLLMs Vision Token Pruning Efficiency and Compression Interpretability and Analysis
The computational cost of Multimodal Large Language Models (MLLMs), driven by the quadratic complexity of processing vision tokens, remains a significant barrier to their widespread adoption. While progressive vision token pruning is a promising solution, we find that its full potential has been unrealized due to two key limitations: it misinterprets the role of shallow layers as being crucial for fusion and employs overly rigid, non-adaptive pruning schedules. To address these flaws, we introduce HiDivDrop, a framework that tailors token pruning to the true hierarchical function of MLLM layers. HiDivDrop incorporates two key innovations: (1) a Late Injection strategy that bypasses passive shallow layers, introducing visual tokens directly where active fusion begins; and (2) a Concave Pyramid Pruning scheme with an Early Exit mechanism that dynamically adjusts the pruning rate throughout the middle and deep layers. This process is optimized via an inter-layer similarity measure and a differentiable top-$k$ operator. Extensive experiments show that HiDivDrop compresses $\sim$90\% visual tokens while matching the original performance and accelerating training by 1.72$\times$. Our work not only sets a new state-of-the-art for efficient MLLM training and inference but also provides valuable insights into the hierarchical nature of multimodal fusion.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3905. Decoupling Primitive with Experts: Dynamic Feature Alignment for Compositional Zero-Shot Learning
๐ฌ 18
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Compositional zero-shot learning; Multi-modal learning
Compositional Zero-Shot Learning (CZSL) investigates compositional generalization capacity to recognize unknown state-object pairs based on learned primitive concepts. Existing CZSL methods typically derive primitives features through a simple composition-prototype mapping, which is suboptimal for a set of individuals that can be divided into distinct semantic subsets.
Moreover, the one-to-all cross-modal primitives matching neglects compositional divergence within identical states or objects, limiting fine-grained image-composition alignment. In this study, we propose EVA, a Mixture-of-Experts Framework for Semantic Variant Alignment. Specifically, we introduce domain-expert adaption, leveraging multiple experts to achieve token-aware learning and model high-quality primitive representations. To enable accurate compositional generalization, we further present semantic variant alignment to select semantically relevant representation for image-primitives matching.
Our method significantly outperforms other state-of-the-art CZSL methods on three popular benchmarks in both closed- and open-world settings, demonstrating the efficacy of the proposed insight.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3906. Representation Alignment for Diffusion Transformers without External Components
๐ฌ 19
โญ 5.00
๐ generative models
๐ท๏ธ Diffusion Transformers Self-Representation Alignment
Recent studies have demonstrated that learning a meaningful internal represen-
tation can accelerate generative training. However, existing approaches necessi-
tate to either introduce an off-the-shelf external representation task or rely on a
large-scale, pre-trained external representation encoder to provide representation
guidance during the training process. In this study, we posit that the unique dis-
criminative process inherent to diffusion transformers enables them to offer such
guidance without requiring external representation components. We propose Self-
Representation Alignment (SRA), a simple yet effective method that obtains rep-
resentation guidance using the internal representations of learned diffusion trans-
former. SRA aligns the latent representation of the diffusion transformer in the
earlier layer conditioned on higher noise to that in the later layer conditioned on
lower noise to progressively enhance the overall representation learning during
only the training process. Experimental results indicate that applying SRA to
DiTs and SiTs yields consistent performance improvements, and largely outper-
forms approaches relying on auxiliary representation task. Our approach achieves
performance comparable to methods that are dependent on an external pre-trained
representation encoder, which demonstrates the feasibility of acceleration with
representation alignment in diffusion transformers themselves.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3907. BRIDGE: Bi-level Reinforcement Learning for Dynamic Group Structure in Coalition Formation Games
๐ฌ 19
โญ 5.00
๐ reinforcement learning
๐ท๏ธ coalition formation games Bi-level reinforcement learning multi-agent reinforcement learning
The challenge of coalition formation games lies in efficiently navigating the exponentially large space of possible coalitions to identify the optimal partition. While existing approaches to solve coalition formation games either provide optimal solutions with limited scalability or approximate solutions without quality guarantees, we propose a novel scalable and sample-efficient approximation method based on deep reinforcement learning. Specifically, we model the coalition formation problem as a finite Markov decision process and use deep neural network to approximate optimal coalition structures within the full and abstracted coalition space. Moreover, our method is applicable to bi-level optimization problems in which coalition values are determined by the policies of individual agents at a lower decision-making level. This way, our approach facilitates dynamic, adaptive adjustments to coalition value assessments as they evolve over time. Empirical results demonstrate our algorithm's effectiveness in approximating optimal coalition structures in both normal-form and sequential mixed-motive games.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3908. There Was Never a Bottleneck in Concept Bottleneck Models
๐ฌ 20
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ concept bottleneck models information bottleneck representation learning variational inference
Deep learning representations are often difficult to interpret, which can hinder their deployment in sensitive applications. Concept Bottleneck Models (CBMs) have emerged as a promising approach to mitigate this issue by learning representations that support target task performance while ensuring that each component predicts a concrete concept from a predefined set. In this work, we argue that CBMs do not impose a true bottleneck: the fact that a component can predict a concept does not guarantee that it encodes only information about that concept. This shortcoming raises concerns regarding interpretability and the validity of intervention procedures. To overcome this limitation, we propose Minimal Concept Bottleneck Models (MCBMs), which incorporate an Information Bottleneck (IB) objective to constrain each representation component to retain only the information relevant to its corresponding concept. This IB is implemented via a variational regularization term added to the training loss. As a result, MCBMs yield more interpretable representations, support principled concept-level interventions, and remain consistent with probability-theoretic foundations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3909. From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning
๐ฌ 22
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Agent-centric benchmark Language model assessment Textual anomaly detection Adaptive benchmarks
The evaluation of large language models (LLMs) has predominantly relied on static datasets, which offer limited scalability and fail to capture the evolving reasoning capabilities of recent models. To overcome these limitations, we propose an agent-centric benchmarking paradigm that moves beyond static datasets by introducing a dynamic protocol in which autonomous agents iteratively generate, validate, and solve problems. Within this protocol, a teacher agent generates candidate problems, an orchestrator agent rigorously verifies their validity and guards against adversarial attacks, and a student agent attempts to solve the validated problems. An invalid problem is revised by the teacher agent until it passes validation. If the student correctly solves the problem, the orchestrator prompts the teacher to generate more challenging variants. Consequently, the benchmark scales in difficulty automatically as more capable agents are substituted into any role, enabling progressive evaluation of large language models without manually curated datasets. Adopting text anomaly detection as our primary evaluation format, which demands cross-sentence logical inference and resists pattern-matching shortcuts, we demonstrate that this protocol systematically exposes corner-case reasoning errors that conventional benchmarks fail to reveal. We further advocate evaluating systems along several complementary axes including cross-model pairwise performance and progress between the initial and orchestrator-finalized problems. By shifting the focus from fixed datasets to dynamic protocols, our approach offers a sustainable direction for evaluating ever-evolving language models and introduces a research agenda centered on the co-evolution of agent-centric benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3910. MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation
๐ฌ 23
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ molecule-language multimodal benchmark molecular structure recognition language-prompted molecule editing and generation
Precise recognition, editing, and generation of molecules are essential prerequisites for both chemists and AI systems tackling various chemical tasks. We present MolLangBench, a comprehensive benchmark designed to evaluate fundamental molecule-language interface tasks: language-prompted molecular structure recognition, editing, and generation. To ensure high-quality, unambiguous, and deterministic outputs, we construct the recognition tasks using automated cheminformatics tools, and curate editing and generation tasks through rigorous expert annotation and validation. MolLangBench supports the evaluation of models that interface language with different molecular representations, including linear strings, molecular images, and molecular graphs. Evaluations of state-of-the-art models reveal significant limitations: the strongest model (GPT-5) achieves $86.2$\% and $85.5$\% accuracy on recognition and editing tasks, which are intuitively simple for humans, and performs even worse on the generation task, reaching only $43.0$\% accuracy. These results highlight the shortcomings of current AI systems in handling even preliminary molecular recognition and manipulation tasks. We hope MolLangBench will catalyze further research toward more effective and reliable AI systems for chemical applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3911. HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space
๐ฌ 21
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Pruning; MoE
Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared to dense LLMs. However, their large parameter counts result in prohibitive memory requirements, limiting practical deployment. While existing pruning methods primarily focus on expert-level pruning, this coarse granularity often leads to substantial accuracy degradation. In this work, we introduce HEAPr, a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts, enabling more precise and flexible atomic expert pruning. To measure the importance of each atomic expert, we leverage second-order information based on principles similar to Optimal Brain Surgeon (OBS) theory. To address the computational and storage challenges posed by second-order information, HEAPr exploits the inherent properties of atomic experts to transform the second-order information from expert parameters into that of atomic expert parameters, and further simplifies it to the second-order information of atomic expert outputs. This approach reduces the space complexity from $O(d^4)$, where $d$ is the modelโs dimensionality, to $O(d^2)$. HEAPr requires only two forward passes and one backward pass on a small calibration set to compute the importance of atomic experts. Extensive experiments on MoE models, including DeepSeek MoE and Qwen MoE family, demonstrate that HEAPr outperforms existing expert-level pruning methods across a wide range of compression ratios and benchmarks. Specifically, HEAPr achieves nearly lossless compression at compression ratios of $20\% \sim 25\%$ in most models, while also reducing FLOPs nearly by $20\%$. The code can be found at \href{https://anonymous.4open.science/r/anonymous-code-B927/}{anonymous-code-B927}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3912. VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks
๐ฌ 12
โญ 5.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Neuro-symbolic methods Large Language Models Chain-of-Thought Reasoning verification Formal logic
LLMs can perform multi-step reasoning through Chain-of-Thought (CoT), but they cannot reliably verify their own logic. Even when they reach correct answers, the underlying reasoning may be flawed, undermining trust in high-stakes scenarios. To mitigate this issue, we introduce VeriCoT, a neuro-symbolic method that extracts and verifies formal logical arguments from CoT reasoning. VeriCoT formalizes each CoT reasoning step into first-order logic and identifies premises that ground the argument in source context, commonsense knowledge, or prior reasoning steps. The symbolic representation enables automated solvers to verify logical validity while the NL premises allow humans and systems to identify ungrounded or fallacious reasoning steps. Experiments on the ProofWriter, LegalBench, and BioASQ
datasets show VeriCoT effectively identifies flawed reasoning, and serves as a strong predictor of final answer correctness. We also leverage VeriCoTโs verification signal for (1) inference-time self-reflection, (2) supervised fine-tuning (SFT) on VeriCoT-distilled datasets and (3) preference fine-tuning (PFT) with direct preference optimization (DPO) using verification-based pairwise rewards, further improving reasoning validity and accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3913. ReactID: Synchronizing Realistic Actions and Identity in Personalized Video Generation
๐ฌ 12
โญ 5.00
๐ generative models
๐ท๏ธ Video Generation; Identity Preserving; Diffusion Models
Personalized video generation faces a fundamental trade-off between identity consistency and action realism: overly rigid identity preservation often leads to unnatural motion, while emphasis on action dynamics can compromise subject fidelity. This tension stems from three interrelated challenges: imprecise subject-video alignment, unstable training due to varying sample difficulties, and inadequate modeling of fine-grained actions. To address this, we propose ReactID, a comprehensive framework that harmonizes identity accuracy and motion naturalness through coordinated advances in data, training, and action modeling. First, we construct ReactID-Data, a large-scale dataset annotated with a high-precision pipeline combining vision-based entity label extraction, MLLM-based subject detection, and post-verification to ensure reliable subject-video correspondence. Second, we analyze learning difficulty along dimensions such as subject size, appearance similarity, and sampling strategy, and devise a progressive training curriculum that evolves from easy to hard samples, ensuring stable convergence while avoiding identity overfitting and copy-paste artifacts. Third, ReactID introduces a novel timeline-based conditioning mechanism that supplements monolithic text prompts with structured multi-action sequences. Each sub-action is annotated with precise timestamps and descriptions, and integrated into the diffusion model via two novel components: subject-aware cross-attention module to bind sub-action to the specific subject of interest and temporally-adaptive RoPE to embed the rescaled temporal coordinates invariant to action duration. Experiments show that ReactID achieves state-of-the-art performance in both identity preservation and action realism, effectively balancing the two objectives.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3914. Distractor-free Generalizable 3D Gaussian Splatting
๐ฌ 25
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Distractor-free Generalizable 3D Gaussian Splatting training stability
We present DGGS, a novel framework that addresses the previously unexplored challenge: \textbf{Distractor-free Generalizable 3D Gaussian Splatting} (3DGS). Previous generalizable 3DGS works are often limited to static scenes, struggling to mitigate distractor impacts in training and inference phases, which leads to training instability and inference artifacts. To address this new challenge, we propose a distractor-free generalizable training paradigm and corresponding inference framework, which can be directly integrated into existing Generalizable 3DGS frameworks. Specifically, in our training paradigm, DGGS proposes a feed-forward mask prediction and refinement module based on the 3D consistency of references and semantic prior, effectively eliminating the impact of distractor on training loss. Based on these masks, we combat distractor-induced artifacts and holes at inference time through a novel two-stage inference framework for reference scoring and re-selection, complemented by a distractor pruning mechanism that further removes residual distractor 3DGS-primitive influences. Extensive feed-forward experiments on the real and our synthetic data show DGGS's reconstruction capability when dealing with novel distractor scenes. Moreover, our feed-forward mask prediction even achieves an accuracy superior to scene-specific Distractor-free methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3915. RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
๐ฌ 14
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM RL Agent
The development of autonomous agents for complex, long-horizon tasks is a central goal in AI. However, dominant training paradigms face a critical limitation: reinforcement learning (RL) methods that optimize solely for final task success often reinforce flawed or inefficient reasoning paths, a problem we term inefficient exploration. This leads to agents that are brittle and fail to generalize, as they learn to find solutions without learning how to reason coherently. To address this, we introduce RLVMR, a novel frame-work that integrates dense, process-level supervision into end-to-end RL by rewarding verifiable, meta-reasoning behaviors. RLVMR equips an agent to explicitly tag its cognitive stepsโsuch as planning, exploration, and reflectionโand provides program-matic, rule-based rewards for actions that contribute to effective problem-solving. These process-centric rewards are combined with the final outcome signal and optimized using a critic-free policy gradient method. On the challenging ALFWorld and ScienceWorld benchmarks, RLVMR achieves new state-of-the-art results, with our 7B model reaching an 83.6% success rate on the most difficult unseen task split. Our analysis confirms these gains stem from improved reasoning quality, including significant reductions in redundant actions and enhanced error recovery, leading to more robust, efficient, and interpretable agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3916. TINY BUT MIGHTY: A SOFTWARE-HARDWARE CO- DESIGN APPROACH FOR EFFICIENT MULTIMODAL IN- FERENCE ON BATTERY-POWERED SMALL DEVICES
๐ฌ 18
โญ 5.00
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ On-device VLM Efficient Inference Software-Hardware Co-Design Quantization NPU GPU
Large Multimodal Models (LMMs) are inherently modular, consisting of vision
and audio encoders, projectors, and large language models. Yet, they are almost
always executed monolithically, which underutilizes the heterogeneous accelera-
tors (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end latency.
In this paper, we present NANOMIND, a hardwareโsoftware co-design inference
framework for Large Multimodal Models (LMMs) that breaks large models into
modular โbricksโ (vision, language, audio, etc.) and maps each to its ideal accelera-
tor. The key insight is that large models can be broken into modular components and
scheduled to run on the most appropriate compute units. It performs module-level
dynamic offloading across accelerators on unified-memory SoCs. By combining
customized hardware design, system-level scheduling, and optimized low-bit com-
putation kernels, we demonstrate our framework with a compact, battery-powered
device capable of running LMMs entirely on-device. This prototype functions as
a self-contained intelligent assistant that requires no network connectivity, while
achieving higher throughput and superior power efficiency under strict resource
constraints. The design further bypasses CPU bottlenecks and reduces redundant
memory usage through token-aware buffer management and module-level coordi-
nation. Our system outperforms existing implementations in resource efficiency,
cutting energy consumption by 42.3% and GPU memory usage by 11.2%. This
enables a battery-powered device to run LlaVA-OneVision-qwen2-05B with a
camera for nearly 20.8 hours.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3917. Distribution-Aware Multi-Granularity Phase Coding: Towards Lower Conversion Error for Spike-Driven Large Language Models
๐ฌ 29
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Spiking Neural Network Optimization
Spiking large language models (LLMs) offer significant advantages on neuromorphic hardware, yet training them from scratch remains prohibitively expensive. A promising alternative is ANN-to-SNN conversion, which reuses pretrained ANN weights while minimizing conversion error.
However, existing conversion frameworks neglect activation distributions, as reflected in SNN neurons with rate or temporal coding to map uniformly distributed rather than distribution-aligned discrete values, thus causing latent conversion error arising from distribution misalignment.
To tackle this problem, we propose a distribution-aware multi-granularity phase coding approach, which achieves reasonable discrete value allocation by minimizing conversion error relative to activation distributions.
Specifically, multi-granularity phase coding extends conventional phase coding with multiple learnable bases, incorporating representational capacity across different granularities.
Building on this coding scheme, we further propose a novel ANN-to-SNN conversion paradigm designed towards lower conversion error.
In particular, our paradigm utilizes the activation distributions of hidden layers to sample data for cost-efficient neuron training, without requiring fine-tuning of model weights.
Theoretically, we provide a convergence guarantee for the neuron training algorithm.
Extensive experiments on the LLaMA model confirm the effectiveness of both our coding scheme and conversion paradigm.
Concretely, our spiking LLM attains the lowest perplexity with ANN-level accuracy, accompanied by a 42\% reduction in energy consumption of MAC and AC operations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3918. Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning
๐ฌ 21
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multi-Agent Adaptive Collaboration Metacognitive
While scaling individual Large Language Models (LLMs) has delivered remarkable progress, the next frontier lies in scaling collaboration through multi-agent systems (MAS). However, purely autonomous MAS remain ``closed-world'' systems, constrained by the static knowledge horizon of pre-trained models. This limitation makes them brittle on tasks requiring knowledge beyond training data, often leading to collective failure under novel challenges. To address this, we propose the Learning to Intervene via Metacognitive Adaptation (LIMA) framework, a principled paradigm for human--agent collaboration. LIMA trains agents to learn a metacognitive policy that governs when to solve problems autonomously and when to defer to a human expert. To operationalize this policy, we introduce Dual-Loop Policy Optimization (DLPO), which disentangles immediate decision-making from long-term capability growth. The inner loop applies Group Relative Policy Optimization (GRPO) with a cost-aware reward to optimize deferral decisions, while the outer loop implements continual learning, transforming expert feedback into high-quality supervised signals that strengthen the agent's reasoning ability. Experiments on challenging mathematical and problem-solving benchmarks show that LIMA, equipped with Dual-Loop Optimization, consistently outperforms state-of-the-art MAS, establishing a principled foundation for collaborative and continually improving agentic systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3919. Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models
๐ฌ 14
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Diffusion Language Models Reinforcement Learning
The extension of diffusion models to language tasks has shown promising results, but their post-training methods remain largely unexplored. We highlight the importance of aligning a diffusion language modelโs preference-inference trajectory with its post-training objective. To this end, we propose TraceRL, a trajectory-aware reinforcement learning framework for DLMs that incorporates information from inference trajectories into post-training and is applicable to both full-attention and block-attention diffusion models. We also introduce a diffusion-based value model that enhances training stability and naturally accommodates process rewards. We demonstrate TraceRLโs superiority in enhancing a modelโs reasoning ability on complex math and coding tasks, as well as its applicability in scaling block diffusion models to larger block sizes. Employing TraceRL, we derive a series of state-of-the-art diffusion language models, namely TraDo. Although smaller than Qwen2.5-7B-Instruct, TraDo-4B-Instruct consistently outperforms it on complex math reasoning tasks. TraDo-8B-Instruct achieves 4.5% higher accuracy on MATH500 than Qwen2.5-7B-Instruct and 6.6% higher accuracy on LiveCodeBench-V2 than Llama3.1-8B-Instruct. Through curriculum learning, we also develop the first 8B-scale long-CoT diffusion language model.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3920. A Federated Generalized Expectation-Maximization Algorithm for Mixture Models with an Unknown Number of Components
๐ฌ 14
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ clustering gaussian mixture model federated learning
We study the problem of federated clustering when the total number of clusters $K$ across clients is unknown, and the clients have heterogeneous but potentially overlapping cluster sets in their local data. To that end, we develop FedGEM: a federated generalized expectation-maximization algorithm for the training of mixture models with an unknown number of components. Our proposed algorithm relies on each of the clients performing EM steps locally, and constructing an uncertainty set around the maximizer associated with each local component. The central server utilizes the uncertainty sets to learn potential cluster overlaps between clients, and infer the global number of clusters via closed-form computations. We perform a thorough theoretical study of our algorithm, presenting probabilistic convergence guarantees under common assumptions. Subsequently, we study the specific setting of isotropic GMMs, providing tractable, low-complexity computations to be performed by each client during each iteration of the algorithm, as well as rigorously verifying assumptions required for algorithm convergence. We perform various numerical experiments, where we empirically demonstrate that our proposed method achieves comparable performance to centralized EM, and that it outperforms various existing federated clustering methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3921. Flow Along the $K$-Amplitude for Generative Modeling
๐ฌ 5
โญ 5.00
๐ generative models
๐ท๏ธ generative models frequency transformation image generation ai for science molecule assembly
In this work, we propose K-Flow, a novel generative learning paradigm that flows along the $K$-amplitude domain, where $K$ is a scaling parameter that organizes projected coefficients (frequency bands), and amplitude refers to the norm of such coefficients. We instantiate K-Flow with three concrete $K$-amplitude transformations: Fourier transformation, Wavelet transformation, and PCA. By incorporating the $K$-amplitude transformations, K-Flow enables flow matching across the scaling parameter as time. We discuss six properties of K-Flow, covering its theoretical foundations, energy and temporal dynamics, and practical applications. Specifically, from the perspective of practical usage, K-Flow allows for steerable generation by controlling the information at different scales. To demonstrate the effectiveness of K-Flow, we conduct experiments on both unconditional and conditional image generation tasks, showing that K-Flow achieves competitive performance. Furthermore, we perform three ablation studies to illustrate how K-Flow leverages the scaling parameter for controlled image generation. Additional results, including scientific applications, are also provided.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3922. The Overthinking Predicament: When Reasoning Hurts Ranking
๐ฌ 21
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ information retrieval reranking reasoning
Document reranking is a key component in information retrieval (IR), aimed at refining initial retrieval results to improve ranking quality for downstream tasks. Recent studiesโmotivated by large reasoning models (LRMs)โhave begun incorporating explicit chain-of-thought (CoT) reasoning into LLM-based rerankers. However, the effectiveness of such reasoning for ranking tasks remains underexplored. In this work, we present the first systematic study of reasoning in reranking across both logits-based pointwise and listwise settings, under both supervised fine-tuning and reinforcement learning. Using diverse benchmarks, including reasoning-intensive datasets (BRIGHT) and standard IR benchmarks (BEIR), we find that reasoning-augmented rerankers consistently underperform their direct counterparts that predict rankings without CoT, despite substantially higher inference costs. Our analysis reveals three core limitations: (i) in pointwise rerankers, reasoning breaks calibration and biases models toward the positive class, raising TPR but lowering TNR, which inflates false positives and degrades ranking in negative-dominant pools; (ii) in listwise rerankers, explicit reasoning improves the fit during training but leads to higher variance and fails to improve performance in both in-domain and out-of-domain evaluations, even when reinforcement learning shortens rationales; and (iii) overall, directly fine-tuned rerankers remain more stable, effective, and robust. These findings challenge the assumption that explicit reasoning is universally beneficial for reranking. We conclude by highlighting future directions, including calibration-aware scoring for pointwise rerankers and the design of concise, targeted reasoning strategies to mitigate overfitting and overthinking
in listwise rerankers.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3923. Learning residue level protein dynamics with multiscale Gaussians
๐ฌ 17
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ protein dynamics flexibility ensembles
Many methods have been developed to predict static protein structures, however understanding the \textit{dynamics} of protein structure is essential for elucidating biological function. While molecular dynamics (MD) simulations remain the \textit{in silico} gold standard, its high computational cost limits scalability. We present \textsc{DynaProt}, a lightweight, SE(3)-invariant framework that predicts rich descriptors of protein dynamics directly from static structures. By casting the problem through the lens of multivariate Gaussians, \textsc{DynaProt} estimates dynamics at two complementary scales: (1) per-residue marginal anisotropy as
covariance matrices capturing local flexibility, and (2) joint scalar covariances encoding pairwise dynamic coupling across residues. From these dynamics outputs, \textsc{DynaProt} achieves high accuracy in predicting residue-level flexibility (RMSF) and, remarkably, enables reasonable reconstruction of the full covariance matrix for fast ensemble generation. Notably, it does so using orders of magnitude fewer parameters than prior methods. Our results highlight the potential of direct protein dynamics prediction as a scalable alternative to existing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3924. OmniCVR: A Benchmark for Omni-Composed Video Retrieval with Vision, Audio, and Text
๐ฌ 18
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Composed Video Retrieval; Multimodal Benchmark; Audio-Visual Queries
Composed video retrieval presents a complex challenge: retrieving a target video based on a source video and a textual modification instruction. This task demands fine-grained reasoning over multimodal transformations. However, existing benchmarks predominantly focus on visionโtext alignment, largely overlooking the rich semantic signals embedded in audioโsuch as speech, music, and environmental soundsโwhich are often decisive for comprehensive video understanding. To bridge this gap, we introduce **OmniCVR**, a large-scale benchmark for omni-composed video retrieval that establishes vision, audio, and text as first-class modalities. OmniCVR is constructed via a scalable, automated pipeline integrating content-aware segmentation, omni-modal annotation, and a rigorous dual-validation protocol involving both large language models and human experts. The benchmark comprises vision-centric, audio-centric, and integrated queries, with the latter forming the majority to accurately reflect real-world multimodal complexity. Furthermore, we propose **AudioVLM2Vec**, an audio-aware extension of VLM2Vec. By incorporating explicit audio semantics, AudioVLM2Vec achieves state-of-the-art performance, highlighting fundamental limitations in the audio reasoning capabilities of current multimodal retrieval systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3925. EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations
๐ฌ 14
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Egocentric Vision Diffusion Models
Egocentric vision is essential for both human and machine visual understanding, particularly in capturing the detailed hand-object interactions needed for manipulation tasks. Translating third-person views into first-person views significantly benefits augmented reality (AR), virtual reality (VR) and robotics applications. However, current exocentric-to-egocentric translation methods are limited by their dependence on 2D cues, synchronized multi-view settings, and unrealistic assumptions such as the necessity of an initial egocentric frame and relative camera poses during inference. To overcome these challenges, we introduce *EgoWorld*, a novel two-stage framework that reconstructs an egocentric view from rich exocentric observations, including projected point clouds, 3D hand poses, and textual descriptions. Our approach reconstructs a point cloud from estimated exocentric depth maps, reprojects it into the egocentric perspective, and then applies diffusion-based inpainting to produce dense, semantically coherent egocentric images. Evaluated on 4 datasets (i.e., H2O, TACO, Assembly101, and Ego-Exo4D), *EgoWorld* achieves state-of-the-art performance and demonstrates robust generalization to new objects, actions, scenes, and subjects. Moreover, *EgoWorld* exhibits robustness on in-the-wild examples, underscoring its practical applicability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3926. COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences
๐ฌ 15
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM alignment general preference Nash learning from human feedback last-iterate convergence
Many alignment methods, including reinforcement learning from human feedback (RLHF), rely on the Bradley-Terry reward assumption, which is not always sufficient to capture the full range and complexity of general human preferences. We explore RLHF under a general preference framework by modeling the alignment problem as a two-player zero-sum game in a game-theoretic framework, where the Nash equilibrium policy guarantees a 50\% win rate against any competing policy. However, previous self-play algorithms for finding the Nash policy either diverge or only converge to a Nash policy in a modified game, even in a simple synthetic setting, thereby failing to maintain the 50\% win rate guarantee against all other policies. We propose a meta-algorithm, **Co**nvergent **M**eta **Al**ignment Algorithm (COMAL), for language model alignment with general preferences, inspired by convergent algorithms in game theory. We provide theoretical analysis that our meta-algorithm converges to an exact Nash policy in the last iterate and demonstrate its effectiveness on a range of synthetic and preference optimization datasets. COMAL is simple and can be integrated with many existing methods designed for preference optimization with minimal changes, and empirically it consistently maintains above 60.2\% and 56.8\% win rates, when applied to Llama-3-8B-Instruct and Qwen2.5-7B, against all compared algorithms under controlled evaluations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3927. Context Tokens are Anchors: Understanding the Repetition Curse in Diffusion MLLMs from an Information Flow Perspective
๐ฌ 23
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Diffusion Multimodal Large Language Models; Information flow
Recent diffusion-based Multimodal Large Language Models (dMLLMs) suffer from high inference latency and therefore rely on caching techniques to accelerate decoding. However, the application of cache mechanisms often introduces undesirable repetitive text generation, a phenomenon we term the Repeat Curse. To better investigate underlying mechanism behind this issue, we analyze repetition generation through the lens of information flow. Our work reveals three key findings: (1) context tokens aggregate semantic information as anchors and guide the final predictions; (2) as information propagates across layers, the entropy of context tokens converges in deeper layers, reflecting the modelโs growing prediction certainty; (3) Repetition is typically linked to disruptions in the information flow of context tokens and to the inability of their entropy to converge in deeper layers. Based on these insights, we present CoTA, a plug-and-play method for mitigating repetition. CoTA enhances the attention of context tokens to preserve intrinsic information flow patterns, while introducing a penalty term to the confidence score during decoding to avoid outputs driven by uncertain context tokens. With extensive experiments, CoTA demonstrates significant effectiveness in alleviating repetition and achieves consistent performance improvements on general tasks. Code will be made available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3928. Learning Collective Variables from BioEmu with Time-Lagged Generation
๐ฌ 15
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ collective variables molecular dynamics protein enhanced samplings
Molecular dynamics is crucial for understanding molecular systems but its applicability is often limited by the vast timescales of rare events like protein folding. Enhanced sampling techniques overcome this by accelerating the simulation along key reaction pathways, which are defined by collective variables (CVs). However, identifying effective CVs that capture the slow, macroscopic dynamics of a system remains a major bottleneck. This work proposes a novel framework coined BioEmu-CV that learns these essential CVs automatically from BioEmu, a recently proposed foundation model for generating protein equilibrium samples. In particular, we re-purpose BioEmu to learn time-lagged generation conditioned on the learned CV, i.e., predict the distribution of molecular states after a certain amount of time. This training process promotes the CV to encode only the slow, long-term information while disregarding fast, random fluctuations. We validate our learned CV on fast-folding proteins with two key applications: (1) estimating free energy differences using on-the-fly probability enhanced sampling and (2) sampling transition paths with steered molecular dynamics. Our empirical study also serves as a new systematic and comprehensive benchmark for MLCVs on fast-folding proteins larger than Alanine Dipeptide.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3929. OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
๐ฌ 4
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Omni-modal models Multimodal LLMs Large Language Models
Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings.
We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, improves over Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens โ a 6ร reduction compared to Qwen2.5-Omniโs 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3930. VisualPRM400K: An Effective Dataset for Training Multimodal Process Reward Models
๐ฌ 9
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Large Language Models Multimodal Process Reward Model
We construct VisualPRM400K, a dataset comprising about 400K multimodal process supervision data. Building upon this dataset, we develop VisualPRM, an advanced multimodal Process Reward Model (PRM) capable of estimating the value score of each step during the reasoning process.
Under the Best-of-N evaluation setting, our model improves the reasoning performance of three types of MLLMs and four different model scales.
Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks.
Experimental results show that the PRM model trained on our VisualPRM400K exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation.
To further facilitate the development of multimodal PRMs, we construct VisualProcessBench, a benchmark designed to measure the abilities of PRMs and MLLMs to detect incorrect steps in multimodal reasoning tasks.
We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3931. Understanding VLMs Spatial Mental Modeling Capability from Limited Views
๐ฌ 26
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Vision Language Models VLMs Multi Modal Language Models Spatial Intelligence Spatial Reasoning
Can Vision Language Models (VLMs) imagine the full scene from just a few views, like humans do? Humans form spatial mental models, internal representations of unseen space, to reason about layout, perspective, and motion. Our new MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap, where existing VLMs exhibit near-random performance. Using MindCube, we systematically evaluate how well VLMs build robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for "what-if" movements). We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps. The significant improvement comes from a synergistic approach, "map-then-reason", that jointly trains the model to first generate a cognitive map and then reason upon it. By training models to reason over these internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding reinforcement learning pushed performance even further to 70.7% (+32.9%). Our key insight is that such scaffolding of spatial mental models, actively constructing and utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of unobservable space.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3932. Transferable and Stealthy Adversarial Attacks on Large Vision-Language Models
๐ฌ 15
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Adversarial Attacks Robustness
Existing adversarial attacks on large Vision-Language Models (VLMs) often struggle with limited transferability to black-box models or produce perceptible artifacts that are easily detected. This paper presents Progressive Semantic Infusion (PSI), a diffusion-based attack that progressively aligns and infuses natural target semantics. To improve transferability, PSI leverages diffusion priors to better align adversarial examples with the natural image distribution and employs progressive alignment to mitigate overfitting on a single fixed surrogate objective. To enhance stealthiness, PSI embeds source-aware cues during denoising to preserve visual fidelity and avoid detectable artifacts. Experiments show that PSI effectively attacks open-source, adversarially trained, and commercial VLMs, including GPT-5 and Grok-4, surpassing existing methods in both transferability and stealthiness. Our findings highlight a critical vulnerability in modern vision-language systems and offer valuable insights towards building more robust and trustworthy multimodal models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3933. Flatness Guided Test-Time Adaptation for Vision-Language Models
๐ฌ 18
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Vision-Language Models Generalization Loss landscape
Test-time adaptation (TTA) of Vision-Language Models (VLMs) has emerged as a technique for tackling distribution shifts during the test time. Recent research indicates that the test-time adaptation is intrinsically linked to the model's training history. However, existing TTA methods, such as Test-time Prompt Tuning, often design adaptation strategies in isolation from the models' training characteristics, which degrade their performance. This paper argues that the flatness acquired via sharpness-aware training is an efficient clue for the test-time adaptation of VLMs. Built on this insight, this paper proposes a novel Flatness-Guided Adaptation framework (FGA) for VLMs to cohesively unify training and test-time procedures. Its core idea is to leverage the alignment between the training minimum and test loss flat regions to guide the adaptation process. Specifically, our FGA consists of a prompt-tuning stage and a test-time adaptation stage. In the tuning stage, a Sharpness-Aware Prompt Tuning method is utilized to identify the training flat minimum, offering a geometric clue of flatness for subsequent adaptation. In the test stage, a Sharpness-based Test Sample Selection approach is proposed to ensure the alignment of flat minima between the training and each augmented test sample's loss landscape. In comparison to existing TTA methods, our FGA avoids the expensive prompt parameter updates during test time, and substantially reduces the computation overhead. Extensive experiments on both domain generalization and cross-dataset benchmarks demonstrate that our FGA achieves superior performance over prevalent TTA methods. Notably, FGA even surpasses SOTA performance by 4.55\% on ImageNet-A, when using a ViT-B/16 image encoder. Our code will be available soon.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3934. Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
๐ฌ 20
โญ 5.00
๐ generative models
๐ท๏ธ Diffusion Models Distillation Consistency Models Few-Step Generation
This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobianโvector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the โmode-coveringโ nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the โmode-seekingโ reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3935. Towards Sustainable Investment Policies Informed by Opponent Shaping
๐ฌ 9
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Climate Change Social Dilemmas Multi Agent Reinforcement Learning Opponent Shaping Policy Making
Addressing climate change requires global coordination, yet rational economic actors often prioritize immediate gains over collective welfare, resulting in social dilemmas. InvestESG is a recently proposed multi-agent simulation that captures the dynamic interplay between investors and companies under climate risk. We provide a formal characterization of the conditions under which InvestESG exhibits an intertemporal social dilemma, deriving theoretical thresholds at which individual incentives diverge from collective welfare. Building on this, we apply Advantage Alignment, a scalable opponent shaping algorithm shown to be effective in general-sum games, to influence agent learning in InvestESG. We offer theoretical insights into why Advantage Alignment systematically favors socially beneficial equilibria by biasing learning dynamics toward cooperative outcomes. Our results demonstrate that strategically shaping the learning processes of economic agents can result in better outcomes that could inform policy mechanisms to better align market incentives with long-term sustainability goals.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3936. Diversity-Aware Online Prompt Assignment to Generative Models
๐ฌ 13
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Generative Models Online Learning Diversity
The expansion of generative AI services highlights the growing need for adaptive mechanisms to select an appropriate generative model for responding to a user's prompts. Recent works have proposed offline and online learning formulations to identify the optimal generative model for an input prompt, based solely on maximizing prompt-based fidelity evaluation scores, e.g., CLIP-Score in text-to-image generation. However, such fidelity-based selection methods overlook the diversity of generated outputs, and hence, they can fail to address potential diversity shortcomings in the generated responses. In this paper, we introduce the *Diversity-Aware Kernelized Upper Confidence Bound (DAK-UCB)* method as a contextual bandit algorithm for the online selection of generative models with diversity considerations. The proposed DAK-UCB method incorporates both fidelity and diversity-related metrics into the selection process. We design this framework based on prompt-aware diversity score functions that decompose to a two-sample-based expectation over prompt-output pairs in the previous generation rounds. Specifically, we illustrate the application of our framework using joint kernel distance and kernel entropy measures. Our experimental results demonstrate the effectiveness of DAK-UCB in promoting diversity-aware model selection while maintaining fidelity in the generations for a sequence of prompts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3937. LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena
๐ฌ 22
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Forecasting LLM Benchmark LLM-as-a-Prophet LLM Evaluation
With the rapid progress of large language models (LLMs) trained on every available piece of data, it becomes increasingly challenging to reliably evaluate their intelligence due to potential data contamination and benchmark overfitting. To overcome these challenges, we investigate a new angle of benchmarking LLMs' intelligence by evaluating their capability in forecasting real-world future events, a paradigm we call "LLM-as-a-Prophet". Such forecasting tasks require combination of sophisticated capabilities while remaining free from data contamination or overfitting. To systematically evaluate such predictive intelligence of LLMs, we introduce $\texttt{Prophet Arena}$, a general evaluation benchmark that continuously collects live forecasting tasks and decomposes each task into distinct pipeline stages, supporting our controlled and large-scale experimentation. Our comprehensive evaluation reveals that many LLMs already exhibit impressive forecasting capabilities, reflected in, e.g., their small calibration errors, consistent prediction confidence and promising market returns. However, we also uncover key bottlenecks even in frontier models, such as inaccurate event recalls, misunderstanding of data sources and slower information aggregation compared to markets when resolution nears.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3938. Cross-Modal Redundancy and the Geometry of VisionโLanguage Embeddings
๐ฌ 11
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ multimodal concepts sparse autoencoder modality gap applications of interpretability
Visionโlanguage models (VLMs) align images and text with remarkable success, yet the geometry of their shared embedding space remains poorly understood.
To probe this geometry, we begin from the Iso-Energy Assumption, which exploits cross-modal redundancy: a concept that is truly shared should exhibit the same average energy across modalities.
We operationalize this assumption with an Aligned Sparse Autoencoder (SAE) that encourages energy consistency during training while preserving reconstruction.
We find that this inductive bias changes the SAE solution without harming reconstruction, giving us a representation that serves as a tool for geometric analysis.
Sanity checks on controlled data with known ground truth confirm that alignment improves when Iso-Energy holds and remains neutral when it does not.
Applied to foundational VLMs, our framework reveals a clear structure with practical consequences:
**(*i*)** sparse *bimodal* atoms carry the entire *cross-modal* alignment signal;
**(*ii*)** *unimodal* atoms act as *modality-specific* biases and fully explain the modality gap;
**(*iii*)** removing unimodal atoms collapses the gap without harming performance;
**(*iv*)** restricting vector arithmetic to the bimodal subspace yields in-distribution edits and improved retrieval.
These findings suggest that the right inductive bias can both preserve model fidelity and render the latent geometry interpretable and actionable.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3939. Toward Enhancing Representation Learning in Federated Multi-Task Settings
๐ฌ 26
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Contrastive learning federated learning knowledge transfer multi-task learning mutual information representation learning
Federated multi-task learning (FMTL) seeks to collaboratively train customized models for users with different tasks while preserving data privacy. Most existing approaches assume model congruity (i.e., the use of fully or partially homogeneous models) across users, which limits their applicability in realistic settings. To overcome this limitation, we aim to learn a shared representation space across tasks rather than shared model parameters. To this end, we propose *Muscle loss*, a novel contrastive learning objective that simultaneously aligns representations from all participating models. Unlike existing multi-view or multi-model contrastive methods, which typically align models pairwise, Muscle loss can effectively capture dependencies across tasks because its minimization is equivalent to the maximization of mutual information among all the models' representations. Building on this principle, we develop *FedMuscle*, a practical and communication-efficient FMTL algorithm that naturally handles both model and task heterogeneity. Experiments on diverse image and language tasks demonstrate that FedMuscle consistently outperforms state-of-the-art baselines, delivering substantial improvements and robust performance across heterogeneous settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3940. FlashDLM: Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion
๐ฌ 13
โญ 5.00
๐ generative models
๐ท๏ธ Diffusion Model Diffusion Language Model Inference Speed Language Models Efficient Inference
Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models~(e.g., Dream 7B, LLaDA 8B) suffer from slow inference. While they match the quality of similarly sized Autoregressive (AR) Models (e.g., Qwen2.5 7B, Llama3 8B), their iterative denoising requires multiple full-sequence forward passes, resulting in high computational costs and latency, particularly for long input prompts and long-context scenarios. Furthermore, parallel token generation introduces token incoherence problems, and current sampling heuristics suffer from significant quality drops with decreasing denoising steps. We address these limitations with two training-free techniques. First, we propose \texttt{FreeCache}, a Key-Value (KV) approximation caching technique that reuses stable KV projections across denoising steps, effectively reducing the computational cost of DLM inference.
Second, we introduce \texttt{Guided Diffusion}, a training-free method that uses a lightweight pretrained autoregressive model to supervise token unmasking, dramatically reducing the total number of denoising iterations without sacrificing quality.
We conduct extensive evaluations on open-source reasoning benchmarks, and our combined methods deliver an average of \textbf{12.14}$\times$ end-to-end speedup across various tasks with negligible accuracy degradation. For the first time, diffusion language models achieve a comparable and even faster latency as the widely adopted autoregressive models. Our work successfully paved the way for scaling up the diffusion language model to a broader scope of applications across different domains. Our code and implementation are anonymously available at https://anonymous.4open.science/r/anon-flash-dlm-A42B/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3941. Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks
๐ฌ 22
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models Survival Analysis Consistency Multi-turn Dialogue Time-to-Event Modeling LLM Robustness
Large Language Models (LLMs) have revolutionized conversational AI, yet their robustness in extended multi-turn dialogues remains poorly understood. Existing evaluation frameworks focus on static benchmarks and single-turn assessments, failing to capture the temporal dynamics of conversational degradation that characterize real-world interactions. In this work, we present a large-scale survival analysis of conversational robustness, modeling failure as a time-to-event process over 36,951 turns from 9 state-of-the-art LLMs on the MT-Consistency benchmark. Our framework combines Cox proportional hazards, Accelerated Failure Time (AFT), and Random Survival Forest models with simple semantic drift features. We find that abrupt prompt-to-prompt semantic drift sharply increases the hazard of inconsistency, whereas cumulative drift is counterintuitively \emph{protective}, suggesting adaptation in conversations that survive multiple shifts. AFT models with modelโdrift interactions achieve the best combination of discrimination and calibration, and proportional hazards checks reveal systematic violations for key drift covariates, explaining the limitations of Cox-style modeling in this setting. Finally, we show that a lightweight AFT model can be turned into a turn-level risk monitor that flags most failing conversations several turns before the first inconsistent answer while keeping false alerts modest. These results establish survival analysis as a powerful paradigm for evaluating multi-turn robustness and for designing practical safeguards for conversational AI systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3942. A Study of Posterior Stability in Time-Series Latent Diffusion
๐ฌ 8
โญ 5.00
๐ generative models
๐ท๏ธ Latent Diffusion Time Series Posterior Collapse
Latent diffusion has achieved remarkable success in image generation, with high sampling efficiency. However, this framework might suffer from posterior collapse when applied to time series. In this work, we first show that latent diffusion with a collapsed posterior degenerates into a much weaker generative model: variational autoencoder (VAE). This finding highlights the significance of addressing the problem. We then introduce a principled method: dependency measures, which quantify the sensitivity of a recurrent decoder to input variables. Through this method, we confirm that posterior collapse seriously affects latent time-series diffusion on real time series. For example, the latent variable has an exponentially decreasing impact on the decoder over time. Building on our theoretical and empirical studies, we finally introduce a new framework: posterior-stable latent diffusion, which interprets the diffusion process as a type of variational inference. In this way, it eliminates the use of risky KL regularization and penalizes decoder insensitivity. Extensive experiments on multiple real time-series datasets show that our new framework is with a highly stable posterior and notably outperforms previous baselines in time series synthesis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3943. ProSafePrune: Projected Safety Pruning for Mitigating Over-Refusal in LLMs
๐ฌ 19
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Safety Over-Refusal Alignment
Large Language Models (LLMs) excel in various domains, but their safe deployment faces the challenge of balancing safety and utility. Existing alignment strategies often strengthen refusal mechanisms to reduce harmful outputs, but harmless instructions with superficial risky words are mistakenly rejected, which is known as over-refusal.
This work first reveals that over-refusal stems from a cognitive bias in the model's internal representation space: LLMs naturally encode safety attributes in hidden states, and pseudo-harmful instructions overlap with harmful features, causing over-harmful encoding.
To address this, we propose ProSafePrune, a subspace-projected low-rank parameter pruning framework for mitigating LLM over-refusal. By projecting pseudo-harmful features into subspaces and removing low-rank directions corresponding to harmful components in the most discriminative layers, we significantly reduce over-refusal while preserving the modelโs ability to reject genuinely harmful requests, improving performance on general tasks. In experiments, across different models, our method significantly lowers the average false rejection rate while slightly improving general task performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3944. Sharpness-Aware Minimization in Logit Space Efficiently Enhances Direct Preference Optimization
๐ฌ 14
โญ 5.00
๐ optimization
๐ท๏ธ optimization direct preference optimization sharpness-aware minimization learning dynamics
Direct Preference Optimization (DPO) has emerged as a popular algorithm for aligning pretrained large language models with human preferences, owing to its simplicity and training stability. However, DPO suffers from the recently identified squeezing effect (also known as likelihood displacement), where the probability of preferred responses decreases unintentionally during training. To understand and mitigate this phenomenon, we develop a theoretical framework that models the coordinate-wise dynamics in the logit space. Our analysis reveals that negative-gradient updates cause residuals to expand rapidly along high-curvature directions, which underlies the squeezing effect, whereas Sharpness-Aware Minimization (SAM) can suppress this behavior through its curvature-regularization effect. Building on this insight, we investigate logits-SAM, a computationally efficient variant that perturbs only the output layer with negligible overhead. Extensive experiments on Pythia-2.8B and Mistral-7B across multiple datasets demonstrate that logits-SAM consistently improves the effectiveness of DPO.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3945. Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information
๐ฌ 13
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ mutual information unlearnable examples
The volume of freely scraped data on the Internet has driven the tremendous success of deep learning. Along with this comes the rising concern about data privacy and security. Numerous methods for generating unlearnable examples have been proposed to prevent data from being illicitly learned by unauthorized deep models by impeding generalization. However, the existing approaches primarily rely on
empirical heuristics, making it challenging to enhance unlearnable examples with solid explanations. In this paper, we analyze and improve unlearnable examples from a novel perspective: mutual information reduction. We demonstrate that effective unlearnable examples always decrease mutual information between clean features and poisoned features, and when the network gets deeper, the unlearnability goes better together with lower mutual information. Further, we prove from a covariance reduction perspective that minimizing the conditional covariance of intra-class poisoned features reduces the mutual information between distributions. Based on the theoretical results, we propose a novel unlearnable method called Mutual Information Unlearnable Examples (MI-UE) that reduces covariance by maximizing the cosine similarity among intra-class features, thus impeding the generalization effectively. Extensive experiments demonstrate that our approach significantly outperforms previous state-of-the-art methods, even under defense mechanisms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3946. Contact-guided Real2Sim from Monocular Video with Planar Scene Primitives
๐ฌ 14
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Human Scene Interaction 4D human motion reconstruction Physics-based simulation for control
We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video. Prior work on joint human--scene reconstruction relies on data-driven priors and joint optimization with no physics in the loop, or recovers noisy geometry with artifacts that cause motion-tracking policies with scene interactions to fail. In contrast, our key insight is to fit simulation-ready convex planar primitives to a depth-based point cloud reconstruction of the scene via a simple clustering pipeline over depth, normals, and flow. To reconstruct scene geometry that might be occluded during interactions, we use human--scene contact modeling (e.g., using human posture to reconstruct the occluded seat of a chair). Finally, we ensure that human and scene reconstructions are physically plausible by using them to drive a humanoid controller via reinforcement learning. Our approach reduces motion-tracking failure rates from 55.2\% to 6.9\% on human-centric video benchmarks (EMDB, PROX), while delivering 43\% faster RL simulation throughput. This demonstrates CRISP's ability to generate physically valid human motion and interaction environments at scale, advancing real-to-sim applications for robotics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3947. Elastic Optimal Transport: Theory, Application, and Empirical Evaluation
๐ฌ 4
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ optimal transport domain adaptation transfer learning
The classical optimal transport such as Kantorovich's optimal transport and partial optimal transport could be too restrictive in applications due to the full-mass or fixed-mass preservation constraints. To remedy this limitation, we propose elastic optimal transport (ELOT) which is distinctive from the classical optimal transport in its ability of adaptive-mass preserving. It aims to answer the problem of how to transport the probability mass adaptively between probability distributions, which is a fundamental topic in various areas of artificial intelligence. The strength of elastic optimal transport is its capability to transport adaptive-mass in the light of the geometry structure of the problem itself. As an application example in machine learning, we apply elastic optimal transport to both unsupervised domain adaptation and partial domain adaptation tasks. It adaptively transports masses from source domain to target domain by taking domain shift into consideration and respecting the ubiquity of noises or outliers in the data, in order to improve the generalization performance. The experiment results on the benchmarks show that ELOT significantly outperforms the state-of-the-art methods. As a powerful distribution matching tool, elastic optimal transport might be of interests to the broad areas such as artificial intelligence, healthcare, physics, operations research, urban science, etc. The source code is available in the supplementary material.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3948. ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing
๐ฌ 14
โญ 5.00
๐ generative models
๐ท๏ธ Controllable Video Generation Cartoon Generation
Traditional cartoon and anime production involves keyframing, inbetweening, and colorization stages, which require intensive manual effort. Despite recent advances in AI, existing methods often handle these stages separately, leading to error accumulation and artifacts. For instance, inbetweening approaches struggle with large motions, while colorization methods require dense per-frame sketches. To address this, we introduce ToonComposer, a generative model that unifies inbetweening and colorization into a single post-keyframing stage. ToonComposer employs a sparse sketch injection mechanism to provide precise control using keyframe sketches. Additionally, we propose a novel cartoon adaptation method with the spatial low-rank adapter (SLRA) to effectively tailor a modern video foundation model to the cartoon domain while keeping its temporal prior intact. Requiring as few as a single sketch and a colored reference frame, ToonComposer excels with sparse inputs, while also supporting multiple sketches at any temporal location for more precise motion control. This dual capability reduces manual workload and improves flexibility, empowering artists in real-world scenarios. To evaluate our model, we further created PKBench, a benchmark featuring human-drawn sketches that simulate real-world use cases. Our evaluation demonstrates that ToonComposer outperforms existing methods in visual quality, motion consistency, and production efficiency, offering a superior and more flexible solution for AI-assisted cartoon production.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3949. ELLMob: Event-Driven Human Mobility Generation with Self-Aligned LLM Framework
๐ฌ 20
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Human Mobility Generation Large Language Models Event-Driven Mobility Urban Computing
Human mobility generation aims to synthesize plausible trajectory data, which is widely used in urban system research. While Large Language Model-based methods excel at generating routine trajectories, they struggle to capture deviated mobility during large-scale societal events. This limitation stems from two critical gaps: (1) the absence of event-annotated mobility datasets for design and evaluation, and (2) the inability of current frameworks to reconcile competitions between users' habitual patterns and event-imposed constraints when making trajectory decisions. This work addresses these gaps with a twofold contribution. First, we construct the first event-annotated mobility dataset covering three major events: Typhoon Hagibis, COVID-19, and the Tokyo 2021 Olympics. Second, we propose ELLMob, a self-aligned LLM framework that first extracts competing rationales between habitual patterns and event constraints, based on Fuzzy-Trace Theory, and then iteratively aligns them to generate trajectories that are both habitually grounded and event-responsive. Extensive experiments show that ELLMob wins state-of-the-art baselines across all events, demonstrating its effectiveness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3950. Towards Revealing the Effect of Batch Size Scheduling on Pre-training
๐ฌ 18
โญ 5.00
๐ learning theory
๐ท๏ธ Batch Size Scheduling; Training Dynamics
Training large-scale foundation models relies on effective parallelism strategies, especially batch size scheduling. However, despite its widespread practical use, the influence of batch size scheduling on training dynamics remains poorly understood. In this work, we first investigate this through a simple two-stage batch size schedule.
Specifically, we train the language models with a constant learning rate using three batch size schedules: i) small constant batch size, ii) large constant batch size, and iii) a schedule that switches from small (i) to large (ii) at some switching point. We observe two notable behaviors: (1) **sudden drop**, a sharp drop in loss occurs at the switching point, compared to the loss trajectory of the small batch size; (2) **final merge**, a gradual convergence in loss to the trajectory of the large batch size.
To understand the underlying mechanism behind these phenomena, we then provide a theoretical analysis from the perspective of power-law kernel regression setup. We leverage the **Functional Scaling Law (FSL)** introduced in the recent work by Li et al. (2025), which provides a theoretical framework for analyzing LLM pre-training dynamics. Our analysis shows that increasing batch size provably leads to a sudden loss drop by reducing SGD noise and guarantees convergence to the large batch trajectory at the same step level. Under the data-limited regime, our analysis further reveals a trade-off between intrinsic optimization time and SGD noise in the choice of switching point, predicting that the optimal switching point scales as a power law with total data size. Finally, we empirically validate these theoretical findings through language model pre-training experiments up to 1.1B parameters and 1T tokens, confirming the consistency of our theoretical insights.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3951. GoalRank: Group-Relative Optimization for a Large Ranking Model
๐ฌ 26
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Recommender System Re-ranking Large Ranking Model Group-Relative Optimization
Mainstream ranking approaches typically follow a GeneratorโEvaluator two-stage paradigm, where a generator produces candidate lists and an evaluator selects the best one. Recent work has attempted to enhance performance by expanding the number of candidate lists, for example, through multi-generator settings. However, ranking involves selecting a recommendation list from a combinatorially large space, simply enlarging the candidate set remains ineffective, and performance gains quickly saturate. At the same time, recent advances in large recommendation models have shown that end-to-end one-stage models can achieve promising performance with the expectation of scaling laws. Motivated by this, we revisit ranking from a generator-only one-stage perspective. We theoretically prove that, for any (finite Multi-)GeneratorโEvaluator model, there always exists a generator-only model that achieves strictly smaller approximation error to the optimal ranking policy, while also enjoying a scaling law as its size increases. Building on this result, we derive an evidence upper bound of the one-stage optimization objective, from which we find that one can leverage a reward model trained on real user feedback to construct a reference policy in a group-relative manner. This reference policy serves as a practical surrogate of the optimal policy, enabling effective training of a large generator-only ranker. Based on these insights, we propose GoalRank, a generator-only ranking framework. Extensive offline experiments on public benchmarks and large-scale online A/B tests demonstrate that GoalRank consistently outperforms state-of-the-art methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3952. Cross-Timestep: 3D Diffusion Model with Trans-temporal Memory LSTM and Adaptive Priori Decoding Strategy for Medical Segmentation
๐ฌ 19
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Diffusion Models; Medical Image Segmentation; LSTM
Diffusion models have recently demonstrated significant robustness in medical image segmentation, effectively accommodating variations across different imaging styles. However, their applications remain limited due to: (i) current successes being primarily confined to 2D segmentation tasksโwe observe that diffusion models tend to collapse at the early stage when applied to 3D medical tasks; and (ii) the inherently isolated iteration along timesteps during training and inference. To tackle these limitations, we propose a novel framework named Cross-Timestep, which incorporates two key innovations: an Adaptive Priori Decoding Strategy (APDS) and a trans-temporal memory LSTM (tLSTM) mechanism. (i) The APDS provides prior guidance during the diffusion process by employing a Priori Decoder(PD) that focuses solely on the conditional branch, successfully stabilizing the reverse diffusion process. (ii) The tLSTM integrates convolution and linear layers into the LSTM gating structure, and enhances the memory cell mechanism to retain temporal state, explicitly preserving and propagating continuous temporal states across timesteps. Experimental results demonstrate that Cross-Timestep performs favorably on heterogeneous 3D medical datasets. Three experiments further analyze the collapse phenomenon in 3D medical diffusion models and validate that APDS effectively prevents initial-stage collapse without excessively constraining the model, while tLSTM facilitates the performance and scalability of diffusion models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3953. Data-to-Energy Stochastic Dynamics
๐ฌ 12
โญ 5.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Schrรถdinger bridge Bayesian posterior inference stochastic differential equations Iterative Proportional Fitting
The Schrรถdinger bridge problem is concerned with finding a stochastic dynamical system bridging two marginal distributions that minimises a certain transportation cost.
This problem, which represents a generalisation of optimal transport to the stochastic case, has received attention due to its connections to diffusion models and flow matching, as well as its applications in the natural sciences.
However, all existing algorithms allow to infer such dynamics only for cases where samples from both distributions are available.
In this paper, we propose the first general method for modelling Schrรถdinger bridges when one (or both) distributions are given by their unnormalised densities, with no access to data samples.
Our algorithm relies on a generalisation of the iterative proportional fitting (IPF) procedure to the data-free case, inspired by recent developments in off-policy reinforcement learning for training of diffusion samplers.
We demonstrate the efficacy of the proposed data-to-energy IPF on synthetic problems, finding that it can successfully learn transports between multimodal distributions.
As a secondary consequence of our reinforcement learning formulation, which assumes a fixed time discretisation scheme for the dynamics, we find that existing data-to-data Schrรถdinger bridge algorithms can be substantially improved by learning the diffusion coefficient of the dynamics.
Finally, we apply the newly developed algorithm to the problem of sampling posterior distributions in latent spaces of generative models, thus creating a data-free image-to-image translation method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3954. Bandits with Single-Peaked Preferences and Limited Resources
๐ฌ 5
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ social choice single-peaked preferences bandits matching
We study an online stochastic matching problem in which an algorithm sequentially matches $U$ users to $K$ arms, aiming to maximize cumulative reward over $T$ rounds under budget constraints. Without structural assumptions, computing the optimal matching is NP-hard, making online learning computationally infeasible. To overcome this barrier, we focus on single-peaked preferences---a well-established structure in social choice theory, where users' preferences are unimodal with respect to a common order over arms. We devise an efficient algorithm for the offline budgeted matching problem, and leverage it into an efficient online algorithm with a regret of $\tilde O(UKT^{2/3})$. Our approach relies on a novel PQ tree-based order approximation method. If the single-peaked structure is known, we develop an efficient UCB-like algorithm that achieves a regret bound of $\tilde O(U\sqrt{TK})$.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3955. Tackling Heavy-Tailed Q-Value Bias in Offline-to-Online Reinforcement Learning with Laplace-Robust Modeling
๐ฌ 23
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Offline-to-Online Q-value Estimation Laplace Distribution
Offline-to-online reinforcement learning (O2O RL) aims to improve the performance of offline pretrained agents through online fine-tuning. Existing O2O RL methods have achieved advances in mitigating the overestimation of Q-value biases (i.e., biases of cumulative rewards), improving the performance. However, in this paper, we are the first to reveal that Q-value biases of these methods often follow a heavy-tailed distribution during online fine-tuning. Such biases induce high estimation variance and hinder performance improvement.
To address this challenge, we propose a Laplace-based robust offline-to-online RL (LAROO) approach. LAROO introduces a parameterized Laplace-distributed noise and transfers the heavy-tailed nature of Q-value biases into this noise, alleviating heavy tailedness of biases for training stability and performance improvement. Specifically, (1) since Laplace distribution is well-suited for modeling heavy-tailed data, LAROO introduces a parameterized Laplace-distributed noise that can adaptively capture heavy tailedness of any data. (2) By combining estimated Q-values with the noise to approximate true Q-values, LAROO transfers the heavy-tailed nature of biases into the noise, reducing estimation variance. (3) LAROO employs conservative ensemble-based estimates to re-center Q-value biases, shifting their mean towards zero. Based on (2) and (3), LAROO promotes heavy-tailed Q-value biases into a standardized form, improving training stability and performance. Extensive experiments demonstrate that LAROO achieves significant performance improvement, outperforming several state-of-the-art O2O RL baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3956. BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses
๐ฌ 21
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ debiasing large language models bias mitigation social bias
Existing studies on bias mitigation methods for large language models (LLMs) use diverse baselines and metrics to evaluate debiasing performance, leading to inconsistent comparisons among them. Moreover, their evaluations are mostly based on the comparison between LLMs' probabilities of biased and unbiased contexts, which ignores the gap between such evaluations and real-world use cases where users interact with LLMs by reading model responses and expect fair and safe outputs rather than LLMs' probabilities. To enable consistent evaluation across debiasing methods and bridge this gap, we introduce **BiasFreeBench**, an empirical benchmark that comprehensively compares eight mainstream bias mitigation techniques (covering four prompting-based and four training-based methods) on two test scenarios (multi-choice QA and open-ended multi-turn QA) by reorganizing existing datasets into a unified query-response setting. We further introduce a response-level metric, **Bias-Free Score**, to measure the extent to which LLM responses are fair, safe, and anti-stereotypical. Debiasing performances are systematically compared and analyzed across key dimensions: the prompting vs. training paradigm, model size, and generalization of different training strategies to unseen bias types. We will publicly release our benchmark, aiming to establish a unified testbed for bias mitigation research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3957. Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization
๐ฌ 19
โญ 5.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning offline reinforcement learning batch reinforcement learning deep reinforcement learning combinatorial action spaces structured action spaces discrete action spaces representation learning
Reinforcement learning in combinatorial action spaces requires searching over exponentially many joint actions to simultaneously select multiple sub-actions that form coherent combinations. Existing approaches either simplify policy learning by assuming independence across sub-actions, which often yields incoherent or invalid actions when coordination is required, or attempt to learn action structure and control jointly, which is slow and unstable. We introduce Structured Policy Initialization (SPIN), a two-stage framework that first pre-trains an Action Structure Model (ASM) to capture the manifold of valid actions, then freezes this representation and trains lightweight policy heads for control. On challenging DM Control benchmarks, SPIN improves average return by up to $39\%$ over the state of the art while reducing time to convergence by up to $12.8\times$.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3958. Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training
๐ฌ 9
โญ 5.00
๐ learning theory
๐ท๏ธ GRPO off-policy
We revisit Group Relative Policy Optimization (GRPO) in both on-policy and off-policy optimization regimes. Our motivation comes from recent work on off-policy Proximal Policy Optimization (PPO), which improves training stability, sampling efficiency, and memory usage. In addition, a recent analysis of GRPO suggests that estimating the advantage function with off-policy samples could be beneficial. Building on these observations, we adapt GRPO to the off-policy setting. We show that both on-policy and off-policy GRPO objectives yield an improvement in the reward. This result motivates the use of clipped surrogate objectives in the off-policy version of GRPO. We then compare the empirical performance of reinforcement learning with verifiable rewards in post-training using both GRPO variants. Our results show that off-policy GRPO either significantly outperforms or performs on par with its on-policy counterpart.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3959. DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision
๐ฌ 7
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Preference Alignment in 3D Preference Alignment Human Preference Alignment 3D Generation
While text-to-3D generation has attracted growing interest, existing methods often struggle to produce 3D assets that align well with human preferences. Current preference alignment techniques for 3D content typically rely on hardly-collected preference-paired multi-view 2D images to train 2D reward models, when then guide 3D generation โ leading to geometric artifacts due to their inherent 2D bias. To address these limitations, we construct 3D-MeshPref, the first large-scale unpaired 3D preference dataset, featuring diverse 3D meshes annotated by a large language model and refined by human evaluators. We then develop RewardCS, the first reward model trained directly on unpaired 3D-MeshPref data using a novel Cauchy-Schwarz divergence objective, enabling effective learning of human-aligned 3D geometric preferences without requiring paired comparisons. Building on this, we propose DreamCS, a unified framework that integrates RewardCS into text-to-3D pipelines โ enhancing both implicit and explicit 3D generation with human preference feedback. Extensive experiments show DreamCS outperforms prior methods, producing 3D assets that are both geometrically faithful and human-preferred. Code and models will be released publicly.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3960. Rethinking LLM Reasoning: From Explicit Trajectories to Latent Representations
๐ฌ 5
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language model efficient reasoning
Large Language Models (LLMs) have achieved impressive performance on complex tasks by generating human-like, step-by-step rationales, referred to as \textit{reasoning trajectory}, before arriving at final answers. However, the length of these reasoning trajectories often far exceeds that of the final answers, which incurs substantial inference costs even for relatively simple tasks. Advanced methods typically attempt to compress reasoning trajectory length through post-training, but they remain decoding-intensive and fail to inherently mitigate the efficiency challenge. In this work, we challenge the necessity of generating full reasoning trajectories and empirically demonstrate that LLMs can generate accurate answers using only fragmental reasoning paths, without relying on complete token-by-token sequences. To this end, we propose a novel \textbf{Latent Reasoning Tuning (LRT)} framework, which empowers LLMs to perform reasoning using implicit, compact, learnable representations instead of explicit textual trajectories. Technically, LRT replaces the costly autoregressive generation of reasoning steps with a single forward pass through a lightweight reasoning network, which generates latent vectors that encapsulate the necessary reasoning logic and condition the LLM to produce the final answer. Experiments on mathematical and out-of-domain benchmarks demonstrate that our LRT consistently outperforms relevant efficient reasoning methods. Moreover, by transforming explicit reasoning into latent reasoning, our approach surpasses the state-of-the-art Qwen3 hybrid reasoning framework.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3961. ICPO: Provable and Practical In-Context Policy Optimization for Test-Time Scaling
๐ฌ 17
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ in-context learning self-reflection policy optimization FTRL bandits large language models reasoning
We study test-time scaling, where a model improves its answer through multi-round self-reflection at inference. We introduce In-Context Policy Optimization (ICPO), in which an agent optimizes its response in context using self-assessed or externally observed rewards without modifying its parameters.
To explain this ICPO process, we theoretically show that with sufficient pretraining under a novel Fisher-weighted logit-matching objective, a single-layer linear self-attention model can provably imitate policy-optimization algorithm for linear bandits. Building on this theory, we propose Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time.
By selecting the responses and their rewards with minimum entropy, ME-ICPO ensures the robustness of the self-assessed rewards via majority voting.
Across standard mathematical reasoning tasks, ME-ICPO attains competitive, top-tier performance while keeping inference costs affordable compared with other inference-time algorithms. Overall, ICPO provides a principled understanding of self-reflection in LLMs and yields practical benefits for test-time scaling for mathematical reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3962. Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders
๐ฌ 22
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Large Language Models Long Video Understanding Keyframe Selection Keyframe Narratives
Employing Multimodal Large Language Models (MLLMs) for long video understanding remains a challenging problem due to the dilemma between the substantial number of video frames (i.e., visual tokens) versus the limited context length of language models. Traditional uniform sampling often leads to selection of irrelevant content, while post-training MLLMs on thousands of frames imposes a substantial computational burden. In this paper, we propose _Narrating KeyFrames Capturing_ (Nar-KFC), a plug-and-play module to facilitate effective and efficient long video perception. Nar-KFC generally involves two collaborative steps. First, we formulate the _keyframe_ selection process as an integer quadratic programming problem, jointly optimizing query-relevance and frame-diversity. To avoid its computational complexity, a customized greedy search strategy is designed as an efficient alternative. Second, to mitigate the temporal discontinuity caused by sparse keyframe sampling, we further introduce interleaved textual _narratives_ generated from non-keyframes using off-the-shelf captioners. These narratives are inserted between keyframes based on their true temporal order, forming a coherent and compact representation. Nar-KFC thus serves as a temporal- and content-aware compression strategy that complements visual and textual modalities. Experimental results on multiple long-video benchmarks demonstrate that Nar-KFC significantly improves the performance of popular MLLMs. Code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3963. Towards Self-Evolving Agent Benchmarks : Validatable Agent Trajectory via Test-Time Exploration
๐ฌ 32
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Benchmark Evolution Agent Evaluation Test-Time Exploration Multi-Agent Systems Large Language Models Dynamic Task Generation
Recent advances in large language models (LLMs) and agent system designs have empowered agents with unprecedented levels of capability. However, existing agent benchmarks are showing a trend of rapid ceiling-hitting by newly developed agents, making it difficult to meet the demands for evaluating agent abilities. To address this problem, we propose the Trajectory-based Reproducible Agent-
benchmark Complexity Evolution (TRACE) framework. This framework takes an original task from an existing benchmark and encourages agents to freely explore and evolve it into a new task with higher difficulty while recording traceable agent trajectories. The framework proceeds in three stages: (1) evolutionary proposal mining, which provides task evolution proposals through preliminary exploration and divergent thinking; (2) problem formation and free exploration, where proposals are conceptualized into feasible problem candidates and the agents then explore them freely while recording their execution trajectories; and (3) multi-level validation, which ensures that the evolved tasks are accompanied by validatable and reproducible trajectories. Experiments on the GAIA benchmark demonstrate that the TRACE framework consistently enhances task complexity while improving the reliability of correctness through validatable execution trajectories. This work marks a paradigm shift from static, manually curated benchmarks to dynamic, self-evolving evaluation systems, providing a sustainable and challenging runway for agent development
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3964. Generalizable Heuristic Generation Through LLMs with Meta-Optimization
๐ฌ 20
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Combinatorial Optimization Large Language Models Heuristic Generation
Heuristic design with large language models (LLMs) has emerged as a promising approach for tackling combinatorial optimization problems (COPs). However, existing approaches often rely on manually predefined evolutionary computation (EC) heuristic-optimizers and single-task training schemes, which may constrain the exploration of diverse heuristic algorithms and hinder the generalization of the resulting heuristics. To address these issues, we propose Meta-Optimization of Heuristics (MoH), a novel framework that operates at the optimizer level, discovering effective heuristic-optimizers through the principle of meta-learning. Specifically, MoH leverages LLMs to iteratively refine a meta-optimizer that autonomously constructs diverse heuristic-optimizers through (self-)invocation, thereby eliminating the reliance on a predefined EC heuristic-optimizer. These constructed heuristic-optimizers subsequently evolve heuristics for downstream tasks, enabling broader heuristic exploration. Moreover, MoH employs a multi-task training scheme to promote its generalization capability. Experiments on classic COPs demonstrate that MoH constructs an effective and interpretable meta-optimizer, achieving state-of-the-art performance across various downstream tasks, particularly in cross-size settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3965. Poly-attention: a general scheme for higher-order self-attention
๐ฌ 17
โญ 5.00
๐ learning theory
๐ท๏ธ computational complexity polynomial method fine-grained complexity communication complexity tensor generalizations
The self-attention mechanism, at the heart of the transformer model, is able to effectively model pairwise interactions between tokens. However, numerous recent works have shown that it is unable to perform basic tasks involving detecting triples of correlated tokens, or compositional tasks where multiple input tokens need to be referenced to generate a result. Some higher-dimensional alternatives to self-attention have been proposed to address this, including higher-order attention (Sanford et al., 2023) and Strassen attention (Kozachinskiy et al., 2025), which can perform some of these polyadic tasks in exchange for slower, superquadratic running times.
In this work, we define a vast class of generalizations of self-attention, which we call poly-attention mechanisms. Our mechanisms can incorporate arbitrary higher-order (tensor) computations as well as arbitrary relationship structures between the input tokens, and they include the aforementioned alternatives as special cases. We then systematically study their computational complexity and representational strength, including giving new algorithms and matching complexity-theoretic lower bounds on the time complexity of computing the attention matrix exactly as well as approximately, and tightly determining which polyadic tasks they can each perform. Our results give interesting tradeoffs between different desiderata for these mechanisms, including a tight relationship between how expressive a mechanism is, and how large the coefficients in the model may be so that the mechanism can be approximated in almost-linear time.
Notably, we give a new attention mechanism which can be computed exactly in quadratic time, and which can perform function composition for any fixed number of functions. Prior mechanisms, even for just composing two functions, could only be computed in superquadratic time, and our new lower bounds show that faster algorithms for them are not possible.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3966. Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders
๐ฌ 24
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language Model MLLMs Vision Encoder Machine Learning
Recent multimodal large language models (MLLMs) increasingly integrate multiple vision encoders to improve performance on various benchmarks, assuming that diverse pretraining objectives yield complementary visual signals.
However, we show this assumption often fails in practice. Through systematic encoder masking across representative multi-encoder MLLMs, we find that performance typically degrades gracefullyโand sometimes even improvesโwhen selected encoders are masked, revealing pervasive encoder redundancy.
To quantify this effect, we introduce two principled metrics: the Conditional Utilization Rate (CUR), which measures an encoderโs marginal contribution in the presence of others, and the Information Gap (IG), which captures heterogeneity in encoder utility within a model.
Using these tools, we observe: (i) strong specialization on tasks like OCR \& Chart, where a single encoder can dominate with a CUR >90%, (ii) high redundancy on general VQA and knowledge-based tasks, where encoders are largely interchangeable, (iii) instances of detrimental encoders with negative CUR.
Notably, masking specific encoders can yield up to 16% higher accuracy on a specific task category and 3.6% overall performance boost compared to the full model.
Furthermore, single- and dual- encoder variants recover over 90% of baseline on most non-OCR tasks. Our analysis challenges the โmore encoders are betterโ heuristic in MLLMs and provides actionable diagnostics for developing more efficient and effective multimodal architectures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3967. LaplacianFormer:Rethinking Linear Attention with Laplacian Kernel
๐ฌ 30
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Linear Attention Transformer Kernel learning
The quadratic complexity of softmax attention presents a major obstacle for scaling Transformers to high-resolution vision tasks. Existing linear attention variants often replace the softmax with Gaussian kernels to reduce complexity, but such approximations lack theoretical grounding and tend to oversuppress mid-range token interactions. We propose LaplacianFormer, a Transformer variant that employs a Laplacian kernel as a principled alternative to softmax, motivated by empirical observations and theoretical analysis. To address expressiveness degradation under low-rank approximations, we introduce a provably injective feature map that retains fine-grained token information. For efficient computation, we adopt a Nystrรถm approximation of the kernel matrix and solve the resulting system using Newton--Schulz iteration, avoiding costly matrix inversion and SVD. We further develop custom CUDA implementations for both the kernel and solver, enabling high-throughput forward and backward passes suitable for edge deployment. Experiments on ImageNet show that LaplacianFormer achieves strong performance-efficiency trade-offs while improving attention expressiveness. Our anonymous repository is at \href{https://anonymous.4open.science/r/sdfasfsdgsfgdrf}{\textcolor{black}{https://anonymous.4open.science/r/sdfasfsdgsfgdrf}}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3968. Preserving Forgery Artifacts: AI-Generated Video Detection at Native Scale
๐ฌ 10
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ AI-Generated Video Detection Video Generation AIGC Detection
The rapid advancement of video generation models has enabled the creation of highly realistic synthetic media, raising significant societal concerns regarding the spread of misinformation. However, current detection methods suffer from critical limitations. They often rely on preprocessing operations like fixed-resolution resizing and cropping, which not only discard subtle, high-frequency forgery artifacts but can also cause distortion and significant information loss. Furthermore, these methods are frequently trained and evaluated on outdated datasets that fail to capture the sophistication of modern generative models. To address these challenges, we introduce two key contributions: a new large-scale dataset and benchmark, as well as a novel detection framework. We present a comprehensive dataset of over 140K videos from 15 state-of-the-art open-source and leading commercial generators. Specifically, we curate the Magic Videos Testset, featuring ultra-realistic videos produced by six latest generators through a meticulous generation and filtering pipeline. In addition, we propose a novel detection framework built on the Qwen2.5-VL Vision Transformer, which processes videos at their native spatial resolution and temporal duration. This native-scale approach preserves high-frequency details and spatiotemporal inconsistencies that are often lost during conventional preprocessing. Extensive experiments show that our method achieves state-of-the-art performance across multiple benchmarks. Our work underscores the importance of native-scale processing and establishes a robust new baseline for AI-generated video detection.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3969. Out-of-Distribution Graph Models Merging
๐ฌ 15
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Graph Models Merging Source-Free Domain Generalization Graph Neural Networks
This paper studies a novel problem of out-of-distribution graph models merging, which aims to construct a generalized model from multiple graph models pre-trained on different domains with distribution discrepancy. This problem is challenging because of the difficulty in learning domain-invariant knowledge implicitly in model parameters and consolidating expertise from potentially heterogeneous GNN backbones. In this work, we propose a graph generation strategy that instantiates the mixture distribution of multiple domains. Then, we merge and fine-tune the pre-trained graph models via a MoE module and a masking mechanism for generalized adaptation. Our framework is architecture-agnostic and can operate without any source/target domain data. Both theoretical analysis and experimental results demonstrate the effectiveness of our approach in addressing the model generalization problem.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3970. TrajTok: What makes for a good trajectory tokenizer in behavior generation?
๐ฌ 15
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ behavior generation tokenizer autonomous driving
Behavior generation in autonomous driving aims to simulate dynamic driving scenarios from recorded driving logs. A popular approach is to apply next-token-prediction with discrete trajectory tokenization. In this work, we explore what makes a good trajectory tokenizer from the perspective of logged data usage. We first analyze the four properties (coverage, utilization, symmetry and robustness) of vocabularies of data-driven and rule-based trajectory tokenizers and their impact on performance and generalization. Data-driven tokenizers often build vocabularies with better utilization but suffer from insufficient coverage and sensitivity to noise, while rule-based methods have better coverage but contain too many useless tokens. With these insights, we propose TrajTok, a trajectory tokenizer that combines the two methods with rule-based vocabulary candidate setup and data-driven filtering and selection processes. The tokenizer has balanced coverage and utilization as well as good symmetry and robustness. Furthermore, we propose a spatial-aware label smoothing method for the cross-entropy loss to better model the similarities between the trajectory tokens. Our method wins first place in the 2025 Waymo Open Sim Agents Challenge.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3971. NextQuill: Causal Preference Modeling for Enhancing LLM Personalization
๐ฌ 16
โญ 5.00
๐ generative models
๐ท๏ธ Personalized text generation Large Language Models LLM Personalization
Personalizing large language models (LLMs) is increasingly important as they are progressively integrated into real-world applications to support usersโ daily lives. However, existing approaches often fail to distinguish which components of response predictions by model and ground-truth response in training data truly reflect user preferences, resulting in shallow personalization alignment. In this paper, we introduce NextQuill, a novel LLM personalization alignment framework grounded in causal preference modeling. We approach personalization from a causal perspective, recognizing that model-predicted responses (model side) and user-written ground-truth responses (data side) are both outcomes shape by user history (characteristics) and other context factors. To better capture user preferences, we define causal preference effects as the causal effect of the user history/characteristics on outcomes from the model/data side. Building on this foundation, NextQuill introduces two complementary alignment strategies: (1) aligning model-side causal preference effects (on predictions) with those of ground-truth data, rather than indiscriminately aligning all predictions, and (2) emphasizing learning the preference-driven ground-truth tokens, identified via data-side causal preference effects, rather than treating all tokens equally. As such, NextQuill shifts the alignment process toward learning from causal preference effects, facilitating more effective and personalized LLM adaptation. Experiments on multiple personalization benchmarks demonstrate that NextQuill substantially improves personalization quality. Code is available at \url{https://anonymous.4open.science/r/NextQuill-1E4E}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3972. GARLIC: Graph Attention-based Relational Learning of Multivariate Time Series in Intensive Care
๐ฌ 18
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ irregular multivariate time series graph neural network deep learning for health intensive care unit explainability
ICU (Intensive Care Unit) records comprise heterogeneous multivariate time series sampled at irregular intervals with pervasive missingness, yet clinical applications demand predictive models that are both accurate and interpretable. We present our Graph Attention-based Relational Learning for Intensive Care (GARLIC) model, a novel neural network architecture that imputes missing data through a learnable exponential-decay encoder, captures inter-sensor dependencies through time-lagged summary graphs, and fuses global patterns with cross-dimensional sequential attention. All attention weights and graph edges are learned end-to-end to serve as built-in observation-, signal-, and edge-level explanations. To reconcile auxiliary reconstruction and primary classification objectives, we develop an alternating decoupled optimization scheme that stabilizes training. On three ICU benchmarks (PhysioNet 2012 \& 2019, MIMIC-III) for outcome prediction, GARLIC sets the new state of the art, significantly improving AUROC and AUPRC over best-performing baselines at comparable computational cost. Ablation studies confirm each moduleโs contribution, and feature-removal trials validate importance attribution fidelity through a monotonic performance drop (full > top 50\% > random 50\% > bottom 50\%). Additionally, we also demonstrate GARLIC's superiority in data imputation and classification on various datasets beyond the ICU domain, showing its generalizability and applicability to broader tasks and datasets. Finally, real-time case studies demonstrate actionable risk warnings with transparent explanations, marking a significant advancement toward accurate, explainable deep learning for irregularly sampled ICU time series data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3973. Cross-Tokenizer Likelihood Scoring Algorithms for Language Model Distillation
๐ฌ 18
โญ 5.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Tokenization likelihood scoring language models
Computing the next-token likelihood ratio between two language models (LMs) is a standard task in training paradigms like knowledge distillation. Since this task requires both models to share the same probability space, it becomes challenging when the teacher and student LMs use different tokenizers, for instance, when edge-device deployment necessitates a smaller vocabulary size to lower memory overhead. In this work, we address this vocabulary misalignment problem by uncovering an implicit recursive structure in the commonly deployed Byte-Pair Encoding (BPE) algorithm and utilizing it to create a probabilistic framework for cross-tokenizer likelihood scoring. Our method enables sequence likelihood evaluation for vocabularies different from the teacher model native tokenizer, addressing two specific scenarios: when the student vocabulary is a subset of the teacher vocabulary, and the general case where it is arbitrary. In the subset regime, our framework computes exact likelihoods and provides next-token probabilities for sequential sampling with only $\mathcal{O}(1)$ model evaluations per token. When used for distillation, this yields up to a 12\% reduction in memory footprint for the Qwen2.5-1.5B model while also improving baseline performance up to 4\% on the evaluated tasks. For the general case, we introduce a rigorous lossless procedure that leverages BPE recursive structure, complemented by a fast approximation that keeps large-vocabulary settings practical. Applied to distillation for mathematical reasoning, our approach improves GSM8K accuracy by more than 2% over the current state of the art.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3974. Pursuing Minimal Sufficiency in Spatial Reasoning
๐ฌ 15
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ spatial reasoning agent VLM
Spatial reasoning, the ability to ground language in 3D understanding, remains a persistent challenge for Vision-Language Models (VLMs). We identify two fundamental bottlenecks: \textit{inadequate} 3D understanding capabilities stemming from 2D-centric pre-training, and reasoning failures induced by \textit{redundant} 3D information.
To address these, we first construct a Minimal Sufficient Set (MSS) of information before answering a given question: a \textit{compact} selection of 3D perception results from \textit{expert models}. We introduce \textbf{MSSR} (Minimal Sufficient Spatial Reasoner), a dual-agent framework that implements this principle. A \textit{Perception Agent} programmatically queries 3D scenes using a versatile perception toolbox to extract sufficient information, including a novel \textbf{SOG} (Situated Orientation Grounding) module that robustly extracts language-grounded directions. A \textit{Reasoning Agent} then iteratively refines this information to pursue minimality, pruning redundant details and requesting missing ones in a closed loop until the MSS is curated.
Extensive experiments demonstrate that our method, by explicitly pursuing both sufficiency and minimality, significantly improves accuracy and achieves state-of-the-art performance across two challenging benchmarks. Furthermore, our framework produces interpretable reasoning paths, offering a promising source of high-quality training data for future models. Source code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3975. BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models
๐ฌ 23
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ benchmarking evaluation large language models reasoning
Evaluating language models fairly is becoming harder as static benchmarks available on the internet risk contamination by training data. This makes it unclear whether models are truly reasoning or just recalling answers. In this paper, we introduce $\textbf{BeyondBench}$, an evaluation framework that avoids this problem by using $\textbf{algorithmic problem generation}$. Unlike traditional benchmarks that risk contamination from internet-scale training data, $\textbf{BeyondBench}$ creates mathematically grounded problems on the fly, ensuring each test remains fresh and uncontaminated. Our framework covers $\textbf{44 algorithmic tasks}$ with a total of $\textbf{117 variations}$, grouped into three difficulty levels: the $\textit{Easy Suite}$ (29 tasks) for basic arithmetic and statistics, the $\textit{Medium Suite}$ (5 tasks, 49 variations) for sequence patterns and reasoning, and the $\textit{Hard Suite}$ (10 tasks, 68 variations) tackling NP-complete and constraint satisfaction problems. Each task generates problems from a combinatorial space larger than $10^{15}$ unique instances, with solutions verified deterministically by mathematical proofs. We evaluated $\textbf{101 language models}$, including 85 open-source and 16 closed-source models, spanning sizes from 0.5B to 141B parameters and multiple quantization schemes. All evaluations use three-fold cross-validation to ensure statistical robustness. Our results show consistent reasoning deficiencies across model families, with performance degrading sharply as problem complexity increases from polynomial to exponential. In our Hard Suite evaluations, models such as Gemini-2.5-pro, Llama-3.3-70B, and Qwen2.5-72B achieved average accuracies of 56.38%, 26.91%, and 33.60%, respectively. Moreover, we observe that performance drops drastically without tool usage, with GPT-5, GPT-5-mini, and GPT-5-nano showing a $\textbf{decline}$ of 16.81% 28.05%, and 47.59% accuracy on the hard suite. The contamination resistance of $\textbf{BeyondBench}$ rests on three guarantees: (i) the problem space is vastly larger than any static dataset, (ii) every instance has a unique, verifiable solution, and (iii) isomorphic transformations generate semantically equivalent but syntactically new problems. $\textbf{BeyondBench}$ redefines reasoning evaluation through genuine algorithmic problem-solving capability, ensuring a fair and meaningful evaluation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3976. Beyond RAG vs. Long-Context: Learning Distraction-Aware Retrieval for Efficient Knowledge Grounding
๐ฌ 19
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Retrieval-Augmented Generation Language Models Long-Context
Retrieval-Augmented Generation (RAG) is a framework for grounding Large Language Models (LLMs) in external, up-to-date information. However, recent advancements in context window size allow LLMs to process inputs of up to 128K tokens or more, offering an alternative strategy: supplying the full document context directly to the model, rather than relying on RAG to retrieve a subset of contexts. Nevertheless, this emerging alternative strategy has notable limitations: (i) it is token-inefficient to handle large and potentially redundant contexts; (ii) it exacerbates the โlost in the middleโ phenomenon; and (iii) under limited model capacity, it amplifies distraction, ultimately degrading LLM output quality. In this paper, we propose LDAR (Learning Distraction-Aware Retrieval), an adaptive retriever that learns to retrieve contexts in a way that mitigates interference from distracting passages, thereby achieving significantly higher performance with reduced token usage compared to long-context approaches. Extensive experiments across diverse LLM architectures and six knowledge-intensive benchmarks demonstrate the effectiveness and robustness of our approach, highlighting the importance of balancing the trade-off between information coverage and distraction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3977. Fast Proteome-Scale Protein Interaction Retrieval via Residue-Level Factorization
๐ฌ 17
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Proteinโprotein interaction Kernel methods Random Fourier features
Protein-protein interactions (PPIs) are mediated at the residue level. Most sequence-based PPI models consider residue-residue interactions across two proteins, which can yield accurate interaction scores but are too slow to scale. At proteome scale, identifying candidate PPIs requires evaluating nearly *all possible protein pairs*. For $N$ proteins of average length $L$, exhaustive all-against-all search requires $\mathcal{O}(N^2L^2)$ computation, rendering conventional approaches computationally impractical. We introduce RaftPPI, a scalable framework that approximates residue-level PPI modeling while enabling efficient large-scale retrieval. RaftPPI represents residue interactions with a Gaussian kernel, approximated efficiently via structured random Fourier features, and applies a low-rank factorized attention mechanism that admits pooling into a compact embedding per protein. Each protein is encoded once into an indexable embedding, allowing approximate nearest-neighbor search to replace exhaustive pairwise scoring, reducing proteome-wide retrieval from *months* to *minutes* on a single GPU. On the human proteome with the D-SCRIPT dataset, RaftPPI retrieves the top 20\% candidate pairs ($\sim$200M) in 6 GPU minutes, covering 75.1\% of the true interacting pairs,
compared to 4.9 GPU months for the best prior method (61.2\%). Across seven benchmarks with sequence- and degree-controlled splits, RaftPPI achieves state-of-the-art PPI classification and retrieval performance, while enabling residue-aware, retrieval-friendly screening at proteome scale.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3978. STaMP: Sequence Transformation and Mixed Precision for Low-Precision Activation Quantization
๐ฌ 14
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Quantization Efficient Deep Learning Activation Quantization
Quantization is the key method for reducing inference latency, power and memory footprint of generative AI models.
However, accuracy often degrades sharply when activations are at low bit widths.
Recent work suggests that invertible linear transformations (e.g. rotations) can aid quantization, by reparameterizing feature channels and weights.
In this paper, we propose Sequence Transformation and Mixed Precision (STaMP) quantization, a novel strategy that applies linear transformations along the sequence dimension to exploit the strong local correlation in language and visual data.
By keeping a small number of tokens in each intermediate activation at higher precision, we can maintain model accuracy at lower (average) activations bit-widths.
We evaluate STaMP on recent LVM and LLM architectures, demonstrating that it significantly improves low bit width activation quantization and complements established activation and weight quantization methods including recent feature transformations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3979. GeoDiv: Framework for Measuring Geographical Diversity in Text-to-Image Models
๐ฌ 26
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ geographical diversity
Text-to-image (T2I) models are rapidly gaining popularity, yet their outputs often lack geographical diversity, reinforce stereotypes, and misrepresent regions. Given their broad reach, it is critical to rigorously evaluate how these models portray the world. Existing diversity metrics either rely on curated datasets or focus on surface-level visual similarity, limiting interpretability. We introduce GeoDiv, a framework leveraging large language and vision-language models to assess geographical diversity along two complementary axes: the Socio-Economic Visual Index (SEVI), capturing economic and condition-related cues, and the Visual Diversity Index (VDI), measuring variation in primary entities and backgrounds. Applied to images generated by models such as Stable Diffusion and FLUX.1-dev across $10$ entities and $16$ countries, GeoDiv reveals a consistent lack of diversity and identifies fine-grained attributes where models default to biased portrayals. Strikingly, depictions of India, Nigeria, and Colombia are disproportionately impoverished and worn, reflecting underlying socio-economic biases. These results highlight the need for greater geographical nuance in generative models. GeoDiv provides the first systematic, interpretable framework for measuring such biases, marking a step toward fairer and more inclusive generative systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3980. Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning
๐ฌ 20
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Mechanistic Interpretability Unsupervised Learning Representation Space Geometry
Understanding internal representations of neural models is a core interest of mechanistic interpretability. Due to its large dimensionality, the representation space can encode various aspects about inputs. To what extent are different aspects organized and encoded in separate subspaces? Is it possible to find these "natural" subspaces in a purely unsupervised way? Somewhat surprisingly, we can indeed achieve this and find interpretable subspaces by a seemingly unrelated training objective. Our method, neighbor distance minimization (NDM), learns non-basis-aligned subspaces in an unsupervised manner. Qualitative analysis shows subspaces are interpretable in many cases, and encoded information in obtained subspaces tends to share the same abstract concept across different inputs, making such subspaces similar to "variables" used by the model. We also conduct quantitative experiments using known circuits in GPT-2; results show a strong connection between subspaces and circuit variables. We also provide evidence showing scalability to 2B models by finding separate subspaces mediating context and parametric knowledge routing. Viewed more broadly, our findings offer a new perspective on understanding model internals and building circuits.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3981. Continuous-Time Value Iteration for Multi-Agent Reinforcement Learning
๐ฌ 12
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Continuous-time multi-agent reinforcement learning physics-informed neural networks
Existing reinforcement learning (RL) methods struggle with complex dynamical systems that demand interactions at high frequencies or irregular time intervals. Continuous-time RL (CTRL) has emerged as a promising alternative by replacing discrete-time Bellman recursion with differentiable value functions defined as viscosity solutions of the HamiltonโJacobiโBellman (HJB) equation. While CTRL has shown promise, its applications have been largely limited to the single-agent domain. This limitation stems from two key challenges: (i) conventional methods for solving HJB equations suffer from the curse of dimensionality (CoD), making them intractable in high-dimensional systems; and (ii) even with learning-based approaches to alleviate the CoD, accurately approximating centralized value functions in multi-agent settings remains difficult, which in turn destabilizes policy training. In this paper, we propose a CT-MARL framework that uses physics-informed neural networks (PINNs) to approximate HJB-based value functions at scale. To ensure the value is consistent with its differential structure, we align value learning with value-gradient learning by introducing a Value Gradient Iteration (VGI) module that iteratively refines value gradients along trajectories. This improves gradient accuracy, in turn yielding more precise value approximations and stronger policy learning. We evaluate our method using continuousโtime variants of standard benchmarks, including multiโagent particle environment (MPE) and multiโagent MuJoCo. Our results demonstrate that our approach consistently outperforms existing continuousโtime RL baselines and scales to complex cooperative multi-agent dynamics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3982. GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning
๐ฌ 20
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ GUI Visual Grounding Reinforcement Fine-Tuning
Graphical user interface visual grounding (GUI-VG)โa core capability for GUI agentsโhas primarily relied on supervised fine-tuning (SFT) of multimodal large language models (MLLMs), demanding extensive data curation and significant training costs. However, as MLLMs continue to advance and even cover GUI domains during pretraining, the necessity of exhaustive SFT post-training becomes increasingly questionable. Meanwhile, the recent successes of rule-based reinforcement fine-tuning (RFT) suggest a more efficient alternative. However, despite its promise, the optimal manner of RFT for GUI-VG remains unexplored. To bridge this gap, we introduce GuirlVG, a reinforcement learningโbased GUI-VG method built on a systematic empirical study and a novel stabilization technique. Preliminarily, we find that naive application of RFT underperforms the SFT baseline, motivating a deeper exploration of RFT. First, we decompose RFT into its core components and analyze the optimal formulation of each. Second, as part of this exploration, we propose a novel Adversarial KL Factor that dynamically stabilizes training to mitigate reward over-optimization. Third, we further explore the training configurations of RFT to enhance the effectiveness. Extensive experiments show that GuirlVG, with only 5.2K training samples, outperforms SFT methods trained on over 10M samples, achieving a +7.7% improvement on ScreenSpot, a +17.2% improvement on ScreenSpotPro and 91.9% accuracy on ScreenSpotV2.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3983. DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
๐ฌ 14
โญ 5.00
๐ generative models
๐ท๏ธ multimodal large language model visual tokenizer
The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level visual appearance, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks. Conversely, a vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks. To bridge this gap, we propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer. However, directly integrating reconstruction and semantic objectives creates conflicts, leading to degraded performance in both reconstruction fidelity and semantic accuracy. Instead of forcing a single codebook to capture both visual appearance and semantics, DualToken disentangles them by introducing separate codebooks for high-level semantics and low-level visual details, effectively turning their inherent conflict into a synergistic relationship. As a result, DualToken sets a new record of 0.25 rFID and 82.0% zero-shot accuracy on ImageNet, and demonstrates strong effectiveness in downstream MLLM tasks for both understanding and generation. Specifically, our method surpasses VILA-U by 5.8 points on average across ten visual understanding benchmarks and delivers a 13% improvement on GenAI-Bench, attaining state-of-the-art performance among existing autoregressive unified models. Notably, incorporating dual visual tokens consistently outperforms using a single token type on both understanding and generation tasks. We hope our research offers a new perspective on leveraging dual visual vocabularies for building unified visionโlanguage models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3984. EasyCreator: Empowering 4D Creation through Video Inpainting
๐ฌ 18
โญ 5.00
๐ generative models
๐ท๏ธ diffusion model; 4D video generation and editing
We introduce EasyCreator, a novel 4D video creation framework capable of both generating and editing 4D content from a single monocular video input. By leveraging a powerful video inpainting foundation model as a generative prior, we reformulate 4D video creation as a video inpainting task, enabling the model to fill in missing content caused by camera trajectory changes or user edits. To facilitate this, we generate composite masked inpainting video data to effectively fine-tune the model for 4D video generation. Given an input video and its associated camera trajectory, we first perform depth-based point cloud rendering to obtain invisibility masks that indicate the regions that should be completed. Simultaneously, editing masks are introduced to specify user-defined modifications, and these are combined with the invisibility masks to create a composite masks dataset. During training, we randomly sample different types of masks to construct diverse and challenging inpainting scenarios, enhancing the modelโs generalization and robustness in various 4D editing and generation tasks. To handle temporal consistency under large camera motion, we design a self-iterative tuning strategy that gradually increases the viewing angles during training, where the model is used to generate the next-stage training data after each fine-tuning iteration. Moreover, we introduce a temporal packaging module during inference to enhance generation quality. Our method effectively leverages the prior knowledge of the base model without degrading its original performance, enabling the generation of 4D videos with consistent multi-view coherence. In addition, our approach supports prompt-based content editing, demonstrating strong flexibility and significantly outperforming state-of-the-art methods in both quality and versatility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3985. Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs
๐ฌ 4
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Benchmark MLLM Reasoning
Multimodal Large Language Models (MLLMs) combine the linguistic strengths of LLMs with the ability to process multimodal data, enabling them to address a broader range of tasks. This progression highlights a shift from language-only reasoning to integrated visionโlanguage reasoning in children's development.
Inspired by the Wechsler Intelligence Scales, we introduce KidGym, a comprehensive 2D grid-based benchmark for assessing five essential capabilities of MLLMs: execution, perception reasoning, learning, memory, and planning.
The benchmark comprises 12 unique tasks, each targeting at least one core capability, specifically designed to gauge MLLMs' adaptability and developmental potential, mirroring the stages of children's cognitive growth. Additionally, our tasks encompass diverse scenarios and objects with randomly generated layouts, ensuring more accurate and robust evaluation of MLLM capabilities.
KidGym is designed to be fully user-customizable and extensible, allowing researchers to create new evaluation scenarios and adjust difficulty levels to accommodate the rapidly growing MLLM community.
Through evaluation of state-of-the-art MLLMs using KidGym, we identified significant insights into model capabilities and revealed important limitations of current status. We release our benchmark at: https://kidgym.github.io/KidGym-Website/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3986. FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning
๐ฌ 24
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Efficient Reasoning Attention Outlier Reasoning
We propose **FROST**, an attention-aware method for efficient reasoning.
Unlike traditional approaches, FROST leverages attention weights to prune uncritical reasoning paths, yielding shorter and more reliable reasoning trajectories.
Methodologically, we introduce the concept of reasoning outliers and design an attention-based mechanism to remove them.
Theoretically, FROST preserves and enhances the modelโs reasoning capacity while eliminating outliers at the sentence level.
Empirically, we validate FROST on four benchmarks using two strong reasoning models (Phi-4-Reasoning and GPT-oss-20B), outperforming state-of-the-art methods such as TALE and ThinkLess.
Notably, FROST achieves an average **58.72%** reduction in token usage and a **10.64%** improvement in accuracy over the base model. Furthermore, in evaluations of attention outlier metrics, FROST reduces the maximum infinity norm $\lVert \mathbf{x} \rVert_{\infty}$
by **15.97%** and the average kurtosis by **91.09%** compared to the base model.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3987. Watermarking Diffusion Language Models
๐ฌ 4
โญ 5.00
๐ generative models
๐ท๏ธ Watermarks Diffusion Language Models LLM
We introduce the first watermark tailored for diffusion language models (DLMs), an emergent LLM paradigm able to generate tokens in arbitrary order, in contrast to standard autoregressive language models (ARLMs) which generate tokens sequentially. While there has been much work in ARLM watermarking, a key challenge when attempting to apply these schemes directly to the DLM setting is that they rely on previously generated tokens, which are not always available with DLM generation. In this work we address this challenge by: (i) applying the watermark in expectation over the context even when some context tokens are yet to be determined, and (ii) promoting tokens which increase the watermark strength when used as context for other tokens. This is accomplished while keeping the watermark detector unchanged. Our experimental evaluation demonstrates that the DLM watermark leads to a >99\% true positive rate with minimal quality impact and achieves similar robustness to existing ARLM watermarks, enabling for the first time reliable DLM watermarking.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3988. WIMLE: UncertaintyโAware World Models with IMLE for SampleโEfficient Continuous Control
๐ฌ 17
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Model-based RL
Model-based reinforcement learning promises strong sample efficiency but often underperforms in practice due to compounding model error, unimodal world models that average over multi-modal dynamics, and overconfident predictions that bias learning. We introduce WIMLE, a model-based method that extends Implicit Maximum Likelihood Estimation (IMLE) to the model-based RL framework to learn stochastic, multi-modal world models without iterative sampling and to estimate predictive uncertainty via ensembles and latent sampling. During training, WIMLE weights each synthetic transition by its predicted confidence, preserving useful model rollouts while attenuating bias from uncertain predictions and enabling stable learning. Across 40 continuous-control tasks spanning DeepMind Control, MyoSuite, and HumanoidBench, WIMLE achieves superior sample efficiency and competitive or better asymptotic performance than strong model-free and model-based baselines. Notably, on the challenging Humanoid-run task, WIMLE improves sample efficiency by over 50\% relative to the strongest competitor, and on HumanoidBench it solves 8 of 14 tasks (versus 4 for BRO and 5 for SimbaV2). These results highlight the value of IMLE-based multi-modality and uncertainty-aware weighting for stable model-based RL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3989. Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing
๐ฌ 5
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Vision-Language Models Chart Parsing
While Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities for reasoning and self-correction at the textual level, these strengths provide minimal benefits for complex tasks centered on visual perception, such as Chart Parsing. Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination. Inspired by the human strategy of using a finger as a ``visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR). The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors. We instantiate the VSR paradigm in the domain of Chart Parsing by proposing ChartVSR. This model decomposes the parsing process into two stages: a Refine Stage, where it iteratively uses visual feedback to ensure the accuracy of all data points' Pixel-level Localizations, and a Decode Stage, where it uses these verified localizations as precise visual anchors to parse the final structured data. To address the limitations of existing benchmarks, we also construct ChartP-Bench, a new and highly challenging benchmark for chart parsing. Our work also highlights VSR as a general-purpose visual feedback mechanism, offering a promising new direction for enhancing accuracy on a wide range of vision-centric tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3990. FragFM: Hierarchical Framework for Efficient Molecule Generation via Fragment-Level Discrete Flow Matching
๐ฌ 19
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Molecular Graph Generation Discrete Flow Matching Fragment-Based Drug Discovery Natural Product
We introduce FragFM, a novel hierarchical framework via fragment-level discrete flow matching for efficient molecular graph generation. FragFM generates molecules at the fragment level, leveraging a coarse-to-fine autoencoder to reconstruct details at the atom level. Together with a stochastic fragment bag strategy to effectively handle an extensive fragment space, our framework enables more efficient and scalable molecular generation. We demonstrate that our fragment-based approach achieves better property control than the atom-based method and additional flexibility through conditioning the fragment bag. We also propose a Natural Product Generation benchmark (NPGen) to evaluate modern molecular graph generative models' ability to generate natural product-like molecules. Since natural products are biologically prevalidated and differ from typical drug-like molecules, our benchmark provides a more challenging yet meaningful evaluation relevant to drug discovery. We conduct a FragFM comparative study against various models on diverse molecular generation benchmarks, including NPGen, demonstrating superior performance. The results highlight the potential of fragment-based generative modeling for large-scale, property-aware molecular design, paving the way for more efficient exploration of chemical space.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3991. Can Vision-Language Models Answer Face to Face Questions in the Real-World?
๐ฌ 13
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Situated Dataset Multi-Modal Dataset Vision Language Models
AI models have made significant strides in recent years in their ability to describe and answer questions about real-world images. They have also made progress in the ability to converse with users in real-time using audio input. This raises the question: have we reached the point where AI models, connected to a camera and microphone, can converse with users in real-time about scenes and events that are unfolding live in front of the camera? This has been a long-standing goal in AI and is a prerequisite for real-world AI assistants and humanoid robots to interact with humans in everyday situations. In this work, we introduce a new dataset and benchmark, the Interactive Video Dataset (IVD), which allows us to assess the extent to which existing models can support these abilities, and to what degree these capabilities can be instilled through fine-tuning. The dataset is based on a simple question-answering setup, where users ask questions that the system has to answer, in real-time, based on the camera and audio input. We show that existing models fall far behind human performance on this task, and we identify the main sources for the performance gap. However, we also show that for many of the required perceptual skills, fine-tuning on this form of data can significantly reduce this gap.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3992. Human Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question Answering
๐ฌ 8
โญ 5.00
๐ generative models
๐ท๏ธ Large Vision-Language Model Visual Question Answering
Large vision-language models (VLMs) achieve strong performance in Visual Question Answering but still rely heavily on supervised fine-tuning (SFT) with massive labeled datasets, which is costly due to human annotations. Crucially, real-world datasets often exhibit *human uncertainty* (**HU**) โ variation in human confidence across annotations, but standard SFT simply optimizes toward the most frequent label, disregarding HU distributions. This leaves two open questions: *How does HU affect SFT*, and *how can HU be effectively leveraged in training?* In this work, we first conduct a systematic evaluation of VLMs across varying HU levels. We have two key findings: (i) surprisingly, high-HU samples contribute little, or even degrade, model performance, and (ii) naively training on the full dataset yields under-calibrated models that fail to capture HU distributions. Motivated by these findings, we introduce **HaDola**, a **h**uman uncertainty-**a**ware **d**ata selection and aut**o**matic **la**beling framework. HaDola operates in four stages: **discriminate**, **self-annotate**, **error trigger**, and **training**, to iteratively identify harmful samples, prioritize informative ones, and bootstrap from a small seed set (5% of data). Our approach substantially reduces reliance on costly HU annotations and makes VLMs more accurate and better calibrated. Extensive experiments on VQAv2 and VizWiz datasets demonstrate that HaDola consistently matches or outperforms state-of-the-art baselines, with less training data. Our work highlights the importance of explicitly modeling HU in SFT, suggesting better utilization of HU is more effective than merely scaling up dataset size.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3993. WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning
๐ฌ 13
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Web Agent
To significantly advance the capabilities of open-source web agents, we present WebSailor-V2, a complete post-training pipeline encompassing data construction, Supervised Fine-Tuning (SFT), and Reinforcement Learning (RL). Our methodology features two key innovations: (1) On the data front, we developed SailorFog-QA-2, a novel dataset built from a densely interconnected knowledge graph that introduces a wide variety of uncertainties beyond simple obfuscation, fostering more sophisticated reasoning. (2) For training, we engineered a dual-environment RL framework, combining a high-fidelity simulator for rapid, low-cost algorithmic iteration with a robust, managed real-world environment for stable final policy training, all integrated within a symbiotic data-policy feedback loop. Trained on the Qwen3-30B-A3B model, WebSailor-V2 achieves state-of-the-art results, scoring 35.3 on BrowseComp-EN, 44.1 on BrowseComp-ZH, and 30.6 on Humanity's Last Exam (HLE). Notably, our 30B-A3B MOE agent significantly outperforms all existing open-source agents and surpasses even the 671B DeepSeek-V3.1, demonstrating performance competitive with leading proprietary systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3994. Adaptive Acquisition Selection for Bayesian Optimization with Large Language Models
๐ฌ 12
โญ 5.00
๐ optimization
๐ท๏ธ Bayesian Optimization Large Language Models
Bayesian Optimization critically depends on the choice of acquisition function, but no single strategy is universally optimal; the best choice is non-stationary and problem-dependent.
Existing adaptive portfolio methods often base their decisions on past function values while ignoring richer information like remaining budget or surrogate model characteristics.
To address this, we introduce LMABO, a novel framework that casts a pre-trained Large Language Model (LLM) as a zero-shot, online strategist for the BO process.
At each iteration, LMABO uses a structured state representation to prompt the LLM to select the most suitable acquisition function from a diverse portfolio.
In an evaluation across 50 benchmark problems, LMABO demonstrates a significant performance improvement over strong static, adaptive portfolio, and other LLM-based baselines.
We show that the LLM's behavior is a comprehensive strategy that adapts to real-time progress, proving its advantage stems from its ability to process and synthesize the complete optimization state into an effective, adaptive policy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3995. PINFDiT: Energy-Based Physics-Informed Diffusion Transformers for General-purpose Time Series Tasks
๐ฌ 18
โญ 5.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Diffusion; Transformer; Time Series; Physics Informed Machine Learning;Physics-Guided Inference in Time Series Diffusion Transformers
Time series analysis underpins scientific advances. While specialized models have advanced various time series tasks, scientific domains face unique challenges: limited samples with complex physical dynamics, missing observations, multi-resolution sampling, and requirements for physical consistency. With the increasing demands on generative modeling capabilities, we introduce PINFDiT, a diffusion transformer-based model with physics injection during inference. Our approach combines a transformer backbone for capturing temporal dependencies with a comprehensive masking strategy that addresses imperfect data. The diffusion framework enables high-quality sample generation with inherent generative capability. In addition, our model-free physics-guided correction steers generated samples toward physically consistent solutions using calibrated Langevin dynamics, which balances distribution fidelity and physical law adherence without architectural modifications or retraining. Our evaluation demonstrates PINFDiT's effectiveness across multivariate forecasting with imperfect data, physics knowledge incorporation in data-limited scenarios, zero-shot and fine-tuning performance across diverse domains, establishing it as a proto-foundation model that bridges the gap between general-purpose and domain-specific models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3996. Stretching Beyond the Obvious: A Gradient-Free Framework to Unveil the Hidden Landscape of Visual Invariance
๐ฌ 28
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ vision visual invariance feature visualization human-AI alignment deep convolutional neural networks NeuroAI psychophysics computational neuroscience robustness adversarial attacks evolutionary algorithm gradient-free optimization machine learning
Uncovering which feature combinations are encoded by visual units is critical to understanding how images are transformed into representations that support recognition. While existing feature visualization approaches typically infer a unit's most exciting images, this is insufficient to reveal the manifold of transformations under which responses remain invariant, which is critical to generalization in vision.
Here we introduce Stretch-and-Squeeze (SnS), an unbiased, model-agnostic, and gradient-free framework to systematically characterize a unitโs maximally invariant stimuli, and its vulnerability to adversarial perturbations, in both biological and artificial visual systems. SnS frames these transformations as bi-objective optimization problems. To probe invariance, SnS seeks image perturbations that maximally alter (stretch) the representation of a reference stimulus in a given processing stage while preserving unit activation downstream (squeeze). To probe adversarial sensitivity, stretching and squeezing are reversed to maximally perturb unit activation while minimizing changes to the upstream representation.
Applied to CNNs, SnS revealed invariant transformations that were farther from a reference image in pixel-space than those produced by affine transformations, while more strongly preserving the target unit's response. The discovered invariant images differed depending on the stage of the image representation used for optimization: pixel-level changes primarily affected luminance and contrast, while stretching mid- and late-layer representations mainly altered texture and pose.
By measuring how well the hierarchical invariant images obtained for $L_2$-robust (i.e., adversarially trained) networks were classified by humans and other observer networks, we discovered a substantial drop in their interpretability when the representation was stretched in deep layers, while the opposite trend was found for standard (i.e., not robustified) models. This indicates that $L_2$ adversarial training fails to increase the interpretability of high-level invariances, despite good perceptual alignment between humans and robustified models at the pixel level. This demonstrates how SnS can be used as a powerful new tool to measure the alignment between artificial and biological vision.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3997. Inpainting-Guided Policy Optimization for Diffusion Large Language Models
๐ฌ 13
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Diffusion Large Language Models Reinforcement Learning Inpainting Group Relative Policy Optimization
Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for dLLMs. Aligning LLMs with reinforcement learning faces an exploration challenge: sparse reward signals and sample waste when models fail to discover correct solutions. While this inefficiency affects LLMs broadly, dLLMs offer a distinctive opportunityโtheir inpainting ability can guide exploration. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically inserts partial ground-truth reasoning traces during online sampling. Unlike providing full solutions, inpainting steers exploration toward promising trajectory spaces while preserving self-generated reasoning, bridging supervised fine-tuning and reinforcement learning.
We apply IGPO to group-based optimization methods such as GRPO, where exploration failures cause zero advantages and gradients. IGPO restores meaningful gradients while improving sample efficiency. We also propose supervised fine-tuning on synthetically rewritten concise traces that better align with dLLM generation patterns. With additional techniques including entropy-based filtering, our training recipe yields substantial gains across four mathematical benchmarksโGSM8K, Math500, AMC and Minervaโachieving new state-of-the-art results for full-attention masked dLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3998. Improving LLM-based Global Optimization with Search Space Partitioning
๐ฌ 14
โญ 5.00
๐ optimization
๐ท๏ธ global optimization LLMs bayesian optimization
Large Language Models (LLMs) have recently emerged as effective surrogate models and candidate generators within global optimization frameworks for expensive black-box functions. Despite promising results, LLM-based methods often struggle in high-dimensional search spaces or when lacking domain-specific priors, leading to sparse or uninformative suggestions. To overcome these limitations, we propose HOLLM, a novel global optimization algorithm that enhances LLM-driven sampling by partitioning the search space into promising subregions. Each subregion acts as a "meta-arm" selected via a bandit-inspired scoring mechanism that effectively balances exploration and exploitation. Within each selected subregion, an LLM then proposes high-quality candidate points, without any explicit domain knowledge. Empirical evaluation on standard optimization benchmarks shows that HOLLM consistently matches or surpasses leading global optimization methods, while substantially outperforming traditional global LLM-based sampling strategies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
3999. Benchmarking Overton Pluralism in LLMs
๐ฌ 17
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Pluralism Overton pluralism pluralistic alignment benchmark
We introduce a novel framework for measuring Overton pluralism in LLMsโthe extent to which diverse viewpoints are represented in model outputs. We (i) formalize Overton pluralism as a set-coverage metric (OVERTONSCORE), (ii) conduct a large-scale US-representative human study (N=1209; 60 questions; 8 LLMs), and (iii) develop an automated benchmark that closely reproduces human judgments. On average, models achieve OVERTONSCOREs of 0.35 โ 0.41, with Deepseek V3 performing best; yet all models remain far below the theoretical maximum of 1.0, revealing substantial headroom for improvement. Because repeated large-scale human studies are costly and slow, scalable evaluation tools are essential for model development. Hence, we propose an automated benchmark that achieves high rank correlation with human judgments ($\rho=0.88$), providing a practical proxy while not replacing human assessment.
By turning pluralistic alignment from a normative aim into a measurable benchmark, our work establishes a foundation for systematic progress toward more pluralistic LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4000. Linear Mechanisms for Spatiotemporal Reasoning in Vision Language Models
๐ฌ 16
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ interpretability vision VLMs visual reasoning spatial understanding temporal understanding video
Spatio-temporal reasoning is a remarkable capability of Vision Language Models (VLMs), but the underlying mechanisms of such abilities remain largely opaque. We postulate that visual/geometrical and textual representations of spatial structure must be combined at some point in VLM computations. We search for such confluence, and ask whether the identified representation can causally explain aspects of input-output model behavior through a linear model. We show empirically that VLMs encode object locations by linearly binding \textit{spatial IDs to textual activations, then perform reasoning via language tokens. Through rigorous causal interventions we demonstrate that these IDs, which are ubiquitous across the model, can systematically mediate model beliefs at intermediate VLM layers. Additionally, we find that spatial IDs serve as a diagnostic tool for identifying limitations and bottlenecks in existing VLMs. We extend our analysis to video VLMs and identify an analogous linear temporal ID mechanism. By characterizing our proposed spatiotemporal ID mechanism, we elucidate a previously underexplored internal reasoning process in VLMs, toward improved interpretability and the principled design of more aligned and capable models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4001. Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding
๐ฌ 14
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Speculative Decoding Joint Intractability Lossless Verification
Verification is a key bottleneck in improving inference speed while maintaining distribution fidelity in Speculative Decoding. Recent work has shown that sequence-level verification leads to a higher number of accepted tokens compared to token-wise verification. However, existing solutions often rely on surrogate approximations or are constrained by partial information, struggling with joint intractability. In this work, we propose Hierarchical Speculative Decoding (HSD), a provably lossless verification method that significantly boosts the expected number of accepted tokens and overcomes joint intractability by balancing excess and deficient mass across accessible branches. Through extensive large-scale experiments, we show that HSD consistently improves acceptance rates, especially with longer draft sequences. Its strong explainability and generality further highlight the potential for integration into a wide range of speculative decoding frameworks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4002. On the Sample Complexity of GNNs
๐ฌ 21
โญ 5.00
๐ learning theory
๐ท๏ธ Graph Neural Networks (GNNs) sample complexity lower bounds
Graph Neural Networks (GNNs) achieve strong empirical performance across domains, yet their fundamental statistical behavior remains poorly understood. This paper develops a minimax analysis of ReLU message-passing GNNs with explicit architectural assumptions, in both inductive (graph-level) and transductive (node-level) settings. For arbitrary graphs without structural constraints, we show that the worst-case generalization error scales as $\sqrt{\log d / n}$ with sample size $n$ and input dimension $d$, matching the $1/\sqrt{n}$ behavior of feed-forward networks. Under a spectral--homophily condition combining strong label homophily and bounded spectral expansion, we prove a stronger minimax lower bound of $d/\log n$ for transductive node prediction. We complement these results with a systematic empirical study on three large-scale benchmarks (ogbn\_arxiv, ogbn\_products\_50k, Reddit\_50k) and two controlled synthetic datasets representing the worst-case and structured regimes of our theory.
All real graphs satisfy the spectral--homophily condition, and ratio-based scaling tests show error decay consistent with the $d/\log n$ rate in real and structured settings, while the worst-case synthetic dataset follows the $\sqrt{\log d / n}$ curve. Together, these results indicate that practical GNN tasks often operate in the spectral--homophily regime, where our lower bound $d/\log n$ is tight and effective sample complexity is driven by graph topology rather than universal $1/\sqrt{n}$ behavior.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4003. VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use
๐ฌ 15
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reinforcement Learning Vison Lanaguage Model Multi turn
Reinforcement learning finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, multi-turn self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on original image inputs, falling short of incorporating visual reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms.
We introduce VTool-R1, the first RFT framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that benefit final reasoning. Trained with outcome-based rewards, our approach elicits strategic visual tool use for multi-modal reasoning without relying on process-based supervision. Extensive experiments on structured visual reasoning over charts and tables show that VTool-R1 enhances reasoning performance by teaching VLMs to "think with images" and generate multimodal chain of thoughts with tools. To support future research in multi-turn multi-modal reasoning, we will release our full training and evaluation code as open source.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4004. Planning with an Embodied Learnable Memory
๐ฌ 12
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Embodied Memory Planning Reinforcement Learning
We develop a novel memory representation for embodied planning models performing long-horizon mobile manipulation in dynamic, large-scale indoor environments. Prior memory representations fall short in this setting, as they struggle with object movements, suffer from computational deficiencies, and often depend on the heuristic integration of multiple models. To overcome these limitations, we present the Embodied Perception Memory (EMP), a learnable memory designed for embodied planning. EMP is implemented as a unified Vision-Language Model (VLM) that uses egocentric vision to maintain and update a textual environment representation. We further introduce two complementary methods for training planners to leverage the EMP: an imitation strategy that uses human trajectories for natural exploration and interaction, and a novel reinforcement learning approach, Dynamic Difficulty-Aware Fine-Tuning (DDAFT), which improves planning performance via difficulty-aware exploration. Our memory representation, when integrated with our planning training methods, leads to significant improvements on planning tasks, showing up to a 55% increase in success rate on the PARTNR benchmark compared to strong baselines. Also, our planning method outperforms these baselines even when they have access to groundtruth perception.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4005. SupCLAP: Controlling Optimization Trajectory Drift in Audio-Text Contrastive Learning with Support Vector Regularization
๐ฌ 14
โญ 5.00
๐ learning theory
๐ท๏ธ contrastive learning; audio-text retrieval
Contrastive language-audio pretraining, which aims to unify multimodal representations in a shared embedding space, serves as a cornerstone for building a wide range of applications, from cross-modal retrieval to cutting-edge multimodal large language models. However, we find that the perpendicular component of the pushing force from negative samples in contrastive learning is a double-edged sword: it contains rich supplementary information from negative samples, yet its unconstrained nature causes optimization trajectory drift and training instability. To address this, we propose Support Vector Regularization (SVR), a method that introduces an auxiliary support vector to control this perpendicular component, aiming to harness its rich information while mitigating the associated trajectory drift. The efficacy of SVR is critically governed by its semantic radius, for which we explore two unsupervised modeling strategies: direct parameterization and an adaptive radius predictor module enhanced with constraints to improve its predicting accuracy. Extensive experimental results demonstrate that our method surpasses widely used baselines like InfoNCE and SigLIP loss across classification, monolingual retrieval, and multilingual retrieval on standard audio-text datasets. Both the theoretical analysis and the experimental results on optimizing trajectory drift validate the correctness and effectiveness of our SVR method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4006. A Graph Meta-Network for Learning on KolmogorovโArnold Networks
๐ฌ 18
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Weight Space Kolmogorov Arnold Networks Graph Neural Networks Symmetries Equivariance
Weight-space models learn directly from the parameters of neural networks, enabling tasks such as predicting their accuracy on new datasets. Naive methods -- like applying MLPs to flattened parameters -- perform poorly, making the design of better weight-space architectures a central challenge. While prior work leveraged permutation symmetries in standard networks to guide such designs, no analogous analysis or tailored architecture yet exists for KolmogorovโArnold Networks (KANs). In this work, we show that KANs share the same permutation symmetries as MLPs, and propose the KAN-graph, a graph representation of their computation.
Building on this, we develop WS-KAN, the first weight-space architecture that learns on KANs, which naturally accounts for their symmetry. We analyze WS-KANโs expressive power, showing it can replicate an input KANโs forward pass - a standard approach for assessing expressiveness in weight-space architectures. We construct a comprehensive ``zoo'' of trained KANs spanning diverse tasks, which we use as benchmarks to empirically evaluate WS-KAN. Across all tasks, WS-KAN consistently outperforms structure-agnostic baselines, often by a substantial margin.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4007. StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning
๐ฌ 5
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Video Class-Incremental Learning๏ผFrame-Shared Semantics Distillation๏ผ Temporal Decomposition-based Mixture-of-Experts
Video Class-Incremental Learning (VCIL) seeks to develop models that continuously learn new action categories over time without forgetting previously acquired knowledge. Unlike traditional Class-Incremental Learning (CIL), VCIL introduces the added complexity of spatiotemporal structures, making it particularly challenging to mitigate catastrophic forgetting while effectively capturing both frame-shared semantics and temporal dynamics. Existing approaches either rely on exemplar rehearsal, raising concerns over memory and privacy, or adapt static image-based methods that neglect temporal modeling. To address these limitations, we propose Spatiotemporal Preservation and Routing (StPR), a unified and exemplar-free VCIL framework that explicitly disentangles and preserves spatiotemporal information. We begin by introducing Frame-Shared Semantics Distillation (FSSD), which identifies semantically stable and meaningful channels by jointly considering channel-wise sensitivity and classification contribution. By selectively regularizing these important semantic channels, FSSD preserves prior knowledge while allowing for adaptation. Building on this preserved semantic space, we further design a Temporal Decomposition-based Mixture-of-Experts (TD-MoE), which dynamically routes task-specific experts according to temporal dynamics, thereby enabling inference without task IDs or stored exemplars. Through the synergy of FSSD and TD-MoE, StPR progressively leverages spatial semantics and temporal dynamics, culminating in a unified, exemplar-free VCIL framework. Extensive experiments on UCF101, HMDB51, SSv2 and Kinetics400 show that our method outperforms existing baselines while offering improved interpretability and efficiency in VCIL. Code is available in the supplementary materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4008. Accelerating Eigenvalue Dataset Generation via Chebyshev Subspace Filter
๐ฌ 24
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Data Generation Eigenvalue Problem AI4PDE
Eigenvalue problems are among the most important topics in many scientific disciplines. With the recent surge and development of machine learning, neural eigenvalue methods have attracted significant attention as a forward pass of inference requires only a tiny fraction of the computation time compared to traditional solvers.
However, a key limitation is the requirement for large amounts of labeled data in training, including operators and their eigenvalues.
To tackle this limitation, we propose a novel method, named **S**orting **C**hebyshev **S**ubspace **F**ilter (**SCSF**), which significantly accelerates eigenvalue data generation by leveraging similarities between operators---a factor overlooked by existing methods.
Specifically, SCSF employs truncated fast Fourier transform (FFT) sorting to group operators with similar eigenvalue distributions and constructs a Chebyshev subspace filter that leverages eigenpairs from previously solved problems to assist in solving subsequent ones, reducing redundant computations.
To the best of our knowledge, SCSF is the first method to accelerate eigenvalue data generation.
Experimental results show that SCSF achieves up to a $3.5\times$ speedup compared to various numerical solvers.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4009. Attention Is All You Need for KV Cache in Diffusion LLMs
๐ฌ 20
โญ 5.00
๐ generative models
๐ท๏ธ Diffusion LLMs Attention-aware KV Cache Update Layer-aware KV Cache Update
This work studies how to adaptively recompute keyโvalue (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant MASK tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose Elastic-Cache, a training-free, architecture-agnostic strategy that jointly decides ${when}$ to refresh (via an attention-aware drift test on the most-attended token) and ${where}$ to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: $8.7\times$ on GSM8K (256 tokens), $45.1\times$ on longer sequences, and $4.8\times$ on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ($6.8\times$ on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4010. Enhancing LLMs for Knowledge Base Question Answering by Chain-of-Decomposition
๐ฌ 12
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ LLMs; LoRA; KBQA
Large language models (LLMs) have demonstrated remarkable success across diverse domains through in-context learning or fine-tuning. However, adapting LLMs to Knowledge Base Question Answering (KBQA) remains challenging, as KBQA necessitates multi-step reasoning over large-scale structured knowledge bases. Directly prompting LLMs with entire knowledge bases incurs prohibitive computational costs, while existing methods provide limited guidance on effectively fine-tuning LLMs for such complex reasoning tasks. In this work, we propose Chain-of-Decomposition (\texttt{CoD}), a novel framework that decomposes KBQA into three modular steps: (1) an LLM-free retrieval module to extract query-relevant subgraphs from the knowledge base, (2) a parameter-free reformulation step that transforms retrieved contexts into structured reasoning paths, and (3) a lightweight LLM-based reasoning module trained to evaluate the logical validity of each path. By isolating computation-heavy retrieval and rule-based reformulation from LLM reasoning, \texttt{CoD} reduces task complexity and enables efficient fine-tuning focused solely on the final verification step. Comprehensive experiments demonstrate that Llama-2 7B, fine-tuned with the proposed \texttt{CoD} surpasses strong baselines, including GPT-4 augmented with retrieved knowledge, achieving state-of-the-art performance on WebQSP and MetaQA benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4011. LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation
๐ฌ 20
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Efficiency Key-Value Cache Compression Long-Context LLM Inference Optimization
Transformer-based large language models (LLMs) rely on keyโvalue (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for longโcontext tasks. Existing solutions mitigate this problem by evicting prompt KV that are deemed unimportant, guided by estimated importance scores. Notably, a recent line of work improves eviction quality by โglimpsing into the futureโ, in which a lowโcost draft generator first produces a surrogate response that mimics the target model's true response, which is subsequently used to estimate the importance scores of cached KV. In this paper, we propose LookaheadKV, a lightweight eviction framework that leverages the strength of surrogate future response without the need for costly draft generation. LookaheadKV augments transformer layers with parameterโefficient modules trained to predict true importance scores with high accuracy. Our design ensures negligible runtime overhead comparable to existing inexpensive heuristics, while achieving accuracy superior to more costly approximation methods. Extensive experiments on long-context understanding benchmarks, across a wide range of models, demonstrate that our method not only outperform recent competitive baseline by long-context understanding tasks by $25$\%, but also reduces the eviction cost by up to $14.5$ร, leading to significantly faster time-to-first-token (TTFT).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4012. Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models
๐ฌ 29
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Modality Missing Plug-and-play module multimodal imbalanced learning
Missing modalities present a fundamental challenge in multimodal models, often causing catastrophic performance degradation. Our observations suggest that this fragility stems from an imbalanced learning process, where the model develops an implicit preference for certain modalities, leading to the under-optimization of others. We propose a simple yet efficiency method to address this challenge. The central insight of our work is that the dominance relationship between modalities can be effectively discerned and quantified in the frequency domain. To leverage this principle, we first introduce a **F**requency **R**atio **M**etric (FRM) to quantify modality preference by analyzing features in the frequency domain. Guided by FRM, we then propose a **M**ultimodal **W**eight **A**llocation **M**odule, a plug-and-play component that dynamically re-balances the contribution of each branch during training, promoting a more holistic learning paradigm. Extensive experiments demonstrate that MWAM can be seamlessly integrated into diverse architectural backbones, such as those based on CNNs and ViTs. Furthermore, MWAM delivers consistent performance gains across a wide range of tasks and modality combinations. This advancement extends beyond merely optimizing the performance of the base model; it also manifests as further performance improvements to state-of-the-art methods addressing the missing modality problem.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4013. Bounds of Chain-of-Thought Robustness: Reasoning Steps, Embed Norms, and Beyond
๐ฌ 16
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Chain-of-Thought; Reasoning Robustness
Existing research indicates that the output of **Chain-of-Thought (CoT)** is significantly affected by input perturbations.
Although many methods aim to mitigate such impact by optimizing prompts, a theoretical explanation of how these perturbations influence CoT outputs remains an open area of research.
This gap limits our in-depth understanding of how input perturbations propagate during the reasoning process and hinders further improvements in prompt optimization methods.
Therefore, in this paper, we theoretically analyze the effect of input perturbations on the fluctuation of CoT outputs.
We first derive an upper bound for input perturbations under the condition that the output fluctuation is within an acceptable range, and we prove that:
- *i)* This upper bound is **positively correlated** with the number of reasoning steps in the CoT;
- *ii)* Even an infinitely long reasoning process **cannot eliminate** the impact of input perturbations.
We then apply these conclusions to the **Linear Self-Attention (LSA)** model, which can be viewed as a simplified version of Transformer.
For the LSA model, we prove that the upper bound for input perturbation is **negatively correlated** with the norms of the input embedding and hidden state vectors.
To validate this theoretical analysis, we conduct experiments on **three mainstream datasets** and **four mainstream models**.
The experimental results align with our theoretical analysis, empirically demonstrating the correctness of our findings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4014. MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference
๐ฌ 14
โญ 5.00
๐ generative models
๐ท๏ธ Flow Matching Model Acceleration Caching Mechanism Training-Free
We present MeanCache, a training-free caching framework for efficient Flow Matching inference. Existing caching methods reduce redundant computation but typically rely on instantaneous velocity information (e.g., feature caching), which often leads to severe trajectory deviations and error accumulation under high acceleration ratios. MeanCache introduces an average-velocity perspective: by leveraging cached Jacobian--vector products (JVP) to construct interval average velocities from instantaneous velocities, it effectively mitigates local error accumulation. To further improve cache timing and JVP reuse stability, we develop a trajectory-stability scheduling strategy as a practical tool, employing a Peak-Suppressed Shortest Path under budget constraints to determine the schedule. Experiments on FLUX.1, Qwen-Image, and HunyuanVideo demonstrate that MeanCache achieves $4.12\times$, $4.56\times$, and $3.59\times$ acceleration, respectively, while consistently outperforming state-of-the-art caching baselines in generation quality. We believe this simple yet effective approach provides a new perspective for Flow Matching inference and will inspire further exploration of stability-driven acceleration in commercial-scale generative models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4015. Bradley-Terry and Multi-Objective Reward Modeling Are Complementary
๐ฌ 28
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models; Reward Models
Reward models trained on human preference data have demonstrated strong effectiveness in aligning Large Language Models (LLMs) with human intent under the framework of Reinforcement Learning from Human Feedback (RLHF). However, RLHF remains vulnerable to reward hacking, where the policy exploits imperfections in the reward function rather than genuinely learning the intended behavior.
Although significant efforts have been made to mitigate reward hacking, they predominantly focus on and evaluate in-distribution scenarios, where the training and testing data for the reward model share the same distribution.
In this paper, we empirically show that state-of-the-art methods struggle in more challenging out-of-distribution (OOD) settings. We further demonstrate that incorporating fine-grained multi-attribute scores helps address this challenge. However, the limited availability of high-quality data often leads to weak performance of multi-objective reward functions, which can negatively impact overall performance and become the bottleneck. To address this issue, we propose a unified reward modeling framework that jointly trains Bradley-Terry (BT) single-objective and multi-objective regression-based reward functions using a shared embedding space. We theoretically establish a connection between the BT loss and the regression objective and highlight their complementary benefits. Specifically, the regression task enhances the single-objective reward functionโs ability to mitigate reward hacking in challenging OOD settings, while BT-based training improves the scoring capability of the multi-objective reward function, enabling a 7B model to outperform a 70B baseline.
Extensive experimental results demonstrate that our framework significantly improves both the robustness and the scoring performance of reward models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4016. OPRIDE: Efficient Offline Preference-based Reinforcement Learning via In-Dataset Exploration
๐ฌ 20
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Multi-Agent Systems Partial Observability Diffusion Models
Preference-based reinforcement learning (PbRL) can help avoid sophisticated reward designs and align better with human intentions, showing great promise in various real-world applications. However, obtaining human feedback for preferences can be expensive and time-consuming, which forms a strong barrier for PbRL. In this work, we address the problem of low query efficiency in offline PbRL, pinpointing two primary reasons: inefficient exploration and overoptimization of learned reward functions. In response to these challenges, we propose a novel algorithm, Offline PbRL via In-Dataset Exploration (OPRIDE), designed to enhance the query efficiency of offline PbRL. OPRIDE consists of two key features: a principled exploration strategy that maximizes the informativeness of the queries and a discount scheduling mechanism aimed at mitigating overoptimization of the learned reward functions. Through empirical evaluations, we demonstrate that OPRIDE significantly outperforms prior methods, achieving strong performance with notably fewer queries. Moreover, we provide theoretical guarantees of the algorithm's efficiency. Experimental results across various locomotion, manipulation, and navigation tasks underscore the efficacy and versatility of our approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4017. Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning
๐ฌ 25
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Inductive Knowledge Graph Reasoning Large Language Model Knowledge Graph Foundation Model
Inductive Knowledge Graph Reasoning (KGR) aims to discover facts in open-domain KGs containing unknown entities and relations, which poses a challenge for KGR models in comprehending uncertain KG components. Existing studies have proposed Knowledge Graph Foundation Models (KGFMs) that learn structural invariances across KGs to handle this uncertainty. Recently, Large Language Models (LLMs) have demonstrated strong capabilities for open-domain knowledge reasoning. As a result, the latest research has focused on LLM-based KGFMs that integrate LLM knowledge with KG context for inductive KGR. However, the intrinsic knowledge of LLMs may be overshadowed by sparse KG context, leading to LLM knowledge distortion, which can cause irreversible damage to model reasoning. Moreover, existing LLM-based KGR methods still struggle to fully constrain generative hallucinations in LLMs, severely limiting the credibility of reasoning results. To address these limitations, we propose a Knowledge Reasoning Language Model (KRLM) that achieves unified coordination between LLM knowledge and KG context throughout the KGR process. Specifically, we design a Knowledge Reasoning Language (KRL) instruction format and a KRL tokenizer to align LLM knowledge with KG representations. Then, we propose a KRL attention layer that coordinates intrinsic LLM knowledge with additional KG context through a dynamic knowledge memory mechanism. Finally, a structure-aware next-entity predictor is proposed, which strictly constrains the reasoning results within a trustworthy knowledge domain. Extensive experimental results on 25 real-world inductive KGR datasets demonstrate the significant superiority of the proposed KRLM in both zero-shot reasoning and fine-tuning scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4018. Sculpting Subspaces: Constrained Full Fine-Tuning in LLMs for Continual Learning
๐ฌ 30
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Continual Learning Parameter-Efficient Fine-Tuning Full Fine-Tuning Catastrophic Forgetting Singular Value Decomposition Geometric Constraints Orthogonal Subspaces Low-Rank Subspaces Constrained Optimization
Continual learning in large language models (LLMs) is prone to catastrophic forgetting, where adapting to new tasks significantly degrades performance on previously learned ones. Existing parameter-efficient methods often limit model expressivity or introduce new parameters per task, creating scalability issues. To address these limitations, we introduce **Orthogonal Subspace Fine-Tuning (OSFT)**, a novel parameter-efficient approach for continual learning. OSFT leverages adaptive singular value decomposition (SVD) to dynamically identify and preserve critical, high-rank parameter subspaces that encode prior knowledge. All updates for new tasks are constrained to be strictly orthogonal to these preserved subspaces, which minimizes interference while maintaining a fixed parameter count and avoiding the need to store task-specific gradients. We extensively evaluate OSFT on standard continual learning benchmarks using both encoder-decoder (T5-Large) and decoder-only (LLaMA-2 7B, Mistral-7B) models across diverse tasks. Empirically, our method achieves a state-of-the-art trade-off between learnability and knowledge retention, dominating the Pareto frontier, with **up to 7\% higher** average accuracy than recent baselines like O-LoRA, and **reduces forgetting to near-negligible levels**. It notably maintains the model's general linguistic capabilities, instruction-following, and safety throughout the learning process. OSFT provides a practical, theoretically grounded, and scalable solution that effectively balances model plasticity and knowledge retention for continual learning in LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4019. Improving LLM Alignment with References
๐ฌ 16
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Alignment; LLM-as-a-Judge; Alignment Evaluation; Preference Optimization
The evaluation of LLM alignment is typically conducted in a reference-free manner that does not rely on reference outputs. This prevents the direct adaptation of recent LLM training methods that are based on verifiable metrics or rewards which rely on ground-truth outputs.
In this work, we investigate whether reference outputs can be effectively leveraged for improving LLM alignment. To this end, we first design evaluation methods that enhance LLM-based evaluators with high-quality reference outputs. Through comprehensive experiments, we show that the reference-guided evaluation method substantially improves the performance of less capable LLM-evaluators, using references generated by frontier LLMs. Moreover, strong LLM-evaluators can be further enhanced by human-written references. We then demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. The results show that reference-based LLMs-as-Judges yield clear gains over reference-free baselines in this semi-self-improvement setting, and achieve performance comparable to training with finetuned reward models. In particular, reference-guided self-improvement achieves scores of 73.1\% and 58.7\% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B. These results highlight the great potential of leveraging references for LLM training in non-verifiable domains using reference-guided LLM-based evaluators.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4020. Influence-Preserving Proxies for Gradient-Based Data Selection in LLM FineTuning
๐ฌ 15
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Data Selection Large Language Models
Supervised fine-tuning (SFT) relies critically on selecting training data that most benefits model's downstream performance. Gradient-based data selection methods such as TracIn and Influence Functions leverage influence to identify useful samples, but their computational cost scales poorly, making them impractical for multi-billion-parameter large language models (LLMs). A common alternative is to use off-the-shelf smaller models as proxies, but they remain suboptimal since their learning dynamics are unclear, their sizes cannot be flexibly adjusted, and they cannot be further aligned with the target model in terms of gradient-based influence estimation. To address these challenges, we introduce IProX, a two-stage framework that derives influence-preserving proxies directly from the target model. It first applies a low-rank compression stage to preserve influence information of the target model, and then an aligning stage to align both model gradients and logits, thereby constructing proxies that flexibly control computational cost while retaining the target modelโs influence. Experimental results across diverse LLM families and evaluation tasks show that IProX consistently outperforms off-the-shelf proxies and baseline methods. On Qwen3-4B, a 1.5B proxy constructed with IProX achieves stronger performance than the larger 1.7B off-the-shelf proxy. Notably, on Llama3.2, IProX achieves better performance than baselines while reducing computational cost by more than half relative to the full 3B model. These results show that IProX provides effective influence-preserving proxies, making gradient-based data selection more scalable for LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4021. Emotions Where Art Thou: Understanding and Characterizing the Emotional Latent Space of Large Language Models
๐ฌ 19
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ emotions latent space
This work investigates how large language models (LLMs) internally represent emotion by analyzing the geometry of their hidden-state space. Using a synthetic dataset of emotionally rewritten sentences, we identify a low-dimensional emotional manifold via singular value decomposition and show that emotional representations are directionally encoded, distributed across layers, and aligned with interpretable dimensions. These structures are stable across depth and generalize to eight real-world emotion datasets spanning five languages. Cross-domain alignment yields low error and strong linear probe performance, indicating a universal emotional subspace. Within this space, internal emotion perception can be steered while preserving semantics using a learned intervention module, with especially strong control for basic emotions across languages. These findings reveal a consistent and manipulable affective geometry in LLMs and offer insight into how they internalize and process emotion.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4022. Searching for Privacy Risks in LLM Agents via Simulation
๐ฌ 21
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Agent Privacy Search AI Risk
The widespread deployment of LLM-based agents is likely to introduce a critical privacy threat: malicious agents that proactively engage others in multi-turn interactions to extract sensitive information. However, the evolving nature of such dynamic dialogues makes it challenging to anticipate emerging vulnerabilities and design effective defenses. To tackle this problem, we present a search-based framework that alternates between improving attack and defense strategies through the simulation of privacy-critical agent interactions. Specifically, we employ LLMs as optimizers to analyze simulation trajectories and iteratively propose new agent instructions. To explore the strategy space more efficiently, we further utilize parallel search with multiple threads and cross-thread propagation. Through this process, we find that attack strategies escalate from direct requests to sophisticated tactics, such as impersonation and consent forgery, while defenses evolve from simple rule-based constraints to robust identity-verification state machines. The discovered attacks and defenses transfer across diverse scenarios and backbone models, demonstrating strong practical utility for building privacy-aware agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4023. Towards Quantization-Aware Training for Ultra-Low-Bit Reasoning LLMs
๐ฌ 16
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Quantization
Large language models (LLMs) have achieved remarkable performance across diverse reasoning tasks, yet their deployment is hindered by prohibitive computational and memory costs. Quantization-aware training (QAT) enables ultra-low-bit compression (<4 bits per weight), but existing QAT methods often degrade reasoning capability, partly because complex knowledge structures are introduced during the post-training process in LLMs. In this paper, through a systematic investigation of how quantization affects different data domains, we find that its impact on pre-training and reasoning capabilities differs. Building on this insight, we propose a novel two-stage QAT pipeline specifically designed for reasoning LLMs. In the first stage, we quantize the model using mixed-domain calibration data to preserve essential capabilities across domains; in the second stage, we fine-tune the quantized model with a teacher-guided reward-rectification loss to restore reasoning capability. We first demonstrate that mixed-domain calibration outperforms single-domain calibration by up to 2.74% improvement on average over six tasks, including reasoning and pre-trained tasks. Following experiments on five reasoning benchmarks show that our 2-bit-quantized Qwen3-8B outperforms post-training quantization (PTQ) baselines by 50.45% on average. Moreover, compared to ultra-low-bit-specialized models such as BitNet-2B4T, our pipeline achieves about 2% higher mathematical-reasoning accuracy using only 40K training sequences.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4024. SynCoGen: Synthesizable 3D Molecule Generation via Joint Reaction and Coordinate Modeling
๐ฌ 14
โญ 5.00
๐ generative models
๐ท๏ธ molecule generation flowmatching diffusion chemistry synthesizable
Ensuring synthesizability in generative small molecule design remains a major challenge. While recent developments in synthesizable molecule generation have demonstrated promising results, these efforts have been largely confined to 2D molecular graph representations, limiting the ability to perform geometry-based conditional generation. In this work, we present SYNCOGEN (Synthesizable Co-Generation), a single framework that combines simultaneous masked graph diffusion and flow matching for synthesizable 3D molecule generation. SYNCOGEN samples from the joint distribution of molecular building blocks, chemical reactions, and atomic coordinates. To train the model, we curated SYNSPACE, a dataset containing over 600K synthesis-aware building block graphs and 3.3M conformers. SYNCOGEN achieves state-of-the-art performance in unconditional small molecule graph and conformer generation, and the model delivers competitive performance in zero-shot molecular linker design and pharmacophore conditioning for protein ligand generation in drug discovery. Overall, this multimodal formulation represents a foundation for future applications enabled by non-autoregressive molecular generation, including analog expansion, lead optimization, and direct structure conditioning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4025. SparseD: Sparse Attention for Diffusion Language Models
๐ฌ 14
โญ 5.00
๐ generative models
๐ท๏ธ Diffusion Language Models Sparse Attention
While diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs), existing open-source DLMs suffer from high inference latency. This bottleneck is mainly due to the attentionโs quadratic complexity with respect to context length in computing all queryโkey pairs. Intuitively, to reduce this complexity, restricting computation to sparse attention patterns that retain only the most important ones offers an effective solution. This type of method is widely used in ARs, where the attention mechanism exhibits clear and fixed sparse patterns. In DLMs, our analysis also reveals the presence of sparse patterns and further highlights three unique observations: (1) attention patterns vary across heads, (2) attention patterns in each head remain highly similar across denoising steps, and (3) early denoising steps are critical for generation. These unique findings render well-studied fixed sparse attention methods in ARs largely incompatible with DLMs, as their fixed patterns fail to capture head-specific patterns in DLMs, and sparse attention applied in the early steps can lead to degradation in generation. To address these challenges, we propose **SparseD**, a novel sparse attention method for DLMs.
Leveraging the observations in DLMs, SparseD only pre-computes and selects the most important queryโkey pairs once as head-specific sparse patterns for reusing across denoising steps. This manner can handle head-specific patterns without incurring the high latency associated with recomputing sparse patterns at each denoising step. Meanwhile, SparseD skips sparse attention and uses full attention in the early steps to preserve generation quality. Together, these establish SparseD as a practical and efficient solution for deploying DLMs in long-context applications. Experimental results demonstrate that SparseD achieves lossless acceleration, delivering up to $1.50\times$ speedup over FlashAttention at a 64k context length with 1,024 denoising steps. Anonymous code is available at https://anonymous.4open.science/r/SparseD-8C76/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4026. InfoBridge: Mutual Information estimation via Bridge Matching
๐ฌ 20
โญ 5.00
๐ generative models
๐ท๏ธ Mutual Information Diffusion Bridge Models Bridge Matching
Diffusion bridge models have recently become a powerful tool in the field of generative modeling. In this work, we leverage their power to address another important problem in machine learning and information theory, the estimation of the mutual information (MI) between two random variables. Neatly framing MI estimation as a domain transfer problem, we construct an unbiased estimator for data posing difficulties for conventional MI estimators. We showcase the performance of our estimator on three standard MI estimation benchmarks, i.e., low-dimensional, image-based and high MI, and on real-world data, i.e., protein language model embeddings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4027. Sci2Pol: Evaluating and Fine-tuning LLMs on Scientific-to-Policy Brief Generation
๐ฌ 19
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Benchmark Dataset Policy Science
We propose Sci2Pol-Bench and Sci2Pol-Corpus, the first benchmark and training dataset for evaluating and fine-tuning large language models (LLMs) on policy brief generation from a scientific paper.
We build Sci2Pol-Bench on a five-stage taxonomy to mirror the human writing process:
(i) Autocompletion, (ii) Understanding, (iii) Summarization, (iv) Generation, and (v) Verification.
It features 18 tasks in multiple-choice and open-ended formats.
Specifically, for the Generation stage, we show that BERTScore and ROUGE scores fail to capture the quality of brief writing, and introduce a new LLM-based evaluation metric aligned with expert judgement.
Using this benchmark, we evaluate 13 leading open-source and commercial LLMs to uncover key limitations.
To improve LLM performance on brief writing, we curate the Sci2Pol-Corpus for fine-tuning.
We start by linking each cited scientific paper to its corresponding policy document, drawn from 5.6 million policy records.
This produces 140,000 candidate pairs.
We then employ an LLM-as-a-judge to filter high-quality examples, followed by in-context polishing using three expert-written samples as references.
This process yields a final set of 639 new pairs.
Finally, we fine-tune three models on Sci2Pol-Corpus: LLaMA-3.1-8B, Gemma-12B, and Gemma-27B.
Fine-tuning leads to consistent performance improvements across Sci2Pol-Bench.
Notably, after fine-tuning, Gemma-27B surpasses the much larger GPT-4o and DeepSeek-V3 (671B).
These demonstrate the effectiveness of our corpus in bridging the gap between science and policy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4028. ScaleCap: Scalable Image Captioning via Dual-Modality Debiasing
๐ฌ 18
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Vision Language Model Image Caption
This paper presents ScaleCap, a scalable image captioning strategy that generates
comprehensive and detailed image captions. The key challenges of high-quality
image captioning lie in the inherent biases of LVLMs: multimodal bias resulting in
imbalanced descriptive granularity, offering detailed accounts of some elements
while merely skimming over others; linguistic bias leading to hallucinated de-
scriptions of non-existent objects. To address these issues, we propose a scalable
debiased captioning strategy, which continuously enriches and calibrates the caption
with increased inference budget. Specifically, we propose two novel components:
heuristic question answering and contrastive sentence rating. The former generates
content-specific questions based on the image and answers them to progressively
inject relevant information into the caption. The latter employs sentence-level
offline contrastive decoding to effectively identify and eliminate hallucinations
caused by linguistic biases. With increased inference cost, more heuristic questions
are raised by ScaleCap to progressively capture additional visual details, generating
captions that are more accurate, balanced, and informative. Extensive modality
alignment experiments demonstrate the effectiveness of ScaleCap. Annotating
450K images with ScaleCap and using them for LVLM pretraining leads to consis-
tent performance gains across 11 widely used benchmarks. Furthermore, ScaleCap
showcases superb richness and fidelity of generated captions with two additional
tasks: replacing images with captions in VQA task, and reconstructing images
from captions to assess semantic coverage.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4029. Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs
๐ฌ 15
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Diffusion LLM Efficient Inference
Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models, designed for fast parallel generation. However, existing DLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation. We attribute this to the irreversibility of standard decoding in DLLMs, which is easily polarized into the wrong decoding direction along with early error context accumulation. To resolve this, we introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding in DLLMs. WINO employs a parallel draft-and-verify mechanism, aggressively drafting multiple tokens while simultaneously using the model's bidirectional context to verify and re-mask suspicious ones for refinement. Verified in open-source DLLMs like LLaDA and MMaDA, WINO is shown to decisively improve the quality-speed trade-off. For instance, on the GSM8K math benchmark, it accelerates inference by 6 while improving accuracy by 2.58%; on Flickr30K captioning, it achieves a 10 speedup with higher performance. More comprehensive experiments are conducted to demonstrate the superiority and provide an in-depth understanding of WINO.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4030. Stacked from One: Multi-Scale Self-Injection for Context Window Extension
๐ฌ 17
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ long-context modeling; continual pretraining; extrapolation
The limited context window of contemporary large language models (LLMs) hinders broader application. In this work, we present SharedLLM, a novel approach grounded in the design philosophy of multi-grained context compression and query-aware information retrieval. SharedLLM is composed of two short-context LLMs: a lower moel (compressor) and an upper model (decoder). The lower model compresses context information, while the upper model processes compressed, context information from the lower model and performs context-aware modeling. Information transfer between the compressor and decoder occurs only at the lowest layers to reduce redundant computation. Based on this architecture, we introduce a specialized tree-style data structure to efficiently encode, store and retrieve multi-grained contextual information from text chunks. This entire process, wherein the sender and receiver are derived from the same LLM layer, is referred to as self-injection. In our evaluation on long-context modeling and understanding tasks, SharedLLM achieves superior or comparable results to several strong baselines, striking an effective balance between efficiency and performance. Meanwhile, with the aforementioned design choices, SharedLLM can greatly reduce memory consumption, and demonstrates substantial speed-ups over other advanced baselines. The core code of our implementation along with training and evaluation is available in appendix and supplementary.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4031. Efficient Submodular Maximization for Sums of Concave over Modular Functions
๐ฌ 13
โญ 5.00
๐ optimization
๐ท๏ธ Submodular maximization Sums of Concave over Modular functions Accelerated Approximate Projected Gradient Ascent Randomized rounding GPU-parallel optimization
Submodular maximization has broad applications in machine learning, network design, and data mining. However, classical algorithms often suffer from prohibitively high computational costs, which severely limit their scalability in practice. In this work, we focus on maximizing Sums of Concave over Modular functions (SCMs), an important subclass of submodular functions, under three fundamental constraints: cardinality, knapsack, and partition matroids.
Our method integrates three components: continuous relaxation, Accelerated Approximate Projected Gradient Ascent (AAPGA), and randomized rounding, to efficiently compute near-optimal solutions. We establish a $(1 - \varepsilon - \eta - e^{-\Omega(\eta^2)})$ approximation guarantee for both cardinality and partition matroid constraints, with query complexity $O\left(n^{1/2}\varepsilon^{-1/2} (T_1 + T_2)\right)$. For the knapsack constraint, the approximation ratio degrades by a factor of $1/2$, with query complexity $O\left(n T_1 + n^{1/2}\varepsilon^{-1/2} T_2\right)$, where $T_1$ denotes the computational cost of evaluating the concave extension, and $T_2$ denotes the computational cost of backpropagation. By leveraging efficient convex optimization techniques, our approach substantially accelerates convergence toward high-quality solutions.
In empirical evaluations, we demonstrate that AAPGA consistently outperforms standard PGA. On small-scale experiments, AAPGA achieves superior results in significantly less time, being up to $32.3\times$ faster than traditional methods. On large-scale experiments, our parallel multi-GPU implementation further enhances performance, demonstrating the scalability of our approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4032. An Information-Theoretic Parameter-Free Bayesian Framework for Probing Labeled Dependency Trees from Attention Score
๐ฌ 16
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ probing attention score dependency syntax mutual information
Figuring out how neural language models comprehend syntax acts as a key to revealing how they understand languages.
We systematically analyzed methods of extracting syntax from models, namely _probing_, and found limitations yet widely exist in previous probing practice.
We proposed a method capable of estimating mutual information (MI) and directly extracting dependency trees from attention scores in a mathematical-rigorous way, requiring no additional network training effort.
Compared with previous approaches, it has a much simpler model, while being able to probe more complex dependency trees, also transparent for fine-grained explanation.
We tested our method on several open-source LLMs and demonstrated its effectiveness by systematically comparing it with a great many competitive baselines. Several informative conclusions can be drawn by further analysis of the results, shedding light on our methodโs explanatory potential.
An anonymous and off-the-shelf version of our code is released at https://anonymous.4open.science/r/IPBP-99F1.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4033. WALT: Web Agents that Learn Tools
๐ฌ 14
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ web agents tool use LLMs agentic reasoning
Web agents promise to automate complex browser tasks, but current methods remain brittle -- relying on step-by-step UI interactions and heavy LLM reasoning that break under dynamic layouts and long horizons. Humans, by contrast, exploit website-provided functionality through high-level operations like search, filter, and sort. We introduce WALT (Web Agents that Learn Tools), a framework that reverse-engineers latent website functionality into deterministic, callable tools. Rather than hypothesizing ad-hoc skills, WALT exposes robust implementations of automations already designed into websites, spanning discovery (search, filter, sort), communication (post, comment, upvote), and content management (create, edit, delete). Tools abstract away low-level execution: instead of reasoning about how to click and type, agents simply call search(query) or create(listing). This shifts the computational burden from fragile step-by-step reasoning to reliable tool invocation. On VisualWebArena and WebArena, WALT achieves significantly higher success rates with fewer steps and less LLM-dependent reasoning, establishing a robust and generalizable paradigm for browser automation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4034. Mix-Ecom: Towards Mixed-Type E-Commerce Dialogues with Complex Domain Rules
๐ฌ 12
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Agent LLM E-commerce
E-commerce agents contribute greatly to helping users complete their e-commerce needs. To promote further research and application of e-commerce agents, benchmarking frameworks are introduced for evaluating LLM agents in the e-commerce domain.
Despite the progress, current benchmarks lack evaluating agents' capability to handle mixed-type e-commerce dialogue and complex domain rules. To address the issue, this work first introduces a novel corpus, termed Mix-ECom,
which is constructed based on real-world customer-service dialogues with post-processing to remove user privacy and add CoT process.
Specifically, Mix-ECom contains 4,799 samples with multiply dialogue types in each e-commerce dialogue, covering four dialogue types (QA, recommendation, task-oriented dialogue, and chit-chat),
three e-commerce task types (pre-sales, logistics, after-sales), and 82 e-commerce rules.
Furthermore, this work build baselines on Mix-Ecom and propose a dynamic framework to further improve the performance.
Results show that current e-commerce agents lack sufficient capabilities to handle e-commerce dialogues, due to the hallucination cased by complex domain rules. The dataset will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4035. Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure
๐ฌ 13
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Reversal Curse The Binding Problem Transformers Reasoning
Despite their impressive capabilities, LLMs exhibit a basic generalization failure known as the *Reversal Curse*, where they struggle to learn reversible factual associations. Understanding why this occurs could help identify weaknesses in current models and advance their generalization and robustness. In this paper, we conjecture that the Reversal Curse in LLMs is a manifestation of the long-standing *binding problem* in cognitive science, neuroscience and AI. Specifically, we hypothesize two primary causes of the Reversal Curse stemming from transformers' limitations in conceptual binding: the *inconsistency* and *entanglements* of concept representations. We perform a series of experiments that support these conjectures. Our exploration leads to a model design based on JEPA (Joint-Embedding Predictive Architecture) that for the first time breaks the Reversal Curse without side-stepping it with specialized data augmentation or non-causal masking, and moreover, generalization could be further improved by incorporating special memory layers that support disentangled concept representations. Our research opens up the broader fundamental challenge of designing models capable of learning systematic conceptual binding with less human scaffolding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4036. Robust Training of Neural Networks at Arbitrary Precision and Sparsity
๐ฌ 11
โญ 5.00
๐ optimization
๐ท๏ธ Quantization Sparsity
The discontinuous operations inherent in quantization and sparsification introduce a long-standing obstacle to backpropagation, particularly in ultra-low precision and sparse regimes. The standard Straight-Through Estimator (STE) is widely used to address this, but the well-understood mismatch between its quantization-aware forward pass and quantization-oblivious backward pass leads to unmanaged error that can corrupt the learning process. We solve this by introducing a denoising dequantization transform derived from a principled ridge regression objective. This transform makes the entire learning process aware of and robust to the quantization error that STE's surrogate gradient bypasses, by creating an explicit, corrective gradient path. We extend this principle to sparsification by viewing it as a special form of quantization that maps insignificant values to zero. Our unified framework allows existing models to be trained at a wide spectrum of precisions and sparsity levels with off-the-shelf recipes, achieving stable training of fully binary (A1W1) and sparse sub-1-bit networks where other methods falter. This approach yields state-of-the-art results and provides a theoretically-grounded path to hyper-efficient neural networks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4037. CONCUR: A Framework for Continual Constrained and Unconstrained Routing
๐ฌ 18
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ model routing continual routing constrained routing unconstrained routing
AI tasks differ in complexity and are best addressed with different computation strategies (e.g., combinations of models and decoding methods). Hence, an effective routing system that maps tasks to the appropriate strategies is crucial.
Most prior methods build the routing framework by training a *single* model across *all* strategies, which demands full retraining whenever new strategies appear and leads to high overhead. Attempts at such continual routing, however, often face difficulties with generalization.
Prior models also typically use a *single* input representation, limiting their ability to capture the full complexity of the routing problem and leading to sub-optimal routing decisions.
To address these gaps, we propose CONCUR, a **con**tinual routing framework that supports both **c**onstrained and **u**nconstrained **r**outing (i.e., routing with or without a budget).
Our *modular* design trains a separate predictor model for each strategy, enabling seamless incorporation of new strategies with low additional training cost.
Our predictors also leverage *multiple* representations of both tasks and computation strategies to better capture overall problem complexity.
Experiments on both in-distribution and out-of-distribution, knowledge- and reasoning-intensive tasks show that our method outperforms the best single strategy and strong existing routing techniques with higher end-to-end accuracy and lower inference cost in both continual and non-continual settings, while also reducing training cost in the continual setting.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4038. GTool: Graph Enhanced Tool Planning with Large Language Model
๐ฌ 14
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Tool Learning Large Language Model Graph Data Mining
Tool planning with large language models (LLMs), referring to selecting, organizing, and preparing the tools necessary to complete a user request, bridges the gap between natural language understanding and task execution. However, current works treat different tools as isolated components and fail to leverage the inherent dependencies of tools, leading to invalid planning results. Since tool dependencies are often incomplete, it becomes challenging for LLMs to accurately identify the appropriate tools required by a user request, especially when confronted with a large toolset. To solve this challenge, we propose GTool, which is the first work aiming to enhance the tool planning ability of LLMs under incomplete dependencies. GTool constructs a request-specific tool graph to select tools efficiently and generate the \ which provides sufficient dependency information understandable by LLMs. Moreover, a missing dependency prediction task is designed to improve the reliability of GTool with incomplete dependencies. Without trimming LLMs, GTool can be seamlessly integrated with various LLM backbones without extensive retraining. Extensive experiments show that GTool achieves more than 29.6% performance improvements compared with the state-of-the-art (SOTA) baselines with a light-weight (7B) LLM backbone.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4039. A Generalized Geometric Theoretical Framework of Centroid Discriminant Analysis for Linear Classification of Multi-dimensional Data
๐ฌ 26
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Linear classi๏ฌcation; Efficient and scalable algorithm; Geometric discriminant analysis; Centroid discriminant analysis; Nonlinear classification via kernel method
With the advent of the neural network era, traditional machine learning methods have increasingly been overshadowed. Nevertheless, continuing to research about the role of geometry for learning in data science is crucial to envision and understand new principles behind the design of efficient machine learning. Linear classifiers are favored in certain tasks due to their reduced susceptibility to overfitting and their ability to provide interpretable decision boundaries. However, achieving both scalability and high predictive performance in linear classification remains a persistent challenge. Here, we propose a theoretical framework named geometric discriminant analysis (GDA). GDA includes the family of linear classifiers that can be expressed as function of a centroid discriminant basis (CDB0) - the connection line between two centroids - adjusted by geometric corrections under different constraints. We demonstrate that linear discriminant analysis (LDA) is a subcase of the GDA theoretical framework, and we show its convergence to CDB0 under certain conditions. Then, based on the GDA framework, we propose an efficient linear classifier named centroid discriminant analysis (CDA) which is defined as a special case of GDA under a 2D plane geometric constraint. CDA training is initialized starting from CDB0 and involves the iterative calculation of new adjusted centroid discriminant lines whose optimal rotations on the associated 2D planes are searched via Bayesian optimization. CDA has good scalability (quadratic time complexity) which is lower than LDA and support vectors machine (SVM) (cubic complexity). Results on 27 real datasets across classification tasks of standard images, medical images and chemical properties, offer empirical evidence that CDA outperforms other linear methods such as LDA, SVM and logistic regression (LR) in terms of scalability, performance and stability. Furthermore, we show that linear CDA can be generalized to nonlinear CDA via kernel method, demonstrating improvements on both linear CDA and kernel SVM with tests on three challenging datasets of images and chemical data. GDA general validity as a new theoretical framework may inspire the design of new classifiers under the definition of different geometric constraints, paving the way towards more deeper understanding of the role of geometry in learning from data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4040. Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding
๐ฌ 19
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ large vision-language models multimodality language prior
Large vision-language models (LVLMs) achieve strong performance on multimodal tasks, yet they often default to their *language prior* (LP)---memorized textual patterns from pre-training while under-utilizing visual evidence.
Prior analyses of LP mostly rely on inputโoutput probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs.
Our analysis reveals a universal phenomenon: each model exhibits a *Visual Integration Point* (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding.
Building on this observation, we introduce the *Total Visual Integration* (TVI) estimator, which aggregates representation distance beyond the VIP to quantify how strongly visual query influences response generation. Across 54 modelโdataset combinations spanning 9 contemporary LVLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. This offers a principled toolkit for diagnosing and understanding language prior in LVLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4041. SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
๐ฌ 12
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ unified image tokenizer multimodal learning
In this paper, we introduce SemHiTok, a unified image Tokenizer via Semantic-Guided Hierarchical codebook (SGHC) that provides consistent discrete representations for multimodal understanding and generation. Recently, unified image tokenizers have sparked exploration within the research community, which is designed to capture high-level semantic features for understanding and retaining low-level pixel features for generation. Previous works attempt to train a unified image tokenizer by combining loss for semantic distillation and pixel reconstruction. However, due to the differing levels of features prioritized by multimodal understanding and generation, joint training methods face significant challenges in achieving a good trade-off. SemHiTok addresses this challenge through a novel semantic-guided hierarchical codebook, which builds pixel sub-codebooks on a pretrained semantic codebook. This design decouples semantic and pixel in both terms of structure and training strategy, enabling the tokenizer to capture pixel features while retaining its ability to comprehend high-level semantic information. Our experiments demonstrate that SemHiTok achieves SOTA performance in image reconstruction and multimodal understanding under the LLaVA-v1.5 setting. Further, we develop a unified MLLM with SemHiTok, which exhibits superior performance across multimodal understanding and generation tasks. For understanding, SemHiTok achieves impressive performance on most benchmarks. For generation, our model achieves SOTA performance on MJHQ30K in unified MLLMs. Our code and models will be open source.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4042. DiVE-k: DIFFERENTIAL VISUAL REASONING FOR FINE-GRAINED IMAGE RECOGNITION
๐ฌ 12
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ zero-shot image classification visual reasoning vision-language model
Large Vision Language Models (LVLMs) possess extensive text knowledge but struggles to utilize this knowledge for fine-grained image recognition, often failing to differentiate between visually similar categories. Existing fine-tuning methods using Reinforcement Learning (RL) with exact-match reward signals are often brittle, encourage memorization of training categories, and fail to elicit differential reasoning needed for generalization to unseen classes. To address this, we propose $\textbf{DiVE-k}$, $\textbf{Di}$fferential $\textbf{V}$isual r$\textbf{E}$asoning using top-$\textbf{k}$ generations, framework that leverages model's own top-k predictions as a training signal.
For each training image, DiVE-k creates a multiple-choice question from the model's top-k outputs and uses RL to train the model to select the correct answer. This approach requires the model to perform fine-grained differential reasoning among plausible options and provides a simple, verifiable reward signal that mitigates memorization and improves generalization.
Experiments on five standard fine-grained datasets show that our method significantly outperforms existing approaches.
In the standard base-to-novel generalization setting, DiVE-k surpasses the QWEN2.5-VL-7B and ViRFT by 10.04% and 6.16% on the Harmonic Mean metric, respectively. Further experiments show similar gains in mixed-domain and few-shot scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4043. MLE-Smith: Scaling MLE Tasks with Automated Multi-agent Pipeline
๐ฌ 19
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Machine Learning LLM Agents
While Language Models (LMs) have made significant progress in automating machine learning engineering (MLE), the acquisition of high-quality MLE training data is significantly constrained. Current MLE benchmarks suffer from low scalability and limited applicability because they rely on static, manually curated tasks that demand extensive time and manual effort to produce.
We introduce MLE-Smith, a fully automated multi-agent pipeline, to transform raw datasets into competition-style MLE challenges through an efficient generate--verify--execute paradigm for scaling MLE tasks with verifiable quality, real-world usability and rich diversity.
The proposed multi-agent pipeline in MLE-Smith drives structured task design and standardized refactoring, coupled with a hybrid verification mechanism that enforces strict structural rules and high-level semantic soundness. It further validates empirical solvability and real-world fidelity through interactive execution.
We apply MLE-Smith to 224 of real-world datasets and generates 606 tasks spanning multiple categories, objectives, and modalities, demonstrating that MLE-Smith can work
effectively
across a wide range of real-world datasets.
Evaluation on generated tasks shows that the performance of eight mainstream and cutting-edge LLMs on MLE-Smith tasks is strongly correlated with their performance on carefully human-designed tasks, highlighting the effectiveness of the MLE-Smith in scaling up MLE tasks while maintaining task quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4044. Continual Low-Rank Adapters for LLM-based Generative Recommender Systems
๐ฌ 17
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ LLM-based generative recommender; continual learning; low-rank adapters
While large language models (LLMs) achieve strong performance in recommendation, they face challenges in continual learning as users, items, and user preferences evolve over time. Existing LoRA-based continual methods primarily focus on preserving performance on previous tasks, but this overlooks the unique nature of recommendation: the goal is not to predict past preferences, and outdated preferences can even harm performance when current interests shift significantly. To address this, we propose PESO (Proximally rEgularized Single evolving lOra, a continual adaptation method for LoRA in recommendation. PESO introduces a proximal regularizer that anchors the current adapter to its most recent frozen state, enabling the model to flexibly balance adaptation and preservation, and to better capture recent user behaviors. Theoretically, we show that this proximal design provides data-aware, direction-wise guidance in the LoRA subspace. Empirically, PESO consistently outperforms existing LoRA-based continual learning methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4045. PICS: Pairwise Image Compositing with Spatial Interactions
๐ฌ 11
โญ 5.00
๐ generative models
๐ท๏ธ image compositing diffusion model spatial relations
Despite strong single-turn performance, diffusion-based image compositing often struggles to preserve coherent spatial relations in pairwise or sequential edits, where subsequent insertions may overwrite previously generated content and disrupt physical consistency. We introduce PICS, a self-supervised composition-by-decomposition paradigm that composes objects in parallel while explicitly modeling the compositional interactions among (fully-/partially-)visible objects and background. At its core, an Interaction Transformer employs mask-guided Mixture-of-Experts to route background, exclusive, and overlap regions to dedicated experts, with an adaptive $\alpha$-blending strategy that infers a compatibility-aware fusion of overlapping objects while preserving boundary fidelity. To further enhance robustness to geometric variations, we incorporate geometry-aware augmentations covering both out-of-plane and in-plane pose changes of objects. Our method delivers superior pairwise compositing quality and substantially improved stability, with extensive evaluations across virtual try-on, indoor, and street scene settings showing consistent gains over state-of-the-art baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4046. ACCORD: Alleviating Concept Coupling through Dependence Regularization for Text-to-Image Diffusion Personalization
๐ฌ 5
โญ 5.00
๐ generative models
๐ท๏ธ dependence regularization concept decoupling text-to-image diffusion model
Image personalization enables customizing Text-to-Image models with a few reference images but is plagued by "concept coupling"โthe model creating spurious associations between a subject and its context. Existing methods tackle this indirectly, forcing a trade-off between personalization fidelity and text control. This paper is the first to formalize concept coupling as a statistical dependency problem, identifying two root causes: a Denoising Dependence Discrepancy that arises during the generative process, and a Prior Dependence Discrepancy within the learned concept itself. To address this, we introduce ACCORD, a framework with two targeted, plug-and-play regularization losses. The Denoising Decouple Loss minimizes dependency changes across denoising steps, while the Prior Decouple Loss aligns the conceptโs relational priors with those of its superclass. Extensive experiments across subject, style, and face personalization demonstrate that ACCORD achieves a superior balance between fidelity and text control, consistently improving upon existing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4047. From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones
๐ฌ 13
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reinforcement Learning Large Language Model Reasoning Exploration
Does reinforcement learning (RL) teach large language models (LLMs) genuinely new skills, or does it merely activate existing ones? This question lies at the core of ongoing debates about the role of RL in LLM post-training. On one side, strong empirical results can be achieved with RL alone even without preceding supervised finetuning; on the other, critics argue that RL contributes little beyond reweighting existing reasoning strategies. This work provides concrete evidence that LLMs can acquire genuinely new skills during RL by composing existing ones, mirroring one of the central mechanisms by which humans acquire new cognitive skills \citep{Anderson1982Acquisition}. To mitigate data contamination and other confounding factors and to allow precise control over task complexity, we develop a synthetic framework for our investigation. Specifically, we define a skill as the ability to infer the output of a string transformation function $f(x)$ given $x$. Once an LLM has already learned $f$ and $g$ prior to RL, our experiments reveal that RL enables it to learn unseen compositions of them $h(x)=g(f(x))$. Further, this compositional ability generalizes to more difficult problems such as compositions of $>2$ functions unseen during training. Our experiments provide surprising evidence that this compositional ability, acquired on the source task, transfers to a different target task. This transfer occurs even though the model has never trained with RL on any compositional problems in the target task, as long as it has acquired the target task's atomic skills prior to RL on the source task. Our qualitative analysis shows that RL fundamentally changes the reasoning behaviors of the models. In contrast, neither of the findings is observed in next-token prediction training with the same data. Our systematic experiments provide fresh insights into the learning behaviors of widely-used post-training approaches for LLMs. They suggest the value of building base models with the necessary basic skills, followed by RL with appropriate incentivization to acquire more advanced skills that generalize better to complex and out-of-domain problems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4048. DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning
๐ฌ 20
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ large language models preference learning user feedback post-training self-improvement
Real-world large language model deployments (e.g., conversational AI systems, code generation assistants) naturally generate abundant implicit user dissatisfaction (DSAT) signals, as users iterate toward better answers through refinements, corrections, and expressed preferences, while explicit satisfaction (SAT) feedback is scarce. Existing preference learning approaches are poorly aligned with this data profile, as they rely on costly human annotations or assume plentiful positive responses. In this paper, we introduce \textbf{DRIFT} (\textbf{D}issatisfaction-\textbf{R}efined \textbf{I}terative pre\textbf{F}erence \textbf{T}raining), which anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy. Empirically, DRIFT models trained on real-world \textit{WildFeedback} datasets and synthetic \textit{UltraFeedback} datasets achieve up to +6.23\% (7B) / +7.61\% (14B) on WildBench Task Score and up to +8.95\% (7B) / +12.29\% (14B) on AlpacaEval2 win rate over base models, outperforming strong baseline methods such as iterative DPO and SPIN. At larger scales, the improvements are particularly pronounced: 14B models trained with DRIFT surpass GPT-4o-mini on WildBench. Further analysis shows that DRIFT also preserves exploratory capacity, yielding more diverse high-reward solutions rather than collapsing to narrow subsets. Theoretically, we demonstrate that this design preserves preference margins and avoids the gradient degeneration. These results show that DRIFT is an effective and scalable recipe for real-world post-training that leverages the most abundant and informative signal.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4049. SHE-LoRA: Selective Homomorphic Encryption for Federated Tuning with Heterogeneous LoRA
๐ฌ 17
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Foundation models LoRA Homomorphic Encryption
Federated fine-tuning is critical for improving the performance of large language models (LLMs) in handling domain-specific tasks while keeping training data decentralized and private.
However, prior work has shown that clients' private data can actually be recovered via gradient inversion attacks.
Existing privacy preservation techniques against such attacks typically entail performance degradation and high costs, making them ill-suited for clients with heterogeneous data distributions and device capabilities.
In this paper, we propose SHE-LoRA, which integrates selective homomorphic encryption (SHE) and low-rank adaptation (LoRA) to enable efficient and privacy-preserving federated tuning of LLMs in cross-device environments.
Based on model parameter sensitivity assessment, heterogeneous clients adaptively negotiate and select a subset of model parameters for homomorphic encryption.
To ensure accurate model aggregation, we design a column-aware secure aggregation method and customized reparameterization techniques to align the aggregation results with the heterogeneous device capabilities of clients.
Extensive experiments demonstrate that SHE-LoRA maintains performance comparable to non-private baselines, achieves strong resistance to state-of-the-art attacks, and significantly reduces communication overhead by 99.71\% and encryption time by 99.87\%, compared to HE baselines.
Our code is accessible at https://anonymous.4open.science/r/SHE-LoRA.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4050. Forget Forgetting: Continual Learning in a World of Abundant Memory
๐ฌ 25
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ continual learning model merging machine learning large language models
Continual learning (CL) has traditionally focused on minimizing exemplar memory, a constraint often misaligned with modern systems where GPU time, not storage, is the primary bottleneck. This paper challenges this paradigm by investigating a more realistic regime: one where memory is abundant enough to mitigate forgetting, but full retraining from scratch remains prohibitively expensive. In this practical "middle ground", we find that the core challenge shifts from stability to plasticity, as models become biased toward prior tasks and struggle to learn new ones. Conversely, improved stability allows simple replay baselines to outperform the state-of-the-art methods at a fraction of the GPU cost. To address this newly surfaced trade-off, we propose Weight Space Consolidation, a lightweight method that combines (1) rank-based parameter resets to restore plasticity with (2) weight averaging to enhance stability. Validated on both class-incremental learning with image classifiers and continual instruction tuning with large language models, our approach outperforms strong baselines while matching the low computational cost of replay, offering a scalable alternative to expensive full-retraining. These findings challenge long-standing CL assumptions and establish a new, cost-efficient baseline for real-world CL systems where exemplar memory is no longer the limiting factor.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4051. TetraGT: Tetrahedral Geometry-Driven Explicit Token Interactions with Graph Transformer for Molecular Representation Learning
๐ฌ 16
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Molecular Representation Learning Graph Transformer Molecular Geometry Pretraining
Molecular representations that fully capture geometric parameters such as bond angles and torsion angles are crucial for accurately predicting important molecular properties including enzyme catalytic activity, drug bioactivity, and molecular spectral characteristics, as demonstrated by extensive studies.
However, current molecular graph representation learning approaches represent molecular geometric parameters only indirectly through combinations of atoms and bonds, neglecting the spatial relationships and interactions between these higher-order geometric structures.
In this paper, we propose \textbf{TetraGT} (\textbf{Tetra}hedral \textbf{G}eometry-Driven Explicit \textbf{T}oken Interactions with Graph Transformer), a novel architecture that directly models molecular geometric parameters.
Based on the spatial solid geometry theory of face angle and dihedral angle inequality, TetraGT explicitly represents bond angles and torsion angles as structured tokens for the first time, directly reflecting their intrinsic role in determining the molecular conformational stability and properties.
Through our designed spatial tetrahedral attention mechanism, TetraGT achieves highly selective direct communication between structural tokens.
Experimental results demonstrate that TetraGT achieves superior performance on the PCQM4Mv2 and OC20 IS2RE benchmarks.
We also apply our pre-trained TetraGT model to downstream tasks including QM9, PDBBind, Peptides and LIT-PCBA, demonstrating that TetraGT delivers excellent results in transfer learning scenarios and shows scalability with increasing molecular size.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4052. Revisiting Multimodal Positional Encoding in VisionโLanguage Models
๐ฌ 11
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Vision-Language Models Multimodal Position Encoding
Multimodal position encoding is essential for vision-language models, yet there has been little systematic investigation into multimodal position encoding. We conduct a comprehensive analysis of multimodal Rotary Positional Embedding (RoPE) by examining its two core components: position design and frequency allocation. Through extensive experiments, we identify three key guidelines: positional coherence, full frequency utilization, and preservation of textual priorsโensuring unambiguous layout, rich representation, and faithful transfer from the pre-trained LLM. Based on these insights, we propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), two simple and plug-and-play variants that require no architectural changes. Our methods consistently outperform existing approaches across diverse benchmarks, with significant improvements in both general and fine-grained multimodal understanding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4053. Multi-Feature Quantized Self-Attention for Fair Large Language Models
๐ฌ 6
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large language models multi-attribute social bias quantized adversarial autoencoder
Large Language Models (LLMs) often encode social biases tied to sensitive features such as race and gender, undermining fairness in downstream tasks even after instruction tuning. Existing debiasing methods either require expensive fine-tuning, are tied to specific architectures, or operate only at the input or decoding stage while neglecting attention-level representations, which can lead to compromised task performance. Moreover, most approaches are tailored to single-attribute settings and do not explicitly address scenarios with multiple, overlapping protected attributes and their intersections. This paper introduces Multi-feature Quantized Attention Regularization (MQAR), a novel method for mitigating multi-feature bias by injecting a structured quantization into frozen self-attention layers. MQAR disentangles attribute-specific activations through vector-quantized regularization and uses a discriminator-guided autoencoding regularizer to adversarially suppress protected-attribute information while preserving task-relevant semantics. Crucially, the proposed method operates without modifying the backbone parameters or accessing pre-training data, ensuring architecture-agnostic applicability and minimizing representation distortion. This paper evaluates MQAR on five diverse LLMs (BERT, T5, GPT-Neo, Mixtral, and LLaMA 3.2) using three standard bias benchmarks (WinoBias, StereoSet, CrowS-Pairs). Across these models, MQAR consistently reduces bias for multiple protected attributes and their intersections while maintaining downstream accuracy within at most 0.4 \%, on average, of non-debiased baselines on sentiment analysis, abusive language detection, and text generation tasks. These findings highlight quantized attention regularization as a scalable and effective method for mitigating social bias in modern language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4054. Controllable Video Generation with Provable Disentanglement
๐ฌ 10
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Video Generation GAN Disentanglement nonlinear ICA
Controllable video generation remains a significant challenge, despite recent advances in generating high-quality and consistent videos. Most existing methods for controlling video generation treat the video as a whole, neglecting intricate fine-grained spatiotemporal relationships, which limits both control precision and efficiency. In this paper, we propose \textbf{Co}ntrollable \textbf{V}ide\textbf{o} \textbf{G}enerative \textbf{A}dversarial \textbf{N}etworks (\ourmes) to disentangle the video concepts, thus facilitating efficient and independent control over individual concepts. Specifically, following the \textbf{minimal change principle}, we first disentangle static and dynamic latent variables. We then leverage the \textbf{sufficient change property} to achieve component-wise identifiability of dynamic latent variables, enabling independent control over motion and identity. To establish the theoretical foundation, we provide a rigorous analysis demonstrating the identifiability of our approach. Building on these theoretical insights, we design a \textbf{Temporal Transition Module} to disentangle latent dynamics. To enforce the minimal change principle and sufficient change property, we minimize the dimensionality of latent dynamic variables and impose temporal conditional independence. To validate our approach, we integrate this module as a plug-in for GANs. Extensive qualitative and quantitative experiments on various video generation benchmarks demonstrate that our method significantly improves generation quality and controllability across diverse real-world scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4055. DiMeR: Disentangled Mesh Reconstruction Model with Normal-only Geometry Training
๐ฌ 16
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ LRM 3D Reconstruction 3D Generation Image-to-3D
We propose DiMeR, a novel geometry-texture disentangled feed-forward model with 3D supervision for sparse-view mesh reconstruction. Existing methods confront two persistent obstacles: (i) textures can conceal geometric errors, i.e., visually plausible images can be rendered even with wrong geometry, producing multiple ambiguous optimization objectives in geometry-texture mixed solution space for similar objects; and (ii) prevailing mesh extraction methods are redundant, unstable, and lack 3D supervision. To solve these challenges, we rethink the inductive bias for mesh reconstruction. First, we disentangle the unified geometry-texture solution space, where a single input admits multiple feasible solutions, into geometry and texture spaces individually. Specifically, given that normal maps are strictly consistent with geometry and accurately capture surface variations, the normal maps serve as the only input for geometry prediction in DiMeR, while the texture is estimated from RGB images. Second, we streamline the algorithm of mesh extraction by eliminating modules with low performance/cost ratios and redesigning regularization losses with 3D supervision. Notably, DiMeR still accepts raw RGB images as input by leveraging foundation models for normal prediction. Extensive experiments demonstrate that DiMeR generalises across sparseโviews-3D, singleโimage-3D, and textโtoโ3D tasks, consistently outperforming baselines. On the GSO and OmniObject3D datasets, DiMeR significantly reduces Chamfer Distance by more than 30%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4056. SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning
๐ฌ 5
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Reasoning Reinforcement Learning Supervised Reinforcement Fine-Tuning
Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet optimally integrating Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through a comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from an entropy-based perspective, we reveal key differences between these paradigms: SFT induces coarse-grained, global shifts to policy distributions, while RL performs fine-grained, selective optimizations.
Our analysis further establishes entropy as a critical indicator of training efficacy.
Building on these observations, we introduce **S**upervised **R**einforcement **F**ine-**T**uning (**SRFT**), a single-stage framework that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms.
SRFT simultaneously applies SFT and RL to directly optimize LLMs using demonstrations and self-exploration rollouts rather than through two-stage sequential methods.
Extensive experiments show that SRFT outperforms zero-RL baselines by **9.0%** on five mathematical reasoning benchmarks and by **10.9%** on three out-of-distribution benchmarks.
Moreover, by leveraging demonstration data, SRFT maintains a more stable policy entropy, facilitating sustained policy improvement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4057. Distributional Consistency Loss: Beyond Pointwise Data Terms in Inverse Problems
๐ฌ 15
โญ 5.00
๐ optimization
๐ท๏ธ Inverse problems data fidelity denoising image reconstruction regularization overfitting
Recovering true signals from noisy measurements is a central challenge in inverse problems spanning medical imaging, geophysics, and signal processing. Current solutions nearly always balance prior assumptions regarding the true signal (regularization) with agreement to noisy measured data (data fidelity). Conventional data fidelity loss functions, such as mean-squared error (MSE) or negative log-likelihood, seek pointwise agreement with noisy measurements, often leading to overfitting to noise. In this work, we instead evaluate data fidelity collectively by testing whether the observed measurements are statistically consistent with the noise distributions implied by the current estimate. We adopt this aggregated perspective and introduce $\textit{distributional consistency (DC) loss}$, a data-fidelity objective that replaces pointwise matching with distribution-level calibration. DC loss acts as a direct and practical plug-in replacement for standard data consistency terms: i) it is compatible with modern unsupervised regularizers that operate without paired measurementโground-truth data, ii) it is optimized in the same way as traditional losses, and iii) it avoids overfitting to measurement noise even without the use of priors. Its scope naturally fits many practical inverse problems where the measurement-noise distribution is known and where the measured dataset consists of many independent noisy values. We demonstrate efficacy in two key example application areas: i) in image denoising with deep image prior, using DC instead of MSE loss removes the need for early stopping and achieves higher PSNR; ii) in medical image reconstruction from Poisson-noisy data, DC loss reduces artifacts in highly-iterated reconstructions and enhances the efficacy of hand-crafted regularization. These results position DC loss as a statistically grounded, performance-enhancing alternative to conventional fidelity losses for an important class of unsupervised noise-dominated inverse problems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4058. Prompt-Robust Vision-Language Models via Meta-Finetuning
๐ฌ 16
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Vision-Language Models Prompt Learning Meta-learning
Vision-language models (VLMs) have demonstrated remarkable generalization across diverse tasks by leveraging large-scale image-text pretraining. However, their performance is notoriously unstable under variations in natural language prompts, posing a considerable challenge for reliable real-world deployment. To address this prompt sensitivity, we propose Promise, a meta-learning framework for prompt-Robust vision-language models via meta-finetuning, which explicitly learns to generalize across diverse prompt formulations. Our method operates in a dual-loop meta-finetuning setting: the inner loop adapts token embeddings based on a set of varied prompts, while the outer loop optimizes for generalization on unseen prompt variants. To further improve robustness, we introduce an adaptive prompt weighting mechanism that dynamically emphasizes more generalizable prompts and a token-specific learning rate module that fine-tunes individual prompt tokens based on contextual importance. We further establish that Promiseโs weighted and preconditioned inner update provably (i) yields a one-step decrease of the outer empirical risk together with a contraction of across-prompt sensitivity, and (ii) tightens a data-dependent generalization bound evaluated at the post-inner initialization. Across 15 benchmarks spanning base-to-novel generalization, cross-dataset transfer, and domain shift, our approach consistently reduces prompt sensitivity and improves performance stability over existing prompt learning methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4059. Fair Conformal Classification via Learning Representation-Based Groups
๐ฌ 9
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Classification; Conformal Prediction; Equalized Coverage; Fairness
Conformal prediction methods provide statistically rigorous marginal coverage guarantees for machine learning models, but such guarantees fail to account for algorithmic biases, thereby undermining fairness and trust. This paper introduces a fair conformal inference framework for classification tasks. The proposed method constructs prediction sets that guarantee conditional coverage on adaptively identified subgroups, which can be implicitly defined through nonlinear feature combinations. By balancing effectiveness and efficiency in producing compact, informative prediction sets and ensuring adaptive equalized coverage across unfairly treated subgroups, our approach paves a practical pathway toward trustworthy machine learning. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the framework.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4060. When Foundation Models are One-Liners: Limitations and Future Directions for Time Series Anomaly Detection
๐ฌ 15
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Time Series Foundation Model Anomaly Detection
Recent efforts have extended the foundation model paradigm from natural language to time series, raising expectations that pre-trained time-series foundation models generalize well across downstream tasks. In this work, we focus on time-series anomaly detection, in which time-series foundation models detect anomalies based on the reconstruction or forecasting error. Specifically, we critically examine the performance of five popular families of time-series foundation models: MOMENT, Chronos, TimesFM, Time-MoE, and TSPulse. We find that for each model family using varying model sizes and context window lengths, anomaly detection performance does not significantly differ to simple one-liner baselines: moving-window variance and squared-difference. These findings suggest that the key assumptions underlying reconstruction-based and forecasting-based methodologies for time-series anomaly detection are not satisfied for time-series foundation models: anomalies are not consistently harder to reconstruct or forecast. The results suggest that current approaches for leveraging foundation models in anomaly detection are insufficient. Building upon our insights, we propose alternative directions to effectively detect anomalies using foundation models, thereby unlocking their full potential for time-series anomaly detection.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4061. The Price of Amortized inference in Sparse Autoencoders
๐ฌ 4
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Mechanistic Interpretability Polysemanticity Sparse Autoencoders Amortization Inference
Polysemy has long been a major challenge in Mechanistic Interpretability (MI), with Sparse Autoencoders (SAEs) emerging as a promising solution. SAEs employ a shared encoder to map inputs to sparse codes, thereby amortizing inference costs across all instances. However, this parameter-sharing paradigm inherently conflicts with the MI community's emphasis on instance-level optimality, including the consistency and stitchability of monosemantic features. We first reveal the trade-off relationships among various pathological phenomena, including feature absorption, feature splitting, dead latents, and dense latents under global reconstruction-sparsity constraints from the perspective of training dynamics, finding that increased sparsity typically exacerbates multiple pathological phenomena, and attribute this trade-off relationship to amortized inference. By reducing reliance on amortized inference through the introduction of semi-amortized and non-amortized approaches, we observed that various pathological indicators were significantly mitigated, thereby validating our hypothesis. As the first step in this direction, we propose Local Amortized SAE (LocA-SAE), a method that groups polysemantically close latents based on the angular variance. This method is designed to balance the computational cost of per-sample optimization with the limitations of amortized inference. Our work provides insights for understanding SAEs and advocates for a paradigm shift in future research on polysemy disentanglement. The code is available at \url{https://anonymous.4open.science/r/sae-amortization-5335}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4062. Bilevel Optimization with Lower-Level Uniform Convexity: Theory and Algorithm
๐ฌ 17
โญ 5.00
๐ optimization
๐ท๏ธ Bilevel Optimization Lower-level Uniform Convexity Thery Algorithm
Bilevel optimization is a hierarchical framework where an upper-level optimization problem is constrained by a lower-level problem, commonly used in machine learning applications such as hyperparameter optimization. Existing bilevel optimization methods typically assume strong convexity or Polyak-ลojasiewicz (PL) conditions for the lower-level function to establish non-asymptotic convergence to a solution with a small hypergradient. However, these assumptions may not hold in practice, and recent work (Chen et al. 2024) has shown that bilevel optimization is inherently intractable for general convex lower-level functions with the goal of finding small hypergradients.
In this paper, we identify a tractable class of bilevel optimization problems that interpolates between lower-level strong convexity and general convexity via lower-level uniform convexity. For uniformly convex lower-level functions with exponent $p\geq 2$, we establish a novel implicit differentiation theorem characterizing the hyperobjective's smoothness property. Building on this, we design a new stochastic algorithm, termed UniBiO, with provable convergence guarantees, based on an oracle that provides stochastic gradient and Hessian-vector product information for the bilevel problems. Our algorithm achieves $\widetilde{O}(\epsilon^{-5p+6})$ oracle complexity bound for finding $\epsilon$-stationary points. Notably, our complexity bounds match the optimal rates in terms of the $\epsilon$ dependency for strongly convex lower-level functions ($p=2$), up to logarithmic factors. Our theoretical findings are validated through experiments on synthetic tasks and data hyper-cleaning, demonstrating the effectiveness of our proposed algorithm.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4063. Echoes as Anchors: Probabilistic Costs and Attention Refocusing in LLM Reasoning
๐ฌ 22
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LRM reasoning finetuning
Test-time compute allocation in large reasoning models (LRMs) is widely used and has applications in mathematical problem solving, code synthesis, and planning. Recent work has addressed this problem by scaling self-consistency and parallel thinking, adding generic thinking tokens and prompting models to re-read the question before answering. Unfortunately, these approaches either inject task-agnostic tokens or mandate heuristics that do not explain---and often ignore---the \emph{spontaneous} repetition that many LRMs exhibit at the head of their internal chains. In contrast, we analyze and harness the model's tendency to restate the question, which we term the \emph{Echo of Prompt (EOP)}, as a front-loaded, compute-shaping mechanism. We formalize its probabilistic cost by casting echo removal as rejection-based conditioning and defining the \emph{Echo Likelihood Gap} $\Delta\mathcal{L}$ as a computable proxy. This provides the missing theoretical link that links early repetition to likelihood gains and downstream accuracy. However, it does not by itself specify how to exploit EOP. Consequently, we develop \emph{Echo-Distilled SFT (ED-SFT)} to instill an ``echo-then-reason'' pattern through supervised finetuning, and \emph{Echoic Prompting (EP)} to re-ground the model mid-trace without training. While promising, quantifying benefits beyond verbosity is non-trivial. Therefore, we conduct length and suffix-controlled likelihood analyses together with layer-wise attention studies, showing that EOP increases answer to answer-prefix attention in middle layers, consistent with an \emph{attention refocusing} mechanism. We evaluate under identical decoding settings and compute budgets on GSM8K, MathQA, Hendrycks-MATH, AIME24, and MATH-500 under identical decoding settings and budgets, and find consistent gains over baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4064. GraphUniverse: Enabling Systematic Evaluation of Inductive Generalization
๐ฌ 21
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Graph Neural Networks Synthetic Dataset Generation Graph Benchmarking
A fundamental challenge in graph learning is understanding how models generalize to new, unseen graphs. While synthetic benchmarks offer controlled settings for analysis, existing approaches are confined to single-graph, transductive settings where models train and test on the same graph structure. Addressing this gap, we introduce GraphUniverse, a framework for generating entire families of graphs to enable the first systematic evaluation of inductive generalization at scale. Our core innovation is the generation of graphs with persistent semantic communities, ensuring conceptual consistency while allowing fine-grained control over structural properties like homophily and degree distributions. This enables crucial but underexplored robustness tests, such as performance under controlled distribution shifts. Benchmarking a wide range of architecturesโfrom GNNs to graph transformers and topological architecturesโreveals that strong transductive performance is a poor predictor of inductive generalization. Furthermore, we find that robustness to distribution shift is highly sensitive not only to model architecture choice but also to the initial graph regime (e.g., high vs. low homophily). Beyond benchmarking, GraphUniverseโs flexibility and scalability can facilitate the development of robust and truly generalizable architectures. An interactive demo is available at https://graphuniverse.streamlit.app.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4065. Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs
๐ฌ 28
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Vision Language Model Multi-modal QA Mechanistic Interpretability
Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present. In this work, we systematically investigate whether these failures arise from not perceiving the evidence or from not leveraging it effectively. By examining layer-wise attention dynamics, we find that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions. Surprisingly, VLMs often perceive the visual evidence when outputting incorrect answers, a phenomenon we term "seeing but not believing" that widely exists in major VLM families. Building on this, we introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking. It requires no training and consistently improves accuracy across multiple families, including LLaVA, Qwen, Gemma, and InternVL. These results show that VLMs encode reliable evidence internally but under-utilize it, and that making such signals explicit can bridge the gap between perception and reasoning, advancing the diagnostic understanding and reliability of VLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4066. IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?
๐ฌ 22
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ benchmark mllm web
The webpage-to-code task requires models to understand visual representations of webpages and generate corresponding code.
However, existing benchmarks primarily focus on static screenshot-to-code tasks, thereby overlooking the dynamic interactions fundamental to real-world web applications.
To address this limitation, this paper introduces IWR-Bench, a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video.
IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions and featuring diverse interaction complexities (e.g., web games), visual styles, and domains.
Aligning with standard web development practices, each task includes not only user interaction videos but also all crawled static assets (e.g., images, videos).
This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code.
An agent-as-a-judge framework with a comprehensive metric system automatically assesses the functional correctness and visual fidelity of generated webpages.
Extensive experiments on 28 LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35\%, as functional correctness (24.39\% IFS) lags significantly behind visual fidelity (64.25\% VFS).
These results highlight critical limitations in current models' ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging frontier for vision-language research.
The benchmark and evaluation code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4067. RM-R1: Reward Modeling as Reasoning
๐ฌ 27
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Reward Model Reasoning Reinforcement Learning
Reward modeling is essential for aligning large language models with human preferences through reinforcement learning. To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. Inspired by recent advances of long chain-of-thought on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning into reward modeling significantly enhances RM's interpretability and performance. We introduce a new class of generative reward models, Reasoning Reward Models (ReasRMs), which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. RM-R1 features a chain-of-rubrics (CoR) mechanism -- self-generating sample-level chat rubrics or math/code solutions, and evaluating candidate responses against them. The training of RM-R1 consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. Empirically, our models achieve superior performance across three reward model benchmarks on average, outperforming much larger open-weight models (e.g., INF-ORM-Llama3.1-70B) and proprietary ones (e.g., GPT-4o) by up to 4.9%. Beyond final performance, we perform thorough analyses to understand the key ingredients of successful ReasRM training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4068. Learning Global Hypothesis Space for Enhancing Synergistic Reasoning Chain
๐ฌ 29
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models๏ผComplex Reasoning
Chain-of-Thought (CoT) has emerged as an effective paradigm to enhance the reasoning ability of large language models (LLMs) in complex tasks. However, existing approaches still face two major challenges: (1) the lack of a global mechanism to integrate and interact across diverse reasoning hypotheses, and (2) the absence of structured analysis techniques to filter redundancy and extract salient reasoning features. To address these challenges, we propose GHS-TDA (Global Hypothesis Space with Topological Data Analysis), a two-stage reasoning framework that achieves synergistic enhancement through global information integration and topological feature analysis. Specifically, (i) Global Hypothesis Space (GHS) constructs a semantically enriched global hypothesis graph via agenda-driven multi-agent interactions, integrating diverse hypotheses and their semantic relations; (ii) Topological Data Analysis (TDA) applies persistent homology to extract multi-scale topological features, identify stable connected components and self-consistent loops, and derive a redundancy-free reasoning skeleton chain. GHS-TDA preserves reasoning diversity while leveraging topological stability to achieve self-aware convergence, ultimately producing high-confidence and interpretable reasoning paths. Experimental results show that GHS-TDA consistently outperforms existing methods across multiple benchmark datasets, demonstrating its effectiveness and competitiveness in complex reasoning scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4069. PoseX: AI Defeats Physics-based Methods on Protein Ligand Cross-Docking
๐ฌ 22
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ AI docking AI co-folding protein-ligand interaction cross docking
Recently, significant progress has been made in protein-ligand docking, especially in deep learning methods, and some benchmarks were proposed, such as PoseBench and PLINDER. However, these studies typically focus on the self-docking scenario, which is less practical in real-world applications. Moreover, some studies involve heavy frameworks requiring extensive training, posing challenges to convenient and efficient assessment of docking methods. To fill these gaps, we design PoseX, an open-source benchmark to evaluate both self-docking and cross-docking, enabling a practical and comprehensive assessment of algorithmic advances. Specifically, we curated a novel dataset comprising 718 entries for self-docking and 1,312 entries for cross-docking; secondly, we incorporated 23 docking methods in three methodological categories, including physics-based methods (e.g., Schrรถdinger Glide), AI docking methods (e.g., DiffDock) and AI co-folding methods (e.g., AlphaFold3); thirdly, we developed a relaxation method for post-processing to minimize conformational energy and refine binding poses; fourthly, we established a public leaderboard to rank submitted models in real-time. We derived some key insights and conclusions through extensive experiments: (1) AI-based approaches consistently outperform physics-based methods in overall docking success rate. (2) Most intra- and intermolecular clashes of AI-based approaches can be greatly alleviated with relaxation, which means combining AI modeling with physics-based post-processing could achieve excellent performance. (3) AI co-folding methods exhibit ligand chirality issues, except for Boltz-1x, which introduced physics-inspired potentials to fix hallucinations, suggesting that stereochemical modeling greatly improves the structural plausibility of the predicted protein-ligand complexes. (4) Specifying binding pockets significantly promotes docking performance, indicating that pocket information can be leveraged adequately, particularly for AI co-folding methods, in future modeling efforts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4070. Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!
๐ฌ 23
โญ 5.00
๐ generative models
๐ท๏ธ Streaming Video Manipulation Drag-Style Manipulation
Achieving streaming, fine-grained control over the outputs of autoregressive video diffusion models remains challenging, making it difficult to ensure that they consistently align with user expectations. To bridge this gap, we propose \textbf{stReaming drag-oriEnted interactiVe vidEo manipuLation (REVEL)}, a new task that enables users to modify generated videos \emph{anytime} on \emph{anything} via fine-grained, interactive drag. Beyond DragVideo and SG-I2V, REVEL unifies drag-style video manipulation as editing and animating video frames with both supporting user-specific translation, deformation, and rotation effects, making drag operations versatile. In resolving REVEL, we observe: \emph{i}) drag-induced perturbations accumulate in latent space, causing severe latent distribution drift that halts the drag process; \emph{ii}) streaming drag is easily disturbed by context frames, thereby yielding visually unnatural outcomes. We thus propose a training-free approach, \textbf{DragStream}, comprising: \emph{i}) an adaptive distribution self-rectification strategy that leverages neighboring frames' statistics to effectively constrain the drift of latent embeddings; \emph{ii}) a spatial-frequency selective optimization mechanism, allowing the model to fully exploit contextual information while mitigating its interference via selectively propagating visual cues along generation. Our method can be seamlessly integrated into existing autoregressive video diffusion models, and extensive experiments firmly demonstrate the effectiveness of our DragStream.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4071. Incentivizing LLM Reasoning via Reinforcement Learning with Functional Monte Carlo Tree Search
๐ฌ 14
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Reinforcement Learning Reasoning
In this work, we propose ***R**einforced **F**unctional **T**oken **T**uning* (RFTT), a novel reinforced fine-tuning framework that empowers Large Language Models (LLMs) with learn-to-reason capabilities. Unlike prior prompt-driven reasoning efforts, RFTT embeds a rich set of learnable functional tokens (*e.g.*, \
, \, \) directly into the model vocabulary, enabling chain-of-thought construction with diverse human-like reasoning behaviors. Specifically, RFTT comprises two phases: (1) supervised fine-tuning performs prompt-driven tree search to obtain self-generated training data annotated with functional tokens, which warms up the model to learn these tokens for initial reasoning capability; and (2) online reinforcement learning further allows the model to explore diverse reasoning pathways through functional token sampling without relying on prompts, thereby facilitating effective self-improvement for functional reasoning. Extensive experiments demonstrate the superiority of the proposed RFTT on mathematical benchmarks and highlight its strong generalization capability to other general domains. Moreover, the performance of RFTT exhibits consistent gains with increased test-time computation through additional search rollouts. Our code and dataset are available at anonymous url.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4072. A Noise is Worth Diffusion Guidance
๐ฌ 23
โญ 5.00
๐ generative models
๐ท๏ธ Classifier Free Guidance Diffusion Guidance Guidance Distillation Text to Image Synthesis
Diffusion models have demonstrated remarkable image generation capabilities, but their performance heavily relies on classifier-free guidance (CFG). While CFG significantly enhances image quality, evaluating both conditional and unconditional models at every denoising step leads to substantial computational overhead. Existing approaches mitigate this cost through distillation, training a student network to learn the guided predictions. In contrast, we take an orthogonal approach by refining the \textit{initial Gaussian noise}, a critical yet under-explored factor in the diffusion-based generation pipelines. Recent studies have explored noise optimization for specific tasks such as layout-conditioned generation and human preference alignment. However, whether refined noise alone can enable guidance-free high-quality image generation remains an open question. We introduce a noise refinement framework where a refining network is trained to minimize the difference between images generated by unguided sampling from the refined noise and those produced by guided sampling from the input Gaussian noise. Our method achieves CFG-like quality without modifying the diffusion model, preserving its prior knowledge and compatibility with techniques like DreamBooth LoRA. Additionally, the learned refining network generalizes across domains without retraining and seamlessly integrates with existing distilled models, further improving sample quality. Beyond its practical benefits, we provide an in-depth analysis of refined noise, offering insights into its role in the denoising process and its interaction with guidance. Our findings suggest that structured noise initialization is key to efficient and high-fidelity image synthesis. Code and weights will be publicly released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4073. SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start
๐ฌ 22
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Vision Language Models Reinforcement Learning Reasoning Cold-Start Preference Optimization Direct Preference Optimization (DPO) Self-Distillation
Reinforcement learning (RL) with verifiable rewards has recently catalyzed a wave of โMLLM-r1โ approaches that bring RL to vision language models. Most representative paradigms begin with a cold start, typically employing supervised fine-tuning (SFT), to initialize the policy before RL. However, SFT-based cold start adopts the reasoning paradigm intertwined with task solution and output format, which may induce instruction-style overfitting, weakens out-of-distribution generalization, and ultimately affects downstream RL. We revisit the cold start along two views, its training method and data construction, and introduce the Generalization Factor (GF) coefficient to quantify the generalization capability under different methods. Our empirical study finds that preferenceโbased training methods (e.g. DPO) generalizes better than SFT-based methods in cold start. Motivated by this, we propose $\textbf{SPECS}$โa $\textbf{S}$elf-distilled, $\textbf{P}$r$\textbf{e}$ference-based $\textbf{C}$old $\textbf{S}$tart framework that decouples multimodal learning: (1) generates introspective preference data pairs via self-distillation, avoiding reliance on larger teachers or manual annotation; (2) performs preferenceโbased training to learn, focusing on shallow, transferable surface-form criteria (format, structure, style) rather than memorizing content; and (3) hands off to RL with verifiable rewards for deep reasoning results. Experimental results across multiple multimodal benchmarks show that our decoupling learning framework yields consistent performance gains over strong baselines, improving MEGA-Bench by 4.1\% and MathVista by 12.2\%. Additional experiments indicate that SPECS contributes to reducing in-distribution โstuckness,โ improving exploration, stabilizing training, and raising the performance ceiling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4074. OpenFly: A COMPREHENSIVE PLATFORM FOR AERIAL VISION-LANGUAGE NAVIGATION
๐ฌ 14
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Vision-and-Language Navigation multimodal learning vision-language model
Aerial Vision-Language Navigation (VLN) seeks to guide UAVs by leveraging language instructions and visual cues, establishing a new paradigm for human-UAV interaction. However, the collection of VLN data demands extensive human
effort to construct trajectories and corresponding instructions, hindering the development of large-scale datasets and capable models. To address this problem, we propose OpenFly, a comprehensive platform for aerial VLN. Firstly, OpenFly integrates 4 rendering engines and advanced techniques for diverse environment simulation, including Unreal Engine, GTA V, Google Earth, and 3D Gaussian Splatting (3D GS). Particularly, 3D GS supports real-to-sim rendering, further enhancing the realism of our environments. Secondly, we develop a highly automated toolchain for aerial VLN data collection, streamlining point cloud acquisition, scene semantic segmentation, flight trajectory creation, and instruction generation. Thirdly, based on the toolchain, we construct a large-scale aerial VLN dataset with 100k trajectories, covering samples of diverse scenarios and assets across 18 scenes. Moreover, we propose OpenFly-Agent, a keyframe-aware VLN
model emphasizing key observations to promote performance and reduce computations. For benchmarking, extensive experiments and analyses are conducted, where our navigation success rate outperforms others by 14.0% and 7.9% on the seen and unseen scenarios, respectively. The toolchain, dataset, and codes will be open-sourced.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4075. Unified Multi-Modal Interactive and Reactive 3D Motion Generation via Rectified Flow
๐ฌ 15
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multi-modal Generative Modeling Interactive Motion Reactive Motion Rectified Flow
Generating realistic, context-aware two-person motion conditioned on diverse modalities remains a fundamental challenge for graphics, animation and embodied AI systems. Real-world applications such as VR/AR companions, social robotics and game agents require models capable of producing coordinated interpersonal behavior while flexibly switching between interactive and reactive generation. We introduce DualFlow, the first unified and efficient framework for multi-modal two-person motion generation. DualFlow conditions 3D motion generation on diverse inputs, including text, music, and prior motion sequences. Leveraging rectified flow, it achieves deterministic straight-line sampling paths between noise and data, reducing inference time and mitigating error accumulation common in diffusion-based models. To enhance semantic grounding, DualFlow employs a novel Retrieval-Augmented Generation (RAG) module for two-person motion that retrieves motion exemplars using music features and LLM-based text decompositions of spatial relations, body movements, and rhythmic patterns. We use contrastive rectified flow objective to further sharpen alignment with conditioning signals and add synchronization loss to improve inter-person temporal coordination. Extensive evaluations across interactive, reactive, and multi-modal benchmarks demonstrate that DualFlow consistently improves motion quality, responsiveness, and semantic fidelity. DualFlow achieves state-of-the-art performance in two-person multi-modal motion generation, producing coherent, expressive, and rhythmically synchronized motion. Code will be released upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4076. Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards
๐ฌ 28
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Reinforcement Learning Natural Language Reasoning
Reinforcement Learning (RL) has become a compelling way to strengthen the multi step reasoning ability of Large Language Models (LLMs). However, prevalent RL paradigms still lean on sparse outcome-based rewards and limited exploration, which often drives LLMs toward repetitive and suboptimal reasoning patterns. In this paper, we study the central question of how to design exploration for LLM reasoning and introduce MERCI (Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards), a novel RL algorithm that augments policy optimization with a principled intrinsic reward. Building on the idea of count-based exploration, MERCI leverages a lightweight Coin Flipping Network (CFN) to estimate the pseudo count and further epistemic uncertainty over reasoning trajectories, and converts them into an intrinsic reward that values novelty while preserving the learning signal from task rewards. We integrate MERCI into some advanced RL frameworks like Group Relative Policy Optimization (GRPO). Experiments on complex reasoning benchmarks demonstrate that MERCI encourages richer and more varied chains of thought, significantly improves performance over strong baselines, and helps the policy escape local routines to discover better solutions. It indicates that our targeted intrinsic motivation can make exploration reliable for language model reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4077. Calibrating Verbalized Confidence with Self-Generated Distractors
๐ฌ 17
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM uncertainty confidence calibration verbalized confidence
Calibrated confidence estimates are necessary for large language model (LLM) outputs to be trusted by human users. While LLMs can express their confidence in human-interpretable ways, verbalized LLM-generated confidence scores have empirically been found to be miscalibrated, reporting high confidence on instances with low accuracy and thereby harming trust and safety. We hypothesize that this overconfidence often stems from a given LLMโs heightened suggestibility when faced with claims that it encodes little information about; we empirically validate this hypothesis, finding more suggestibility on lower-accuracy claims. Building on this finding, we introduce Distractor-Normalized Coherence (DINCO), which estimates and accounts for an LLMโs suggestibility bias by having the model verbalize its confidence independently across several self-generated distractors (i.e. alternative claims), and normalizes by the total verbalized confidence. To further improve calibration, we leverage generator-validator disagreement, augmenting normalized validator confidence with a consistency-based estimate of generator confidence. Here, we frame the popular approach of self-consistency as leveraging coherence across sampled generations, and normalized verbalized confidence as leveraging coherence across validations on incompatible claims, allowing us to integrate these complementary dimensions of coherence into DINCO. Moreover, our analysis shows that DINCO provides less saturated, and therefore more usable, confidence estimates, and that further sampling alone cannot close the gap between DINCO and baselines, with DINCO at 10 runs outperforming self-consistency at 100. We include our code in the supplementary.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4078. SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery
๐ฌ 23
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Generalized Category Discovery Spectral Filtering Semi-Supervised Representation Learning
Generalized Category Discovery (GCD) aims to identify novel categories in unlabeled data while leveraging a small labeled subset of known classes. Training a parametric classifier solely on image features often leads to overfitting to old classes, and recent multimodal approaches improve performance by incorporating textual information. However, they treat modalities independently and incur high computational cost. We propose SpectralGCD, an efficient and effective multimodal approach to GCD that uses CLIP cross-modal image-concept similarities as a unified cross-modal representation. Each image is expressed as a mixture over semantic concepts from a large task-agnostic dictionary, which anchors learning to explicit semantics and reduces reliance on spurious visual cues. To maintain the semantic quality of representations learned by an efficient student, we introduce Spectral Filtering which exploits a cross-modal covariance matrix over the softmaxed similarities measured by a strong teacher model to automatically retain only relevant concepts from the dictionary. Forward and reverse knowledge distillation from the same teacher ensures that the cross-modal representations of the student remain both semantically sufficient and well-aligned. Across six benchmarks, SpectralGCD delivers accuracy comparable to or significantly superior to state-of-the-art methods at a fraction of the computational cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4079. WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction
๐ฌ 16
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Image Reconstruction Image Generation
Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoding (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratios. On the ImageNet 50k validation set, at a high-fidelity setting, WeTok achieves a record-low zero-shot rFID of 0.12, outperforming leading continuous tokenizers like FLUX-VAE (0.18) and SD-VAE 3.5 (0.19) with 400% compression ratio. Furthermore, in a high-compression regime, WeTok achieves a zero-shot rFID of 3.49 at a 768ร compression ratio, substantially surpassing Cosmos, which scores 4.57 at only 50% our compression ratio.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4080. OptimSyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation
๐ฌ 30
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLMs data synthetic instruction tuning
Large language models (LLMs) achieve strong downstream performance largely due to abundant supervised fine-tuning (SFT) data that imparts problem-solving capabilities. However, as applications expand, high-quality SFT data in knowledge-intensive verticals (e.g., humanities and social sciences, medicine, law, finance) is exceedingly scarce: expert curation is costly, privacy constraints are strict, and label consistency is hard to guarantee. Recent work turns to synthetic data, typically prompting a teacher model over domain documents and filtering with handcrafted rubrics. Yet, rubric design is expert-dependent and rarely transfers across domains; moreover, prevalent heuristic optimization follows a brittle loop (write rubric $\rightarrow$ synthesize $\rightarrow$ train $\rightarrow$ inspect $\rightarrow$ guess tweaks) that lacks reliable, quantitative feedback about a rubric's true contribution to downstream performance.
We argue for assessing synthetic data quality through its causal impact on the target model, using this feedback to guide data generation. Inspired by classic influence functions, we repurpose an optimizer-aware estimator that uses gradient information to quantify each synthetic sample's contribution to the objective of a given target model on specific tasks. Our analysis reveals a gap: although synthetic and real samples may be close in embedding space, their influence on learning can differ substantially. Building on this insight, we propose an optimization-based synthetic data framework that adapts rubrics with target-model feedback. Instead of manually engineering domain rubrics, we supply lightweight guiding text and delegate rubric generation to a rubric-specialized model conditioned on the task; crucially, rubric (and data) selection is supervised by estimated downstream impact rather than proxy formality. Empirically, the framework yields consistent gains across domains (HSS and health), target models (e.g., Qwen and Llama families), and data generators, demonstrating broad generalization and engineering portability without task-specific tuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4081. JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation
๐ฌ 15
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Vision-Language Navigation Spatial Understanding Dual Implicit Memory
Vision-and-Language Navigation (VLN) requires an embodied agent to navigate through unseen environments, guided by natural language instructions and a continuous video stream. Recent advances in VLN have been driven by the powerful semantic understanding of Multimodal Large Language Models (MLLMs). However, these methods typically rely on explicit semantic memory, such as building textual cognitive maps or storing historical visual frames. This type of method suffers from spatial information loss, computational redundancy, and memory bloat, which impede efficient navigation. Inspired by the implicit scene representation in human navigation, analogous to the left brain's semantic understanding and the right brain's spatial cognition, we propose JanusVLN, a novel VLN framework featuring a dual implicit neural memory that models spatial-geometric and visual-semantic memory as separate, compact, and fixed-size neural representations. This framework first extends the MLLM to incorporate 3D prior knowledge from the spatial-geometric encoder, thereby enhancing the spatial reasoning capabilities of models based solely on RGB input. Then, the historical key-value (KV) caches from the spatial-geometric and visual-semantic encoders are constructed into a dual implicit memory. By retaining only the KVs of tokens in the initial and sliding window, redundant computation is avoided, enabling efficient incremental updates. Extensive experiments demonstrate that JanusVLN outperforms over 20 recent methods to achieve SOTA performance. For example, the success rate improves by 10.5-35.5 compared to methods using multiple data types as input and by 3.6-10.8 compared to methods using more RGB training data. This indicates that the proposed dual implicit neural memory, as a novel paradigm, explores promising new directions for future VLN research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4082. Samples Are Not Equal: A Sample Selection Approach for Deep Clustering
๐ฌ 28
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Deep Clustering Clustering Sample Selection
Deep clustering has recently achieved remarkable progress across various domains. However, existing clustering methods typically treat all samples equally, neglecting the inherent differences in their feature patterns and learning states. Such redundant learning often drives models to overemphasize simple feature patterns in high-density regions, weakening their ability to capture complex yet diverse ones in low-density regions. To address this issue, we propose a novel plug-in designed to mitigate overfitting to simple and redundant feature patterns while encouraging the learning of more complex yet diverse ones. Specifically, we introduce a density-aware clustering head initialization strategy that adaptively adjusts each sample's contribution to cluster prototypes according to its local density in the feature space. This strategy mitigates the bias towards high-density regions and encourages a more comprehensive attention on medium- and low-density ones. Furthermore, we design a dynamic sample selection strategy that evaluates the learning state of samples based on the feature consistency and pseudo-label stability. By removing sufficiently learned samples and prioritizing unstable ones, this strategy adaptively reallocates training resources, enabling the model to consistently focus on samples that remain under-learned throughout training. Our method can be integrated as a plug-in into a wide range of deep clustering architectures. Extensive experiments on multiple benchmark datasets demonstrate that our method improves clustering accuracy by up to $\textbf{6.1}$\% and enhances training efficiency by up to $\textbf{1.3$\times$}$. $\textbf{Code is available in the supplementary material.}$
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4083. FZOO: Fast Zeroth-Order Optimizer for FineโTuning Large Language Models towards AdamโScale Speed
๐ฌ 33
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Zerothโorder optimization Large language models Fineโtuning Adaptive step size Batch gradient estimation Memory efficiency
Fine-tuning large language models (LLMs) often faces GPU memory bottlenecks: the backward pass of first-order optimizers like Adam increases memory usage to more than 10 times the inference level (e.g., 633~GB for OPT-30B). Zeroth-order (ZO) optimizers avoid this cost by estimating gradients only from forward passes, yet existing methods like MeZO usually need tens of times more steps to converge. Can this trade-off between speed and memory in ZO be fundamentally improved? Normalized-SGD, for instance, demonstrates strong empirical performance with greater memory efficiency than Adam. In light of this, we introduce FZOO, a Fast Zeroth-Order Optimizer towards Adam-Scale Speed. On the one hand, FZOO reduces the total forward passes needed for convergence by employing batched one-sided estimates that adapt step-sizes based on the standard deviation of batch losses. On the other hand, it accelerates per-batch computation through the use of Rademacher random vector (ยฑ1) perturbations, which also enables further speedups through batched evaluation. Extensive experiments on diverse models (including RoBERTa-large, the OPT family (350M-66B), Phi-2, and Llama3) across 11 varied downstream tasks validate FZOO's effectiveness. On average, FZOO outperforms MeZO by +3% in accuracy while requiring 3$\times$fewer forward passes. Notably, for the RoBERTa-large model, FZOO achieves average improvements of +5.6% in accuracy and 18$\times$reduction in forward passes compared to MeZO, achieving convergence speeds comparable to Adam. We also provide theoretical analysis proving FZOOโs formal equivalence to a normalized-SGD update rule and establishing its convergence guarantees. Beyond full-parameter tuning, FZOO plugs smoothly into PEFT techniques, unlocking even larger memory savings. Taken together, our results make single-GPU, high-speed, full-parameter fine-tuning realistic today and point toward future work on memory-efficient pre-training. Code: https://anonymous.4open.science/r/FZOO-5927
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4084. CoEmoGen: Towards Semantically-Coherent and Scalable Emotional Image Content Generation
๐ฌ 13
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Emotional Image Content Generation Semantically-Coherent Sentence-level Guidance Hierarchical LoRA
Emotional Image Content Generation (EICG) aims to generate semantically clear and emotionally faithful images based on given emotion categories, with broad application prospects. While recent text-to-image diffusion models excel at generating concrete concepts, they struggle with the complexity of abstract emotions. There have also emerged methods specifically designed for EICG, but they excessively rely on word-level attribute labels for guidance, which suffer from semantic incoherence, ambiguity, and limited scalability. To address these challenges, we propose CoEmoGen, a novel pipeline notable for its semantic coherence and high scalability. Specifically, leveraging multimodal large language models (MLLMs), we construct high-quality captions focused on emotion-triggering content for context-rich semantic guidance. Furthermore, inspired by psychological insights, we design a Hierarchical Low-Rank Adaptation (HiLoRA) module to cohesively model both polarity-shared low-level features and emotion-specific high-level semantics. Extensive experiments demonstrate CoEmoGenโs superiority in emotional faithfulness and semantic coherence from quantitative, qualitative, and user study perspectives. To intuitively showcase scalability, we curate EmoArt, a large-scale dataset of emotionally evocative artistic images, providing endless inspiration for emotion-driven artistic creation. The dataset and code will be available on GitHub.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4085. Expert Heads: Robust Evidence Identification for Large Language Models
๐ฌ 5
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Large language model Knowledge Integration Attention Mechanisms
Large language models (LLMs) exhibit strong abilities in multi-document reasoning, yet their evidence identification is highly sensitive to input order. We trace this limitation to attention mechanisms, where many heads overemphasize sequence boundaries and neglect central content.
We systematically analyze attention distributions under document permutations and discover a small subset of heads that consistently prioritize task-relevant documents regardless of position. We formalize these as Expert Heads, identified via activation frequency and stability across permutations.
Experiments on LLaMA, Mistral, and Qwen reveal architecture-specific patterns: mid-layer heads in LLaMA and Mistral dominate semantic integration, while deeper-layer heads in Qwen specialize in evidence selection. Moreover, Expert Heads exhibit concentrated focus during understanding and more distributed engagement during generation. Their activation strongly correlates with answer correctness, providing diagnostic signals for hallucination detection.
Leveraging Expert Heads for document voting significantly improves retrieval and ranking on HotpotQA, 2WikiMultiHopQA, and MuSiQue, outperforming dense retrievers and LLM-based ranking with minimal overhead. Ablations confirm that even a small subset achieves robust gains.
Our findings establish Expert Heads as a stable and interpretable mechanism for evidence integration, offering new directions for context pruning, hallucination mitigation, and head-guided training of LLMs
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4086. Multi-Marginal Flow Matching with Adversarially Learnt Interpolants
๐ฌ 21
โญ 5.00
๐ generative models
๐ท๏ธ flow matching stochastic interpolants adversarial learning scRNA-seq trajectory inference
Learning the dynamics of a process given sampled observations at several time points is an important but difficult task in many scientific applications.
When no ground-truth trajectories are available, but one has only snapshots of data taken at discrete time steps, the problem of modelling the dynamics, and thus inferring the underlying trajectories, can be solved by multi-marginal generalisations of flow matching algorithms.
This paper proposes a novel flow matching method that overcomes the limitations of existing multi-marginal trajectory inference algorithms.
Our proposed method, ALI-CFM, uses a GAN-inspired adversarial loss to fit neurally parametrised interpolant curves between source and target points such that the marginal distributions at intermediate time points are close to the observed distributions.
The resulting interpolants are smooth trajectories that, as we show, are unique under mild assumptions.
These interpolants are subsequently marginalised by a flow matching algorithm, yielding a trained vector field for the underlying dynamics.
We showcase the versatility and scalability of our method by outperforming the existing baselines on spatial transcriptomics and cell tracking datasets, while performing on par with them on single-cell trajectory prediction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4087. Evolution of Concepts in Language Model Pre-Training
๐ฌ 33
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Large Language Model; Pre-Training; Mechanistic Interpretability; Training Dynamics; Crosscoder
Language models obtain extensive capabilities through pre-training. However, the pre-training process remains a black box. In this paper, we track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders. We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages. Feature attribution analyses reveal causal connections between feature evolution and downstream performance. Our feature-level observations are highly consistent with previous findings on Transformer's two-stage learning process, which we term a statistical learning phase and a feature learning phase. Our work opens up the possibility to track fine-grained representation progress during language model learning dynamics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4088. Smooth Reading: Bridging the Gap of Recurrent LLM to Self-Attention LLM on Long-Context Understanding
๐ฌ 13
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ RNN Recurent LLM Long Context Modeling Large Language Model
Recurrent large language models (Recurrent LLMs) offer linear computational complexity as efficient alternatives to quadratic self-attention-based LLMs (Self-Attention LLMs). However, Recurrent LLMs underperform on long-context tasks due to limited fixed-size memory. Previous research focused on architectural innovations to enhance memory capacity, but failed to match Self-Attention LLM performance. We argue this limitation stems from processing entire contexts at once being ill-suited for Recurrent LLMs. We propose Smooth Reading, a co-design of recurrent architecture and inference method. It introduces a end-to-end multi-round inference method that processes context incrementally and iteratively summarizes information, reducing memory demands.
Methodologically, we reveal architecture-inference interactions play an important role for performance, efficiency and scalability, shedding light on future Recurrent LLM design.
Besides, our method substantially bridges the performance gap between Recurrent and Self-Attention LLMs on long-context tasks while preserving efficiency advantages.
Smooth Reading boosts SWA-3B-4k from 5.68% lower to 3.61% higher performance than Self-Attention LLMs on LongBench, while maintaining 2.5ร faster training and 2ร faster inference at 64k context.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4089. Image is All You Need: Towards Efficient and Effective Large Language Model-Based Recommender Systems
๐ฌ 18
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Recommender Systems Large Language Models Sequence Modeling
Large Language Models (LLMs) have recently emerged as a powerful backbone for recommender systems. Existing LLM-based recommender systems take two different approaches for representing items in natural language, i.e., Attribute-based Representation and Description-based Representation. In this work, we aim to address the trade-off between efficiency and effectiveness that these two approaches encounter, when representing items consumed by users. Based on our observation that there is a significant information overlap between images and descriptions associated with items, we propose a novel method, **I**mage is all you need for **LLM**-based **Rec**ommender system (I-LLMRec). Our main idea is to leverage images as an alternative to lengthy textual descriptions for representing items, aiming at reducing token usage while preserving the rich semantic information of item descriptions.
Through extensive experiments on real-world Amazon datasets, we demonstrate that I-LLMRec outperforms existing methods that leverage textual descriptions for representing items in both efficiency and effectiveness by leveraging images. Moreover, a further appeal of I-LLMRec is its ability to reduce sensitivity to noise in descriptions, leading to more robust recommendations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4090. Gelato: Graph Edit Distance via Autoregressive Neural Combinatorial Optimization
๐ฌ 15
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ graph edit distance neural combinatorial optimization graph matching graph neural networks
The graph edit distance (GED) is a widely used graph dissimilarity measure that quantifies the minimum cost of the edit operations required to transform one graph into another. Computing it, however, involves solving the associated NP-hard graph matching problem. Indeed, exact solvers already struggle to handle graphs with more than 20 nodes and classical heuristics frequently produce suboptimal solutions. This motivates the development of machine-learning methods that exploit recurring patterns in problem instances to produce high-quality approximate solutions. In this work, we introduce Gelato, a graph neural network model that constructs GED solutions incrementally by predicting a pair of nodes to be matched at each step. By conditioning each prediction autoregressively on the previous choices, it is able to capture complex structural dependencies. Empirically, Gelato achieves state-of-the-art results, even when generalizing to graphs larger than the ones seen during training, and runs orders of magnitude faster than competing ML-based methods. Moreover, it remains effective even under limited or noisy supervision, alleviating the demand for costly ground-truth generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4091. MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models
๐ฌ 29
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Multimodal Large Language Models Affective Computing
Recent advances in multimodal large language models (MLLMs) have catalyzed transformative progress in affective computing, enabling models to exhibit emergent emotional intelligence. Despite substantial methodological progress, current emotional benchmarks remain limited, as it is still unknown: (a) the generalization abilities of MLLMs across distinct scenarios, and (b) their reasoning capabilities to identify the triggering factors behind emotional states. To bridge these gaps, we present MME-Emotion, a systematic benchmark that assesses both emotional understanding and reasoning capabilities of MLLMs, enjoying scalable capacity, diverse settings, and unified protocols. As the largest emotional intelligence benchmark for MLLMs, MME-Emotion contains over 6,000 curated video clips with task-specific questioning-answering (QA) pairs, spanning broad scenarios to formulate eight emotional tasks. It further incorporates a holistic evaluation suite with hybrid metrics for emotion recognition and reasoning, analyzed through a multi-agent system framework.
Through a rigorous evaluation of 20 advanced MLLMs, we uncover both their strengths and limitations, yielding several key insights: (1) Current MLLMs exhibit unsatisfactory emotional intelligence, with the best-performing model achieving only $39.3\%$ recognition score and $56.0\%$ Chain-of-Thought (CoT) score on our benchmark. (2) Generalist models (\emph{e.g.}, Gemini-2.5-Pro) derive emotional intelligence from generalized multimodal understanding capabilities, while specialist models (\emph{e.g.}, R1-Omni) can achieve comparable performance through domain-specific post-training adaptation. By introducing MME-Emotion, we hope that it can serve as a foundation for advancing MLLMs' emotional intelligence in the future.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4092. PEERING INTO THE UNKNOWN: ACTIVE VIEW SELECTION WITH NEURAL UNCERTAINTY MAPS FOR 3D RECONSTRUCTION
๐ฌ 20
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Active viewpoint selection; efficient 3D reconstruction; viewpoint sampling; uncertainty-driven learning
Imagine trying to understand the shape of a teapot by viewing it from the frontโyou might see the spout, but completely miss the handle. Some perspectives naturally provide more information than others. How can an AI system determine which viewpoint offers the most valuable insight for accurate and efficient 3D object reconstruction? Active view selection (AVS) for 3D reconstruction remains a fundamental challenge in computer vision. The aim is to identify the minimal set of views that yields the most accurate 3D reconstruction.
Instead of learning radiance fields, like NeRF or 3D Gaussian Splatting, from a current observation and computing uncertainty for each candidate viewpoint, we introduce a novel AVS approach guided by neural uncertainty maps predicted by a lightweight feedforward deep neural network, named UPNet.
UPNet takes a single input image of a 3D object and outputs a predicted uncertainty map, representing uncertainty values across all possible candidate viewpoints. By leveraging heuristics derived from observing many natural objects and their associated uncertainty patterns, we train UPNet to learn a direct mapping from viewpoint appearance to uncertainty in the underlying volumetric representations.
Next, our approach aggregates all previously predicted neural uncertainty maps to suppress redundant candidate viewpoints and effectively select the most informative one. Using these selected viewpoints, we train 3D neural rendering models and evaluate the quality of novel view synthesis against other competitive AVS methods. Remarkably, despite using half of the viewpoints than the upper bound, our method achieves comparable reconstruction accuracy. In addition, it significantly reduces computational overhead during AVS, achieving up to a 400 times speedup along with over 50\% reductions in CPU, RAM, and GPU usage compared to baseline methods. Notably, our approach generalizes effectively to AVS tasks involving novel object categories, without requiring any additional training. We will release all code, models, and datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4093. CroCoDiLight: Repurposing Cross-View Completion Encoders for Relighting
๐ฌ 15
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ cross-view completion relighting intrinsic image estimation albedo estimation shadow removal
Cross-view completion (CroCo) has proven effective as pre-training for geometric downstream tasks such as stereo depth, optical flow, and point cloud prediction. In this paper we show that it also learns photometric understanding due to training pairs with differing illumination. We propose a method to disentangle CroCo latent representations into a single latent vector representing illumination and patch-wise latent vectors representing intrinsic properties of the scene. To do so, we use self-supervised cross-lighting and intrinsic consistency losses on a dataset two orders of magnitude smaller than that used to train CroCo. This comprises pixel-wise aligned, paired images under different illumination. We further show that the lighting latent can be used and manipulated for tasks such as interpolation between lighting conditions, shadow removal, and albedo estimation. This clearly demonstrates the feasibility of using cross-view completion as pre-training for photometric downstream tasks where training data is more limited.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4094. Almost Bayesian: Dynamics of SGD Through Singular Learning Theory
๐ฌ 10
โญ 5.00
๐ learning theory
๐ท๏ธ singular learning theory SGD gradient noise gradient descent Fokker-Planck training dynamics Bayes Bayesian
The nature of the relationship between Bayesian sampling and stochastic gradient descent in neural networks has been a long-standing open question in the theory of deep learning. We shed light on this question by modeling the long runtime behaviour of SGD as diffusion on porous media. Using singular learning theory, we show that the late stage dynamics are strongly impacted by the degeneracies of the loss surface. From this we are able to show that under reasonable choices of hyperparameters for vanilla SGD, the local steady state distribution of SGD (if it exists) is effectively a tempered version of the Bayesian posterior over the weights which accounts for local accessibility constraints.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4095. Tackling the XAI Disagreement Problem with Adaptive Feature Grouping
๐ฌ 10
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Explainability Disagreements Functional Decomposition Feature Groups
Post-hoc explanations aim at understanding which input features (or groups thereof) are the most impactful toward certain model decisions.
Many such methods have been proposed (ArchAttribute, Occlusion, SHAP, RISE, LIME, Integrated Gradient) and it is hard for practitioners
to understand the differences between them. Even worse, faithfulness metrics, often used to quantitatively compare explanation methods,
also exhibit inconsistencies. To address these issues, recent work has unified explanation methods
through the lens of Functional Decomposition. We extend such work to scenarios where input features are partitioned into groups
(e.g. pixel patches) and prove that disagreements between explanation methods and faithfulness metrics are caused by between-group
interactions. Crucially, getting rid of between-group interactions would lead to a single explanation that is optimal according to all faithfulness metrics. We finally show how to reduce the disagreements by adaptively grouping features/pixels on tabular/image data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4096. Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Method
๐ฌ 14
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ visual reasoning benchmark thinking with images MLLM
Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically ref-
erencing visual regions, just like human โthinking with imagesโ. However, no
benchmark exists to evaluate these capabilities holistically. To bridge this gap, we
propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic
benchmark built on three principles: (1) focused visual perception of subtle targets
in complex scenes, (2) traceable evidence via bounding box evaluation, and (3)
second-order reasoning to test object interactions and spatial hierarchies beyond
simple object localization. Prioritizing images with dense objects, we initially
sample 1K high-quality images from SA-1B, and incorporate eight LMM experts
to manually annotate questions, candidate options, and answers for each image.
After three stages of quality control, TreeBench consists of 405 challenging vi-
sual question-answering pairs, even the most advanced models struggle with this
benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only
54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual
Grounded Reasoning), a training paradigm to supervise localization and reasoning
jointly with reinforcement learning, enabling accurate localizations and explainable
reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench
(+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is
key to advancing vision-grounded reasoning. The code and data will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4097. How Muonโs Spectral Design Benefits Generalization: A Study on Imbalanced Data
๐ฌ 16
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Muon Shampoo Spectral Gradient Descent Generalization
The growing adoption of spectrum-aware matrix-valued optimizers such as Muon and Shampoo in deep learning motivates a systematic study of their generalization properties and, in particular, when they might outperform competitive algorithms. We approach this question by introducing appropriate simplifying abstractions as follows: First, we use imbalanced data as a testbed. Second, we study the canonical form of such optimizers, which is Spectral Gradient Descent (SpecGD)โeach update step is $\mathbf{U}\mathbf{V}^T$ where $\mathbf{U}\mathbf{\Sigma}\mathbf{ V}^T$ is the truncated SVD of the gradient. Third, within this framework we identify a canonical setting for which we precisely quantify when SpecGD outperforms vanilla Euclidean GD. For a Gaussian mixture data model and both linear and bilinear models, we show that unlike GD, which prioritizes learning dominant principal components of the data first, SpecGD learns all principal components of the data at equal rates. We demonstrate how this translates to a growing gap in balanced accuracy favoring SpecGD early in training and further show that the gap remains consistent even when the GD counterpart uses adaptive step-sizes via normalization. By extending the analysis to deep linear models, we show that depth amplifies these effects. We empirically verify our theoretical findings on a variety of imbalanced datasets. Our experiments compare practical variants of spectral methods, like Muon and Shampoo, against their Euclidean counterparts and Adam. The results validate our findings that these spectral optimizers achieve superior generalization by promoting a more balanced learning of the data's underlying components.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4098. RD-HRL: Generating Reliable Sub-Goals for Long-Horizon Sparse-Reward Tasks
๐ฌ 22
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Hierarchical Reinforcement Learning Sub-goal Key States Choice Learning Reinforcement Learning Goal-conditioned Reinforcement Learning
Long-horizon sparse-reward tasks, such as goal-conditioned or robot manipulation tasks, remain challenging in offline reinforcement learning due to the credit assignment problem. Hierarchical methods have been proposed to tackle this problem by introducing sub-goal planning guided by value functions, which in principle can shorten the effective planning horizon for both high-level and low-level planners, and thereby avoiding the credit assignment problem. However, we demonstrate that the sub-goal selection mechanism is unreliable, as it relies on value functions suffering from generalization noise, which misguides value estimation and thus leads to sub-optimal sub-goals. In this work, to provide more reliable sub-goals, we novelly introduce a reliability-driven decision mechanism, and propose Reliability-Driven HRL (RD-HRL) as the solution. The reliability-driven decision mechanism provide decision-level targets from transition regions, thereby providing noise-immune decision spaces for high-level policy, ensuring the reliability of sub-goals (which are termed as action-level targets in this paper). Comprehensive experimental results demonstrate that our approach RD-HRL outperforms baseline methods across multiple benchmarks, highlighting the competitive advantages of RD-HRL. Our code is anonymously available at \url{https://anonymous.4open.science/r/RD-HRL-243D }.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4099. Dichotomous Diffusion Policy Optimization
๐ฌ 17
โญ 5.00
๐ reinforcement learning
๐ท๏ธ reinforcement learning diffusion model autonomous driving robotics
Diffusion-based policies have gained growing popularity in solving a wide range of decision-making tasks due to their superior expressiveness and controllable generation during inference. However, effectively training large diffusion policies using reinforcement learning (RL) remains challenging. Existing methods either suffer from unstable training due to directly maximizing value objectives, or face computational issues due to relying on crude Gaussian likelihood approximation, which requires a large amount of sufficiently small denoising steps. In this work, we propose DIPOLE (Dichotomous diffusion Policy improvement), a novel RL algorithm designed for stable and controllable diffusion policy optimization. We begin by revisiting the KL-regularized objective in RL, which offers a desirable weighted regression objective for diffusion policy extraction, but often struggles to balance greediness and stability. We then formulate a greedified policy regularization scheme, which naturally enables decomposing the optimal policy into a pair of stably learned dichotomous policies: one aims at reward maximization, and the other focuses on reward minimization. Under such a design, optimized actions can be generated by linearly combining the scores of dichotomous policies during inference, thereby enabling flexible control over the level of greediness.Evaluations in offline and offline-to-online RL settings on ExORL and OGBench demonstrate the effectiveness of our approach. We also use DIPOLE to train a large vision-language-action (VLA) model for end-to-end autonomous driving (AD) and evaluate it on the large-scale real-world AD benchmark NAVSIM, highlighting its potential for complex real-world applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4100. AttTok: Marrying Attribute Tokens with Generative Pre-trained Vision-Language Models towards Medical Image Understanding
๐ฌ 17
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Medical generative pre-trained models medical Multi-Modal alignment medical VQA instruction tuning
Recent generative pre-trained visionโlanguage (GPTv) models have achieved remarkable success in multi-modal understanding, inspiring their adaptation to medical imaging tasks such as disease diagnosis and visual question answering (VQA). However, current instruction-tuned GPTv models suffer from two key challenges: (1) medical attributes (e.g., disease names, severity grades) are encoded as plain text tokens, collapsing semantically distinct concepts into nearly identical textual sequences; and (2) inadequate textual supervision weakens visual representation learning, leading to severe inter-attribute confusion and misaligned visionโlanguage embeddings. To address these limitations, we introduce attribute tokens (AttTok), a set of preโdefined special tokens that uniquely encode clinical attributes (e.g., imaging modality, diagnosis, severity) within a structured token space. Complemented by attributeโcentric embedding books, AttTok serves as anchor points for aligning both visual and textual modalities into a shared, discriminative representation space. Building on this foundation, we design two key components: an attributeโcentric cross attention (ACC) adapter, which breaks the visionโtoโtext informationโflow bottleneck and enriches the visual encoder with discriminative attribute knowledge, and an attributeโcentric matching (ACM) loss, which enforces robust multiโmodal alignment centered on the attribute tokens. Extensive experiments on five medical classification benchmarks and three VQA datasets demonstrate that AttTok substantially improves both discriminative accuracy and medical knowledge reasoning, establishing a new paradigm for medical GPTv models with clinically discriminative understanding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4101. Grasp Any Region: Prompting MLLM to Understand the Dense World
๐ฌ 14
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ image caption benchmark region understanding
While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle with the dense world, i.e., complex scenes requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehensive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GARBench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Empirically, GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g.,
outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GARBench-VQA. More importantly, our zero-shot
GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong comprehension capabilities can be easily transferred to videos. Code and data will be released to the community.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4102. Multi-Subspace Multi-Modal Modeling for Diffusion Models: Estimation, Convergence and Mixture of Experts
๐ฌ 22
โญ 5.00
๐ generative models
๐ท๏ธ Diffusion Models Estimation Error Convergence Analysis Mixture of Experts
Recently, diffusion models have achieved a great performance with a small dataset of size $n$ and a fast optimization process. Despite the impressive performance, the estimation error suffers from the curse of dimensionality $n^{-1/D}$, where $D$ is the data dimension. Since images are usually a union of low-dimensional manifolds, current works model the data as a union of linear subspaces with Gaussian latent and achieve a $1/\sqrt{n}$ bound. Though this modeling reflects the multi-manifold property of data, the Gaussian latent can not capture the multi-modal property of the latent manifold. To bridge this gap, we propose the mixture subspace of low-rank mixture of Gaussian (MoLR-MoG) modeling, which models the target data as a union of $K$ linear subspaces, and each subspace admits a mixture of Gaussian latent ($n_k$ modals with dimension $d_k$). With this modeling, the corresponding score function naturally has a mixture of expert (MoE) structure, captures the multi-modal information, and contains nonlinear properties
since each expert is a nonlinear latent MoG score. We first conduct real-world experiments to show that the generation results of MoE-latent MoG NN are much better than the results of MoE-latent Gaussian score. Furthermore, MoE-latent MoG NN achieves a comparable performance with MoE-latent Unet with $10 \times$ parameters.
These results indicate that the MoLR-MoG modeling is reasonable and suitable for real-world data. After that, based on such MoE-latent MoG score, we provide a $R^4\sqrt{\Sigma_{k=1}^Kn_k}\sqrt{\Sigma_{k=1}^Kn_kd_k}/\sqrt{n}$ estimation error, which escapes the curse of dimensionality by using data structure. Finally, we study the optimization process and prove the convergence guarantee under the MoLR-MoG modeling. Combined with these results, under a setting close to real-world data, this work explains why diffusion models only require a small training sample and enjoy a fast optimization process to achieve a great performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4103. Closing the Modality Gap Aligns Group-Wise Semantics
๐ฌ 22
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Multimodal Learning Representation Alignment Modality Gap
In multimodal learning, CLIP has been recognized as the \textit{de facto} method for learning a shared latent space across multiple modalities, placing similar representations close to each other and moving them away from dissimilar ones. Although CLIP-based losses effectively align modalities at the semantic level, the resulting latent spaces often remain only partially shared, revealing a structural mismatch known as the modality gap. While the necessity of addressing this phenomenon remains debated, particularly given its limited impact on instance-wise tasks (e.g., retrieval), we prove that its influence is more pronounced in group-level tasks (e.g., clustering). To support this claim, we introduce a novel method designed to consistently reduce this discrepancy in two-modal settings, with a straightforward extension to the general $n$-modal case. Through our extensive evaluation, we prove our novel insight: while reducing the gap provides only marginal or inconsistent improvements in traditional instance-wise tasks, it significantly enhances group-wise tasks. These findings may reshape our understanding of the modality gap, highlighting its key role in improving performance on tasks requiring semantic grouping.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4104. GenCompositor: Generative Video Compositing with Diffusion Transformer
๐ฌ 17
โญ 5.00
๐ generative models
๐ท๏ธ Diffusion Models Video Editing Video Compositing
Video compositing combines live-action footage to create video production, serving as a crucial technique in video creation and film production. Traditional pipelines require intensive labor efforts and expert collaboration, resulting in lengthy production cycles and high manpower costs. To address this issue, we automate this process with generative models, called generative video compositing. This new task strives to adaptively inject identity and motion information of foreground video to the target video in an interactive manner, allowing users to customize the size, motion trajectory, and other attributes of the dynamic elements added in final video. Specifically, we designed a novel Diffusion Transformer (DiT) pipeline based on its intrinsic properties. To maintain consistency of the target video before and after editing, we revised a light-weight DiT-based background preservation branch with masked token injection. As to inherit dynamic elements from other sources, a DiT fusion block is proposed using full self-attention, along with a simple yet effective foreground augmentation for training. Besides, for fusing background and foreground videos with different layouts based on user control, we developed a novel position embedding, named Extended Rotary Position Embedding (ERoPE). Finally, we curated a dataset comprising 61K sets of videos for our new task, called VideoComp. This data includes complete dynamic elements and high-quality target videos. Experiments demonstrate that our method effectively realizes generative video compositing, outperforming existing possible solutions in fidelity and consistency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4105. Action-aware Dynamic Pruning for Efficient Vision-Language-Action Manipulation
๐ฌ 22
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Vision-Language-Actions Efficient Robotic Manipulations
Robotic manipulation with Vision-Language-Action models requires efficient inference over long-horizon multi-modal context, where attention to dense visual tokens dominates computational cost. Existing methods optimize inference speed by reducing visual redundancy within VLA models, but they overlook the varying redundancy across robotic manipulation stages. We observe that the visual token redundancy is higher in coarse manipulation phase than in fine-grained operations, and is strongly correlated with the action dynamic.
Motivated by this observation, we propose Action-aware Dynamic Pruning (ADP), a multi-modal pruning framework that integrates text-driven token selection with action-aware trajectory gating. ADP introduces a gating mechanism that conditions the pruning signal on recent action trajectories, using past motion windows to adaptively adjust token retention ratios in accordance with dynamics, thereby balancing computational efficiency and perceptual precision across different manipulation stages.
Extensive experiments on the LIBERO suites and diverse real-world scenarios demonstrate that our method significantly reduces FLOPs and action inference latency (e.g. 1.35ร speed up on OpenVLA-OFT) while maintaining competitive success rates (e.g. 25.8% improvements with OpenVLA) compared to baselines, thereby providing a simple plug-in path to efficient robot policies that advances the efficiency and performance frontier of robotic manipulation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4106. Grounding and Enhancing Informativeness and Utility in Dataset Distillation
๐ฌ 33
โญ 5.00
๐ generative models
๐ท๏ธ Explainable AI Generative AI
Dataset Distillation (DD) seeks to create a compact dataset from a large, real-world dataset. While recent methods often rely on heuristic approaches to balance efficiency and quality, the fundamental relationship between original and synthetic data remains underexplored. This paper revisits knowledge distillation-based dataset distillation within a solid theoretical framework. We introduce the concepts of Informativeness and Utility, capturing crucial information within a sample and essential samples in the training set, respectively. Building on these principles, we define \textit{optimal dataset distillation} mathematically. We then present InfoUtil, a framework that balances informativeness and utility in synthesizing the distilled dataset. InfoUtil incorporates two key components: (1) game-theoretic informativeness maximization using Shapley Value attribution to extract key information from samples, and (2) principled utility maximization by selecting globally influential samples based on Gradient Norm. These components ensure that the distilled dataset is both informative and utility-optimized. Experiments demonstrate that our method achieves a 6.1\% performance improvement over the previous state-of-the-art approach on ImageNet-1K dataset using ResNet-18.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4107. Mode-conditioning unlocks superior test-time compute scaling
๐ฌ 10
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ test-time compute reasoning diversity
Parallel sampling promises substantial gains in test-time scaling, but its effectiveness is sharply limited by diversity collapse, where models concentrate on a few modes and repeated samples reproduce the same mistakes. We propose the mode-conditioning (ModC) framework, which explicitly allocates test-time compute across reasoning modes using either specialist models or mode-specific prefixes. ModC consistently improves scaling across controlled graph-search tasks and large-scale reasoning benchmarks, spanning model families and sizes from 0.5B to 7B. On OpenThoughts, fine-tuning Qwen2.5-7B with ModC achieves an 8ร efficiency gain over standard training while also improving the maximum attainable Pass@k. We further show that gradient clustering enables ModC without explicit mode labels, yielding up to 10% gains on datasets such as NuminaMath. Finally, we show that ModC improves reinforcement learning (RL) and can further boost diversity-inducing RL methods. These results demonstrate that standard training underutilizes the diversity in data, and that ModC provides a simple, effective remedy for unlocking the full benefits of diversity in test-time scaling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4108. Learning Efficient and Interpretable Multi-Agent Communication
๐ฌ 12
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Multi-agent communication reinforcement learning contrastive learning language grounding
Effective communication is crucial for multi-agent cooperation in partially observable environments. However, a fundamental trilemma exists among task performance, communication efficiency, and human interpretability. To resolve this, we propose a multi-agent communication framework via $\textbf{G}$rounding $\textbf{L}$anguage and $\textbf{C}$ontrastive learning (GLC) to learns efficient and interpretable communication protocols. Specifically, GLC employs an autoencoder to learn discretized and compressed communication symbols, ensuring high communication efficiency. These symbols are then semantically aligned with human concepts using data generated by a Large Language Model (LLM), making them human-interpretable. Furthermore, a contrastive learning objective is introduced to ensure consistency and mutual intelligibility among all agents, thereby securing high task utility. GLC dynamically balances these objectives by the Information Bottleneck principle. Extensive experiments show that GLC outperforms state-of-the-art methods across multiple benchmarks, delivering superior task performance, higher communication efficiency, and enhanced human interpretability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4109. Single Index Bandits: Generalized Linear Contextual Bandits with Unknown Reward Functions
๐ฌ 19
โญ 5.00
๐ learning theory
๐ท๏ธ contextual bandit online learning
Generalized linear bandits have been extensively studied due to their broad applicability in real-world online decision-making problems. However, these methods typically assume that the expected reward function is known to the users, an assumption that is often unrealistic in practice. Misspecification of this link function can lead to the failure of all existing algorithms. In this work, we address this critical limitation by introducing a new problem of generalized linear bandits with unknown reward functions, also known as single index bandits. We first consider the case where the unknown reward function is monotonically increasing, and propose two novel and efficient algorithms, STOR and ESTOR, that achieve decent regrets under standard assumptions. Notably, our ESTOR can obtain the nearly optimal regret bound $\tilde{O}_T(\sqrt{T})$ in terms of the time horizon $T$. We then extend our methods to the high-dimensional sparse setting and show that the same regret rate can be attained with the sparsity index. Next, we introduce GSTOR, an algorithm that is agnostic to general reward functions, and establish regret bounds under a Gaussian design assumption. Finally, we validate the efficiency and effectiveness of our algorithms through experiments on both synthetic and real-world datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4110. When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
๐ฌ 20
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Safety Alignment Jailbreak Large Language Model
Large language models (LLMs) can be prompted with specific styles (e.g., formatting responses as lists), including in malicious queries. Prior jailbreak research mainly augments these queries with additional string transformations to maximize attack success rate (ASR). However, the impact of style patterns in the original queries that are semantically irrelevant to the malicious intent remains unclear. In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment. We first define ASR inflation as the increase in ASR due to style patterns in existing jailbreak benchmark queries. By evaluating $32$ LLMs across seven benchmarks, we find that nearly all models exhibit ASR inflation. Notably, the inflation correlates with an LLM's relative attention to style patterns, which also overlap more with its instruction-tuning data when inflation occurs. We then investigate superficial style alignment, and find that fine-tuning with specific styles makes LLMs more vulnerable to jailbreaks of those same styles. Finally, we propose SafeStyle, a defense strategy that incorporates a small amount of safety training data augmented to match the distribution of style patterns in the fine-tuning data. Across three LLMs, six fine-tuning style settings, and two real-world instruction-tuning datasets, SafeStyle consistently outperforms baselines in maintaining LLM safety.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4111. Exposing Mixture and Annotating Confusion for Active Universal Test-Time Adaptation
๐ฌ 17
โญ 5.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Test-Time Adaptation Open-set
Universal Test-Time Adaptation (UTTA) tackles the challenge of handling both class and domain shifts in unsupervised settings with stream testing data. Currently, most UTTA methods can only deal with minor shifts and heavily rely on heuristic approaches. To advance UTTA under dual shifts, we propose a novel Active Universal Test-Time Adaptation (AUTTA) framework, Exposing Mixture and Annotating Confusion (EMAC), which incorporates active human annotation into the UTTA setting. To select appropriate samples for annotation in AUTTA, we first identify the mixed regions of target domain samples under dual shifts, highlighting potential candidate samples. We then design a reward-guided active selection strategy to prioritize annotating the most representative samples within this set, maximizing annotation effectiveness. Additionally, to balance the use of pseudo-labels with the limited number of annotations, we propose an adaptation objective designed to address the adaptation imbalance caused by annotation scarcity. Extensive experiments show that the proposed AUTTA approach significantly improves performance and achieves state-of-the-art.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4112. Learning To Draft: Adaptive Speculative Decoding with Reinforcement Learning
๐ฌ 4
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ speculative decoding reinforcement learning
Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time spent on drafting candidates and verifying them. However, current state-of-the-art methods rely on a static time allocation, while recent dynamic approaches optimize for proxy metrics like acceptance length, often neglecting the true time cost and treating the drafting and verification phases in isolation. To address these limitations, we introduce Learning to Draft (LTD), a novel method that directly optimizes for throughput of each draft-and-verify cycle. We formulate the problem as a reinforcement learning environment and train two co-adaptive policies to dynamically coordinate the draft and verification phases. This encourages the policies to adapt to each other and explicitly maximize decoding efficiency.
We conducted extensive evaluations on five diverse LLMs and four distinct tasks. Our results show that LTD achieves speedup ratios ranging from 2.24x to 4.32x, outperforming the state-of-the-art method Eagle3 up to
36.4\%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4113. Geometric Autoencoder Priors for Bayesian Inversion: Learn First Observe Later
๐ฌ 14
โญ 5.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Bayesian Inverse Problems Physics Graph Machine Learning Generative Modelling
Uncertainty Quantification (UQ) is paramount for inference in engineering applications. A common inference task is to recover full-field information of physical systems from a small number of noisy observations, a usually highly ill-posed problem. Critically, engineering systems often have complicated and variable geometries prohibiting the use of standard Bayesian UQ. In this work, we introduce Geometric Autoencoders for Bayesian Inversion (GABI), a framework for learning geometry-aware generative models of physical responses that serve as highly informative geometry-conditioned priors for Bayesian inversion. Following a โlearn first, observe laterโ paradigm, GABI distills information from large datasets of systems with varying geometries, without requiring knowledge of governing PDEs, boundary conditions, or observation processes, into a rich latent prior. At inference time, this prior is seamlessly combined with the likelihood of the specific observation process, yielding a geometry-adapted posterior distribution. Our proposed framework is architecture agnostic. A creative use of Approximate Bayesian Computation (ABC) sampling yields an efficient implementation that utilizes modern GPU hardware. We test our method on: steady-state heat over rectangular domains; Reynold-Averaged Navier-Stokes (RANS) flow around airfoils; Helmholtz resonance and source localization on 3D car bodies; RANS airflow over terrain. We find: the predictive accuracy to be comparable to deterministic supervised learning approaches in the restricted setting where supervised learning is applicable; UQ to be well calibrated and robust on challenging problems with complex geometries. The method provides a flexible geometry-aware train-once-use-anywhere foundation model which is independent of any particular observation process.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4114. PICABench: How Far are We from Physical Realistic Image Editing?
๐ฌ 17
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ image edit; benchmark; dataset
Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension(spanning optics, mechanics, and state transitions) for most of the common editing operations(add, remove, attribute change, etc). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K.After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4115. CGSA: Class-Guided Slot-Aware Adaptation for Source-Free Object Detections
๐ฌ 33
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Source-Free Domain Adaptation Object Detection Object-Centric Learning
Source-Free Domain Adaptive Object Detection (SF-DAOD) aims to adapt a detector trained on a labeled source domain to an unlabeled target domain without retaining any source data. Despite recent progress, most popular approaches focus on tuning pseudo-label thresholds or refining the teacher-student framework, while overlooking object-level structural cues within cross-domain data. In this work, we present CGSA, the first framework that brings Object-Centric Learning (OCL) into SF-DAOD by integrating slot-aware adaptation into the DETR-based detector. Specifically, our approach integrates a Hierarchical Slot Awareness (HSA) module into the detector to progressively disentangle images into slot representations that act as visual priors. These slots are then guided toward class semantics via a Class-Guided Slot Contrast (CGSC) module, maintaining semantic consistency and prompting domain-invariant adaptation. Experiments on five cross-domain object detection datasets demonstrate that our approach outperforms previous SF-DAOD methods, with theoretical derivations and experimental analysis further demonstrating the effectiveness of the proposed components and the framework, thereby indicating the promise of object-centric design in privacy-sensitive adaptation scenarios. All code will be released later.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4116. ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation
๐ฌ 20
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Diffusion Models Controllable Generation Multi-Instance Generation Identity Preservation Attention Mechanisms
Multi-instance image generation (MIG) remains a significant challenge for modern diffusion models due to key limitations in achieving precise control over object layout and preserving the identity of multiple distinct subjects. To address these limitations, we introduce **ContextGen**, a novel Diffusion Transformer framework for multi-instance generation that is guided by both layout and reference images. Our approach integrates two key technical contributions: a **Contextual Layout Anchoring (CLA)** mechanism that incorporates the composite layout image into the generation context to robustly anchor the objects in their desired positions, and **Identity Consistency Attention (ICA)**, a novel attention mechanism which leverages contextual reference images to ensure the identity consistency of multiple instances. Recognizing the lack of large-scale, hierarchically-structured datasets for this task, we introduce **IMIG-100K**, the first dataset with detailed layout and identity annotations. Extensive experiments demonstrate that ContextGen sets a new state-of-the-art, outperforming existing methods in control precision, identity fidelity, and overall visual quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4117. STEDiff: Revealing the Spatial and Temporal Redundancy of Backdoor Attacks in Text-to-Image Diffusion Models
๐ฌ 16
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Diffusion Models; Backdoor Attacks; Backdoor Defense; AI Security
Recently, diffusion models have been recognized as state-of-the-art models for image generation due to their ability to produce high-quality images. However, recent studies have shown that diffusion models are susceptible to backdoor attacks, where an attacker can activate hidden biases using a specific trigger pattern, causing the model to generate a predefined target. Fortunately, executing backdoor attacks is still challenging, as they typically require substantial time and memory to perform parameter-based fine-tuning. In this paper, we are the first to reveal the **spatio-temporal redundancy** in backdoor attacks on diffusion models. **Regarding spatial redundancy**, we observed the *enrichment phenomenon*, which reflects the abnormal gradient accumulation induced by backdoor injection. **Regarding temporal redundancy**, we observed a marginal effect associated with specific time steps, indicating that only a limited subset of time steps plays a critical role in backdoor injection. Building on these findings, we present a novel framework, *STEDiff*, comprising two key components: *STEBA* and *STEDF*. *STEBA* is a spatio-temporally efficient accelerated attack strategy that achieves up to **15.07ร** speedup in backdoor injection while reducing GPU memory usage by **82%**. *STEDF* is a detection framework leveraging spatio-temporal features, by modeling the enrichment phenomenon in weights and anisotropy across time steps, which achieves a backdoor detection rate of up to **99.8%**. Our code is available at: [https://anonymous.4open.science/r/STEDiff-9E9F/](https://anonymous.4open.science/r/STEDiff-9E9F/).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4118. Real-Time Motion-Controllable Autoregressive Video Diffusion
๐ฌ 16
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Autoregressive Diffusion Controllable Video Generation Reinforcement Learning
Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters. Codes are available in Supplementary Materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4119. Instance-wise Adaptive Scheduling via Derivative-Free Meta-Learning
๐ฌ 21
โญ 5.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Scheduling Neural Combinatorial Optimization Meta-learning
Deep Reinforcement Learning has achieved remarkable progress in solving NP-hard scheduling problems. However, existing methods primarily focus on optimizing average performance over training instances, overlooking the core objective of solving each individual instance with high quality. While several instance-wise adaptation mechanisms have been proposed, they are test-time approaches only and cannot share knowledge across different adaptation tasks. Moreover, they largely rely on gradient-based optimization, which could be ineffective in dealing with combinatorial optimization problems. We address the above issues by proposing an instance-wise meta-learning framework. It trains a meta model to acquire a generalizable initialization that effectively guides per-instance adaptation during inference, and overcomes the limitations of gradient-based methods by leveraging a derivative-free optimization scheme that is fully GPU parallelizable. Experimental results on representative scheduling problems demonstrate that our method consistently outperforms existing learning-based scheduling methods and instance-wise adaptation mechanisms under various task sizes and distributions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4120. Weight Space Representation Learning on Diverse NeRF Architectures
๐ฌ 16
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ weight space learning representation learning metanetworks graph metanetworks neural fields neural radiance fields NeRF implicit neural representations INR
Neural Radiance Fields (NeRFs) have emerged as a groundbreaking paradigm for representing 3D objects and scenes by encoding shape and appearance information into the weights of a neural network. Recent studies have demonstrated that these weights can be used as input for frameworks designed to address deep learning tasks; however, such frameworks require NeRFs to adhere to a specific, predefined architecture. In this paper, we introduce the first framework capable of processing NeRFs with diverse architectures and performing inference on architectures unseen at training time. We achieve this by training a Graph Meta-Network within an unsupervised representation learning framework, and show that a contrastive objective is conducive to obtaining an architecture-agnostic latent space. In experiments conducted across 13 NeRF architectures belonging to three families (MLPs, tri-planes, and, for the first time, hash tables), our approach demonstrates robust performance in classification, retrieval, and language tasks involving multiple architectures, even unseen at training time, while also matching or exceeding the results of existing frameworks limited to single architectures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4121. There and Back Again: On the relation between Noise and Image Inversions in Diffusion Models
๐ฌ 19
โญ 5.00
๐ generative models
๐ท๏ธ diffusion models ddim inversion image interpolation
Diffusion Models achieve state-of-the-art performance in generating new samples but lack a low-dimensional latent space that encodes the data into editable features. Inversion-based methods address this by reversing the denoising trajectory, transferring images to their approximated starting noise. In this work, we thoroughly analyze this procedure and focus on the relation between the initial noise, the generated samples, and their corresponding latent encodings obtained through the DDIM inversion. First, we show that latents exhibit structural patterns in the form of less diverse noise predicted for smooth image areas (e.g., plain sky). Through a series of analyses, we trace this issue to the first inversion steps, which fail to provide accurate and diverse noise. Consequently, the DDIM inversion space is notably less manipulative than the original noise. We show that prior inversion methods do not fully resolve this issue, but our simple fix, where we replace the first DDIM Inversion steps with a forward diffusion process, successfully decorrelates latent encodings and enables higher quality editions and interpolations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4122. DeepSADR: Deep Transfer Learning with Subsequence Interaction and Adaptive Readout for Cancer Drug Response Prediction
๐ฌ 19
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ adaptive readout; subsequence interaction;drug response;cancer patient
Cancer treatment efficacy exhibits high inter-patient heterogeneity due to genomic variations. While large-scale in vitro drug response data from cancer cell lines exist, predicting patient drug responses remains challenging due to genomic distribution shifts and the scarcity of clinical response data. Existing transfer learning methods primarily align global genomic features between cell lines and patients. However, they often ignore two critical aspects. First, drug response depends on specific drug substructures and genomic pathways. Second, drug response mechanisms differ in vitro and in vivo settings due to factors such as the immune system and tumor microenvironment. To address these limitations, we propose DeepSADR, a novel deep transfer learning framework for enhanced drug response prediction based on subsequence interaction and adaptive readout. In particular, DeepSADR models drug responses as interpretable bipartite interaction graphs between drug substructures and enriched genomic pathways. Subsequently, a supervised graph autoencoder was designed to capture latent interactions between drugs and gene subsequences within these interaction graphs. In addition, DeepSADR treats the drug response process as a transferable domain. A Set Transformer-based adaptive readout (AR) function learns domain-invariant response representations, enabling effective knowledge transfer from abundant cell line data to scarce patient data. Extensive experiments on clinical patient cohorts demonstrate that DeepSADR significantly outperforms state-of-the-art methods, and ablation experiments have validated the effectiveness of each module.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4123. Dancing in Chains: Strategic Persuasion in Academic Rebuttal via Theory of Mind
๐ฌ 35
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ AI for Research; Rebuttal Agent
Although AI has become deeply integrated into various stages of the research workflow and achieved remarkable advancements, academic rebuttal remains a significant and under-explored challenge. Rebuttal is a complex process of strategic communication under severe information asymmetry, not a simple technical debate. Current models fail because they only imitate surface-level linguistics, missing the essential element of perspective-taking required for effective persuasion. In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM). Specifically, the agent implements ToM through a Theory-of-Mind-Strategy-Response (TSR) pipeline, which models a reviewer's mental state, formulates a persuasion strategy, and then generates a strategy-grounded response. To train our agent, we construct RebuttalBench, a large-scale synthetic dataset created via a novel critique-and-refine pipeline. Our twofold training process begins with a Supervised Fine-tuning phase to equip the agent with ToM-based analysis and strategic planning capabilities, followed by a Reinforcement Learning phase using a novel self-reward mechanism for scalable self-improvement without an external reward model. For a reliable and scalable automated evaluation, we develop Rebuttal-RM, a specialized evaluator trained on multi-source data of over 100K samples, whose scoring consistency with human preferences surpasses GPT-4.1. Extensive experiments show RebuttalAgent significantly outperforms the base model by 18.3\% and is competitive with advanced models such as o3 across both automated and human evaluations. Our code will be released publicly.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4124. VisionLaw: Inferring Interpretable Intrinsic Dynamics from Visual Observations via Bilevel Optimization
๐ฌ 15
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Gaussian Splatting Physics-Driven 4D Interaction Scientific Discovery
The intrinsic dynamics of an object governs its physical behavior in the real world, playing a critical role in enabling physically plausible interactive simulation with 3D assets. Existing methods have attempted to infer the intrinsic dynamics of objects from visual observations, but generally face two major challenges: one line of work relies on manually defined constitutive priors, making it difficult to align with actual intrinsic dynamics; the other models intrinsic dynamics using neural networks, resulting in limited interpretability and poor generalization. To address these challenges, we propose VisionLaw, a bilevel optimization framework that infers interpretable expressions of intrinsic dynamics from visual observations. At the upper level, we introduce an LLMs-driven decoupled constitutive evolution strategy, where LLMs are prompted as a physics expert to generate and revise constitutive laws, with a built-in decoupling mechanism that substantially reduces the search complexity of LLMs. At the lower level, we introduce a vision-guided constitutive evaluation mechanism, which utilizes visual simulation to evaluate the consistency between the generated constitutive law and the underlying intrinsic dynamics, thereby guiding the upper-level evolution. Experiments on both synthetic and real-world datasets demonstrate that VisionLaw can effectively infer interpretable intrinsic dynamics from visual observations. It significantly outperforms existing state-of-the-art methods and exhibits strong generalization for interactive simulation in novel scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4125. HYPER: A Foundation Model for Inductive Link Prediction with Knowledge Hypergraphs
๐ฌ 23
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Knowledge Hypergraph Link Prediction Graph Neural Networks Foundation Models
Inductive link prediction with knowledge hypergraphs is the task of predicting missing hyperedges involving completely *novel entities* (i.e., nodes unseen during training). Existing methods for inductive link prediction with knowledge hypergraphs assume a fixed relational vocabulary and, as a result, cannot generalize to knowledge hypergraphs with *novel relation types* (i.e., relations unseen during training). Inspired by knowledge graph foundation models, we propose HYPER as a foundation model for link prediction, which can generalize to *any knowledge hypergraph*, including novel entities and novel relations. Importantly, HYPER can learn and transfer across different relation types of *varying arities*, by encoding the entities of each hyperedge along with their respective positions in the hyperedge. To evaluate HYPER, we construct 16 new inductive datasets from existing knowledge hypergraphs, covering a diverse range of relation types of varying arities. Empirically, HYPER consistently outperforms all existing methods in both node-only and node-and-relation inductive settings, showing strong generalization to unseen, higher-arity relational structures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4126. Think in Parallel, Answer as One: Logit Averaging for Open-Ended Reasoning
๐ฌ 11
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Reasoning
Majority voting has proven effective for close-ended question answering by aggregating parallel reasoning traces. However, it is not directly applicable to open-ended reasoning, where "majority" is undefined. We introduce THINKMERGE, a training-free, plug-and-play decoding strategy that runs K parallel reasoning traces and averages their next-token logits at synchronization points to produce a single coherent output. THINKMERGE integrates seamlessly with vLLM/SGLang and remains compatible with standard decoding techniques such as Top-p/Top-k. Empirically, it matches or surpasses majority voting on AIME and GPQA, while delivering consistent gains on open-ended coding tasks: on LiveCodeBench (hard), pass@1 improves by +8.28% for DeepCoder-14B-Preview and +7.58% for Qwen3-8B. These results demonstrate that parallel test-time scaling can benefit open-ended reasoning without relying on voting over complete outputs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4127. Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations
๐ฌ 9
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ quantization large language models LLMs
Large language models (LLMs) demand extensive memory capacity during both fine-tuning and inference. To enable memory-efficient fine-tuning, existing methods apply block-wise quantization techniques, such as NF4 and AF4, to the network weights. We show that these quantization techniques incur suboptimal quantization errors. Therefore, as a first novelty, we propose an optimization approach for block-wise quantization. Using this method, we design a family of quantizers named 4-bit block-wise optimal float (BOF4), which consistently reduces the quantization error compared to both baseline methods. We provide both a theoretical and a data-driven solution for the optimization process and prove their practical equivalence. Secondly, we propose a modification to the employed normalization method based on the signed absolute block maximum (BOF4-S), enabling further reduction of the quantization error and empirically achieving less degradation in language modeling performance. Thirdly, we explore additional variations of block-wise quantization methods applied to LLMs through an experimental study on the importance of accurately representing zero and large-amplitude weights on the one hand, and optimization towards various error metrics on the other hand. Lastly, we introduce a mixed-precision quantization strategy dubbed outlier-preserving quantization (OPQ) to address the distributional mismatch induced by outlier weights in block-wise quantization. By storing outlier weights in 16-bit precision (OPQ) while applying BOF4-S, we achieve top performance among 4-bit block-wise quantization techniques w.r.t. perplexity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4128. Overshoot and Shrinkage in Classifier-Free Guidance: From Theory to Practice
๐ฌ 23
โญ 5.00
๐ generative models
๐ท๏ธ Diffusion flow-matching classifier-free guidance
Classifier-Free Guidance (CFG) is widely used in diffusion and flow-based generative models for high-quality conditional generation, yet its theoretical properties remain incompletely understood. By connecting CFG to the high-dimensional framework of diffusion regimes, we show that in sufficiently high dimensions it reproduces the correct target distributionโa โblessing-of-dimensionalityโ result. Leveraging this theoretical framework, we analyze how the well-known artifacts of mean overshoot and variance shrinkage emerge in lower dimensions, characterizing how they become more pronounced as dimensionality decreases. Building on these insights, we propose a simple nonlinear extension of CFG, proving that it mitigates both effects while preserving CFGโs practical benefits. Finally, we validate our approach through numerical simulations on Gaussian mixtures and real-world experiments on diffusion and flow-matching state-of-the-art class-conditional and text-to-image models, demonstrating continuous improvements in sample quality, diversity, and consistency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4129. AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning
๐ฌ 21
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Safety Alignment Reasoning Reinfocement Learning with Verifiable Reward
Large language models (LLMs), despite possessing latent safety understanding from their vast pretraining data, remain vulnerable to generating harmful content and exhibit issues such as over-refusal and utility degradation after safety alignment. Current safety alignment methods often result in superficial refusal shortcuts or rely on intensive supervision for reasoning-based approaches, failing to fully leverage the model's intrinsic safety self-awareness. We propose \textbf{AlphaAlign}, a simple yet effective pure reinforcement learning (RL) framework with verifiable safety reward designed to incentivize this latent safety awareness through proactive safety reasoning. AlphaAlign employs a dual-reward system: a verifiable safety reward encourages correctly formatted and explicitly justified refusals for harmful queries while penalizing over-refusals, and a normalized helpfulness reward guides high-quality responses to benign inputs. This allows the model to develop proactive safety reasoning capabilities without depending on supervised safety-specific reasoning data. AlphaAlign demonstrates three key advantages: (1) Simplicity and efficiency, requiring only binary prompt safety labels and minimal RL steps for substantial improvements. (2) Breaking the safety-utility trade-off, by enhancing refusal of harmful content and reducing over-refusals, while simultaneously maintaining or even improving general task performance and robustness to unseen jailbreaks. (3) Deep alignment, fostering proactive safety reasoning that generates explicit safety rationales rather than relying on shallow refusal patterns. Our codes are available at \url{https://anonymous.4open.science/r/AlphaAlign-D5B0}
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4130. CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype Learning
๐ฌ 4
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Unsupervised 3D Representation Learning; Fusion Perception; Autonomous Driving
Unsupervised 3D representation learning reduces the burden of labeling multimodal 3D data for fusion perception tasks. Among different pre-training paradigms, differentiable-rendering-based methods have shown most promise. However, existing works separately conduct pre-training for each modalities due to computational costs of processing large point clouds with images. As such, mutual benefit of high-level semantics (from image) and 3D structure (from point cloud) has not been exploited. To address this gap, we propose a joint unsupervised differentiable-rendering-based pre-training method for images and point clouds, termed CLAP, short for Curvature sampLing and leArnable Prototype. Specifically, our method overcomes the computational hurdle by Curvature Sampling to select the more informative points/pixels for pre-training. To uncover the performance benefits brought by their complementarity, we propose to use learnable prototypes to represent parts of the 3D scenes in a common feature space and an Expectation-Maximization training scheme to associate embeddings of each modality to prototypes. We further propose a swapping prediction loss that explores their interplay through prototypes along with a Gram Matrix Regularization term to maintain training stability. Experiments on NuScenes and Waymo datasets show that CLAP achieves up to 100% more performance gain as compared to previous SOTA pre-training methods. Codes and models will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4131. The Seismic Wavefield Common Task Framework
๐ฌ 15
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Seismology Scientific Machine Learning Common Task Framework Seismic Wavefields Geophysics
Seismology faces fundamental challenges in state forecasting and reconstruction (e.g., earthquake early warning and ground motion prediction) and managing the parametric variability of source locations, mechanisms, and Earth models (e.g., subsurface structure and topography effects). Addressing these with simulations is hindered by their massive scale, both in synthetic data volumes and numerical complexity, while real-data efforts are constrained by models that inadequately reflect the Earth's complexity and by sparse sensor measurements from the field. Recent machine learning (ML) efforts offer promise, but progress is obscured by a lack of proper characterization, fair reporting, and rigorous comparisons. To address this, we introduce a Common Task Framework (CTF) for ML for seismic wavefields, demonstrated here on three distinct wavefield datasets. Our CTF features a curated set of datasets at various scales (global, crustal, and local) and task-specific metrics spanning forecasting, reconstruction, and generalization under realistic constraints such as noise and limited data. Inspired by CTFs in fields like natural language processing, this framework provides a structured and rigorous foundation for head-to-head algorithm evaluation. We evaluate various methods for reconstructing seismic wavefields from sparse sensor measurements, with results illustrating the CTF's utility in revealing strengths, limitations, and suitability for specific problem classes. Our vision is to replace ad hoc comparisons with standardized evaluations on hidden test sets, raising the bar for rigor and reproducibility in scientific ML.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4132. H2OFlow: Grounding Human-Object Affordances with 3D Generative Models and Dense Diffused Flows
๐ฌ 24
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ affordance 3d vision generative model
Understanding how humans interact with the surrounding environment, and specifically reasoning about object interactions and affordances, is a critical challenge in computer vision, robotics, and AI. Current approaches often depend on labor-intensive, hand-labeled datasets capturing real-world or simulated human-object interaction (HOI) tasks, which are costly and time-consuming to produce. Furthermore, most existing methods for 3D affordance understanding are limited to contact-based analysis, neglecting other essential aspects of human-object interactions, such as orientation (e.g., humans might have a preferential orientation with respect certain objects, such as a TV) and spatial occupancy (e.g., humans are more likely to occupy certain regions around an object, like the front of a microwave rather than its back). To address these limitations, we introduce **H2OFlow**, a novel framework that comprehensively learns 3D HOI affordances ---encompassing contact, orientation, and spatial occupancy--- using only synthetic data generated from 3D generative models. H2OFlow employs a dense 3D-flow-based representation, learned through a dense diffusion process operating on point clouds. This learned flow enables the discovery of rich 3D affordances without the need for human annotations. Through extensive quantitative and qualitative evaluations, we demonstrate that H2OFlow generalizes effectively to real-world objects and surpasses prior methods that rely on manual annotations or mesh-based representations in modeling 3D affordance. Interactive 3D visualization, code, and data will be available at this [website](https://sites.google.com/view/h2oflow/home).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4133. ContextBench: Modifying Contexts for Targeted Latent Activation and Behaviour Elicitation
๐ฌ 15
โญ 5.00
๐ interpretability and explainable AI
๐ท๏ธ Interpretability AI Safety Prompt Optimisation Sparse Autoencoders Elicitation Feature Visualisation
Identifying inputs that trigger specific behaviours or latent features in language models could have a wide range of safety use cases.
We investigate a class of methods capable of generating targeted, linguistically fluent inputs that activate specific latent features or elicit model behaviours. We formalise this approach as *context modification* and present ContextBench -- a benchmark with tasks assessing core method capabilities and potential safety applications. Our evaluation framework measures both elicitation strength (activation of latent features or behaviours) and linguistic fluency, highlighting how current state-of-the-art methods struggle to balance these objectives.
We develop two novel enhancements to Evolutionary Prompt Optimisation (EPO): LLM-assistance and diffusion model inpainting, achieving state-of-the-art performance in balancing elicitation and fluency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4134. MoEEdit: Efficient and Routing-Stable Knowledge Editing for Mixture-of-Experts LLMs
๐ฌ 15
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Knowledge Editing; Mixture-of-Experts; Large Language Models
Knowledge editing (KE) is crucial for making precise modifications to factual knowledge within large language models (LLMs). Existing KE methods, however, are primarily designed for dense architectures, limiting their applicability to the increasingly popular sparse Mixture-of-Experts (MoE) models that power modern scalable LLMs. While MoEs offer remarkable efficiency and capacity scaling, their unique structure introduces new challenges for KE. Naively adapting dense-model editors to MoEs is not only computationally expensive but also induces routing distribution shifts that degrade model stability and consistency. To address these challenges, we introduce MoEEdit, the first systematic framework for routing-stable knowledge editing in MoE LLMs. Our approach reparameterizes expert updates through per-expert null-space projections, ensuring router inputs remain invariant to suppress these shifts, and solves the resulting block-structured optimization with an efficient block coordinate descent (BCD) solver. Experiments demonstrate that MoEEdit achieves state-of-the-art efficacy and generalization, while maintaining high specificity, routing stability, and superior computational and memory efficiency. Our work establishes a robust foundation for scalable and precise knowledge editing in modern sparse LLMs by highlighting the necessity of routing-stable interventions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4135. Inlier-Centric Post-Training Quantization for Object Detection Models
๐ฌ 21
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Quantization Object Detection Efficiency
Object detection is pivotal in robotics, but its immense computational demands make the models slow and power-hungry, underscoring the need for quantization. However, when the quantization is applied in practice, cluttered backgrounds and irregular object morphologies cause redundant activations (or anomalies) that inflate precision requirements and waste bit capacity, hindering the preservation of informative features. Moreover, without a clear criterion for defining such anomalies, attempts to exclude or mitigate them often distort useful features. To address this problem, we present InlierQ, an inlier-centric post-training quantization approach that establishes a general criterion to differentiate anomalies from informative inliers. Specifically, InlierQ computes gradient-aware volume saliency scores, classifies each volume as an inlier or outlier, and fits a posterior distribution over these scores using the ExpectationโMaximization (EM) algorithm. This design effectively suppresses the influence of outliers while preserving informative inlier features. InlierQ is a label-free, drop-in method and uses only 64 samples for calibration. Experiments on the COCO and nuScenes benchmarks demonstrate consistent reductions in quantization errors across camera-based (2D and 3D) and LiDAR-based (3D) object detection.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4136. Temporal Graph Thumbnail: Robust Representation Learning with Global Evolutionary Skeleton
๐ฌ 33
โญ 5.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Dynamic Graph Neural Network Global Evolution Von Neumann Entropy Robust Representation Learning
Temporal graphs are commonly employed as conceptual models for capturing time-evolving interactions in real-world systems. Representation learning on such non-Euclidean data typically depends on aggregating information from neighbors, and the presence of temporal dynamics further complicates this process. However, neighbors often contain noisy information in practice, making the unreliable propagation of knowledge and may even lead to the model failure. Although existing methods employ adaptive spatiotemporal neighbor sampling strategies or temporal dependency modeling frameworks to enhance model robustness, their constrained sampling scope limits handling of severe noise and long-term dependencies. This limitation can be attributed to a fundamental cause: neglecting global evolution inherently overlooks the temporal regularities encoded in continuous dynamics. To address this, we propose the **T**emporal **G**raph **T**humbnail (**TGT**), encapsulating a temporal graphโs global evolutionary skeleton as a thumbnail to characterize temporal regularities and enhance model robustness. Specifically, we model the thumbnail by leveraging von Neumann graph entropy and node mutual information to extract essential evolutionary skeleton from the raw temporal graph, and subsequently use it to guide optimization for model learning. In addition to rigorous theoretical derivation, extensive experiments demonstrate that TGT achieves superior capability and robustness compared to baselines, particularly in rapidly evolving and noisy environments. The code is available at https://anonymous.4open.science/r/TGT-BDF2.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4137. PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework
๐ฌ 25
โญ 5.00
๐ generative models
๐ท๏ธ Aesthetic Poster Generation Unified Framework Specific Large-scale Data
Generating aesthetic posters is more challenging than simple design images: it requires not only precise text rendering but also the seamless integration of abstract artistic content, striking layouts, and overall stylistic harmony. To address this, we propose PosterCraft, a unified framework that abandons prior modular pipelines and rigid, predefined layouts, allowing the model to freely explore coherent, visually compelling compositions. PosterCraft employs a carefully designed, cascaded workflow to optimize the generation of high-aesthetic posters: (i) large-scale text-rendering optimization on our newly introduced Text-Render-2M dataset; (ii) region-aware supervised finetuning on HQ-Poster-100K; (iii) aesthetic-text reinforcement learning via best-of-n preference optimization; and (iv) joint visionโlanguage feedback refinement. Each stage is supported by a fully automated data-construction pipeline tailored to its specific needs, enabling robust training without complex architectural modifications. Evaluated on multiple experiments, PosterCraft significantly outperforms open-source baselines in rendering accuracy, layout coherence, and overall visual appealโapproaching the quality of SOTA commercial systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4138. FlexProtein: Joint Sequence and Structure Pretraining for Protein Modeling
๐ฌ 19
โญ 5.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Protein Design Protein Foundation Model Diffusion
Protein foundation models have advanced rapidly, with most approaches falling into two dominant paradigms. Sequence-only language models (e.g., ESM-2) capture sequence semantics at scale but lack structural grounding. MSA-based predictors (e.g., AlphaFold 2/3) achieve accurate folding by exploiting evolutionary couplings, but their reliance on homologous sequences makes them less reliable in highly mutated or alignment-sparse regimes. We present FlexProtein, a pretrained protein model that jointly learns from amino acid sequences and three-dimensional structures. Our pretraining strategy combines masked language modeling with diffusion-based denoising, enabling bidirectional sequence-structure learning without requiring MSAs. Trained on both experimentally resolved structures and AlphaFold 2 predictions, FlexProtein captures global folds as well as flexible conformations critical for biological function. Evaluated across diverse tasks spanning interface design, intermolecular interaction prediction, and protein function prediction, FlexProtein establishes new state-of-the-art performance on 12 different tasks, with particularly strong gains in mutation-rich settings where MSA-based methods often struggle.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4139. Language Models are Injective and Hence Invertible
๐ฌ 22
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ transformers language models invertibility injectivity inversion privacy
Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a modelโs representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4140. Smarter Not Harder: Generative Process Evaluation with Intrinsic-Signal Driving and AbilityโAdaptive Reward Shaping
๐ฌ 35
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Generative Process Reward Model Math Reasoning Large Reasoning Model
Large reasoning models (LRMs) have shown strong performance in complex mathematical reasoning when optimized via reinforcement learning (RL). However, conventional outcome-only reward provides sparse feedback, leading to inefficient optimization. In this work, we investigate whether generative process reward models (GenPRMs) can accelerate RL training of LRMs by improving the utilization of reasoning trajectories. We first analyze critical limitations in existing GenPRMs, including their heavy reliance on reasoning ability during correctness judgment, and suppression of exploration as well as vulnerability to reward hacking during reward assignment. To address these limitations, we first propose a novel \textbf{intrinsic-signal-drive n evaluation} mechanism, which judges reasoning steps using semantic cues from the solution, thus mitigating extensive dependence on GenPRM. Furthermore, we (i) adopt \textbf{thought-level rewarding granularity} to alleviate over-dense step rewards, and (ii) design an \textbf{ability-adaptive reward formulation} that dynamically balances exploration and exploitation and keeping the optimization target of key tokens to mitigate reward hacking. We integrate these innovations into the process reward-based GRPO, resulting in the proposed \textbf{TP-GRPO} algorithm. Experiments on LRMs with 1.5B and 7B parameters show that TP-GRPO achieves higher improvements while using significantly fewer training samples, and more analyses further confirm the effectiveness of our proposed process evaluation mechanism.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4141. Not All Bits Are Equal: How Model Scale Changes Memory-Optimal Reasoning
๐ฌ 23
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language models reasoning efficiency model compression
While 4-bit quantization has emerged as a memory-optimal choice for non-reasoning models and zero-shot tasks across scales, we show that this universal prescription fails for reasoning models, where KV cache rather than model size can dominate memory.
Through systematic experiments on mathematical, code generation, and knowledge-intensive reasoning tasks, we find a scale-dependent trade-off: models with an effective size below 8-bit 4B parameters achieve better accuracy by allocating memory to larger weights, rather than longer generation, while larger models benefit from the opposite strategy.
This scale threshold also determines when parallel scaling becomes memory-efficient and whether KV cache eviction outperforms KV quantization.
Our findings show that memory optimization for LLMs cannot be scale-agnostic, while providing principled guidelines: for small reasoning models, prioritize model capacity over test-time compute, while for large ones, maximize test-time compute.
Our results suggest that optimizing reasoning models for deployment requires fundamentally different strategies than those established for non-reasoning ones.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4142. Scaling Agent Learning via Experience Synthesis
๐ฌ 29
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Agent Reinforcement Learning Data Synthesis
While reinforcement learning (RL) can empower large language model (LLM) agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an experience replay buffer initialized with offline real-world data and continuously enriched with fresh interactions to actively support agent training. To improve knowledge acquisition, DreamGym adaptively generates new tasks that challenge the current agent policy, enabling more effective online curriculum learning. Experiments across diverse environments and agent backbones demonstrate that DreamGym substantially improves RL training, both in fully synthetic settings and in sim-to-real transfer scenarios. On non-RL-ready tasks like WebArena, DreamGym outperforms all baselines by over 150%. And in RL-ready but costly settings, it matches GRPO and PPO performance using only synthetic interactions. When transferring a policy trained purely on synthetic experiences to real-environment RL, DreamGym achieves an additional 64.5% performance gain while using no more than 10% of real-world interactions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4143. HiddenEcho: Mitigating Noise Amplification in Differentially Private LLMs with Hidden-State Correction
๐ฌ 30
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Privacy Preservation Denoise
The rise of large language models (LLMs) has driven the adoption of Model-as-a-Service (MaaS). However, transmitting raw text to servers raises critical privacy concerns. Existing approaches employ deep neural networks (DNNs) or differential privacy (DP) to perturb inputs. Yet, these approaches suffer notable limitations: DNN-based methods often require task-specific pre-training, and conventional DP techniques, though privacy-preserving, suffer from noise amplification as perturbed inputs propagate through the deep transformer layer, leading to significant degradation in downstream task performance. To alleviate this, we propose HIDDENECHO, an end-to-end framework with client noise correction, where hidden states are sent from the server to the client and refined by a lightweight module using both embeddings and intermediate representations. HIDDENECHO suppresses inter-layer noise amplification without pretraining, effectively preserving task-relevant signals under DP constraints. To further reduce communication, HIDDENECHO incorporates gradient-based hidden layer selection and information bottleneck compression, reducing communication cost while preserving essential task information. Experiments across text classification and generation tasks demonstrate that HIDDENECHO achieves up to 46.89\% performance improvement over DP baselines, over 85\% communication reduction, and up to 72.52\% faster training compared to existing denoising approaches, establishing a new privacy-utility trade-off for privatized LLMs. Codes are available at https://anonymous.4open.science/r/hidden-echo.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4144. Structurally Human, Semantically Biased: Detecting LLM-Generated References with Embeddings and GNNs
๐ฌ 17
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models (LLMs) Citation Networks Graph Neural Networks (GNNs)
Large language models are increasingly used to curate bibliographies, raising the question: are their reference lists distinguishable from human ones? We build paired citation graphs, ground truth and GPT-4o-generated (from parametric knowledge), for 10,000 focal papers ($\approx$ 275k references) from SciSciNet, and added a field-matched random baseline that preserves out-degree and field distributions while breaking latent structure. We compare (i) structure-only node features (degree/closeness/eigenvector centrality, clustering, edge count) with (ii) 3072-D title/abstract embeddings, using an RF on graph-level aggregates and Graph Neural Networks with node features. Structure alone barely separates GPT from ground truth (RF accuracy $\approx$ 0.60) despite cleanly rejecting the random baseline ($\approx$ 0.89--0.92). By contrast, embeddings sharply increase separability: RF on aggregated embeddings reaches $\approx$ 0.83, and GNNs with embedding node features achieve 93\% test accuracy on GPT vs.\ ground truth. We show the robustness of our findings by replicating the pipeline with Claude Sonnet 4.5 and with multiple embedding models (OpenAI and SPECTER), with RF separability for ground truth vs.\ Claude $\approx 0.77$ and clean rejection of the random baseline. Thus, LLM bibliographies, generated purely from parametric knowledge, closely mimic human citation topology, but leave detectable semantic fingerprints; detection and debiasing should target content signals rather than global graph structure.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4145. Can VisionโLanguage Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective.
๐ฌ 14
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Vision Language Models; Design aesthetics
Assessing the aesthetic quality of graphic design is central to visual communication, yet remains underexplored in visionโlanguage models (VLMs). We investigate whether VLMs can evaluate design aesthetics in ways comparable to humans. Prior work faces three key limitations: benchmarks restricted to narrow principles and coarse evaluation protocols, a lack of systematic VLM comparisons, and limited training data for model improvement. In this work, we introduce AesEval-Bench, a comprehensive benchmark spanning four dimensions, twelve indicators, and three fully quantifiable tasks: aesthetic judgment, region selection, and precise localization. Then, we systematically evaluate proprietary, open-source, and reasoning-augmented VLMs, revealing clear performance gaps against the nuanced demands of aesthetic assessment. Moreover, we construct a training dataset to fine-tune VLMs for this domain, leveraging human-guided VLM labeling to produce task labels at scale and indicator-grounded reasoning to tie abstract indicators to concrete design regions.Together, our work establishes the first systematic framework for aesthetic quality assessment in graphic design.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4146. Self-Augmented Visual Contrastive Decoding
๐ฌ 16
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ large vision language models large language models contrastive decoding multimodal learning
Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal capabilities, but they inherit the tendency to hallucinate from their underlying language models. While visual contrastive decoding has been proposed to mitigate this issue, existing methods often apply generic visual augmentations that disregard the specific context provided by the text query, limiting their effectiveness. This study introduces a novel training-free decoding strategy that addresses these limitations, featuring two key contributions. First, a self-augmentation prompting strategy that leverages the intrinsic knowledge of the model to dynamically align semantics between the query and the visual augmentation. Second, an adaptive thresholding algorithm that adaptively adjusts next token candidate size based on the output sparsity, utilizing full information from the logit distribution. Extensive experiments across four LVLMs and seven benchmarks demonstrate that the proposed decoding significantly enhances factual consistency compared to state-of-the-art decoding methods. This work highlights the importance of integrating query-dependent augmentation and entropy-aware decoding for improving effective generation of LVLMs. The source code will be released upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4147. On the Convergence Behavior of Preconditioned Gradient Descent Toward the Rich Learning Regime
๐ฌ 9
โญ 5.00
๐ learning theory
๐ท๏ธ spectral bias preconditioned gradient descent grokking optimization dynamics neural tangent kernel higher-order methods
Spectral bias, the tendency of neural networks to learn low frequencies first, can be both a blessing and a curse. While it enhances the generalization capabilities by suppressing high-frequency noise, it can be a limitation in scientific tasks that require capturing fine-scale structures. The delayed generalization phenomenon known as grokking is another barrier to rapid training of neural networks. Grokking has been hypothesized to arise as learning transitions from the NTK to the feature-rich regime. This paper explores the impact of preconditioned gradient descent (PGD), such as Gauss-Newton, on spectral bias and grokking phenomena. We demonstrate through theoretical and empirical results how PGD can mitigate issues associated with spectral bias. Additionally, building on the rich learning grokking hypothesis, we study how PGD can be used to reduce delays associated with grokking. Our conjecture is that PGD, without the impediment of spectral bias, enables uniform exploration of the parameter space in the NTK regime. Our experimental results confirm this prediction, providing strong evidence that grokking represents a transitional behavior between the lazy regime characterized by the NTK and the rich regime. These findings deepen our understanding of the interplay between optimization dynamics, spectral bias, and the phases of neural network learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4148. SERUM: Simple, Efficient, Robust, and Unifying Marking for Diffusion-based Image Generation
๐ฌ 34
โญ 5.00
๐ generative models
๐ท๏ธ watermarks diffusion models marking computer vision image generation
We propose SERUM: an intriguingly simple yet highly effective method for marking images generated by diffusion models (DMs). We only add a unique watermark noise to the initial diffusion generation noise and train a lightweight detector to identify watermarked images, simplifying and unifying the strengths of prior approaches. SERUM provides robustness against any image augmentations or watermark removal attacks and is extremely efficient, all while maintaining negligible impact on image quality. In contrast to prior approaches, which are often only resilient to limited perturbations and incur significant training, injection, and detection costs, our SERUM achieves remarkable performance, with the highest true positive rate (TPR) at a 1% false positive rate (FPR) in most scenarios, along with fast injection and detection and low detector training overhead. Its decoupled architecture also seamlessly supports multiple users by embedding individualized watermarks with little interference between the marks. Overall, our method provides a practical solution to mark outputs from DMs and to reliably distinguish generated from natural images.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4149. KLAS: Using Similarity to Stitch Neural Networks for an Improved Accuracy-Efficiency Tradeoff
๐ฌ 37
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ representation learning neural networks deep learning
Given the wide range of deployment targets, flexible model selection is essential for optimizing performance within a given compute budget.
Recent work demonstrates that stitching pretrained models within a model family enables cost-effective interpolation of the accuracy-efficiency tradeoff space.
Stitching transforms intermediate activations from one pretrained model into another, producing a new interpolated stitched network.
Such networks provide a pool of deployment options along the accuracy-efficiency spectrum.
However, existing stitching approaches often yield suboptimal tradeoffs and lack generalizability, as they primarily rely on heuristics to select stitch configurations.
We argue that constructing improved accuracy-efficiency tradeoffs requires explicitly capturing and leveraging the _similarity_ between pretrained models being stitched.
To this end, we introduce KLAS, a novel stitch selection framework that automates and generalizes stitch selection across model families by leveraging KL divergence between intermediate representations.
KLAS identifies the most promising stitches from the $\mathcal{O}(n^k)$ possibilities for $k$ pretrained models of depth $n$.
Through comprehensive experiments, we demonstrate that KLAS produces improved accuracy-efficiency curve of stitched models at the same cost as baselines.
KLAS achieves up to $1.21\%$ higher ImageNet-1K top-1 accuracy at the same computational cost, or maintains accuracy with a $1.33\times$ reduction in FLOPs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4150. Measuring and Mitigating Rapport Bias of Large Language Models under Multi-Agent Social Interactions
๐ฌ 28
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Multi-Agent Systems (MAS) Social Influence & Trust Formation
Large language models (LLMs) are increasingly deployed in multi-agent systems (MAS) as components of collaborative intelligence, where peer interactions dynamically shape individual decision-making. While prior work has largely focused on conformity bias, we broaden the scope to examine how LLMs build rapport from previous interactions, resist misinformation, and integrate peer input during collaboration, which are key factors for achieving collective intelligence under complex social dynamics. We introduce KAIROS, a benchmark simulating quiz contests with peer agents of varying reliability, offering fine-grained control over conditions such as expertโnovice roles, noisy crowds, and adversarial peers. LLMs receive both historical interactions and current peer responses, allowing systematic investigation into how rapport, peer action, and self-confidence influence decisions. To mitigate this vulnerability, we evaluate prompting, supervised fine-tuning, and reinforcement learning using Group Relative Policy Optimization (GRPO) across multiple models. Our results show that model size plays a central role in moderating susceptibility to social influence: larger models exhibit stronger resilience and benefit from prompting-based mitigation, whereas smaller models are more vulnerable. For the latter, carefully configured GRPO training improves both robustness and overall performance. Our code and datasets are available at: https://anonymous.4open.science/r/KAIROS-4F71
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4151. HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities
๐ฌ 10
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ computer use agents llms evaluation
Web applications are prime targets for cyberattacks due to their role as entry points to vital services and sensitive data repositories. Traditional penetration testing is expensive and requires specialized expertise, creating scalability challenges for securing the expanding web ecosystem. While language model agents have shown promise in certain cybersecurity tasks, modern web applications require visual understanding of complex user interfaces, dynamic content rendering, and multi-step interactive workflows that only computer-use agents (CUAs) can handle. Despite CUAs' demonstrated capabilities in web browsing and visual task automation, their potential to discover and exploit web application vulnerabilities through graphical interfaces remains unknown.
We introduce HackWorld, the first evaluation framework for systematically assessing CUAs' capabilities in exploiting web application vulnerabilities through visual interaction. Unlike existing benchmarks using sanitized environments, HackWorld exposes CUAs to 36 curated applications spanning 11 frameworks and 7 languages, containing realistic vulnerabilities including injection flaws, authentication bypasses, and unsafe input handling. Our framework directly evaluates CUAs' ability to discover and exploit these vulnerabilities using Capture-the-Flag (CTF) methodology while navigating complex web interfaces.
Evaluation of state-of-the-art CUAs reveals exploitation rates below 12%, struggling to plan multi-step attacks and use security tools effectively.
Our results expose CUAs' limited cybersecurity skills when operating on vulnerable web applications, opening future research directions on developing security-aware CUAs for vulnerability detection and exploitation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4152. Emergent Coordination in Multi-Agent Language Models
๐ฌ 21
โญ 5.00
๐ generative models
๐ท๏ธ multi-agent systems LLMs information decomposition emergence collective intelligence
When are multi-agent LLM systems merely a collection of individual agents versus an integrated collective with higher-order structure? We introduce an information-theoretic framework to test---in a purely data-driven way---whether multi-agent systems show signs of higher-order structure. This information decomposition lets us measure whether dynamical emergence is present in multi-agent LLM systems, localize it, and distinguish spurious temporal coupling from performance-relevant cross-agent synergy. We implement both a practical criterion and an emergence capacity criterion operationalized as partial information decomposition of time-delayed mutual information (TDMI). We apply our framework to experiments using a simple guessing game without direct agent communication and only minimal group-level feedback with three randomized interventions. Groups in the control condition exhibit strong temporal synergy but only little coordinated alignment across agents. Assigning a persona to each agent introduces stable identity-linked differentiation. Combining personas with an instruction to ``think about what other agents might do'' shows identity-linked differentiation and goal-directed complementarity across agents. Taken together, our framework establishes that multi-agent LLM systems can be steered with prompt design from mere aggregates to higher-order collectives. Our results are robust across emergence measures and entropy estimators, and not explained by coordination-free baselines or temporal dynamics alone. Without attributing human-like cognition to the agents, the patterns of interaction we observe mirror well-established principles of collective intelligence in human groups: effective performance requires both alignment on shared objectives and complementary contributions across members.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4153. Toward Efficient Exploration by Large Language Model Agents
๐ฌ 15
โญ 5.00
๐ reinforcement learning
๐ท๏ธ Exploration Large Language Models Bayesian RL
A burgeoning area within reinforcement learning (RL) is the design of sequential decision-making agents centered around large language models (LLMs). While autonomous decision-making agents powered by modern LLMs could facilitate numerous real-world applications, such successes demand agents that are capable of data-efficient RL. One key obstacle to achieving data efficiency in RL is exploration, a challenge that we demonstrate many recent proposals for LLM agent designs struggle to contend with. Meanwhile, classic algorithms from the RL literature known to gracefully address exploration require technical machinery that can be challenging to operationalize in purely natural language settings. In this work, rather than relying on finetuning or in-context learning to coax LLMs into implicitly imitating a RL algorithm, we illustrate how LLMs can be used to explicitly implement an existing RL algorithm (Posterior Sampling for Reinforcement Learning) whose capacity for statistically-efficient exploration is already well-studied. We offer empirical results demonstrating how our LLM-based implementation of a known, data-efficient RL algorithm can be considerably more effective in natural language tasks that demand prudent exploration.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4154. MoGen: Detailed Neuronal Morphology Generation via Point Cloud Flow Matching
๐ฌ 12
โญ 5.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ neuroscience connectomics neuron reconstruction generative modelling point clouds flow matching
Biological neurons come in many shapes. High-fidelity generative modeling of their varied morphologies is challenging yet underexplored in neuroscience, and crucial for the subfield of connectomics. We introduce MoGen (Neuronal Morphology Generation), a flow matching model to generate high-resolution 3D point clouds of mouse cortex axon and dendrite fragments. This is enabled by an adaptation that injects local geometric context into a scalable latent transformer backbone, allowing for the generation of high-fidelity, realistic samples. To assess MoGen's generation quality, we propose a dedicated evaluation suite with interpretable geometric and topological features tailored to neuronal structures that we validate in a user study. MoGen's practical utility is showcased through controllable generation for visualization via smooth interpolation and a direct downstream application: we augment the training set of a shape plausibility classifier from a production connectomics neuron reconstruction pipeline with millions of generated samples, thereby improving classifier accuracy and reducing the number of remaining split and merge errors by 4.4%. We estimate this can reduce manual proofreading labor by over 157 person-years for reconstruction of a full mouse brain.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4155. Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in LLMs
๐ฌ 17
โญ 5.00
๐ datasets and benchmarks
๐ท๏ธ Large Language Models Benchmarking and Evaluation Time-Sensitive Question-Answering
Facts change over time, making it essential for Large Language Models (LLMs) to handle time-sensitive factual knowledge accurately and reliably. Although factual Time-Sensitive Question-Answering (TSQA) tasks have been widely developed, existing benchmarks often face manual bottlenecks that limit scalable and comprehensive TSQA evaluation. To address this issue, we propose TDBench, a new benchmark that systematically constructs TSQA pairs by harnessing temporal databases and database techniques, such as temporal functional dependencies, temporal SQL, and temporal joins. We also introduce a new evaluation metric called time accuracy, which assesses the validity of time references in model explanations alongside traditional answer accuracy for a more fine-grained TSQA evaluation. Extensive experiments on contemporary LLMs show how TDBench enables scalable and comprehensive TSQA evaluation while reducing the reliance on human labor, complementing current TSQA evaluation approaches that largely center on Wikipedia/Wikidata by enabling LLM evaluation on application-specific data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4156. Beyond English-Centric Training: How Reinforcement Learning Improves Cross-Lingual Reasoning in LLMs
๐ฌ 22
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Reasoning Generalization Reinforcement Learning Multilingual
Enhancing the complex reasoning capabilities of Large Language Models (LLMs) attracts widespread attention. While reinforcement learning (RL) has shown superior performance for improving complex reasoning, its impact on cross-lingual generalization compared to Supervised Fine-Tuning (SFT) remains unexplored. We present the first systematic investigation into cross-lingual reasoning generalization of RL and SFT. Using Qwen2.5-3B-Base as our foundation model, we conduct experiments on diverse multilingual reasoning benchmarks, including math reasoning, commonsense reasoning, and scientific reasoning. Our investigation yields two significant findings: (1) Tuning with RL not only achieves higher accuracy but also demonstrates substantially stronger cross-lingual generalization capabilities compared to SFT. (2) RL training on non-English data yields better overall performance and generalization than training on English data, which is not observed with SFT. Furthermore, through comprehensive mechanistic analyses, we explore the underlying factors of RL's superiority and generalization across languages. Our results provide compelling evidence that RL enables the model with more robust reasoning strategies, offering crucial guidance for more equitable and effective multilingual reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4157. Test-Time Poisoned Sample Detection by Exploiting Shallow Malicious Matching in Backdoored CLIP
๐ฌ 24
โญ 5.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ poisoned sample detection backdoor defense CLIP shallow malicious matching
CLIP, known for its strong semantic matching capabilities derived from large-scale pretraining, has been shown to be vulnerable to backdoor attacks in prior work. In this work, we find that such attacks leave a detectable trace. This trace manifests as a divergence in how image features align with the CLIP's text manifold where semantically similar texts cluster. Specifically, benign images exhibit *deep benign matching*, where their features are close not only to the predicted text caption but also to the broader manifold of semantically equivalent variants of that caption. In contrast, poisoned images display *shallow malicious matching*, where their features shallowly align with the specific target caption but remain distant from its semantic neighborhood. Leveraging this insight, we propose **Subspace Detection**, a novel test-time poisoned image detection method against backdoored CLIP. First, for a test image, we approximate its corresponding local text manifold by constructing a low-dimensional subspace from semantically equivalent variants of its predicted text. Second, within this board subspace, we probe a region-of-interest that maximally amplifies the separation between the two types of images: benign images remain close due to deep matching, while poisoned images deviate significantly due to shallow matching. Finally, we identify whether the test image is poisoned by measuring its deviation from this region; a large deviation indicates a poisoned image. Experimental results demonstrate that our method significantly outperforms existing detection methods against SoTA backdoor attacks and exhibits robust detection performance across multiple downstream datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4158. DPad: Efficient Diffusion Language Models with Suffix Dropout
๐ฌ 26
โญ 5.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Diffusion-based Large Language Models Model Optimization and Efficiency Token Pruning Model Explainability
Diffusion-based Large Language Models (dLLMs) parallelize text generation by framing decoding as a denoising process, but suffer from high computational overhead since they predict all future suffix tokens at each step while retaining only a small fraction. We propose $\textbf{Diffusion Scratchpad} (\textbf{\textit{DPad}})$, a training-free method that restricts attention to a structured subset of suffix tokens, preserving fidelity while eliminating redundancy. $\textit{DPad}$ integrates two strategies: (i) a $\textit{sliding window}$, which maintains a fixed-length suffix window, and (ii) $\textit{distance-decay dropout}$, which deterministically removes distant suffix tokens before attention computation. This concise design is compatible with existing optimizations such as parallel decoding and prefix caching, and lends itself to a lightweight implementation. Comprehensive evaluations across multiple benchmarks on $\texttt{LLaDA}$ and $\texttt{Dream}$ models demonstrate that $\textit{DPad}$ delivers up to $\mathbf{61.4\times}$ speedup over vanilla dLLMs while maintaining comparable accuracy, highlighting its potential for efficient and scalable long-sequence inference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4159. Statistical Advantage of Softmax Attention: Insights from Single-Location Regression
๐ฌ 9
โญ 5.00
๐ learning theory
๐ท๏ธ attention softmax attention deep learning theory sparse information retrieval high-dimensional limit high-dimensional statistics
Large language models rely on attention mechanisms with a softmax activation. Yet the dominance of softmax over alternatives (e.g., component-wise or linear) remains poorly understood, and many theoretical works have focused on the easier-to-analyze linearized attention. In this work, we address this gap through a principled study of the single-location regression task, where the output depends on a linear transformation of a single input token at a random location. Building on ideas from statistical physics, we develop an analysis of attention-based predictors in the high-dimensional limit, where generalization performance is captured by a small set of order parameters. At the population level, we show that softmax achieves the Bayes risk, whereas linear attention fundamentally falls short. We then examine other activation functions to identify which properties are necessary for optimal performance. Finally, we analyze the finite-sample regime: we provide an asymptotic characterization of the test error and show that, while softmax is no longer Bayes-optimal, it consistently outperforms linear attention. We discuss the connection with optimization by gradient-based algorithms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4160. CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation
๐ฌ 15
โญ 5.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Reasoning Segmentation Reinforcement Learning Positional Prior Multi-Modal Chain-of-Thought
Existing works on reasoning segmentation either connect hidden features from a language model directly to a mask decoder or represent positions in text, which limits interpretability and semantic detail. To solve this, we present CoPRS, a Multi-modal Chain-of-Thought (MCoT)โbased positional perception model that bridges language reasoning to segmentation through a differentiable and interpretable positional prior instantiated as a heatmap. By making the reasoning process clear via MCoT and expressing it as a dense, differentiable heatmap, this interface enhances interpretability and diagnostic analysis and yields more concentrated evidence on the target. A learnable concentration token aggregates features of the image and reasoning text to generate this positional prior, which is decoded to precise masks through a lightweight decoder, providing a direct connection between reasoning and segmentation. Across the RefCOCO series and ReasonSeg, CoPRS matches or surpasses the best reported metrics on each standard split under comparable protocols, with performance at or above prior state of the art across both validation and test partitions. Extensive experiments reveal that the quality of the heatmap strongly influences the resulting mask quality, supporting a consistent association between the reasoning output and downstream mask generation. Collectively, these findings support the utility of this paradigm in bridging reasoning and segmentation and show advantages in concentration driven by reasoning and predicting masks more precisely. Code, checkpoints and logs will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4161. Redirection for Erasing Memory (REM): Towards a universal unlearning method for corrupted data
๐ฌ 16
โญ 5.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ machine unlearning
Machine unlearning is studied for a multitude of tasks, but specialization of unlearning methods to particular tasks has made their systematic comparison challenging. To address this issue, we propose a conceptual space to characterize diverse corrupted data unlearning tasks in vision classifiers. This space is described by two dimensions, the discovery rate (the fraction of the corrupted data that are known at unlearning time) and the statistical regularity of the corrupted data (from random exemplars to shared concepts). Methods proposed previously have been targeted at portions of this space and, as we show, fail predictably outside these regions. We propose Redirection for Erasing Memory (REM), whose key feature is that corrupted data are redirected to dedicated neurons introduced at unlearning time and then discarded or deactivated to suppress the influence of corrupted data. REM performs strongly across the space of tasks, in contrast to prior SOTA methods that fail outside the regions for which they were designed.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4162. STORM: Synergistic Cross-Scale Spatio-Temporal Modeling for Weather Forecasting
๐ฌ 22
โญ 5.00
๐ learning on time series and dynamical systems
๐ท๏ธ spatial-temporal forecasting
Accurate weather forecasting is crucial for climate research, disaster mitigation, and societal planning. Despite recent progress with deep learning, global atmospheric data remain uniquely challenging since weather dynamics evolve across heterogeneous spatial and temporal scales ranging from planetary circulations to localized phenomena. Capturing such cross-scale interactions within a unified framework remains an open problem. To address this gap, we propose \textbf{STORM}, a spatio-temporal model that disentangles atmospheric variations into multiple scales to uncover scale-specific dependencies. In addition, it enables coherent forecasting across multiple resolutions, maintaining consistent temporal evolution. Experiments on benchmark datasets demonstrate that STORM consistently delivers superior performance across both global and regional settings, as well as for short- and long-term forecasts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4163. Adaptive Gaussian Expansion for On-the-fly Category Discovery
๐ฌ 14
โญ 5.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Novel Category Discovery; Open Set Recognition
On-the-Fly Category Discovery (OCD) aims to address the limitations of transductive learning and closed-set prediction in category discovery tasks by enabling real-time classification of potential future categories using prior knowledge. Existing OCD approaches typically rely on hash-based encodings that map features into low-dimensional hash spaces and directly classify test samples using these encodings. Despite efforts to mitigate the sensitivity of hash functions during testing, these methods still suffer from severe overestimation of the number of categories. In this work, we thoroughly analyze the practical limitations of current OCD methods and formally identify a performance lower bound for the task. Based on this insight, we reformulate OCD into two sub-tasks: Open-Set Recognition and an Fully Novel OCD setting. For all samples, we employ a soft class thresholding strategy to directly detect known classes, which significantly enhances the deployment feasibility of OCD to downstream tasks. For outlier samples, we propose Adaptive Gaussian Expansion (AGE), a dynamic category discovery method that models the Probability Density Functions (PDF) of different classes to uncover potential novel categories in real time. Extensive experiments across multiple datasets demonstrate that our method achieves state-of-the-art performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4164. Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling
๐ฌ 14
โญ 4.80
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language models Speculative sampling Auto-regressive generation
Speculative sampling (SpS) has been successful in accelerating the decoding throughput of auto-regressive large language models by leveraging smaller draft models. SpS strictly enforces the generated distribution to match that of the verifier LLM. This is unnecessarily restrictive as slight variation of the verifier's distribution, such as sampling with top-$k$ or temperature, would also be acceptable. Typical acceptance sampling (TAS) alleviates this issue by accepting more tokens using entropy-based heuristics. However, this approach distorts the verifier distribution, potentially degrading output quality when the verifier encodes critical information.
In this work, we formalize the speculative sampling algorithm through the lens of constrained optimization. Based on this formulation, we propose **Cactus** (**c**onstrained **ac**cep**t**ance spec**u**lative **s**ampling), a method that guarantees controlled divergence from the verifier distribution and increasing acceptance rates. Empirical results across a wide range of benchmarks confirm the effectiveness of our approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4165. Beyond Speedup - Utilizing KV Cache for Sampling and Reasoning
๐ฌ 10
โญ 4.80
๐ foundation or frontier models, including LLMs
๐ท๏ธ Machine Learning LLM
KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightweight representation, eliminating the need to recompute or store full hidden states. Despite being weaker than dedicated embeddings, KV-derived representations are shown to be sufficient for two key applications: (i) Chain-of-Embedding, where they achieve competitive or superior performance on Llama-3.1-8B-Instruct and Qwen2-7B-Instruct; and (ii) Fast/Slow Thinking Switching, where they enable adaptive reasoning on Qwen3-8B and DeepSeek-R1-Distil-Qwen-14B, reducing token generation by up to $5.7\times$ with minimal accuracy loss. Our findings establish KV caches as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in LLM inference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4166. Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?
๐ฌ 24
โญ 4.80
๐ learning theory
๐ท๏ธ scaling laws signSGD SGD compute-optimal curves power-law random feature stable-decay schedule
We study scaling laws of signSGD under a power-law random features (PLRF) model that accounts for both feature and target decay.
We analyze the population risk of a linear model trained with one-pass signSGD on Gaussian-sketched features.
We express the risk as a function of model size, training steps, learning rate, and the feature and target decay parameters.
Comparing against the SGD risk analyzed by Paquette et al. (2024), we identify a drift-normalization effect and a noise-reshaping effect unique to signSGD.
We then obtain compute-optimal scaling laws under the optimal choice of learning rate. Our analysis shows that the noise-reshaping effect can make the compute-optimal slope of signSGD steeper than that of SGD in regimes where noise is dominant.
Finally, we observe that a stable-decay scheduleโa simplified variant of the widely used warmup-stable-decay (WSD) scheduleโfurther reduces the noise term and sharpens the compute-optimal slope, when feature decay is fast but target decay is slow.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4167. MoL: Adaptive Mixture-of-Length Reasoning for Efficient Question Answering with Context
๐ฌ 16
โญ 4.80
๐ foundation or frontier models, including LLMs
๐ท๏ธ Question Answering (Large) Language Models
We present Mixture-of-Length (MoL), an approach for Question Answering (QA) with context that aims to improve the balance between reasoning quality and response efficiency. Our method introduces a principled difficulty assessment based on information-theoretic principles and a dual-objective reward mechanism that adaptively modulates response length. In our experiments, MoL exhibits an emergent behavior termed "intelligent brevity": the model tends to produce shorter responses for simpler queries and longer ones for more complex inputs. This property is desirable for human-computer interaction and can reduce inference costs. A post-hoc analysis of internal activations suggests a correlation between this output adaptivity and the effective number of layers that contribute during inference. On multiple QA benchmarks, MoL demonstrates competitive accuracy while substantially reducing tokens compared to baselines, indicating that difficulty-aware length modulation is a promising direction for efficient QA with context.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4168. SciNav: A Principled Agent Framework for Scientific Coding Tasks
๐ฌ 15
โญ 4.80
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ science agents; test-time scaling
Autonomous science agents, built on large language models (LLMs), are increasingly being investigated to generate hypotheses, design experiments, and produce reports. Prior science agents primarily focus on open-ended scientific problems, where such outputsโhypotheses, experiments, or analyses are inherently subjective and thus difficult to evaluate rigorously. In contrast, existing scientific coding benchmarks provide tasks with clearly defined, executable outputs that enable objective assessment. However, current agent-based approaches to these benchmarks remain engineering-driven pipelines, lacking principled framework design. This mismatch exposes a gap: the absence of end-to-end, principled science agent frameworks for scientific coding tasks. We address this gap by focusing on scientific coding tasks, where evaluation can be made rigorously, and introducing an agent framework SciNav (Scientific Navigator) that enables more effective solution exploration. Our framework is designed to operate efficiently under constrained search budgets, moving beyond reliance on pre-defined success metrics and prolonged search cycles. Inspired by findings that comparative judgments often reveal finer-grained quality differences and therefore provide greater discriminative power than absolute scoring, our framework leverages pairwise relative judgments within a tree search process to select top-K promising solution branches, prune low-potential ones, and progressively narrow down the solution candidates on the selected branches guided by relative comparisons. We demonstrate our agent's effectiveness across different types of tasks on two benchmarks. Experiments show that SciNav significantly outperforms direct prompting and prior agents like OpenHands and Self-Debug across different base models, task types, and difficulty levels, and exceeds different frontier comparators such as random selection and LLM absolute scoring. These results confirm the strength of our agent design and highlight the effectiveness of relative judgmentโguided top-K search for high-quality scientific coding, marking a step toward more practical science agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4169. NAIPv2: Debiased Pairwise Learning for Efficient Paper Quality Estimation
๐ฌ 21
โญ 4.80
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ AI for Research Automated Peer Review Article Quality Estimation Literature Intelligence Systems
The ability to estimate the quality of scientific papers is central to how both humans and AI systems will advance scientific knowledge in the future. However, existing LLM-based estimation methods suffer from high inference cost, whereas the faster direct score regression approach is limited by scale inconsistencies. We present NAIPv2, a debiased and efficient framework for paper quality estimation. NAIPv2 employs pairwise learning within domain-year groups to reduce inconsistencies in reviewer ratings and introduces the Review Tendency Signal (RTS) as a probabilistic integration of reviewer scores and confidences. To support training and evaluation, we further construct NAIDv2, a large-scale dataset of 24,276 ICLR submissions enriched with metadata and detailed structured content. Trained on pairwise comparisons but enabling efficient pointwise prediction at deployment, NAIPv2 achieves state-of-the-art performance (78.2\% AUC, 0.432 Spearman), while maintaining scalable, linear-time efficiency at inference. Notably, on unseen NeurIPS submissions, it further demonstrates strong generalization, with predicted scores increasing consistently across decision categories from Rejected to Oral. These findings establish NAIPv2 as a debiased and scalable framework for automated paper quality estimation, marking a step toward future scientific intelligence systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4170. Action-Free Offline-To-Online RL via Discretised State Policies
๐ฌ 17
โญ 4.80
๐ reinforcement learning
๐ท๏ธ Action free Offline to Online Reinforcement Learning Online Reinforcement Learning Offline Reinforcement Learning
Most existing offline RL methods presume the availability of action labels within the dataset, but in many practical scenarios, actions may be missing due to privacy, storage, or sensor limitations. We formalise the setting of action-free offline-to-online RL, where agents must learn from datasets consisting solely of $(s,r,s')$ tuples and later leverage this knowledge during online interaction. To address this challenge, we propose learning state policies that recommend desirable next-state transitions rather than actions. Our contributions are twofold. First, we introduce a simple yet novel state discretisation transformation and propose Offline State-Only DecQN (OSO-DecQN), a value-based algorithm designed to pre-train state policies from action-free data. OSO-DecQN integrates the transformation to scale efficiently to high-dimensional problems while avoiding instability and overfitting associated with continuous state prediction. Second, we propose a novel mechanism for guided online learning that leverages these pre-trained state policies to accelerate the learning of online agents. Together, these components establish a scalable and practical framework for leveraging action-free datasets to accelerate online RL. Empirical results across diverse benchmarks demonstrate that our approach improves convergence speed and asymptotic performance, while analyses reveal that discretisation and regularisation are critical to its effectiveness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4171. Improving Black-Box Generative Attacks via Generator Semantic Consistency
๐ฌ 7
โญ 4.80
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ adversarial transferability transferable adversarial attacks generative models
Transfer attacks optimize on a surrogate and deploy to a black-box target. While iterative optimization attacks in this paradigm are limited by their per-input cost limits efficiency and scalability due to multistep gradient updates for each input, generative attacks alleviate these by producing adversarial examples in a single forward pass at test time. However, current generative attacks still adhere to optimizing surrogate losses (e.g., feature divergence) and overlook the generatorโs internal dynamics, underexploring how the generatorโs internal representations shape transferable perturbations. To address this, we enforce semantic consistency by aligning the early generatorโs intermediate features to an exponential moving average (EMA) teacher, stabilizing object-aligned representations and improving black-box transfer without inference-time overhead. To ground the mechanism, we quantify semantic stability as the standard deviation of foreground IoU between cluster-derived activation masks and foreground masks across generator blocks, and observe reduced semantic drift under our method. For more reliable evaluation, we also introduce Accidental Correction Rate (ACR) to separate inadvertent corrections from intended misclassifications, complementing the inherent blind spots in traditional Attack Success Rate (ASR), Fooling Rate (FR), and Accuracy metrics. Across architectures, domains, and tasks, our approach can be seamlessly integrated into existing generative attacks with consistent improvements in black-box transfer, while maintaining test-time efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4172. (Token-Level) \textbf{InfoRMIA}: Stronger Membership Inference and Privacy Assessment for LLMs
๐ฌ 15
โญ 4.80
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ membership inference attack mia privacy llm privacy memorization
Machine learning models are known to leak sensitive information, as they inevitably memorize (parts of) their training data. More alarmingly, large language models (LLMs) are now trained on nearly all available data, which amplifies the magnitude of information leakage and raises serious privacy risks. Hence, it is more crucial than ever to quantify privacy risk before the release of LLMs. The standard method to quantify privacy is via membership inference attacks, where the state-of-the-art approach is the Robust Membership Inference Attack (RMIA). In this paper, we present InfoRMIA, a principled information-theoretic formulation of membership inference. Our method consistently outperforms RMIA across benchmarks while also offering improved computational efficiency.
In the second part of the paper, we identify the limitations of treating sequence-level membership inference as the gold standard for measuring leakage. We propose a new perspective for studying membership and memorization in LLMs: token-level signals and analyses. We show that a simple token-based InfoRMIA can pinpoint which tokens are memorized within generated outputs, thereby localizing leakage from the sequence level down to individual tokens, while achieving stronger sequence-level inference power on LLMs. This new scope rethinks privacy in LLMs and can lead to more targeted mitigation, such as exact unlearning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4173. ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
๐ฌ 29
โญ 4.80
๐ datasets and benchmarks
๐ท๏ธ Embodied Agents Vision Language Models Benchmarking World Modeling
Embodied cognition argues that intelligence arises from continuous sensorimotor interaction with the world. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? To investigate this, we introduce **ENACT**, a benchmark that probes this question through world modeling from egocentric interaction. Grounded in a partially observable Markov decision process (POMDP) framework, **ENACT** comprises two complementary sequence reordering tasks: forward world modeling (predicting an ordered sequence of future states from actions) and inverse world modeling (inferring an ordered sequence of actions from state changes). Correctly solving these tasks indicates that the model has a solid understanding of how the environment will evolve given one's actions. Our scalable dataset contains 8,972 QA pairs derived from diverse, long-horizon household activities in the BEHAVIOR simulator. Experiments reveal a significant performance gap between state-of-the-art VLMs and humans, which widens dramatically as interaction horizons lengthen. We find that models consistently solve the inverse problem better than the forward one and exhibit strong embodied biases, showing a preference for right-handed actions and performance degradation with camera perspectives that deviate from those of human vision. Code and supplementary materials are available in our [anonymous repository](https://github.com/iclrsubmission2026/iclr-2026-submission).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4174. TIPS: Turn-level Information-Potential Reward Shaping for Search-Augmented LLMs
๐ฌ 29
โญ 4.80
๐ reinforcement learning
๐ท๏ธ Agentic LLM Reinforcement Learning Question Answering
Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn-level rewards to each reasoning + tool-call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential-based reward shaping, TIPS offers fine-grained and policy-invariant guidance that overcomes the limitations of outcome-only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen-2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn-level information-potential reward shaping provides an effective and general solution to sparse-reward credit assignment for multi-turn LLM reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4175. Learning to Grasp Anything By Playing with Random Toys
๐ฌ 21
โญ 4.80
๐ applications to robotics, autonomy, planning
๐ท๏ธ Generalizable Grasping Object-centric Representation Zero-shot Robotic Manipulation
Robotic manipulation policies often struggle to generalize to novel objects, limiting their real-world utility. In contrast, cognitive science suggests that children develop generalizable dexterous manipulation skills by mastering a small set of simple toys and then applying that knowledge to more complex items. Inspired by this, we study if similar generalization capabilities can also be achieved by robots. Our results indicate robots can learn generalizable grasping using randomly assembled objects that are composed from just four shape primitives: spheres, cuboids, cylinders, and rings. We show that training on these "toys" enables robust generalization to real-world objects, yielding strong zero-shot performance. Crucially, we find the key to this generalization is an object-centric visual representation induced by our proposed detection pooling mechanism. Evaluated in both simulation and on physical robots, our model achieves a 67% real-world grasping success rate on the YCB dataset, outperforming state-of-the-art approaches that rely on substantially more in-domain data. We further study how zero-shot generalization performance scales by varying the number and diversity of training toys and the demonstrations per toy. We believe this work offers a promising path to scalable and generalizable learning in robotic manipulation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4176. Learning multimodal dictionary decompositions with group-sparse autoencoders
๐ฌ 12
โญ 4.80
๐ interpretability and explainable AI
๐ท๏ธ sparse autoencoders dictionary learning multimodal representation learning group sparsity interpretability
The Linear Representation Hypothesis asserts that the embeddings learned by neural networks can be understood as linear combinations of features corresponding to high-level concepts. Based on this ansatz, sparse autoencoders (SAEs) have recently become a popular method for decomposing embeddings into a sparse combination of linear directions, which have been shown empirically to often correspond to human-interpretable semantics. However, recent attempts to apply SAEs to multimodal embedding spaces (such as the popular CLIP embeddings for image/text data) have found that SAEs often learn ``split dictionaries,'' where most of the learned sparse features are essentially unimodal, active only for data of a single modality. In this work, we study how to effectively adapt SAEs for the setting of multimodal embeddings while ensuring multimodal alignment. We first argue that the existence of a split dictionary decomposition on an aligned embedding space implies the existence of a non-split dictionary with improved modality alignment. Then, we propose a new SAE-based approach to multimodal embedding decomposition using cross-modal random masking and group-sparse regularization. We apply our method to popular embeddings for image/text (CLIP) and audio/text (CLAP) data and show that, compared to standard SAEs, our approach learns a more multimodal dictionary while reducing the number of dead neurons and improving feature semanticity. We finally demonstrate how this improvement in alignment of concepts between modalities can enable improvements in the interpretability and control of cross-modal tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4177. Efficient Prediction of Large Protein Complexes via Subunit-Guided Hierarchical Refinement
๐ฌ 25
โญ 4.80
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Protein complex structure prediction AlphaFold3 complex modularity
State-of-the-art protein structure predictors have revolutionized structural biology, yet quadratic memory growth with token length makes end-to-end inference impractical for large complexes beyond a few thousand tokens. We introduce \textsc{HierAFold}, a hierarchical pipeline that exploits the modularity of large complexes via PAE-guided (Predicted Aligned Error) subunit decomposition, targeted interface-aware refinement, and confidence-weighted assembly. PAE maps localize rigid intra-chain segments and sparse inter-chain interfaces, enabling joint refinement of likely interacting subunits to capture multi-body cooperativity without increasing memory. \textsc{HierAFold} matches AlphaFold3 accuracy, raises success rates from 49.9\% (CombFold) to 73.1\% on recent PDB set. While for large complexes, it cuts peak memory by $\sim$25\,GB on a 4{,}000-token target ($\sim$40\%), successfully models complexes with over $5{,}000$ tokens that are out-of-memory for AlphaFold3, and raises success rates by two-fold compared with CombFold.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4178. COSMOS: A Hybrid Adaptive Optimizer for Efficient Training of Large Language Models
๐ฌ 21
โญ 4.80
๐ optimization
๐ท๏ธ Efficient Optimizer for LLMs Preconditioning Muon SOAP Adam
Large Language Models (LLMs) have demonstrated remarkable success across various domains, yet their optimization remains a significant challenge due to the complex and high-dimensional loss landscapes they inhabit. While adaptive optimizers such as AdamW are widely used, they suffer from critical limitations, including an inability to capture interdependencies between coordinates and high memory consumption. Subsequent research, exemplified by SOAP, attempts to better capture coordinate interdependence but incurs greater memory overhead, limiting scalability for massive LLMs. An alternative approach aims to reduce memory consumption through low-dimensional projection, but these methods lose the gradient information in the residual space, resulting in less effective optimization. In this paper, we propose COSMOS, a novel hybrid optimizer that leverages the varying importance of eigensubspaces in the gradient matrix to achieve memory efficiency without compromising optimization performance. The design of COSMOS is motivated by our empirical insights and practical considerations. Specifically, COSMOS applies SOAP to the leading eigensubspace, which captures the primary optimization dynamics, and MUON to the remaining eigensubspace, which is less critical but computationally expensive to handle with SOAP. This hybrid strategy significantly reduces memory consumption while maintaining robust optimization performance, making it particularly suitable for massive LLMs. Numerical experiments on various datasets and transformer architectures are provided to demonstrate the effectiveness of COSMOS.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4179. LLM-Guided Evolutionary Program Synthesis for Quasi-Monte Carlo Design
๐ฌ 5
โญ 4.80
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Evolutionary Computation Large Language Models (LLMs) Quasi-Monte Carlo (QMC) Star Discrepancy Sobol' Sequences Algorithmic Discovery Quantitative Finance
Low-discrepancy point sets and digital sequences underpin quasi-Monte Carlo (QMC) methods for high-dimensional integration. We cast two long-standing QMC design problems as program synthesis and solve them with an LLM-guided evolutionary loop that mutates and selects code under task-specific fitness: (i) constructing finite 2D/3D point sets with low star discrepancy, and (ii) choosing Sobolโ direction numbers that minimize randomized quasi-Monte Carlo (rQMC) error on downstream integrands. Our two-phase procedure combines constructive code proposals with iterative numerical refinement. On finite sets, we rediscover known optima in small 2D cases and set new best-known 2D benchmarks for N โฅ 40, while matching known 3D optima up to the proven frontier (N โค 8) and reporting improved 3D benchmarks beyond. On digital sequences, evolving Sobol' parameters yields consistent reductions in rQMC mean-squared error for several 32-dimensional option-pricing tasks relative to widely used JoeโKuo parameters, while preserving extensibility to any sample size and compatibility with standard randomizations. Taken together, the results demonstrate that LLM-driven evolutionary program synthesis can automate the discovery of high-quality QMC constructions, recovering classical designs where they are optimal and improving them where finite-N structure matters. Data and code are available at anonymous.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4180. Dataset Color Quantization: A Training-Oriented Framework for Dataset-Level Compression
๐ฌ 16
โญ 4.80
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ dataset pruning dataset compression quantization
Large-scale image datasets are fundamental to deep learning, but their high storage demands pose challenges for deployment in resource-constrained environments. While existing approaches reduce dataset size by discarding samples, they often ignore the significant redundancy within each imageโparticularly in the color space. To address this, we propose Dataset Color Quantization (DCQ), a unified framework that compresses visual datasets by reducing color-space redundancy while preserving information crucial for model training. DCQ achieves this by enforcing consistent palette representations across similar images, selectively retaining semantically important colors guided by model perception, and maintaining structural details necessary for effective feature learning. Extensive experiments across CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that DCQ significantly improves training performance under aggressive compression, offering a scalable and robust solution for dataset-level storage reduction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4181. DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning
๐ฌ 27
โญ 4.80
๐ datasets and benchmarks
๐ท๏ธ Synthetic Data Large Language Models Reasoning
Large language models (LLMs) have achieved remarkable success in many natural language tasks but still struggle with complex, multi-step reasoning, particularly across diverse disciplines. Existing reasoning datasets often lack disciplinary breadth, reasoning depth, and diversity, as well as guiding principles for question synthesis. We propose DESIGNER: a DESIGN-logic-guidEd Reasoning data synthesis pipeline that leverages naturally available, extensive raw documents (e.g., book corpus and web corpus) to generate multidisciplinary challenging questions. We introduce the concept of "design logic" and instruct LLMs to mimic human educators' question-creation process, enabling the automated synthesis of large-scale, high-difficulty questions. We use LLMs to reverse-engineer and abstract over 120,000 design logics from existing questions across various disciplines. By matching these design logics with source documents, we are able to generate reasoning questions with controllable question types and difficulty levels. Using this pipeline, we synthesized two large-scale reasoning datasets that span 75 disciplines: DLR-Book (3.04 million questions from the book corpus) and DLR-Web (1.66 million questions from the web corpus). Data analysis indicates that the questions synthesized by our method exhibit greater difficulty and diversity compared to those in the baseline datasets. We validate our synthesized data through supervised fine-tuning (SFT) on the Qwen3 and Llama3 model families. Our data substantially enhances their multidisciplinary reasoning capabilities, outperforming existing datasets. Notably, by applying SFT on the base versions of these models using only our data, we even surpass their official final models that have undergone the full post-training process.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4182. MergOPT: A Merge-Aware Optimizer for Robust Model Merging
๐ฌ 24
โญ 4.80
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Model Merging Fine-tuning Multi-task Learning
Model merging aims to integrate multiple independently fine-tuned expert models into a single model while preserving the knowledge of all experts. However, existing approaches mainly address parameter conflicts at the merging stage and overlook the role of the fine-tuning process, which often leads to significant post-merge performance degradation. To address this limitation, we propose a novel merging-aware optimizer (abbreviated as MergOPT) that injects principled merge-induced parameter shifts into the weight update steps so that the fine-tuned model exhibits a more stable loss landscape under subsequent merging operations. Specifically, we first formulate model merging as a distributionally robust optimization problem in the weight space: the parameters of other experts to be merged are viewed as adversarial merge-offsets, and fine-tuning adapts to the worst-case merging scenario. Building on this formulation, we analyze the distribution of parameter updates and the effects of merging hyperparameters, from which we derive a merging-guided feasible region for weight shifts. Finally, extensive experiments across four large language models (LLMs) and one vision model show that our approach consistently outperforms standard fine-tuning, yielding an average relative gain of 3.5\% and a maximum gain of 9.5\% across four merging strategies when merging seven experts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4183. gLSTM: Mitigating Over-Squashing by Increasing Storage Capacity
๐ฌ 16
โญ 4.80
๐ learning on graphs and other geometries & topologies
๐ท๏ธ oversquashing associative memory graph neural network
Graph Neural Networks (GNNs) leverage the graph structure to transmit information between nodes, typically through the message-passing mechanism. While these models have found a wide variety of applications, they are known to suffer from over-squashing, where information from a large receptive field of node representations is collapsed into a single fixed sized vector, resulting in an information bottleneck. In this paper, we re-examine the over-squashing phenomenon through the lens of model storage and retrieval capacity, which we define as the amount of information that can be stored in a nodeโs representation for later use. We study some of the limitations of existing tasks used to measure over-squashing and introduce a new synthetic task to demonstrate that an information bottleneck can saturate this capacity. Furthermore, we adapt ideas from the sequence modeling literature on associative memories, fast weight programmers, and the xLSTM model to develop a novel GNN architecture with improved capacity. We demonstrate strong performance of this architecture both on our capacity synthetic task, as well as a range of real-world graph benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4184. SinkTrack: Attention Sink based Context Anchoring for Large Language Models
๐ฌ 44
โญ 4.80
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Multimodal Large Language Models Hallucination Context Forgetting
Large language models (LLMs) suffer from hallucination and context forgetting. Prior studies suggest that attention drift is a primary cause of these problems, where LLMs' focus shifts towards newly generated tokens and away from the initial input context. To address this, we make use of a related, intrinsic characteristic of LLMs: attention sink โ the tendency to consistently allocate high attention to the very first token (i.e., โจBOSโฉ) of a sequence. Concretely, we propose an advanced context anchoring method, SINKTRACK, which treats โจBOSโฉ as an information anchor and injects key contextual features (such as those derived from the input image or instruction) into its representation. As such, LLM remains anchored to the initial input context throughout the entire generation process. SINKTRACK is training-free, plug-and-play, and introduces negligible inference overhead. Experiments demonstrate that SINKTRACK mitigates hallucination and context forgetting across both textual (e.g., +18.9% on QuAC with Llama3.1-8B-Instruct) and multi-modal (e.g., +23.0% on M3CoT with Qwen2.5-VL-7B-Instruct) tasks. Its consistent gains across different architectures and scales underscore the robustness and generalizability. We also analyze its underlying working mechanism from the perspective of information delivery. Our source code is available at anonymous GitHub.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4185. Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study
๐ฌ 29
โญ 4.80
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Safety Alignment Harmful Fine-Tuning Large Language Models
Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. However, this behavior is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this perspective.
We examine whether safety-relevant behavior is concentrated in specific linear subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in activations. Across both weight and activation spaces, our findings are consistent: subspaces that amplify safe behaviors also amplify useful ones, and prompts with different safety implications activate overlapping representations. Rather than residing in distinct directions, we show that safety is highly entangled with the general learning components of the model. This suggests that subspace-based defenses face fundamental limitations and underscores the need for alternative strategies to preserve safety under continued training. We corroborate these findings with multiple experiments on five open-source LLMs from the Llama and Qwen families. Our code is available anonymously at: https://anonymous.4open.science/r/safety-subspaces-anon-E3CB.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4186. Let OOD Feature Exploring Vast Predefined Classifiers
๐ฌ 8
โญ 4.80
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Out of Distribution Representation Learning Neural Collapse Evidential Deep Learning
Real-world out-of-distribution (OOD) data exhibit broad, continually evolving distributions, rendering reliance solely on in-distribution (ID) data insufficient for robust detection. Consequently, methods leveraging auxiliary Outlier Exposure (OE) data have emerged, substantially enhancing generalization by jointly fine-tuning models on ID and large-scale OE data. However, many existing approaches primarily enforce orthogonality between ID and OE features while pushing OE predictions toward near-uniform, low-confidence scores, thus overlooking the controllability of representation geometry. We propose Vast Predefined Classifiers (VPC), which constructs a pre-specified Orthogonal Equiangular Feature Space (OEFS) to explicitly separate ID and OOD representations while capturing the rich variability of OOD features. We employ evidential priors to align ID features with their class-specific Equiangular Basic Vectors (EBVs), thereby preserving ID performance. In parallel, a new VEBV loss encourages OE features to explore the subspace spanned by Vast EBVs (VEBVs), enabling a rich characterization of diverse OOD patterns. This dual optimization, coupled with the prescribed geometric representation space, promotes optimal orthogonality between ID and OOD representations. Furthermore, we introduce the VPC Score, a discriminative metric based on the L2 activation intensity of features over the predefined classifiers. Extensive experiments across diverse OOD settings and training paradigms on benchmarks including CIFAR-10/100 and the ImageNet-1k, demonstrate strong and robust performance, validating VPC's effectiveness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4187. SCUBA: Salesforce Computer Use Benchmark
๐ฌ 26
โญ 4.80
๐ datasets and benchmarks
๐ท๏ธ Computer-Use Agents Enterprise Benchmark CRM Vision Language Model
We introduce SCUBA, a benchmark designed to evaluate computer-use agents on customer relationship management (CRM) workflows within the Salesforce platform. SCUBA contains 300 task instances derived from real user interviews, spanning three primary personasโplatform administrators, sales representatives, and service agents. The tasks test a range of enterprise-critical abilities, including Enterprise Software UI navigation, data manipulation, workflow automation, information retrieval, and troubleshooting. To ensure realism, SCUBA operates in Salesforce sandbox environments with support for parallel execution and fine-grained evaluation metrics to capture milestone progress. We benchmark a diverse set of agents under both zero-shot and demonstration-augmented settings. We observed huge performance gaps in different agent design paradigm and gaps between the open-source model and the closed-source model. In the zero-shot setting, open-source model powered computer-use agents that have strong performance on related benchmarks like OSWorld only have less than 5\% success rate on SCUBA, while methods built on closed-source models can still have up to 39\% percent task success rate. In the demonstration-augmented settings, task success rates can be improved to 50\% while simultaneously reducing time and costs by 13\% and 16\%, respectively. These findings highlight both the challenges of enterprise tasks automation and the promise of agentic solutions. By offering a realistic benchmark with interpretable evaluation, SCUBA aims to accelerate progress in building reliable computer-use agents for complex business software ecosystems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4188. Adaptive Scaling of Policy Constraints for Offline Reinforcement Learning
๐ฌ 12
โญ 4.80
๐ reinforcement learning
๐ท๏ธ Offline Reinforcement Learning Policy Constraint Behavioral cloning
Offline reinforcement learning (RL) enables learning effective policies from fixed datasets without any environment interaction. Existing methods typically employ policy constraints to mitigate the distribution shift encountered during offline RL training. However, because the scale of the constraints varies across tasks and datasets of differing quality, existing methods must meticulously tune hyperparameters to match each dataset, which is time-consuming and often impractical. To bridge this gap, we propose Adaptive Scaling of Policy Constraints (ASPC), a second-order differentiable framework that automatically adjusts the scale of policy constraints during training. We theoretically analyze its performance improvement guarantee. In experiments on 39 datasets across four D4RL domains, ASPC using a single hyperparameter configuration outperforms other adaptive constraint methods and state-of-the-art offline RL algorithms that require per-dataset tuning, achieving an average 35\% improvement in normalized performance over the baseline. Moreover, ASPC consistently yields additional gains when integrated with a variety of existing offline RL algorithms, demonstrating its broad generality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4189. Identifying and Evaluating Inactive Heads in Pretrained LLMs
๐ฌ 17
โญ 4.80
๐ interpretability and explainable AI
๐ท๏ธ dormant attention multi-head attention attention heads attention sinks
Attention is foundational to large language models (LLMs), enabling different heads to have diverse focus on relevant input tokens. However, learned behaviors like attention sinks, where the first token receives the most attention despite limited semantic importance, suggest some heads may be inactive, and point to a significant source of computational redundancy. To analyze this phenomenon, we propose a taxonomy of 12 score functions that measure different ways a head can be inactive. Thresholding these scores allows us to analyze different sets of potentially inactive attention heads. We evaluate whether identified heads are inactive through model interventions, finding that more than 12% of attention heads are inactive on average, and can be ablated in specific contexts while maintaining MMLU accuracy to within 1% of the pretrained LLM. Across 3 model families, our score functions that measure the average norm of a head's output consistently identify inactive heads that would not have been found by score functions that rely solely on attention weights. We establish that relying on a score function that measures a first token attention sink would underestimate the prevalence of inactive heads, failing to identify more than 7% of inactive heads on average. We also show how measuring score distributions can provide insights into attention behavior. For instance, we find evidence that finetuning causes little to no change in attention behavior, and that even within the same model family, large model scales present markedly different attention behaviors.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4190. ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
๐ฌ 29
โญ 4.80
๐ generative models
๐ท๏ธ Long context Efficient decoding KV cache compression
Large language models (LLMs) face growing challenges in efficient generative inference due to the increasing memory demands of Key-Value (KV) caches, especially for long sequences.
Existing eviction methods typically retain KV pairs with high attention weights but overlook the impact of attention redistribution caused by token removal, as well as the spatial-temporal dynamics in KV selection.
In this paper, we propose ReST-KV, a robust KV eviction method that combines layer-wise output **Re**construction and **S**patial-**T**emporal smoothing to provide a more comprehensive perspective for the KV cache eviction task.
Specifically, ReST-KV formulates KV cache eviction as an optimization problem that minimizes output discrepancies through efficient layer-wise reconstruction. By directly modeling how each tokenโs removal affects the model output, our method naturally captures attention redistribution effects, going beyond simplistic reliance on raw attention weights.
To further enhance robustness, we design exponential moving average smoothing to handle temporal variations and an adaptive window-based mechanism to capture spatial patterns.
Our method, ReST-KV, significantly advances performance on long-context benchmarks. It surpasses state-of-the-art baselines by 2.58\% on LongBench and 15.2\% on RULER. Additionally, ReST-KV consistently outperforms existing methods on Needle-in-a-Haystack and InfiniteBench, all while achieving a remarkable 10.61$\times$ reduction in decoding latency at 128k context length. The code is included in the supplementary material and is designed for easy reproduction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4191. Plug-and-Play Fidelity Optimization for Diffusion Transformer Acceleration via Cumulative Error Minimization
๐ฌ 38
โญ 4.80
๐ generative models
๐ท๏ธ Training-free acceleration Diffusion transformer Error correction
Although Diffusion Transformer (DiT) has emerged as a predominant architecture for image and video generation, its iterative denoising process results in slow inference, which hinders broader applicability and development. Caching-based methods achieve training-free acceleration, while suffering from considerable computational error. Existing methods typically incorporate error correction strategies such as pruning or prediction to mitigate it. However, their fixed caching strategy fails to adapt to the complex error variations during denoising, which limits the full potential of error correction. To tackle this challenge, we propose a novel fidelity-optimization plugin for existing error correction methods via cumulative error minimization, named CEM. CEM predefines the error to characterize the sensitivity of model to acceleration jointly influenced by timesteps and cache intervals. Guided by this prior, we formulate a dynamic programming algorithm with cumulative error approximation for strategy optimization, which achieves the caching error minimization, resulting in a substantial improvement in generation fidelity. CEM is model-agnostic and exhibits strong generalization, which is adaptable to arbitrary acceleration budgets. It can be seamlessly integrated into existing error correction frameworks and quantized models without introducing any additional computational overhead. Extensive experiments conducted on nine generation models and quantized methods across three tasks demonstrate that CEM significantly improves generation fidelity of existing acceleration models, and outperforms the original generation performance on FLUX.1-dev, PixArt-$\alpha$, StableDiffusion1.5 and Hunyuan. The code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4192. Reducing Semantic Mismatch in Brain-to-Text Decoding Through Personalized Multimodal Masking
๐ฌ 17
โญ 4.80
๐ applications to neuroscience & cognitive science
๐ท๏ธ Brain-to-text reconstruction Neural decoding Semantic decoding Multimodal learning
The rapid progress of large vision-language models (VLMs), such as CLIP, has spurred the development of a wide range of neural decoding frameworks. Nevertheless, most existing approaches still suffer from semantic mismatches during representational alignment. This challenge may stem from the fact that the human brain does not distribute attention uniformly across a visual scene, but rather selectively encodes salient or relevant regions. Moreover, such selectivity is closely related to individual interests and varies from person to person. To address this challenge, we propose Yo'Mind, a novel optimal transport (OT)-driven personalized multimodal semantic masking framework designed to bridge the semantic gap between brain and machines in interpreting visual scenes. Technically, Yo'Mind introduces a dynamic semantic pruning and allocation mechanism that adaptively masks redundant visual semantic components in stimulus images based on individual neural responsesโwithout requiring extra human supervision or hyperparameter tuning. This strategy can be used to enhance semantic consensus between brain and machine representations during decoding. Furthermore, the inherent flexibility of OT theory enables Yo'Mind to perform brain-visual-linguistic alignment and cross-subject decoding within a unified end-to-end architecture. Extensive experiments demonstrate that our Yo'Mind offers several advantages, including state-of-the-art brain-to-text reconstruction performance and improved interpretability of the decoding process.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4193. TurboBoA: Faster and Exact Attention-aware Quantization without Backpropagation
๐ฌ 18
โญ 4.80
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Quantization LLMs Hessian Attention
The rapid growth of large language models (LLMs) has heightened the importance of post-training quantization (PTQ) for reducing memory and computation costs.
Among PTQ methods, GPTQ has gained considerable attention for its efficiency, enabling billion-scale LLMs to be quantized within a few GPU hours.
However, GPTQ assumes layer-wise independence, suffering severe accuracy drop in low-bit regimes.
Recently, BoA improves upon GPTQ by incorporating inter-layer dependencies within the attention module, but it requires sequential quantization across all out-channels, making it substantially less efficient than GPTQ.
In this paper, we propose TurboBoA, a new backpropagation-free PTQ algorithm that preserves the accuracy benefits of BoA while significantly accelerating the process.
The proposed TurboBoA introduces three key innovations: (i) joint quantization of multiple out-channels with a closed-form error compensation rule, reducing sequential operations and yielding a 4~6 times speedup; (ii) correction for distortions propagated from preceding quantized Transformer blocks; and (iii) adaptive grid selection with attention-wise refinement to prevent misalignment during iterative updates.
Extensive experiments demonstrate that TurboBoA delivers substantial acceleration over BoA while consistently improving accuracy, and when combined with outlier suppression techniques, it achieves state-of-the-art results in both weight-only and weight-activation quantization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4194. On the Convergence Direction of Gradient Descent
๐ฌ 5
โญ 4.80
๐ optimization
๐ท๏ธ Gradient Descent Edge of Stability Convergence Direction
Gradient descent (GD) is a fundamental optimization method in deep learning, yet its dynamics near the Edge of Stability (EoS) remain unclear despite empirical evidence showing GD often operates in this regime. In this paper, we prove that if GD converges, its trajectory either aligns toward a fixed direction or oscillates along a specific line. The fixed-direction convergence occurs under small learning rates, while the oscillatory convergence behavior emerges for large learning rate. This result offers a new lens for understanding the long-term GD dynamics. In particular, our result sheds light on the phenomenon of EoS, explaining why sharpness oscillates even as the loss converges. Experimentally, we find that this directional convergence behavior also appears in stochastic gradient descent (SGD) and Adam. These findings suggest a broader underlying principle governing the directional structure of optimization trajectories. Our work provides both theoretical clarity and practical insight into the behavior of dynamics for multiple optimization methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4195. Proper Velocity Neural Networks
๐ฌ 15
โญ 4.80
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Hyperbolic geometry Geometric deep learning Manifold learning Proper velocity Representation learning Riemannian geometry
Hyperbolic neural networks (HNNs) have shown remarkable success in representing hierarchical and tree-like structures, yet most existing work relies on the Poincarรฉ ball and hyperboloid models. While these models admit closed-form Riemannian operators, their constrained nature potentially leads to numerical instabilities, especially near model boundaries. In this work, we explore the Proper Velocity (PV) manifold, an unconstrained representation of hyperbolic space rooted in Einsteinโs special relativity, as a stable alternative. We first establish the complete Riemannian toolkit of the PV space. Building on this foundation, we introduce Proper Velocity Neural Networks (PVNNs) with core layers including Multinomial Logistic Regression (MLR), Fully Connected (FC), convolutional, activation, and batch normalization layers. Extensive experiments across four domains, namely numerical stability, graph node classification, image classification, and genomic sequence learning, demonstrate the stability and effectiveness of PVNNs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4196. Graph Diffusion Transformers are In-Context Molecular Designers
๐ฌ 22
โญ 4.80
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Inverse Molecular Design In Context Learing Diffusin Models Transformers
In-context learning lets large models adapt to new tasks from a few demonstrations, but it has shown limited success in molecular design, where labeled data are scarce and properties span millions of biological assays and material measurements. We introduce demonstration-conditioned diffusion models (DemoDiff), which define task contexts through moleculeโscore examples instead of texts. These demonstrations guide a denoising Transformer to generate molecules aligned with target properties. For scalable pretraining, we develop a new molecular tokenizer with Node Pair Encoding that represents molecules at the motif level, requiring 5.5$\times$ fewer nodes. We pretrain a 0.7B parameter model on datasets covering drugs and materials. Across 33 design tasks in six categories, DemoDiff matches or surpasses language models 100โ1000$\times$ larger and achieves an average rank of 4.10 compared to 6.56โ17.95 for 19 baselines. These results position DemoDiff as a molecular foundation model for in-context molecular design.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4197. Activation Function Design Sustains Plasticity in Continual Learning
๐ฌ 19
โญ 4.80
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ loss of plasticity continual learning lifelong learning continual reinforcement learning activation functions
In independent, identically distributed (i.i.d.) training regimes, activation functions have been benchmarked extensively, and their differences often shrink once model size and optimization are tuned. In continual learning, however, the picture is different: beyond catastrophic forgetting, models can progressively lose the ability to adaptโloss of plasticityโand the role of the non-linearity in this failure mode remains underexplored. We show that activation choice is a primary, architecture-agnostic lever for mitigating plasticity loss. Building on a property-level analysis of negative-branch shape and saturation behavior, we introduce two drop-in nonlinearitiesโSmooth-Leaky and Randomized Smooth-Leakyโand evaluate them in two complementary settings: (i) supervised class-incremental benchmarks and (ii) reinforcement learning with non-stationary MuJoCo environments designed to induce controlled distribution and dynamics shifts. We also provide a simple stress protocol and diagnostics that link the shape of the activation to the adaptation under change. The takeaway is straightforward: thoughtful activation design offers a lightweight, domain-general way to sustain plasticity in continual learning without extra capacity or task-specific tuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4198. Task-Aware Data Selection via Proxy-Label Enhanced Distribution Matching for LLM Finetuning
๐ฌ 21
โญ 4.80
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Data Selection; Task-specific model fine-tuning
Task-specific fine-tuning of foundation models is critically dependent on the quality and relevance of the instruction data. While prevailing data selection methods rely exclusively on instruction instances X to approximate the target distribution, we argue that selection should align with the joint distribution of instructions and task-specific labels (X,Y), However, task-specific labels Y are typically unavailable in practice. To address this, we reformulate the task-specific data selection problem and present a novel pipeline that leverages the reasoning capabilities of large language models (LLMs) to infer proxy labels, thereby facilitating joint distribution alignment. Our approach begins by propagating proxy labels from a small target set to a large, unlabeled source corpus. A two-stage filtering process then removes instances with label noise and refines the subset through distribution alignment. This strategy produces more semantically meaningful and task-aware selections than conventional similarity measures based on X alone. Experimental results show that fine-tuning on a subset of only 10K samplesโselected from a pool of 300Kโachieves performance competitive or superior to state-of-the-art methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4199. OrchestrationBench: LLM-Driven Agentic Planning and Tool Use in Multi-Domain Scenarios
๐ฌ 5
โญ 4.80
๐ datasets and benchmarks
๐ท๏ธ OrchestrationBench Workflow-based Planning Constraint-aware Tool Use
Recent progress in Large Language Models (LLMs) has transformed them from text generators into agentic systems capable of multi-step reasoning, structured planning, and tool use. However, existing benchmarks inadequately capture their ability to orchestrate complex workflows across multiple domains under realistic constraints. To address this, we propose OrchestrationBench, a multilingual (English/Korean) benchmark that systematically evaluates (1) workflow-based planning and (2) constraint-aware tool execution. OrchestrationBench spans 17 representative domains with nearly 100 realistic virtual tools, covering scenarios that require sequential/parallel planning and compliance with business constraints. Unlike previous work, it explicitly disentangles planning evaluation from tool execution evaluation, which assesses tool selection, argument extraction, validation, and rejection handling. Constructed entirely through manual annotation with cultural adaptation, the benchmark ensures authenticity, diversity, and freedom from model-specific biases. Extensive experiments across state-of-the-art models show that function calling performance is relatively consistent, whereas planning capabilities exhibit substantial variation across models, emphasizing the need for structured planning evaluation. As a living benchmark, OrchestrationBench is designed to expand toward new domains, tools, and integration enabling rigorous, cross-cultural, and service-ready evaluation of LLM orchestration capabilities. The benchmark is publicly available at GitHub.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4200. Temporally Detailed Hypergraph Neural ODE for Type 2 Diabetes Progression Modeling
๐ฌ 11
โญ 4.80
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Disease Progression Modeling Neural ODE Temporally Detailed Hypergraph
Disease progression modeling aims to characterize and predict how a patient's disease complications worsen over time based on longitudinal electronic health records (EHRs). Accurate modeling of disease progression, such as type 2 diabetes, can enhance patient sub-phenotyping and inform effective and timely interventions. However, the problem is challenging due to the need to learn continuous-time dynamics of progression patterns based on irregular-time event samples and patient heterogeneity (\eg different progression rates and pathways). Existing mechanistic and data-driven methods either lack adaptability to learn from real-world data or fail to capture complex continuous-time dynamics on progression trajectories. To address these limitations, we propose Temporally Detailed Hypergraph Neural Ordinary Differential Equation (TD-HNODE), which represents disease progression on clinically recognized trajectories as a temporally detailed hypergraph and learns the continuous-time progression dynamics via a neural ODE framework. TD-HNODE contains a learnable TD-Hypergraph Laplacian that captures the interdependency of disease complication markers within both intra- and inter-progression trajectories. Experiments on two real-world clinical datasets demonstrate that TD-HNODE outperforms multiple baselines in modeling the progression of type 2 diabetes and related cardiovascular diseases.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4201. Interaction-aware Representation Modeling With Co-Occurrence Consistency for Egocentric Hand-Object Parsing
๐ฌ 16
โญ 4.80
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Egocentric vision human-environment interaction hand-object parsing consistency
Fine-grained understanding of egocentric human-environment interactions is critical for developing next-generation embodied agents.
A fundamental challenge in this area involves accurately parsing hands and active objects.
While transformer-based architectures have demonstrated considerable potential for such tasks, several key limitations remain unaddressed: 1) existing query initialization mechanisms lack adaptability to diverse categories of contacting objects, impairing the localization and recognition of interactive entities; 2) over-reliance on pixel-level semantic features incorporates interaction-irrelevant noise, degrading segmentation accuracy; and 3) prevailing models are susceptible to "interaction illusion", producing physically inconsistent predictions. To handle these issues, we propose the Interaction-aware Transformer (InterFormer), which integrates three key components, i.e., a Prototypical Query Generator (PQG), a Dual-context Feature Selector (DFS), and the Conditional Co-occurrence (CoCo) loss. The PQG fuses learnable parameters with interaction-relevant context to construct robust and adaptive queries for different active objects. The DFS explicitly combines interactive and semantic cues to filter irrelevant information and generate discriminative interaction embeddings. The CoCo loss incorporates hand-object relationship priors to enhance physical consistency in prediction. Our model achieves state-of-the-art performance on both the EgoHOS and the challenging out-of-distribution mini-HOI4D datasets, demonstrating its effectiveness and strong generalization ability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4202. PTQ4ARVG: Post-Training Quantization for AutoRegressive Visual Generation Models
๐ฌ 24
โญ 4.80
๐ optimization
๐ท๏ธ Model Quantization Autoregressive Visual Generation Models
AutoRegressive Visual Generation (ARVG) models retain an architecture compatible with language models, while achieving performance comparable to diffusion-based models. Quantization is commonly employed in neural networks to reduce model size and computational latency. However, applying quantization to ARVG remains largely underexplored, and existing quantization methods fail to generalize effectively to ARVG models. In this paper, we explore this issue and identify three key challenges: (1) severe outliers at channel-wise level, (2) highly dynamic activations at token-wise level, and (3) mismatched distribution information at sample-wise level. To these ends, we propose PTQ4ARVG, a training-free post-training quantization (PTQ) framework consisting of: (1) Gain-Projected Scaling (GPS) mitigates the channel-wise outliers, which expands the quantization loss via a Taylor series to quantify the gain of scaling for activation-weight quantization, and derives the optimal scaling factor through differentiation. (2) Static Token-Wise Quantization (STWQ) leverages the inherent properties of ARVG, fixed token length and position-invariant distribution across samples, to address token-wise variance without incurring dynamic calibration overhead. (3) Distribution-Guided Calibration (DGC) selects samples that contribute most to distributional entropy, eliminating the sample-wise distribution mismatch. Extensive experiments show that PTQ4ARVG can effectively quantize the ARVG family models to 8-bit and 6-bit while maintaining competitive performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4203. QUEST: A robust attention formulation using query-modulated spherical attention
๐ฌ 22
โญ 4.80
๐ foundation or frontier models, including LLMs
๐ท๏ธ attention transformers model robustness
The Transformer model architecture has become one of the most widely used in deep learning and the attention mechanism is at its core. The standard attention formulation uses a softmax operation applied to a scaled dot product between query and key vectors. We explore the role played by norms of the queries and keys, which can cause training instabilities when they arbitrarily increase. We demonstrate how this can happen even in simple Transformer models, in the presence of easy-to-learn spurious patterns in the data. We propose a new attention formulation, QUEry-modulated Spherical aTtention (QUEST), that constrains the keys to a hyperspherical latent space, while still allowing individual tokens to flexibly control the sharpness of the attention distribution. QUEST can be easily used as a drop-in replacement for standard attention. We focus on vision applications while also exploring other domains to highlight the method's generality. We show that (1) QUEST trains without instabilities and (2) produces models with improved performance (3) that are robust to data corruptions and adversarial attacks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4204. Pretraining with hierarchical memories: separating long-tail and common knowledge
๐ฌ 26
โญ 4.80
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language models pretraining memory long-tail knowledge reasoning forgetting
The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a fraction is used per prompt, and impractical for edge devices with limited inference-time memory and compute. We address this shortcoming by a memory-augmented architecture and a pretraining strategy aligned with existing hardware paradigms. We introduce small language models that access large hierarchical parametric memory banks encoding world knowledge. During pretraining and inference, we fetch a small, context-dependent memory block and add it to the model. Our pretraining learns to store long-tail world knowledge in the memory parameters, while the small language model acts as an anchor capturing common knowledge and general reasoning abilities. Through trillion-token-scale experiments, we show significant gains: a 160M-parameters model augmented with an 18M-parameters memory fetched from a 4.6B memory bank obtains comparable performance to a regular model with more than 2x the parameters. Through extensive experiments, we study the optimal type and size of parametric memories in transformers, scaling them to over 21B parameters. We find that our proposed hierarchical feed-forward memories work robustly across transformer architectures, whether added during pretraining or post-hoc.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4205. Robust Equation Structure learning with Adaptive Refinement
๐ฌ 24
โญ 4.80
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Symbolic Regression Genetic Programming Equation Discovery Large Language Model AI for Science
Symbolic regression (SR) aims to automate scientific discovery, but often truncates the hypotheticoโdeductive cycle, focusing on hypothesis and experiment while lacking systematic analysis. We introduce RESTART, a framework that closes this loop by adding a principled analysis stage to diagnose and correct structural errors. RESTART features two core mechanisms: a short-term refinement process that uses boosting to identify unexplained signals and guide an LLM toward targeted corrections, and a long-term structure library that distills successful refinements into reusable code snippets for cumulative knowledge. On LLM-SRBench across Physics, Biology, and Materials Science, RESTART achieves lower error and higher accuracy than state-of-the-art baselines. It also generalizes robustly, recovering near-exact functional forms on out-of-distribution data, representing a significant advance toward fully automated scientific discovery.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4206. Best-of-Majority: Minimax-Optimal Strategy for Pass@k Inference Scaling
๐ฌ 20
โญ 4.80
๐ learning theory
๐ท๏ธ Inference-time scaling Pass@k
LLM inference often generates a batch of candidates for a prompt and selects one via strategies like majority voting or Best-of-
N (BoN). For difficult tasks, this single-shot selection often underperforms. Consequently, evaluations commonly report Pass@$k$: the agent may submit up to $k$ responses, and only the best of them is used when computing regret. Motivated by this, we study inference scaling in the more general Pass@$k$ inference setting, and prove that neither majority voting nor BoN exhibits the desirable scaling with $k$ and the sampling budget $N$. Combining the advantages of majority voting and BoN, we propose a new inference strategy called Best-of-Majority (BoM), with a pivotal step that restricts the candidates to the responses with high frequency in the $N$ samples before selecting the top-$k$ rewards. We prove that when the sampling budget is $N=\tilde\Omega(C^\*)$, the regret of BoM is $O(\epsilon_{\mathrm{opt}}+\sqrt{\epsilon_{\mathrm{RM}}^2C^\*/k})$, where $C^*$ is the coverage coefficient, $\epsilon_{\mathrm{RM}}$ is the estimation error of the reward model, and $\epsilon_{\mathrm{opt}}$ is the estimation error of reward at the optimal response. We further establish a matching lower bound, certifying that our algorithm is minimax optimal. Beyond optimality, BoM has a key advantage: unlike majority voting and BoN, its performance does not degrade when increasing $N$. Experimental results of inference on math problems show BoM outperforming both majority voting and BoN.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4207. Textual Bayes: Quantifying Uncertainty in LLM-Based Systems
๐ฌ 16
โญ 4.80
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ LLMs large language models uncertainty quantification Bayes Bayesian Bayesian inference Metropolis-Hastings prompt engineering autoprompting TextGrad
Although large language models (LLMs) are becoming increasingly capable of solving challenging real-world tasks, accurately quantifying their uncertainty remains a critical open problemโone that limits their applicability in high-stakes domains. This challenge is further compounded by the closed-source, black-box nature of many state-of-the-art LLMs. Moreover, LLM-based systems can be highly sensitive to the prompts that bind them together, which often require significant manual tuning (i.e., prompt engineering). In this work, we address these challenges by viewing LLM-based systems through a Bayesian lens. We interpret prompts as textual parameters in a statistical model, allowing us to use a small training dataset to perform Bayesian inference over these prompts. This novel perspective enables principled uncertainty quantification over both the modelโs textual parameters and its downstream predictions, while also incorporating prior beliefs about these parameters expressed in free-form text. To perform Bayesian inferenceโ a difficult problem even for well-studied data modalitiesโwe introduce Metropolis-Hastings through LLM Proposals (MHLP), a novel Markov chain Monte Carlo (MCMC) algorithm that combines prompt optimization techniques with standard MCMC methods. MHLP is a turnkey modification to existing LLM pipelines, including those that rely exclusively on closed-source models. Empirically, we demonstrate that our method yields improvements in both predictive accuracy and uncertainty quantification (UQ) on a range of LLM benchmarks and UQ tasks. More broadly, our work demonstrates a viable path for incorporating methods from the rich Bayesian literature into the era of LLMs, paving the way for more reliable and calibrated LLM-based systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4208. Naming to Learn: Class Incremental Learning for Vision-Language Model with Unlabeled Data
๐ฌ 18
โญ 4.80
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ continual learning incremental learning vision-language model
Class Incremental Learning (CIL) enables models to adapt to evolving data distributions by learning new classes over time without revisiting previous data. While recent methods utilizing pre-trained models have shown promising results, they often assume access to fully labeled data for each incremental task, which is often impractical. In this paper, we instead tackle a more realistic scenario in which only unlabeled data and the class-name set are available for each new class. Although one could generate pseudo labels with a vision-language model and apply existing CIL methods, the inevitable noise in these pseudo labels tends to aggravate catastrophic forgetting. To overcome this challenge, we propose a method named N2L employing a regression objective with mean squared error loss, which can be solved in a recursive manner. To refine the pseudo labels, N2L applies feature dimensionality reduction to the extracted image features and iteratively updates the labels using a classifier trained on these reduced features. Furthermore, a bi-level weight adjustment strategy is proposed to downweight low-confidence pseudo labels via intra-class adjustment and compensate for pseudo-label class imbalance through inter-class adjustment. This incremental learning with adjustment can be solved recursively, yielding identical performance to joint training with unlabeled data and thereby mitigating forgetting. Our theoretical analysis supports the effectiveness of the pseudo label refinement process, and experiments on various datasets demonstrate that our proposed method outperforms SOTA methods. Code is provided in the appendix.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4209. Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search
๐ฌ 28
โญ 4.80
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ LLM Large Language Model Uncertainty Quantification Beam Search
Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4210. Scaling Speech Tokenizers with Diffusion Autoencoders
๐ฌ 18
โญ 4.80
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Speech Tokenizer Diffusion Autoencoder Codec ASR Speech Language Model
Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech.
Experiments show that SiTok outperforms strong baselines on both reconstruction and understanding tasks, at an extremely low token rate of 12.5 Hz and a bit-rate of 200 bits-per-second.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4211. Turbo-DDCM: Fast and Flexible Zero-Shot Diffusion-Based Image Compression
๐ฌ 12
โญ 4.80
๐ generative models
๐ท๏ธ image compression diffusion models diffusion-based image compression zero-shot diffusion-based imageย compression
While zero-shot diffusion-based compression methods have seen significant progress in recent years, they remain notoriously slow and computationally demanding. This paper presents an efficient zero-shot diffusion-based compression method that runs substantially faster than existing methods, while maintaining performance that is on par with the state-of-the-art techniques. Our method builds upon the recently proposed Denoising Diffusion Codebook Models (DDCMs) compression scheme. Specifically, DDCM compresses an image by sequentially choosing the diffusion noise vectors from reproducible random codebooks, guiding the denoiserโs output to reconstruct the target image. We modify this framework with *Turbo-DDCM*, which efficiently combines a large number of noise vectors at each denoising step, thereby significantly reducing the number of required denoising operations. This modification is also coupled with an improved encoding protocol. Furthermore, we introduce two flexible variants of Turbo-DDCM, a priority-aware variant that prioritizes user-specified regions and a distortion-controlled variant that compresses an image based on a target PSNR rather than a target BPP. Comprehensive experiments position Turbo-DDCM as a compelling, practical, and flexible image compression scheme.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4212. SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis
๐ฌ 28
โญ 4.80
๐ datasets and benchmarks
๐ท๏ธ Causal Inference Survival Analysis Treatment Effect Datasets and Benchmarks
Estimating heterogeneous treatment effects (HTEs) from right-censored survival data is critical in high-stakes applications such as precision medicine and individualized policy-making. Yet, the survival analysis setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from causal survival forests to survival meta-learners and outcome imputation approaches, evaluation practices remain fragmented and inconsistent. We introduce SurvHTEโBench, the first comprehensive benchmark for HTE estimation with censored outcomes. The benchmark spans (i) a modular suite of synthetic datasets with known ground truth, systematically varying causal assumptions and survival dynamics, (ii) semi-synthetic datasets that pair real-world covariates with simulated treatments and outcomes, and (iii) real-world datasets from a twin study (with known ground truth) and from an HIV clinical trial. Across synthetic, semi-synthetic, and real-world settings, we provide the first rigorous comparison of survival HTE methods under diverse conditions and realistic assumption violations. SurvHTEโBench establishes a foundation for fair, reproducible, and extensible evaluation of causal survival methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4213. LDT: Layer-Decomposition Training Makes Networks More Generalizable
๐ฌ 22
โญ 4.80
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Domain generalization
Domain generalization methods can effectively enhance network performance on test samples with unknown distributions by isolating gradients between unstable and stable parameters. However, existing methods employ relatively coarse-grained partitioning of stable versus unstable parameters, leading to misclassified unstable parameters that degrade network feature processing capabilities. We first provide a theoretical analysis of gradient perturbations caused by unstable parameters. Based on this foundation, we propose Layer-Decomposition Training (LDT), which conducts fine-grained layer-wise partitioning guided by parameter instability levels, substantially improving parameter update stability. Furthermore, to address gradient amplitude disparities within stable layers and unstable layers respectively, we introduce a Dynamic Parameter Update (DPU) strategy that adaptively determines layer-specific update coefficients according to gradient variations, optimizing feature learning efficiency. Extensive experiments across diverse tasks (super-resolution, classification) and architectures (Transformer, Mamba, CNN) demonstrate LDT's superior generalization capability. Our code is available at ***.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4214. A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization
๐ฌ 5
โญ 4.80
๐ optimization
๐ท๏ธ Convergence Theory Non-convex Optimization Adaptive Optimization Quantization Low-bit Training
The rapid scaling of large language models (LLMs) has made low-precision training essential for reducing memory, improving efficiency, and enabling larger models and datasets. Existing convergence theories for adaptive optimizers, however, assume all components are exact and neglect hardware-aware quantization, leaving open the question of why low-precision training remains effective. We introduce the first theoretical framework for analyzing the convergence of adaptive optimizers, including Adam and Muon, under floating-point quantization of gradients, weights, and optimizer states (e.g., moment estimates). Within this framework, we derive convergence rates on smooth non-convex objectives under standard stochastic gradient assumptions, explicitly characterizing how quantization errors from different components affect convergence. We show that both algorithms retain rates close to their full-precision counterparts provided mantissa length scales only logarithmically with the number of iterations. Our analysis further reveals that Adam is highly sensitive to weights and second-moment quantization due to its reliance on $\beta_2 \to 1$, while Muon requires weaker error control and is thus potentially more robust. These results narrow the gap between empirical success and theoretical understanding of low-precision training methods. Numerical experiments on synthetic and real-world data corroborate our theory.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4215. Text2Grad: Reinforcement Learning from Natural Language Feedback
๐ฌ 28
โญ 4.80
๐ reinforcement learning
๐ท๏ธ Natural Language Feedback Fine-Grained Policy Optimization Reinforcement Learning for Language Models
Traditional RLHF optimizes language models with coarse, scalar rewards that mask the fine-grained reasons behind success or failure, leading to slow and opaque learning. Recent work augments RL with textual critiques through prompting or reflection, improving interpretability but leaving model parameters untouched. We introduce Text2Grad, a reinforcement-learning paradigm that turns free-form textual feedback into span-level gradients. Given human (or programmatic) critiques, \name aligns each feedback phrase with the relevant token spans, converts these alignments into differentiable reward signals, and performs gradient updates that directly refine the offending portions of the modelโs policy. This yields precise, feedback-conditioned adjustments instead of global nudges. Text2Grad is realized through three components: (1) a high-quality feedbackโannotation pipeline that pairs critiques with token spans; (2) a fine-grained reward model that predicts span-level reward on answer while generating explanatory critiques; and (3) a span-level policy optimizer that back-propagates natural-language gradients. Across summarization, code generation, and question answering, Text2Grad consistently surpasses scalar-reward RL and prompt-only baselines, providing both higher task metrics and richer interpretability. Our results demonstrate that natural-language feedback, when converted to gradients, is a powerful signal for fine-grained policy optimization. The code for our method is available in the supplementary ZIP file.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4216. A Scalable Constant-Factor Approximation Algorithm for $W_p$ Optimal Transport
๐ฌ 12
โญ 4.80
๐ optimization
๐ท๏ธ optimal transport p-Wasserstein min-cost matching
Let $(X,d)$ be a metric space and let $\mu,\nu$ be discrete distributions supported on finite point sets $A,B \subseteq X$. For any $p \in [1,\infty]$, the $W_p$-distance between $\mu$ and $\nu$, $W_p(\mu, \nu)$, is defined as the $p$-th root of the minimum cost of transporting the mass from $\mu$ to $\nu$, where moving a unit of mass from $a \in A$ to $b \in B$ incurs a cost of $d^p(a,b)$. We give a (Las Vegas) randomized algorithm that always computes a $(4+\epsilon)$-approximate $W_p$ optimal-transport (OT) plan in $O(n^2 + (n^{3/2}\epsilon^{-1}\log^2\Delta)^{1+o(1)})$ expected time, for all $p \in [1,\infty]$, where $\epsilon > 0$ is an arbitrarily small constant and $\Delta$ is the spread of $A \cup B$. The best previous result achieved an $O(\log n)$-approximation in $O(pn^2)$ time, but only for constant $p$. Our algorithm significantly improves the approximation factor and, importantly, is the first quadratic-time method that extends to the $W_\infty$-distance. In contrast, additive approximation methods such as Sinkhorn are efficient only for constant $p$ and fail to handle $p=\infty$. Finally, we show that obtaining a relative approximation factor better than $2$ in $O(n^2)$ time would resolve the long-standing open problem of computing a perfect matching in an arbitrary bipartite graph in quadratic time.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4217. Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
๐ฌ 15
โญ 4.80
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Reinforcement learning Reasoning Large Language Model Agent
Large Language Models (LLMs) often struggle with challenging, multi-step reasoning problems due to a fundamental learning gap -- Reinforcement Learning with Verifiable Rewards (RLVR) suffers from sparse rewards when correct solutions are rarely sampled, while Supervised Fine-Tuning (SFT) tends to overfit to long demonstrations through rigid token mimicry. To bridge this gap, we introduce Supervised Reinforcement Learning (SRL), a framework that reformulates problem-solving as a sequence of logical actions. SRL trains the model to learn from each action, where the model first generates an internal reasoning monologue and then commits to an action. The model receives dense rewards based on the similarity between its actions and the expertโs at each step, providing a richer signal than RLVR. More importantly, by only rewarding the action, SRL allows the model flexibility in its self-generated thought process, promoting stronger reasoning abilities than SFT. On challenging mathematical reasoning benchmarks, SRL significantly outperforms both methods. Furthermore, a curriculum that cold-start with SRL before refining with RLVR achieves the strongest results. SRL also generalizes effectively to agentic software engineering tasks, establishing it as a robust framework for various reasoning tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4218. Towards Persistent Noise-Tolerant Active Learning of Regular Languages with Class Query
๐ฌ 11
โญ 4.80
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Active Learning Automata Theory Large Language Models Regular languages Value Alignment Preference Modeling
Large Language Models (LLMs) are increasingly deployed in humanโAI collaborative decision-making systems, where they are expected to align precise formal representations with ambiguous natural language. However, their ad hoc strategies for resolving ambiguity often lead to hallucinations and inconsistencies. We formalize this setting via probabilistic Minimally Adequate Teachers (pMATs) that (i) answer membership queries with fixed but possibly flipped labels, and (ii) return valid counterexamples to hypothesis equivalence. We present **CAPAL** (**C**lass-query **A**ctive, **P**ersistent-noise-**A**ware **L**earning), an active algorithm for learning deterministic finite automata (DFAs) that remains correct under persistent membership noise without demonstrations. CAPAL augments the classic \$L^\star\$ loop with two components grounded in our implementation: (1) a *class query* realized as a statistical same-state test that compares disagreements between two prefixes against a noise-floor estimate \$\hat{\eta}\$ with Hoeffding tolerances; (2) a *discrimination tree* that selects a near-minimal discriminator, keeping the core suffix set small. An efficient micro-bootstrap and cache-reuse scheme estimates \$\hat{\eta}\$ with few new queries. We prove convergence given a perfect language-equivalence oracle and show substantial membership-query savings in practice. Our evaluation across multiple benchmarks, including RegexLib and KB13, demonstrates that this approach enhances both the efficiency and robustness of DFA learning under noisy oracles, supporting the view of LLMs as fallible yet useful collaborators for synthesizing verifiable formal artifacts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4219. Is In-Context Learning Learning?
๐ฌ 14
โญ 4.80
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLMs in-context learning
In-context learning (ICL) allows some autoregressive models to solve tasks via next-token prediction and without needing further training.
This has led to claims about these model's ability to solve (learn) unseen tasks with only a few shots (exemplars) in the prompt. However, deduction does not always imply learning, as ICL does not explicitly encode a given observation. Instead, the models rely on their prior knowledge and the exemplars given, if any. We argue that, mathematically, ICL fits the definition of learning; however, its full characterisation requires empirical work. We then carry out a large-scale analysis of ICL ablating out or accounting for memorisation, pretraining, distributional shifts, and prompting style and phrasing. We find that, empirically, ICL is limited in its ability to learn and generalise to unseen tasks. Namely, in the limit where exemplars become more numerous, accuracy is insensitive to exemplar distribution, model, prompt style, and the input's linguistic features. Instead, it deduces patterns from regularities in the prompt, which leads to distributional sensitivity, especially in prompting styles such as chain-of-thought. Given the varied accuracies and on formally similar tasks, we conclude that autoregression's _ad-hoc_ encoding is not a robust mechanism for learning, and suggests limited all-purpose generalisability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4220. XIL: Cross-Expanding Incremental Learning
๐ฌ 28
โญ 4.80
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Class-incremental learning Continual learning Image Classification
Class-Incremental Learning (CIL) traditionally assumes that all tasks share a similar domain distribution, limiting its applicability in real-world scenarios where data arrive from evolving environments.
We introduce a new problem setting, Cross-Expanding Incremental Learning (XIL), which extends CIL by requiring models to handle class-incremental data across distinct domains and to expand class-domain associations bidirectionally.
In this setting, new classes should be integrated into previously seen domains, while earlier classes are extended to newly encountered ones, a capability we refer to as bidirectional domain transferability (BiDoT).
To address XIL, we present a new framework, Semantic Expansion through Evolving Domains (XEED), which leverages domain-specialized prompts, residual-guided representation modulation, and evolving prototype embeddings to expand class semantics across previously encountered domains.
We further introduce the BiDoT Score, a novel metric for quantifying the degree of BiDoT.
Extensive experiments on benchmark datasets with significant domain shifts demonstrate that XEED outperforms existing CIL baselines by a large margin in both standard accuracy and BiDoT scores, establishing a strong foundation for realistic continual learning under domain-evolving conditions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4221. IGC-Net for conditional average potential outcome estimation over time
๐ฌ 26
โญ 4.80
๐ causal reasoning
๐ท๏ธ causal inference potential outcomes treatment effects healthcare
Estimating potential outcomes for treatments over time based on observational data is important for personalized decision-making in medicine. However, many existing methods for this task fail to properly adjust for time-varying confounding and thus yield biased estimates. There are only a few neural methods with proper adjustments, but these have inherent limitations (e.g., division by propensity scores that are often close to zero), which result in poor performance. As a remedy, we introduce the iterative G-computation network (IGC-Net). Our IGC-Net is a novel, neural end-to-end model which adjusts for time-varying confounding in order to estimate conditional average potential outcomes (CAPOs) over time. Specifically, our IGC-Net is the first neural model to perform fully regression-based iterative G-computation for CAPOs in the time-varying setting. We evaluate the effectiveness of our IGC-Net across various experiments. In sum, this work represents a significant step towards personalized decision-making from electronic health records.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4222. Efficient MorphologyโControl Co-Design via Stackelberg PPO under Non-Differentiable LeaderโFollower Interfaces
๐ฌ 28
โญ 4.80
๐ reinforcement learning
๐ท๏ธ MorphologyโControl Co-Design Stackelberg Game Policy Gradient Proximal Policy Optimization Non-Differentiable LeaderโFollower Interactions Reinforcement Learning
Morphology-control co-design concerns the coupled optimization of an agentโs body structure and control policy. A key challenge is that evaluating each candidate morphology requires extensive rollouts to re-optimize control and assess quality, leading to high computational costs and slow convergence. This challenge is compounded by the non-differentiable interaction between morphology and control---stemming from discrete design choices and rollout-based evaluation---which blocks gradient flow across the morphology-control interface and forces reliance on costly rollout-driven optimization. To address these challenges, we highlight that the co-design problem can be formulated as a novel variant of a Stackelberg Markov game, a hierarchical framework where the leader specifies the morphology and the follower adapts the control. Building on this formulation, we propose \emph{Stackelberg Proximal Policy Optimization (Stackelberg PPO)}, a policy gradient method that leverages the intrinsic coupling between leader and follower to reduce repeated control re-optimization and enable more efficient optimization under non-differentiable interfaces. Experiments across diverse co-design tasks demonstrate that Stackelberg PPO outperforms standard PPO in both stability and final performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4223. MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning
๐ฌ 19
โญ 4.80
๐ foundation or frontier models, including LLMs
๐ท๏ธ Medical visual reasoning Medical VQA
Medical Vision-Language Models (VLMs) hold immense promise for complex clinical tasks, but their reasoning capabilities are often constrained by text-only paradigms that fail to ground inferences in visual evidence. This limitation not only curtails performance on tasks requiring fine-grained visual analysis but also introduces risks of visual hallucination in safety-critical applications. Thus, we introduce MedVR, a novel reinforcement learning framework that enables annotation-free visual reasoning for medical VLMs. Its core innovation lies in two synergistic mechanisms: Entropy-guided Visual Regrounding (EVR) uses model uncertainty to direct exploration, while Consensus-based Credit Assignment (CCA) distills pseudo-supervision from rollout agreement. Without any human annotations for intermediate steps, MedVR achieves state-of-the-art performance on diverse public medical VQA benchmarks, significantly outperforming existing models. By learning to reason directly with visual evidence, MedVR promotes the robustness and transparency essential for accelerating the clinical deployment of medical AI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4224. AlignSep: Temporally-Aligned Video-Queried Sound Separation with Flow Matching
๐ฌ 5
โญ 4.80
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ sound separation audio-visual alignment
Video Query Sound Separation (VQSS) aims to isolate target sounds conditioned on visual queries while suppressing off-screen interferenceโa task central to audiovisual understanding. However, existing methods often fail under conditions of homogeneous interference and overlapping soundtracks, due to limited temporal modeling and weak audiovisual alignment.
We propose \textbf{AlignSep}, the first generative VQSS model based on flow matching, designed to address common issues such as spectral holes and incomplete separation. To better capture cross-modal correspondence, we introduce a series of temporal consistency mechanisms that guide the vector field estimator toward learning robust audiovisual alignment, enabling accurate and resilient separation in complex scenes.
As a \textit{multi-conditioned generation} task, VQSS presents unique challenges that differ fundamentally from traditional flow matching setups. We provide an in-depth analysis of these differences and their implications for generative modeling. To systematically evaluate performance under realistic and difficult conditions, we further construct \textbf{VGGSound-Hard}, a challenging benchmark composed entirely of separation cases with homogeneous interference and strong reliance on temporal visual cues.
Extensive experiments across multiple benchmarks demonstrate that AlignSep achieves state-of-the-art performance both quantitatively and perceptually, validating its practical value for real-world applications. More results and audio examples are available at: \url{https://AlignSep.github.io}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4225. Refine Now, Query Fast: A Decoupled Refinement Paradigm for Implicit Neural Fields
๐ฌ 26
โญ 4.80
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ implicit neural representation scene representation network ensemble simulation scientific simulation
Implicit Neural Representations (INRs) have emerged as powerful surrogates for large-scale scientific simulations, but their practical application is often hindered by a fundamental trade-off: high-fidelity MLP-based models are computationally expensive and slow to query, while fast embedding-based models lack expressive power. To resolve this, we propose the Decoupled Representation Refinement (DRR) paradigm. DRR leverages a deep refiner network in a one-time, offline process to encode rich representations into a compact and efficient embedding structure. This approach decouples slow neural networks with high representational capacity from the fast inference path. We introduce DRR-Net, a simple network that validates this paradigm, and a novel data augmentation strategy, Variational Pairs (VP) for improving INRs under complex tasks like high-dimensional surrogate modeling. Experiments on several ensemble simulation datasets demonstrate that our approach achieves state-of-the-art fidelity, while being up to 27$\times$ faster at inference than high-fidelity baselines and remaining competitive with the fastest models. The DRR paradigm offers an effective strategy for building powerful and practical neural field surrogates and general-purpose INRs, with a minimal compromise between speed and quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4226. Nef-Net+: Adapting Electrocardio Panorama in the wild
๐ฌ 17
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ ECG representation Cardiac Diagnosis
Conventional multi-lead electrocardiogram (ECG) systems capture cardiac signals from a fixed set of anatomical viewpoints defined by lead placement. However, certain cardiac conditions (e.g. Brugada syndrome) require additional, non-standard viewpoints to reveal diagnostically critical patterns that may be absent in standard leads. To systematically overcome this limitation, Nef-Net was recently introduced to reconstruct a continuous electrocardiac field, enabling virtual observation of ECG signals from arbitrary views (termed Electrocardio Panorama). Despite its promise, Nef-Net operates under idealized assumptions and faces in-the-wild challenges, such as long-duration ECG modeling, robustness to device-specific signal artifacts, and suboptimal lead placement calibration.
This paper presents Nef-Net+, an enhanced framework for realistic panoramic ECG synthesis that supports arbitrary-length signal synthesis from any desired view, generalizes across ECG devices, and compensates for operator-induced deviations in electrode placement. These capabilities are enabled by a newly designed model architecture that performs direct view transformation, incorporating a workflow comprising offline pretraining, device calibration tuning steps as well as an on-the-fly calibration step for patient-specific adaptation. To rigorously evaluate panoramic ECG synthesis, we construct a new Electrocardio Panorama benchmark, called Panobench, comprising 9369 recordings with 48-view per subject, capturing the full spatial variability of cardiac electrical activity. Experimental results show that Nef-Net+ delivers substantial improvements over Nef-Net, yielding an increase of around 6 dB in PSNR in real-world setting.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4227. Accelerated co-design of robots through morphological pretraining
๐ฌ 12
โญ 4.67
๐ applications to robotics, autonomy, planning
๐ท๏ธ robot co-design universal control differentiable simulation embodied intelligence
The co-design of robot morphology and neural control typically requires using reinforcement learning to approximate a unique control policy gradient for each body plan, demanding massive amounts of training data to measure the performance of each design. Here we show that a universal, morphology-agnostic controller can be rapidly and directly obtained by gradient-based optimization through differentiable simulation. This process of morphological pretraining allows the designer to explore non-differentiable changes to a robot's physical layout (e.g. adding, removing and recombining discrete body parts) and immediately determine which revisions are beneficial and which are deleterious using the pretrained model. We term this process "zero-shot evolution" and compare it with the simultaneous co-optimization of a universal controller alongside an evolving design population. We find the latter results in _diversity collapse_, a previously unknown pathology whereby the populationโand thus the controller's training dataโconverges to similar designs that are easier to steer with a shared universal controller. We show that zero-shot evolution with a pretrained controller quickly yields a diversity of highly performant designs, and by fine-tuning the pretrained controller on the current population throughout evolution, diversity is not only preserved but significantly increased as superior performance is achieved. Videos viewable at this website: https://gilded-macaron-5a75e3.netlify.app
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4228. FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning
๐ฌ 11
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language models group relative policy optimization speculative decoding acceleration
Group relative policy optimization (GRPO) has demonstrated significant potential in improving the reasoning capabilities of large language models (LLMs) via reinforcement learning. However, its practical deployment is impeded by an excessively slow training process, primarily attributed to the computationally intensive autoregressive generation of multiple responses per query, which makes the generation phase the primary performance bottleneck. Although speculative decoding presents a promising direction for acceleration, its direct application in GRPO achieves limited speedup under high-concurrency training conditions. To overcome this limitation, we propose a concurrency-aware speculative decoding framework that dynamically adjusts the drafting and verification strategy according to real-time concurrency levels, thereby maximizing the acceleration of the generation process. Furthermore, to address performance degradation arising from distributional drift between the evolving target model and the fixed draft model during training, we introduce an online draft learning mechanism that enables the draft model to continuously adapt using feedback signals from the target model. Experimental results across multiple mathematical reasoning datasets and models demonstrate that the proposed method achieves end-to-end speedups of 2.35x to 2.72x, significantly surpassing baseline approaches in efficiency. The code is available at https://anonymous.4open.science/r/GRPO_speculative-2CD3.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4229. CellDuality: Unlocking Biological Reasoning in LLMs with Self-Supervised RLVR
๐ฌ 7
โญ 4.67
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Reinforcement Learning Biological Reasoning Foundation Models Single-Cell Biology
\begin{abstract}
Developing generalist large language models (LLMs) capable of complex biological reasoning is a central challenge in computational biology. While existing LLMs excel at predictive tasks like cell type annotation and logically-constrained problems, enabling open-ended and mechanistic reasoning remains a challenge. A promising direction is Reinforcement Learning from Verifiable Rewards (RLVR), which has been shown to significantly enhance complex reasoning in general domains like mathematics and code synthesis. However, its application in biology is hindered, as most biological outcomes are non-verifiable. For example, verifying a generated gene sequence is usually infeasible. In this paper, we introduce CellDuality, a self-supervised framework that enables LLM agents for robust reasoning in single-cell biology. Our framework is built on the principle of complementary task duality, a self-verification process that leverages a bidirectional reasoning loop. First, the model performs a forward reasoning task by predicting a biological outcome (e.g., a cell's response to a drug). Then, in a complementary inverse task, it must reason backward from its own prediction to reconstruct the initial conditions (e.g., the original drug perturbation). The fidelity of this reconstruction serves as an intrinsic reward signal, creating a feedback loop that enforces logical and biological consistency. We use these intrinsic rewards to align the base LLM via reinforcement learning, without requiring ground-truth verification labels. We demonstrate that CellDuality achieves state-of-the-art performance and provides coherent biological explanations across a diverse suite of single-cell reasoning tasks. Critically, on the challenging out-of-distribution perturbation prediction benchmark, our self-supervised approach significantly outperforms the standard fine-tuning baseline and narrows the performance gap to a supervised RLVR baseline. Our work showcases a new path toward scalable training of biological foundation models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4230. Revisit Visual Prompt Tuning: The Expressiveness of Prompt Experts
๐ฌ 17
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ mixture of experts visual prompt tuning theory parameter-efficient fine-tuning pre-trained model
Visual Prompt Tuning (VPT) has proven effective for parameter-efficient adaptation of pre-trained vision models to downstream tasks by inserting task-specific learnable prompt tokens. Despite its empirical success, a comprehensive theoretical understanding of VPT remains an active area of research. Building on the recently established connection between Mixture of Experts (MoE) and prompt-based methods, wherein each attention head can be conceptualized as a composition of multiple MoE models, we reinterpret VPT as the introduction of new *prompt experts* into these MoE structures. We identify a key limitation in existing VPT frameworks: the *restricted functional expressiveness* of prompt experts, which remain static and thus limited in their adaptability. To address this, we propose **Visual Adaptive Prompt Tuning (VAPT)**, a novel method that endows prompt experts with enhanced expressiveness while preserving parameter efficiency. Empirical evaluations on VTAB-1K and FGVC demonstrate that VAPT achieves *substantial performance improvements*, surpassing fully fine-tuned baselines by **7.34%** and **1.04%**, respectively. Moreover, VAPT consistently outperforms VPT while *requiring fewer additional parameters*. Furthermore, our theoretical analysis indicates that VAPT achieves optimal sample efficiency. Collectively, these results underscore the theoretical grounding and empirical advantages of our approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4231. UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice
๐ฌ 8
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Speech-to-Speech Translation
The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data that retains expressive styles, the complexity of multi-stage processing pipelines, and the limited transfer of translation capabilities from large language models (LLMs). In this work, we address these challenges by introducing UniSS, a novel single-stage framework for expressive S2ST. Our approach features carefully designed speech semantic and style modeling, enabling seamless integration with existing text-based LLM frameworks to develop a unified text-speech language model. To transfer translation capabilities from text to speech, we propose a cross-modal chain-of-thought prompting process that progressively aligns audio semantics with text and ensures style preservation in the decoded results. Furthermore, we construct and release a large-scale, high-quality expressive S2ST dataset, UniST, comprising 44.8k hours of data. Experimental results show that UniSS significantly outperforms previous methods in translation fidelity and speech quality while preserving voice, emotion, and duration consistency. Our work establishes a simpler and more effective paradigm for building the next generation of expressive S2ST systems. Audio samples are available at https://uniss-s2st.github.io/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4232. Tracing the Traces: Latent Temporal Signals for Efficient and Accurate Reasoning
๐ฌ 22
โญ 4.67
๐ interpretability and explainable AI
๐ท๏ธ Test Time Scaling Reasoning Interpretability Representational Analysis
Reasoning models improve their problem-solving ability through inference-time scaling, allocating more compute via longer token budgets. Identifying which reasoning traces are likely to succeed remains a key opportunity: reliably predicting productive paths can substantially reduce wasted computation and improve overall efficiency. We introduce Latent-Trajectory signals that characterize the temporal evolution of a model's internal representations during the generation of intermediate reasoning tokens. By analyzing both the extent and temporal course of latent representational change, as well as its alignment with the final state, we show that these signals are strong predictors of solution accuracy, outperforming conventional output-based confidence measures. We use latent-trajectory signals to guide answer selection across multiple sampled generations, demonstrating that they make test-time scaling more effective and efficient, reducing token usage by up to 70% while preserving and even improving accuracy by 2.6% on average in comparison with majority voting. Finally, we show that these signals often emerge early in the reasoning trace, which enables early selection and allocation of compute to the most promising candidates during generation. Our findings contribute not only practical strategies for inference-time efficiency, but also a deeper interpretability perspective on how reasoning processes are represented and differentiated in latent space.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4233. Exposing and Defending the Achilles' Heel of Video Mixture-of-Experts
๐ฌ 6
โญ 4.67
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Adversarial attacks for Video MoE; Robustness of Video MoE
Mixture-of-Experts (MoE) has demonstrated strong performance in video understanding tasks, yet its adversarial robustness remains underexplored. Existing attack methods often treat MoE as a unified architecture, overlooking the independent and collaborative weaknesses of key components such as routers and expert modules. To fill this gap, we propose Temporal Lipschitz-Guided Attacks (TLGA) to thoroughly investigate component-level vulnerabilities in video MoE models. We first design attacks on the router, revealing its independent weaknesses. Building on this, we introduce Joint Temporal Lipschitz-Guided Attacks (J-TLGA), which collaboratively perturb both routers and experts. This joint attack significantly amplifies adversarial effects and exposes the Achillesโ Heel (collaborative weaknesses) of the MoE architecture. Based on these insights, we further propose Joint Temporal Lipschitz Adversarial Training (J-TLAT). J-TLAT performs joint training to further defend against collaborative weaknesses, enhancing component-wise robustness. Our framework is plug-and-play and reduces inference cost by more than 60% compared with dense models. It consistently enhances adversarial robustness across diverse video datasets and model architectures, effectively mitigating both the independent and collaborative weaknesses of MoE. Code will be released upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4234. A Fair Bayesian Inference through Matched Gibbs Posterior
๐ฌ 3
โญ 4.67
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Algorithmic fairness Bayesian inference Gibbs posterior
With the growing importance of trustworthy AI, algorithmic fairness has emerged as a critical concern.
Among various fairness notions, group fairness - which measures the model bias between sensitive groups - has received significant attention.
While many group-fair models have focused on satisfying group fairness constraints, model uncertainty has received relatively little attention, despite its importance for robust and trustworthy decision-making.
To address this, we adopt a Bayesian framework to capture model uncertainty in fair model training.
We first define group-fair posterior distributions and then introduce a fair variational Bayesian inference.
Then we propose a novel distribution termed matched Gibbs posterior, as a proxy distribution for the fair variational Bayesian inference by employing a new group fairness measure, the matched deviation.
A notable feature of matched Gibbs posterior is that it approximates the posterior distribution well under the fairness constraint without requiring heavy computation.
Theoretically, we show that the matched deviation has a strong relation to existing group fairness measures, highlighting desirable fairness guarantees.
Computationally, by treating the matching function in the matched deviation as a learnable parameter, we develop an efficient MCMC algorithm.
Experiments on real-world datasets demonstrates that matched Gibbs posterior outperforms other methods in balancing uncertaintyโfairness and utilityโfairness trade-offs, while also offering additional desirable properties.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4235. Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning
๐ฌ 9
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multimodal Large Language Model Fine-Grained Visual Recognition Reinforcement Learning
Any entity in the visual world can be hierarchically grouped based on shared characteristics and mapped to fine-grained sub-categories. While Multi-modal Large Language Models (MLLMs) achieve strong performance on coarse-grained visual tasks, they often struggle with Fine-Grained Visual Recognition (FGVR). Adapting general-purpose MLLMs to FGVR typically requires large amounts of annotated data, which is costly to obtain, leaving a substantial performance gap compared to contrastive CLIP models dedicated for discriminative tasks. Moreover, MLLMs tend to overfit to seen sub-categories and generalize poorly to unseen ones. To address these challenges, we propose Fine-R1, an MLLM tailored for FGVR through an R1-style training framework: (1) Chain-of-Thought Supervised Fine-tuning, where we construct a high-quality FGVR CoT dataset with rationales of "visual analysis, candidate sub-categories, comparison, and predictionโ, transition the model into a strong open-world classifier; and (2) Triplet Augmented Policy Optimization, where Intra-class Augmentation mixes trajectories from anchor and positive images within the same category to improve robustness to intra-class variance, while Inter-class Augmentation maximizes the response distinction conditioned on images across sub-categories to enhance discriminative ability. With only 4-shot training, Fine-R1 outperforms existing general MLLMs, reasoning MLLMs, and even contrastive CLIP models in identifying both seen and unseen sub-categories, showing promise in working in knowledge-intensive domains where gathering expert annotations for all sub-categories is arduous.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4236. On the Convergence of Two-Layer Kolmogorov-Arnold Networks with First-Layer Training
๐ฌ 11
โญ 4.67
๐ learning theory
๐ท๏ธ Kolmogorov-Arnold Networks (KANs) Overparameterization Neural Tangent Kernel (NTK) Gradient Descent
Kolmogorov-Arnold Networks (KANs) have emerged as a promising alternative to traditional neural networks, offering enhanced interpretability based on the Kolmogorov-Arnold representation theorem. While their empirical success is growing, a theoretical understanding of their training dynamics remains nascent. This paper investigates the optimization of a two-layer KAN in the overparameterized regime, focusing on a simplified yet insightful setting where only the first-layer coefficients are trained via gradient descent.
Our main result establishes that, provided the network is sufficiently wide, this training method is guaranteed to converge to a global minimum and achieve zero training error. Furthermore, we derive a novel, fine-grained convergence rate that explicitly connects the optimization speed to the structure of the data labels through the eigenspectrum of the KAN Tangent Kernel (KAN-TK). Our analysis reveals a key advantage of this architecture: guaranteed convergence is achieved with a hidden layer width of $m=\mathcal{O}(n^2)$, a significant polynomial improvement over the $m=\mathcal{O}(n^6)$ requirement for classic two-layer neural networks using ReLU activation functions and analyzed within the same Tangent Kernel framework. We validate our theoretical findings with numerical experiments that corroborate our predictions on convergence speed and the impact of label structure.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4237. Beyond Aggregation: Guiding Clients in Heterogeneous Federated Learning
๐ฌ 13
โญ 4.67
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Federated learning; Healthcare
Federated learning (FL) is increasingly adopted in domains like healthcare, where data privacy is paramount.
A fundamental challenge in these systems is statistical heterogeneityโthe fact that data distributions vary significantly across clients (e.g., different hospitals may treat distinct patient demographics).
While current FL algorithms focus on aggregating model updates from these heterogeneous clients, the potential of the central server remains under-explored.
This paper is motivated by a healthcare scenario: could a central server not only coordinate model training but also guide a new patient to the hospital best equipped for their specific condition? We generalize this idea to propose a novel paradigm for FL systems where the server actively guides the allocation of new tasks or queries to the most appropriate client.
To enable this, we introduce a density ratio model and empirical likelihood-based framework that simultaneously addresses two goals: (1) learning effective local models on each client, and (2) finding the best matching client for a new query.
Empirical results demonstrate the framework's effectiveness on benchmark datasets, showing improvements in both model accuracy and the precision of client guidance compared to standard FL approaches. This work opens a new direction for building more intelligent and resource-efficient FL systems that leverage heterogeneity as a feature, not just a bug.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4238. Incentive-Aligned LLM Summaries
๐ฌ 22
โญ 4.67
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM summarization incentive alignment truthfulness retrieval-augmented generation (RAG) peer prediction
Large language models (LLMs) are increasingly used in modern search and answer systems to synthesize multiple, sometimes conflicting, texts into a single response, yet current pipelines offer weak incentives for sources to be accurate and are vulnerable to adversarial content. We introduce Truthful Text Summarization (TTS), an incentive-aligned framework that improves factual robustness without ground-truth labels. TTS (i) decomposes a draft synthesis into atomic claims, (ii) elicits each sourceโs stance on every claim, (iii) scores sources with an adapted multi-task peer-prediction mechanism that rewards informative agreement, and (iv) filters unreliable sources before re-summarizing. We establish formal guarantees that align a sourceโs incentives with informative honesty, making truthful reporting the utility-maximizing strategy. Experiments show that TTS improves factual accuracy and robustness while preserving fluency, aligning exposure with informative corroboration and disincentivizing manipulation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4239. AUHead: Realistic Emotional Talking Head Generation via Action Units Control
๐ฌ 35
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Talking Head Generation Diffusion model
Realistic talking-head video generation is critical for virtual avatars, film production, and interactive systems. Current methods struggle with nuanced emotional expressions due to the lack of fine-grained emotion control. To address this issue, we introduce a novel two-stage method (AUHead) to disentangle fine-grained emotion control, i.e. , Action Units (AUs), from audio and achieve controllable generation. In the first stage, we explore the AU generation abilities of large audio-language models (ALMs), by spatial-temporal AU tokenization and an "emotion-then-AU" chain-of-thought mechanism. It aims to disentangle AUs from raw speech, effectively capturing subtle emotional cues. In the second stage, we propose an AU-driven controllable diffusion model that synthesizes realistic talking-head videos conditioned on AU sequences. Specifically, we first map the AU sequences into the structured 2D facial representation to enhance spatial fidelity, and then model the AU-vision interaction within cross-attention modules. To achieve flexible AU-quality trade-off control, we introduce an AU disentanglement guidance strategy during inference, further refining the emotional expressiveness and identity consistency of the generated videos. Results on benchmark datasets demonstrate that our approach achieves competitive performance in emotional realism, accurate lip synchronization, and visual coherence, significantly surpassing existing techniques.
Our implementation is available at https://anonymous.4open.science/r/AUHead-3761.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4240. Intrinsic Explanation of Random Subspace Method for Enhanced Security Applications
๐ฌ 10
โญ 4.67
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Feature attribution Certified Robustness Jailbreak Attack
Random subspace method has wide security applications such as providing certified defenses against adversarial and backdoor attacks, and building robustly aligned LLM against jailbreaking attacks. However, the explanation of random subspace methods lacks sufficient exploration. Existing state-of-the-art feature attribution methods, such as Shapley value and LIME are computationally impractical and lack robustness guarantees when applied to random subspace methods. In this work, we propose EnsembleSHAP, an intrinsically faithful and robust feature attribution for random subspace methods that reuses its computational byproducts. Specifically, our feature attribution method is 1) computationally efficient, 2) maintains essential properties of effective feature attribution (such as local accuracy), and 3) offers guaranteed robustness against attacks on feature attribution methods. To the best of our knowledge, this is the first work to establish provable robustness against explanation-preserving attacks. We also perform comprehensive evaluations for our explanationโs effectiveness when faced with different empirical attacks, including backdoor attacks, adversarial attacks, and jailbreak attacks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4241. The Geometry and Topology of Circuits: the Manifolds of Modular Addition
๐ฌ 11
โญ 4.67
๐ interpretability and explainable AI
๐ท๏ธ mechanistic interpretability representation learning geometry topology manifolds universality platonic representation hypothesis
The Clock and Pizza interpretations, associated with architectures differing in either uniform or learnable attention, were introduced to argue that different architectural designs can yield distinct circuits for modular addition.
In this work, we show that this is not the case, and that both the uniform and trainable attention architectures implement the same algorithm via topologically and geometrically equivalent representations.
Our methodology goes beyond the interpretation of individual neurons and weights.
Instead, we identify all of the neurons corresponding to each learned representation and then study the collective group of neurons as one entity.
This method reveals that each learned representation is a manifold that we can study utilizing tools from topology.
Based on this insight, we can statistically analyze the learned representations across hundreds of circuits to demonstrate the similarity between learned modular addition circuits that arise naturally from common deep learning paradigms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4242. High-dimensional Mean-Field Games by Particle-based Flow Matching
๐ฌ 8
โญ 4.67
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ mean-field games particle-based flow matching simulation-free
Mean-field games (MFGs) study the Nash equilibrium of systems with a continuum of interacting agents, which can be formulated as the fixed-point of optimal control problems. They provide a unified framework for a variety of applications, including optimal transport (OT) and generative models. Despite their broad applicability, solving high-dimensional MFGs remains a significant challenge due to fundamental computational and analytical obstacles.
In this work, we propose a particle-based deep Flow-Matching (FM) method to tackle high-dimensional deterministic MFG computation.
In each iteration of our proximal best response scheme, particles are updated using first-order information, and a flow neural network is trained to match the velocity of the sample trajectories in a simulation-free manner.
Theoretically, in the optimal control setting, we prove that our scheme converges to a stationary point sublinearly, and upgrade to linear (exponential) convergence under additional convexity assumptions.
Our proof uses FM to induce an Eulerian coordinate (density-based) from a Lagrangian one (particle-based), and this also leads to certain equivalence results between the two formulations for MFGs when the Eulerian solution is sufficiently regular.
Our method demonstrates promising performance on non-potential MFGs and high-dimensional OT problems cast as MFGs through a relaxed terminal-cost formulation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4243. Evaluating Data Influence in Meta Learning
๐ฌ 12
โญ 4.67
๐ interpretability and explainable AI
๐ท๏ธ Data Attribution Influence Function Meta Learning
As one of the most fundamental models, meta learning aims to effectively address few-shot learning challenges. However, it still faces significant issues related to the training data, such as training inefficiencies due to numerous low-contribution tasks in large datasets and substantial noise from incorrect labels. Thus, training data attribution methods are needed for meta learning. However, the dual-layer structure of meta learning complicates the modeling of training data contributions because of the interdependent influence between meta parameters and task-specific parameters, making existing data influence evaluation tools inapplicable or inaccurate. To address these challenges, based on the influence function, we propose a general data attribution evaluation framework for meta learning within the bilevel optimization framework. Our approach introduces task influence functions (task-IF) and instance influence functions (instance-IF) to accurately assess the impact of specific tasks and individual data points in closed forms. This framework comprehensively models data contributions across both the inner and outer training processes, capturing the direct effects of data points on meta parameters as well as their indirect influence through task-specific parameters. We also provide several strategies to enhance computational efficiency and scalability. Experimental results demonstrate the framework's effectiveness in training data evaluation via several downstream tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4244. HATSolver: Learning Grรถbner Bases with Hierarchical Attention Transformers
๐ฌ 7
โญ 4.67
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Hierarchical Attention Transformer Groebner Basis Symbolic Computation Multivariate Polynomial Equations
At NeurIPS 2024, Kera (2311.12904) introduced the use of transformers for computing Groebner bases, a central object in computer algebra with numerous practical applications. In this paper, we improve this approach by applying Hierarchical Attention Transformers (HATs) to solve systems of multivariate polynomial equations via Groebner bases computation. The HAT architecture incorporates a tree-structured inductive bias that enables the modeling of hierarchical relationships present in the data and thus achieves significant computational savings compared to conventional flat attention models. We generalize to arbitrary depths and include a detailed computational cost analysis. Combined with curriculum learning, our method solves instances that are much larger than those in Kera (2311.12904).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4245. DVLA-RL: Dual-Level VisionโLanguage Alignment with Reinforcement Learning Gating for Few-Shot Learning
๐ฌ 17
โญ 4.67
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Few-Shot Learning VisionโLanguage Alignment Large Language Models Reinforcement Learning
Few-shot learning (FSL) aims to generalize to novel categories with only a few samples. Recent approaches incorporate large language models (LLMs) to enrich visual representations with semantic embeddings derived from class names. However, they overlook progressive and adaptive alignment between vision and language from low-level to high-level semantics, resulting in limited semantic gains. To address these challenges, we propose Dual-level VisionโLanguage Alignment with Reinforcement Learning gating (DVLA-RL), which consists of Dual-level Semantic Construction (DSC) and RL-gated Attention (RLA). Specifically, DSC conditions LLMs on both class names and support samples to generate discriminative attributes, progressively selects the most relevant ones, and then synthesizes them into coherent class descriptions. This process provides complementary low-level attributes and high-level descriptions, enabling both fine-grained grounding and holistic class understanding. To dynamically integrate dual-level semantics along with the visual network layers, RLA formulates cross-modal fusion as a sequential decision process. A lightweight policy trained with episodic REINFORCE adaptively adjusts the contributions of self-attention and cross-attention to integrate textual and visual tokens. As a result, shallow layers refine local attributes and deep layers emphasize global semantics, enabling more precise cross-modal alignment. This achieves class-specific discrimination and generalized representations with merely a few support samples. DVLA-RL achieves new state-of-the-art performance across nine benchmarks in three diverse FSL scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4246. SIPDO: Closed-Loop Prompt Optimization via Synthetic Data Feedback
๐ฌ 17
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Synthetic Data Prompt Optimization
Prompt quality plays a critical role in the performance of large language models (LLMs), motivating a growing body of work on prompt optimization. Most existing methods optimize prompts over a fixed dataset, assuming static input distributions and offering limited support for iterative improvement. We introduce SIPDO (Self-Improving Prompts through Data-Augmented Optimization), a closed-loop framework for prompt learning that integrates synthetic data generation into the optimization process. SIPDO couples a synthetic data generator with a prompt optimizer, where the generator produces new examples that reveal current prompt weaknesses and the optimizer incrementally refines the prompt in response. This feedback-driven loop enables systematic improvement of prompt performance without assuming access to external supervision or new tasks. Experiments across question answering and reasoning benchmarks show that SIPDO outperforms standard prompt tuning methods, highlighting the value of integrating data synthesis into prompt learning workflows.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4247. VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks
๐ฌ 28
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ video-qa captioning RL
The "Reason-Then-Respond" paradigm, enhanced by Reinforcement Learning, has shown great promise in advancing Multimodal Large Language Models. However, its application to the video domain has led to specialized models that excel at either question answering (QA) or captioning tasks, but struggle to master both. Naively combining reward signals from these tasks results in mutual performance degradation, which we attribute to a conflict between their opposing task natures. To address this challenge, we propose a novel training framework built upon two intermediate proxy tasks: DarkEventInfer, which presents videos with masked event segments, requiring models to infer the obscured content based on contextual video cues; and MixVidQA, which presents interleaved video sequences composed of two distinct clips, challenging models to isolate and reason about one while disregarding the other. These proxy tasks compel the model to simultaneously develop both holistic, divergent understanding and precise, convergent reasoning capabilities. Embodying this framework, we present VidBridge-R1, the first versatile video reasoning model that effectively bridges the paradigm conflict. Extensive experiments show that VidBridge-R1 achieves significant performance gains on both QA and captioning within one model, demonstrating the efficacy of our approach in fostering more generalizable and powerful video understanding models. All code, models, and data will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4248. Fractional-Order Spiking Neural Network
๐ฌ 31
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ spiking neural networks fractional order differential equations
Spiking Neural Networks (SNNs) draw inspiration from biological neurons to enable brain-like computation, demonstrating effectiveness in processing temporal information with energy efficiency and biological realism.
Most existing SNNs are based on neural dynamics such as the (leaky) integrate-and-fire (IF/LIF) models, which are described by \emph{first-order} ordinary differential equations (ODEs) with Markovian characteristics.
This means the potential state at any time depends solely on its immediate past value, potentially limiting network expressiveness.
Empirical studies of real neurons, however, reveal long-range correlations and fractal dendritic structures, suggesting non-Markovian behavior better modeled by \emph{fractional-order} ODEs.
Motivated by this, we propose a \emph{fractional-order} spiking neural network (\emph{f}-SNN) framework that strictly generalizes integer-order SNNs and captures long-term dependencies in membrane potential and spike trains via fractional dynamics, enabling richer temporal patterns. We also release an open-source toolbox to support the \emph{f}-SNN framework, applicable to diverse architectures and real-world tasks. Experimentally, fractional adaptations of established SNNs into the \emph{f}-SNN framework achieve superior accuracy, comparable energy efficiency, and improved robustness to noise, underscoring the promise of \emph{f}-SNNs as an effective extension of traditional SNNs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4249. AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
๐ฌ 17
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Synthetic data Computer-use agents Scalable
We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents. Leveraging information asymmetry, AgentSynth constructs subtasks that are simple during generation but significantly more challenging when composed into long-horizon tasks, enabling the creation of over 6,000 diverse and realistic tasks. A key strength of AgentSynth is its ability to precisely modulate task complexity by varying the number of subtasks. Empirical evaluations show that state-of-the-art LLM agents suffer a steep performance drop, from 18\% success at difficulty level 1 to just 4\% at level 6, highlighting the benchmark's difficulty and discriminative power. Moreover, our pipeline achieves a low average cost of \$0.60 per trajectory, orders of magnitude cheaper than human annotations. Code is available in the supplementary materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4250. PAS: Estimating the target Accuracy before domain adaptation
๐ฌ 17
โญ 4.67
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Transferability estimation Domain adaptation
The goal of domain adaptation is to make predictions for unlabeled samples from a target domain with the help of labeled samples from a different but related source domain. The performance of domain adaptation methods is highly influenced by the choice of source domain and pre-trained feature extractor. However, the selection of source data and pre-trained model is not trivial due to the absence of a labeled validation set for the target domain and the large number of available pre-trained models. In this work, we propose PAS, a novel score designed to estimate the transferability of a source domain set and a pre-trained feature extractor to a target classification task before actually performing domain adaptation. PAS leverages the generalization power of pre-trained models and assesses source-target compatibility based on the pre-trained feature embeddings. We integrate PAS into a framework that indicates the most relevant pre-trained model and source domain among multiple candidates, thus improving target accuracy while reducing the computational overhead.
Extensive experiments on image classification benchmarks demonstrate that PAS correlates strongly with actual target accuracy and consistently guides the selection of the best-performing pre-trained model and source domain for adaptation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4251. VERINA: Benchmarking Verifiable Code Generation
๐ฌ 17
โญ 4.67
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ code generation formal verification verifiable code generation AI for math theorem proving AI for code
Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation---jointly generating code, specifications, and proofs of code-specification alignment---offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often focus on only individual components rather than providing a holistic evaluation framework of all tasks. In this paper, we introduce VERINA (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. VERINA consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains.
The best model, OpenAI o3, achieves a 72.6% code correctness rate, 52.3% for specification soundness and completeness, and a mere 4.9% proof success rate (based on one trial per task).
We hope VERINA will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4252. StyliTruth : Unlocking Stylized yet Truthful LLM Generation via Disentangled Steering
๐ฌ 28
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Controllable Stylized and Truthful Generation Representation Editing
Generating stylized large language model (LLM) responses via representation editing is a promising way for fine-grained output control. However, there exists an inherent trade-off: imposing a distinctive style often degrades truthfulness. Existing representation editing methods, by naively injecting style signals, overlook this collateral impact and frequently contaminate the modelโs core truthfulness representations, resulting in reduced answer correctness. We term this phenomenon stylization-induced truthfulness collapse. We attribute this issue to latent coupling between style and truth directions in certain key attention heads, and propose \textbf{StyliTruth}, a mechanism that preserves stylization while keeping truthfulness intact. StyliTruth separates the style-relevant and truth-relevant subspaces in the modelโs representation space via an orthogonal deflation process. This decomposition enables independent control of style and truth in their own subspaces, minimizing interference. By designing adaptive, token-level steering vectors within each subspace, we dynamically and precisely control the generation process to maintain both stylistic fidelity and truthfulness. We validate our method on multiple styles and languages. Extensive experiments and analyses show that StyliTruth significantly reduces stylization-induced truthfulness collapse and outperforms existing inference-time intervention methods in balancing style adherence with truthfulness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4253. A Sharp KL-Convergence Analysis for Diffusion Models under Minimal Assumptions
๐ฌ 16
โญ 4.67
๐ learning theory
๐ท๏ธ diffusion models probability flow ODEs score based generative models convergence analysis
Diffusion-based generative models have emerged as highly effective methods for synthesizing high-quality samples. Recent works have focused on analyzing the convergence of their generation process with minimal assumptions, either through reverse SDEs or probability flow ODEs. The best known guarantees, without any smoothness assumptions, for the KL divergence so far achieve a linear dependence on the data dimension $d$ and an inverse quadratic dependence on accuracy level $\varepsilon$. In this work, we present a refined analysis for the standard Exponential Integrator discretization that improves the dependence on $\varepsilon$, at the same time maintaining the linear dependence on $d$. Following recent works on higher order/randomized midpoint discretizations, we model the generation process as a composition of two steps: a reverse ODE step followed by a smaller noising step, which leads to better dependence on step size. We then provide a novel analysis which achieves linear dependence on $d$ for the ODE discretization error without any smoothness assumptions.
Specifically, we introduce a general ODE-based counterpart of the stochastic localization argument from Benton et al and develop new proof techniques to bound second-order spatial derivatives of the score function -- terms that do not arise in previous diffusion analyses and cannot be handled by existing techniques. Leveraging this framework, we prove that $\tilde{O}\left(\tfrac{d \log^{3/2}(1/\delta)}{\varepsilon}\right)$ steps suffice to approximate the target distributionโcorrupted by Gaussian noise of variance $\delta$โto within $O(\varepsilon^2)$ in KL divergence, improving upon the previous best result requiring $\tilde{O}\left(\tfrac{d \log^2(1/\delta)}{\varepsilon^2}\right)$ steps.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4254. Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning
๐ฌ 3
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reinforcement Learning Large Language Model Math Reasoning
Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs) to elicit stronger reasoning. Yet, most recent RL for LLMs (RL4LLM) methods avoid explicit critics, replacing them with average advantage baselines. This shift is largely pragmatic: conventional value functions are computationally expensive to train at LLM scale and often fail under sparse rewards and long reasoning horizons. We revisit this bottleneck from an architectural perspective and introduce Asymmetric Proximal Policy Optimization (**AsyPPO**), a simple and scalable framework that restores the criticโs role while remaining efficient in large-model settings. **AsyPPO** employs a set of lightweight *mini-critics*, each trained on disjoint prompt shards. This design encourages diversity while preserving calibration, reducing value-estimation bias. Beyond robust estimation, **AsyPPO** leverages inter-critic uncertainty to refine the policy update: (i) masking advantages in states where critics agree and gradients add little learning signal, and (ii) filtering high-divergence states from entropy regularization, suppressing spurious exploration. Across multiple reasoning benchmarks, **AsyPPO** consistently improves learning stability and performance over strong baselines, e.g., GRPO, achieving performance gains of $> 6$% on *Qwen3-4b-Base* and about $3$% on *Qwen3-8b-Base* and *Qwen3-14b-Base* over classic PPO. Such results highlight the importance of architectural innovations in critics for scalable, efficient algorithms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4255. LLMs as Rules Oracles: Exploring Real-World Multimodal Reasoning in Tabletop Strategy Game Environments
๐ฌ 13
โญ 4.67
๐ datasets and benchmarks
๐ท๏ธ game reasoning multimodal qa vision-language grounding benchmark situated reasoning tabletop games board games
We introduce **LudoBench**, a multimodal reasoning benchmark that evaluates whether vision-enabled large language models (LMs) can acquire, integrate, and reason over heterogeneous game knowledge in mainstream analog tabletop games. Unlike prior works that emphasize deep strategic mastery, LudoBench targets an initial reasoning challenge uninitiated gamers face: *correctly comprehending a new tabletop strategy game for the first time*.
We examine whether, given a visual depiction of a tabletop scene and a corresponding ruleset, a model can correctly answer grounded questions about the pictured scenario. Concretely, LudoBench tests three cumulative situated game-comprehension capabilities: (1) *Environment Perception*, (2) *Heterogeneous Rules Integration*, and (3) *Short-horizon Optimization*, to progressively stress-test the foundational reasoning required for real-world game comprehension.
Evaluating frontier LMs on three diverse strategy games, we find that even the strongest models achieve only ~68% accuracy on simple environment perception tasks and fall below 10% on situated multi-step comprehension puzzles that hobbyist gamers can routinely solve.
Our extensive failure analysis and knowledge-ablation experiments reveal that *models largely fail to comprehend rich cross-modal reference knowledge* and are subsequently unable to apply this knowledge to messy and unfamiliar situated environments. Our findings highlight the many steps remaining for current methods to succeed on complex multimodal reasoning in the real world.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4256. Fine-Grained Activation Steering: Steering Less, Achieving More
๐ฌ 15
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Activation Steering Large Language Models Fine-Grained Intervention
Activation steering has emerged as a cost-effective paradigm for modifying large language model (LLM) behaviors. Existing methods typically intervene at the block level, steering the bundled activations of selected attention heads, feedforward networks, or residual streams. However, we reveal that block-level activations are inherently heterogeneous, entangling beneficial, irrelevant, and harmful features, thereby rendering block-level steering coarse, inefficient, and intrusive. To investigate the root cause, we decompose block activations into fine-grained atomic unit (AU)โlevel activations, where each AU-level activation corresponds to a single dimension of the block activation, and each AU denotes a slice of the block weight matrix. Steering an AU-level activation is thus equivalent to steering its associated AU. Our theoretical and empirical analysis show that heterogeneity arises because different AUs or dimensions control distinct token distributions in LLM outputs. Hence, block-level steering inevitably moves helpful and harmful token directions together, which reduces efficiency. Restricting intervention to beneficial AUs yields more precise and effective steering. Building on this insight, we propose AUSteer, a simple and efficient method that operates at a finer granularity of the AU level. AUSteer first identifies discriminative AUs globally by computing activation momenta on contrastive samples. It then assigns adaptive steering strengths tailored to diverse inputs and selected AU activations. Comprehensive experiments on multiple LLMs and tasks show that AUSteer consistently surpasses advanced baselines while steering considerably fewer activations, demonstrating that steering less achieves more.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4257. RAR: Reversing Visual Attention Re-Sinking for Unlocking Potential in Multimodal Large Language Models
๐ฌ 16
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ MLLMs visual attention sink
Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks, yet they frequently exhibit suboptimal output layers, where intermediate decoder layers outperform the final ones, signaling underutilized model capacity. In this work, we delve into the root causes and attribute this issue to the Visual Attention Re-sinking phenomenon, precipitated by attention gradient sparsity driven by textual supervision dominance. This degradation causes attention heads to evolve into sink heads that prioritize low-semantic backgrounds, thereby disrupting modality fusion, neglecting visual information, and biasing outputs toward textual priors, ultimately impairing model performance. To mitigate this, we introduce a parameter-free Sink Attention Dynamic Sparsification (SADS) framework that dynamically identifies and retains all vision heads(concentrating visual attention on semantically relevant regions) while sparsifying sink heads, preserving essential global context through a shared head. Integrated into diverse MLLMs, our framework yields substantial performance gains across 20 benchmarks spanning five task categories (visual grounding, general VQA, OCR-related VQA, vision-centric tasks, and visual hallucination mitigation) surpassing supervised fine-tuning while boosting inference speed by 10.3\%. This approach offers a novel avenue for maximizing MLLMs capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4258. When Priors Backfire: On the Vulnerability of Unlearnable Examples to Pretraining
๐ฌ 13
โญ 4.67
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ unlearnable examples data privacy
Unlearnable Examples (UEs) are introduced as a data protection strategy that generates imperceptible perturbations to mislead models into learning spurious correlations rather than real semantics. In this paper, we reveal a fundamental vulnerability of UEs that emerges when learning starts from a pretrained model. Specifically, our empirical analysis shows that even when data are protected by carefully crafted perturbations, pretraining priors still allow the model to bypass the shortcuts introduced by UEs and capture semantic information from the data, thereby nullifying unlearnability. To counter this effect, we propose $\textbf{BAIT}$ ($\textbf{B}$inding $\textbf{A}$rtificial perturbations to $\textbf{I}$ncorrect $\textbf{T}$argets), a novel biโlevel optimization formulation in which the inner level mirrors standard UE objectives, while the outer level enforces a dynamic association of perturbations with incorrect labels, deliberately misleading pretraining priors and preventing them from aligning with true semantics. This mislabel-perturbation binding mechanism blocks the pretrained model from readily establishing the true label-data relationship, so the learning process cannot quickly rely on image semantics and instead remains dependent on the perturbations. Extensive experiments on standard benchmarks and multiple pretrained backbones demonstrate that our approach produces UEs that remain effective in the presence of pretraining priors.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4259. Context and Diversity Matter: The Emergence of In-Context Learning in World Models
๐ฌ 10
โญ 4.67
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ In-Context Learning; World Models
The capability of predicting environmental dynamics underpins both biological neural systems and general embodied AI in adapting to their surroundings. Yet prevailing approaches rest on static world models that falter when confronted with novel or rare configurations. We investigate in-context learning (ICL) of world models, shifting attention from zero-shot performance to the growth and asymptotic limits of the world model. Our contributions are three-fold: (1) we formalize ICL of a world model and identify two core mechanisms: environment recognition (ER) and environment learning (EL); (2) we derive error upper-bounds for both mechanisms that expose how the mechanisms emerge; and (3) we empirically confirm that distinct ICL mechanisms exist in the world model, and we further investigate how data distribution and model architecture affect ICL in a manner consistent with theory. These findings demonstrate the potential of self-adapting world models and highlight the key factors behind the emergence of EL/ER, most notably the necessity of long context and diverse environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4260. Prima.cpp: Fast 30-70B LLM Inference on Heterogeneous and Low-Resource Home Clusters
๐ฌ 14
โญ 4.67
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ Distributed LLM system on-device inference low-resource and heterogeneous devices home AI
On-device inference offers privacy, offline use, and instant response, but consumer hardware restricts large language models (LLMs) to low throughput and capability. To overcome this challenge, we present prima.cpp, a distributed on-device inference system that runs 30-70B LLMs on consumer home clusters with mixed CPUs/GPUs, insufficient RAM/VRAM, slow disks, Wi-Fi links, and heterogeneous OSs. We introduce pipelined-ring parallelism (PRP) to overlap disk I/O with compute and communication, and address the prefetch-release conflict in mmap-based offloading. We further propose Halda, a heterogeneity-aware scheduler that co-optimizes per-device CPU/GPU workloads and device selection under RAM/VRAM constraints. On four consumer home devices, a 70B model reaches 674 ms/token TPOT with <6% memory pressure, and a 32B model with speculative decoding achieves 26 tokens/s. Compared with llama.cpp, exo, and dllama, our proposed prima.cpp achieves 5-17ร lower TPOT, supports fine-grained model sizes from 8B to 70B, ensures broader cross-OS and quantization compatibility, and remains OOM-free, while also being Wi-Fi tolerant, privacy-preserving, and hardware-independent. The code is available at https://anonymous.4open.science/r/prima-cpp.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4261. MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
๐ฌ 3
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Unified Multimodal Large Language Models understanding generation hybrid tokenizer
Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4262. Safeguarding Multimodal Knowledge Copyright in the RAG-as-a-Service Environment
๐ฌ 15
โญ 4.67
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Watermark VLM Dataset Copyright Protection
As Retrieval-Augmented Generation (RAG) evolves into service-oriented platforms (Rag-as-a-Service) with shared knowledge bases, protecting the copyright of contributed data becomes essential. Existing watermarking methods in RAG focus solely on textual knowledge, leaving image knowledge unprotected. In this work, we propose \textit{AQUA}, the first watermark framework for image knowledge protection in Multimodal RAG systems. \textit{AQUA} embeds semantic signals into synthetic images using two complementary methods: acronym-based triggers and spatial relationship cues. These techniques ensure watermark signals survive indirect watermark propagation from image retriever to textual generator, being efficient, effective and imperceptible. Experiments across diverse models and datasets show that \textit{AQUA} enables robust, stealthy, and reliable copyright tracing, filling a key gap in multimodal RAG protection.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4263. Learning-Time Encoding Shapes Unlearning in LLMs
๐ฌ 9
โญ 4.67
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large language model unlearning
As large language models (LLMs) are increasingly deployed in the real world, the ability to ``unlearn'', or remove specific pieces of knowledge post hoc, has become essential for a variety of reasons ranging from privacy regulations to correcting outdated or harmful content. Prior work has proposed unlearning benchmarks and algorithms, and has typically assumed that the training process and the target model are fixed. In this work, we empirically investigate how learning-time encoding in knowledge encoding impact the effectiveness of unlearning factual knowledge.
We conduct two studies: (i) examining how paraphrased descriptions influence unlearning performance, and (ii) analyzing unlearning when multiple facts are embedded within the same training text chunk.
Our empirical study reveals two important implications: a new perspective for interpreting unlearning performance and practical strategies for improving LLM unlearning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4264. Prior-free Tabular Test-time Adaptation
๐ฌ 3
โญ 4.67
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Tabular Test-time Adaptation Prior-free
Deep neural networks (DNNs) have been effectively deployed in tabular data modeling for various applications. However, these models suffer severe performance degradation when distribution shifts exist between training and test tabular data. While test-time adaptation (TTA) serves as a promising solution to distribution shifts, existing TTA methods primarily focus on visual modalities and demonstrate poor adaptation when directly applied to tabular modality. Recent efforts have proposed tabular-specific TTA approaches to mitigate distribution shifts on tabular data. Nevertheless, these methods inherently assume the accessibility of source domain or prior and fail to fundamentally address feature shift while overlooking unique characteristics of tabular data, leading to suboptimal adaptation. In this paper, we focus on the problem of \textit{prior-free tabular test-time adaptation} where no access to source data and any prior knowledge is allowed, and we propose a novel method, \underline{P}rior-\underline{F}ree \underline{T}abular \underline{T}est-\underline{T}ime \underline{A}daptation (PFT$_3$A), which has three designs to simultaneously address label shift and feature shift without source domain or prior access. Specially, PFT$_3$A contains the \textit{Class Prior Estimating} module for estimating source-target class priors to calibrate prediction, eliminating dependency on source class prior and mitigating label shift; the \textit{Robust Feature Learning} module for learning robust feature by aligning source-like and target-like features to mitigate feature shift; the \textit{Representative Subspace Exploration} module for eliminating redundant features by projecting feature into subspace to enhance feature alignment. Extensive experiments demonstrate the effectiveness and generalization of PFT$_3$A in tabular TTA tasks. The implementation is at \url{https://anonymous.4open.science/r/PFT3A/README.md}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4265. SAFER: Risk-Constrained Sample-then-Filter in Large Language Models
๐ฌ 15
โญ 4.67
๐ generative models
๐ท๏ธ Question answering Calibration Uncertainty Quantification
As large language models (LLMs) are increasingly deployed in risk-sensitive applications such as real-world open-ended question answering (QA), ensuring the trustworthiness of their outputs has become critical. Existing selective conformal prediction (SCP) methods provide statistical guarantees by constructing prediction sets with a constrained miscoverage rate for correct answers. However, prior works unrealistically assume that admissible answers for all instances can be obtained via finite sampling, even for open-ended QA scenarios that lack a fixed and finite solution space. To address this, we introduce a two-stage risk control framework comprising abstention-aware **SA**mpling and conformalized **F**ilt**ER**ing (SAFER). Firstly, on a held-out calibration set, SAFER calibrates a sampling budget within the maximum sampling cap, using the ClopperโPearson exact method at a user-desired risk level (i.e., the maximum allowable miscoverage rate of the sampling sets). If the risk level cannot be satisfied within the cap, we abstain; otherwise, the calibrated sampling budget becomes the minimum requirements at test time. Then, we employ calibration instances where correct answers are attainable under the calibrated budget and apply the conformal risk control method to determine a statistically valid uncertainty threshold, which filters unreliable distractors from the candidate set for each test data point. In this stage, SAFER introduces an additional risk level to guide the calculation of the threshold, thereby controlling the risk of correct answers being excluded. We evaluate SAFER on three free-form QA datasets utilizing five popular LLMs, and demonstrate that it rigorously constrains two-stage miscoverage risks at test time.
Furthermore, we show that SAFER is compatible with various task-specific admission criteria and calibration-test split ratios, highlighting its robustness and high data efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4266. Attention, Please! Revisiting Attentive Probing Through the Lens of Efficiency
๐ฌ 27
โญ 4.67
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Linear Probing Attentive Probing Vision Transformers Vision-Language Models Evaluation Neural Networks
As fine-tuning becomes increasingly impractical at scale, probing is emerging as the preferred evaluation protocol. Yet, the standard linear probing fails to adequately reflect the potential of models whose pre-training optimizes representations of patch tokens rather than an explicit global representation. This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite its growing adoption, attentive probing remains underexplored, with existing methods suffering from excessive parameterization and poor computational efficiency. In this work, we revisit attentive probing through the lens of the accuracy vs. parameter efficiency trade-off. We present the first comprehensive study of existing methods, analyzing their design choices and benchmarking their performance. Building on this, we propose efficient probing (EP), a simple yet effective multi-query cross-attention mechanism that eliminates redundant projections and reduces the number of trainable parameters. Despite its simplicity, EP outperforms linear probing and prior attentive probing approaches across seven benchmarks, generalizes well to diverse pre-training paradigms, and delivers strong low-shot and layer-wise gains, and remains effective even when combined with parameter-efficient fine-tuning methods. Beyond evaluation, our analysis uncovers emerging properties of EP, such as complementary attention maps, which open new directions for leveraging probing beyond protocol design.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4267. ResWorld: Temporal Residual World Model for End-to-End Autonomous Driving
๐ฌ 10
โญ 4.67
๐ applications to robotics, autonomy, planning
๐ท๏ธ Autonomous Drivng End-to-End World model
The comprehensive understanding capabilities of world models for driving scenarios have significantly improved the planning accuracy of end-to-end autonomous driving frameworks. However, the redundant modeling of static regions and the lack of deep interaction with trajectories hinder world models from exerting their full effectiveness. In this paper, we propose a Temporal Residual World Model (TR-World), which focuses on dynamic object modeling. By calculating the temporal residuals of BEV features, the information of dynamic objects can be extracted without relying on detection and tracking. TR-World only takes temporal residuals as the input to make more precise predictions of the dynamic objects' future spatial distribution. By combining the prediction with the static object information contained in the current BEV features, accurate future BEV features can be obtained. Furthermore, we propose Future-Guided Trajectory Refinement (FGTR) module, which conducts interaction between prior trajectories (predicted from the current scene representations) and the future BEV features. This enables effective utilization of future road conditions and also alleviates world model collapsing. Comprehensive experiments conducted on the nuScenes and NAVSIM datasets demonstrate that our method, namely ResWorld, achieves state-of-the-art performance on planning accuracy. Code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4268. Emergence of Spatial Representation in an Actor-Critic Agent with Hippocampus-Inspired Sequence Generator
๐ฌ 8
โญ 4.67
๐ applications to neuroscience & cognitive science
๐ท๏ธ neuroscience deep reinforcement learning place cells navigation
Sequential activation of place-tuned neurons in an animal during navigation is typically interpreted as reflecting the sequence of input from adjacent positions along the trajectory. More recent theories about such place cells suggest sequences to arise from abstract cognitive objectives like planning. Here, we propose a mechanistic and parsimonious interpretation to complement these ideas: hippocampal sequences arise from intrinsic recurrent circuitry that propagates activity without sustained input, acting as a temporal memory buffer for extremely sparse inputs.
We implement a minimal sequence generator inspired by neurobiology and pair it with an actorโcritic learner for egocentric visual navigation. Our agent reliably solves a continuous maze without explicit geometric cues, with performance depending on the length of the recurrent sequence. Crucially, the model outperforms LSTM cores under sparse input conditions (16 channels, $\sim2.5$% activity), but not under dense input, revealing a strong interaction between representational sparsity and memory architecture.
Hidden units develop localized place fields, distance-dependent spatial kernels, and task-dependent remapping, while inputs to the generator orthogonalize and spatial information increases across layers. These phenomena align with neurobiological data and are causal to performance. Together, our results show that sparse input synergizes with sequence-generating dynamics, providing both a mechanistic account of place cell sequences in the mammalian hippocampus and a simple inductive bias for reinforcement learning based on sparse egocentric inputs in navigation tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4269. CP-Agent: ContextโAware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations
๐ฌ 27
โญ 4.67
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Cell painting microscopy; Biomedicine; Multimodal reasoning; LLM; Agent
Cell Painting combines multiplexed fluorescent staining, highโcontent imaging, and quantitative analysis to generate high-dimensional phenotypic readouts to support diverse downstream tasks such as mechanism-of-action (MoA) inference, toxicity prediction, and construction of drugโdisease atlases. However, existing workflows are slow, costly and difficult to interpret. Approaches for drug screening modeling predominantly focus on molecular representation learning, while neglecting actual experimental context (e.g., cell line, dosing schedule, etc.), limiting generalization and MoA resolution. We introduce CP-Agent, an agentic multimodal large language model (MLLM) capable of generating mechanism-relevant, human-interpretable rationales for cell morphological changes under drug perturbations. At its core, CP-Agent leverages a context-aware alignment module, CP-CLIP, that jointly embeds high-content images and experimental metadata to enable robust treatment and MoA discrimination (achieving a maximum F1-score of 0.896). By integrating CP-CLIP outputs with agentic tool usage and reasoning, CPโAgent compiles rationales into a structured report to guide experimental design and hypothesis refinement. These capabilities highlight CP-Agentโs potential to accelerate drug discovery by enabling more interpretable, scalable, and context-aware phenotypic screening---streamlining iterative cycles of hypothesis generation in drug discovery.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4270. A One-shot Framework for Directed Evolution of Antibodies
๐ฌ 5
โญ 4.67
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ One-shot learning matching directed evolution antibodies structure-embeddings
Improving antibody binding to an antigen without antibody-antigen structure information or antigen-specific data remains a critical challenge in therapeutic protein design. In this work, we propose \textbf{\textsc{AffinityEnhancer}}, a framework to improve the affinity of an antibody in a one-shot setting. In the \emph{oneโshot} setting, we start from a single lead sequenceโnever fineโtuning on it or using its structure in complex with the antigen or epitope/paratope informationโand seek variants that reliably boost affinity. During training, \textsc{AffinityEnhancer} utilizes pairs of related sequences with higher versus lower measured binding in a pan-antigen dataset comprising diverse โenvironmentsโ (antigens) and a shared structure-aware module that learns to transform lowโaffinity sequences into highโaffinity ones, effectively distilling consistent, causal features that drive binding. By incorporating pretrained sequence-structure embeddings and a sequence decoder, our method enables robust generalization to entirely new antibody seeds. Across multiple unseen internal and public seeds, \textsc{AffinityEnhancer} identifies key affinity enhancing mutations on the paratope, outperforms existing structureโconditioned and inpainting approaches, achieving substantial (in silico) affinity gains in true, oneโshot experiments without ever seeing antigen data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4271. ManipEvalAgent: Promptable and Efficient Evaluation Framework for Robotic Manipulation Policies
๐ฌ 19
โญ 4.67
๐ applications to robotics, autonomy, planning
๐ท๏ธ Robotic manipulation Evaluation Agent
In recent years, robotic manipulation policies have made substantial progress. However, evaluating these policies typically requires large-scale sampling in simulation benchmarks, leading to high time costs. Moreover, existing evaluation pipelines are usually fixed, do not account for user needs, and report only a single scalar score, lacking interpretability. In contrast, human experts can quickly form an intuitive impression of a policyโs capabilities from just a handful of executions. We therefore propose ManipEvalAgent, an efficient, promptable, and dynamically multi-round evaluation framework for robotic manipulation policies. The framework conducts small-batch, multi-round evaluations and adaptively plans subsequent evaluation steps based on intermediate observations from each round. Via code generation, it constructs tasks and evaluation functions within simulator. By generating evaluation functions and leveraging visionโlanguage models (VLMs) for video understanding, ManipEvalAgent provides user-instruction-centric, fine-grained analysis. Our approach offers three key advantages: (1) efficiency, no need for massive sampling; (2) promptable, planning the evaluation process according to user queries; and (3) interpretability, providing diagnostic text that goes beyond a single score. Across multiple settings, our evaluation method significantly shortens the overall time compared with traditional simulation benchmarks, while reaching conclusions comparable to those from large-scale simulation benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4272. Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval
๐ฌ 14
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Information Retrieval LLM Reasoning Reinforcement Learning
With the growing popularity of LLM agents and RAG, it has become increasingly important to retrieve documents that are essential for solving a task, even when their connection to the task is indirect or implicit. Addressing this problem requires fine-grained reasoning to accurately assess the relevance between the task and each candidate document. This capability, however, poses a significant challenge for existing IR techniques. Despite recent progress in reasoning-enhanced IR, existing approaches still face significant challenges in applicability, scalability, and efficiency. In this work, we propose **Retro\***, a novel approach for reasoning-intensive document retrieval. Our method introduces a rubric-based relevance **scoring mechanism**, enabling the model to reason about the relationship between a task and a document based on explicitly defined criteria, whereby producing a fine-grained, interpretable relevance score. Retro\* also supports **test-time scaling** by combining multiple reasoning trajectories via score integration, which produces more reliable relevance estimates. To optimize Retro\*'s reasoning capabilities, we introduce a novel **reinforcement learning** algorithm tailored for its relevance scoring mechanism, which employs two composite rewards to fully exploit the trajectories of each training sample. Our experiments show that Retro\* outperforms existing document retrieval methods with notable advantages, leading to **state-of-the-art** performance on the BRIGHT benchmark.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4273. MAPSS: Manifold-based Assessment of Perceptual Source Separation
๐ฌ 19
โญ 4.67
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Audio Source Separation Perceptual Quality Assessment Uncertainty Quantification SelfโSupervised Representation Manifold Learning
Objective assessment of audio sourceโseparation systems still mismatches subjective human perception, especially when interference from competing talkers and distortion of the target signal interact. We introduce Perceptual Separation (PS) and Perceptual Match (PM), a complementary pair of measures that, by design, isolate these leakage and distortion factors.
Our intrusive approach generates a set of fundamental distortions, e.g., clipping, notch filter, and pitch shift from each reference waveform signal in the mixture. Distortions, references, and system outputs from all sources are independently encoded by a pre-trained self-supervised model, then aggregated and embedded with a manifold learning technique called diffusion maps, which aligns Euclidean distances on the manifold with dissimilarities of the encoded waveform representations.
On this manifold, PM captures the selfโdistortion of a source by measuring distances from its output to its reference and associated distortions, while PS captures leakage by also accounting for distances from the output to nonโattributed references and distortions.
Both measures are differentiable and operate at a resolution as high as 75 frames per second, allowing granular optimization and analysis.
We further derive, for both measures, frame-level deterministic error radius and non-asymptotic, high-probability confidence intervals.
Experiments on English, Spanish, and music mixtures show that, against 14 widely used measures, the PS and PM are almost always placed first or second in linear and rank correlations with subjective human mean-opinion scores.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4274. VisualPrompter: Semantic-Aware Prompt Optimization with Visual Feedback for Text-to-Image Synthesis
๐ฌ 14
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ prompt engineering image generation diffusion model text-to-image synthesis
The notable gap between user-provided and model-preferred prompts poses a significant challenge for generating high-quality images with text-to-image models, compelling the need for prompt engineering.
Current studies on prompt engineering can effectively enhance the style and aesthetics of generated images.
However, they often neglect the semantic alignment between generated images and user descriptions, resulting in visually appealing but content-wise unsatisfying outputs.
In this work, we propose VisualPrompter, a novel training-free prompt engineering framework that refines user inputs to model-preferred sentences.
VisualPrompter utilizes an automatic self-reflection module that identifies absent concepts in the generated images, followed by a target-specific prompt optimization mechanism which revises the prompts in a fine-grained manner.
By deconstructing prompts, introducing new elements at the atomic semantic level, and then reassembling them, our model is able to maintain semantic consistency and integrity throughout the optimization process.
Extensive experiments demonstrate the effectiveness of VisualPrompter, which achieves new state-of-the-art performance on multiple benchmarks for text-image alignment evaluation.
Additionally, our framework features a plug-and-play design, making it highly adaptable to various generative models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4275. Cutting the Skip: Training Residual-Free Transformers
๐ฌ 15
โญ 4.67
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Vision Transformers Skip Connections Network Conditionin
Transformers have achieved remarkable success across a wide range of applications, a feat often attributed to their scalability. Yet training them without residual (skip) connections remains notoriously difficult. While skips stabilize optimization, they also disrupt the hierarchical structure of representations, raising the long-standing question of whether transformers can be trained efficiently without them. In this work, we address this problem by analyzing the Jacobian of a skipless transformer block, showing why residuals improve conditioning and revealing that their stabilization benefits can be recovered through a principled initialization strategy. Building on this insight, we introduce the first method that enables stable and efficient training of skipless transformers without altering the standard architecture. We validate our approach on Vision Transformers (ViTs) in both supervised and self-supervised settings, demonstrating that skipless ViTs trained with our initialization overcome the usual optimization barriers, learn richer hierarchical representations, and outperform strong residual baselines on dense prediction benchmarks. These results show that skip connections are not a fundamental requirement for training ViTs and open new avenues for hierarchical representation learning in vision models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4276. Neural Sum-of-Squares: Certifying the Nonnegativity of Polynomials with Transformers
๐ฌ 10
โญ 4.67
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ transformer ai4math ai4science sum-of-squares polynomial
Certifying nonnegativity of polynomials is a well-known NP-hard problem with direct applications spanning non-convex optimization, control, robotics, and beyond. A sufficient condition for nonnegativity is the Sum-of-Squares property, i.e., it can be written as a sum of squares of other polynomials. In practice, however, certifying the SOS criterion remains computationally expensive and often involves solving a Semidefinite Program (SDP), whose dimensionality grows quadratically in the size of the monomial basis of the SOS expression; hence, various methods to reduce the size of the monomial basis have been proposed. In this work, we introduce the first learning-augmented algorithm to certify the SOS criterion. To this end, we train a Transformer model that predicts an almost-minimal monomial basis for a given polynomial, thereby drastically reducing the size of the corresponding SDP. Our overall methodology comprises three key components: efficient training dataset generation of over 100 million SOS polynomials, design and training of the corresponding Transformer architecture, and a systematic fallback mechanism to ensure correct termination, which we analyze theoretically. We validate our approach on over 200 benchmark datasets, achieving speedups of over $100\times$ compared to state-of-the-art solvers and enabling the solution of instances where competing approaches fail. Our findings provide novel insights towards transforming the practical scalability of SOS programming.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4277. SoftCFG: Uncertainty-guided Stable Guidance for Visual Autoregressive Model
๐ฌ 10
โญ 4.67
๐ generative models
๐ท๏ธ visual autoregressive model
Autoregressive (AR) models have emerged as powerful tools for image generation by modeling images as sequences of discrete tokens. While Classifier-Free Guidance (CFG) has been adopted to improve conditional generation, its application in AR models faces two key issues: guidance diminishing, where the conditionalโunconditional gap quickly vanishes as decoding progresses, and over-guidance, where strong conditions distort visual coherence. To address these challenges, we propose SoftCFG, an uncertainty-guided inference method that distributes adaptive perturbations across all tokens in the sequence. The key idea behind SoftCFG is to let each generated token contribute certainty-weighted guidance, ensuring that the signal persists across steps while resolving conflicts between text guidance and visual context. To further stabilize long-sequence generation, we introduce Step Normalization, which bounds cumulative perturbations of SoftCFG. Our method is training-free, model-agnostic, and seamlessly integrates with existing AR pipelines. Experiments show that SoftCFG significantly improves image quality over standard CFG and achieves state-of-the-art FID on ImageNet 256 ร 256 among autoregressive models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4278. Global and Local Topology-Aware Graph Generation via Dual Conditioning Diffusion
๐ฌ 13
โญ 4.67
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ generative model; AI for science; conditioning method
Graph generation plays an important role in various domains such as molecular design, protein prediction, and drug discovery. However, generating graph-structured data poses challenges due to the complex dependencies inherent in graphs, spanning from intricate local substructures to broad global topologies. Although recent advances in graph-generative models have made notable progress, most existing methods still leverage the node-level generative paradigms and struggle with graphs that exhibit pronounced sparsity and complicated multiscale relationships. To address these challenges, we propose a unified latent diffusion model that jointly learns local and global topological information, enabling effective and efficient graph generation. Besides, our approach introduces a dual conditioning mechanism designed to promote dynamic interaction between local and global information, equipping the generative model with global and local awareness to better capture the coupled dependencies within graphs. Our method can largely promote the joint modeling of global and local information and substantially improve the quality of the generated graphs. Extensive experiments consistently demonstrate the effectiveness of our proposed method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4279. AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent
๐ฌ 34
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Mathematical reasoning Large Language Models Reinforcement Learning Agent
Large Reasoning Models (LRMs) like o3 and DeepSeek-R1 have achieved remarkable progress in natural language reasoning with long chain-of-thought. However, they remain computationally inefficient and struggle with accuracy when solving problems requiring complex mathematical operations. In this work, we present AgentMath, an agent framework that seamlessly integrates language models' reasoning capabilities with code interpreters' computational precision to efficiently tackle complex mathematical problems. Our approach introduces three key innovations: (1) An automated method that converts natural language chain-of-thought into structured tool-augmented trajectories, generating high-quality supervised fine-tuning (SFT) data to alleviate data scarcity; (2) A novel agentic reinforcement learning (RL) paradigm that dynamically interleaves natural language generation with real-time code execution. This enables models to autonomously learn optimal tool-use strategies through multi-round interactive feedback, while fostering emergent capabilities in code refinement and error correction; (3) An efficient training system incorporating innovative techniques, including request-level asynchronous rollout scheduling, agentic partial rollout, and prefix-aware weighted load balancing, achieving 4-5ร speedup and making efficient RL training feasible on ultra-long sequences with scenarios with massive tool calls. Extensive evaluations show that AgentMath achieves state-of-the-art performance on challenging mathematical competition benchmarks including AIME24, AIME25, and HMMT25, substantially outperforming frontier openโsource models of comparable size. Specifically, AgentMath-30B-A3B attains 90.6\%, 86.4\%, and 73.8\% accuracy respectively, surpassing OpenAI-o3-mini and Claude-Opus-4.0-Thinking while remaining competitive with OpenAI-o3, Gemini-2.5-Pro, and DeepSeek-R1-671B-0528. These results validate the effectiveness of our approach and pave the way for building more sophisticated and scalable mathematical reasoning agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4280. Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs
๐ฌ 14
โญ 4.67
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ data filtering model tampering unlearning robustness open-weight open-source safety biorisk
Open-weight AI systems offer unique benefits, including enhanced transparency, open research, and decentralized access. However, they are vulnerable to tampering attacks which can efficiently elicit harmful behaviors by modifying weights or activations. Currently, there is not yet a robust science of open-weight model risk management. Existing safety fine-tuning methods and other post-training techniques have struggled to make LLMs resistant to more than a few dozen steps of adversarial fine-tuning. In this paper, we investigate whether filtering text about dual-use topics from training data can prevent unwanted capabilities and serve as a more tamper-resistant safeguard. We introduce a multi-stage pipeline for scalable data filtering and show that it offers a tractable and effective method for minimizing biothreat proxy knowledge in LLMs. We pretrain multiple 6.9B-parameter models from scratch and find that they exhibit substantial resistance to adversarial fine-tuning attacks on up to 10,000 steps and 300M tokens of biothreat-related text -- outperforming existing post-training baselines by over an order of magnitude -- with no observed degradation to unrelated capabilities. However, while filtered models lack internalized dangerous knowledge, we find that they can still leverage such information when it is provided in context (e.g., via search tool augmentation), demonstrating a need for a defense-in-depth approach. Overall, these findings help to establish pretraining data curation as a promising layer of defense for open-weight AI systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4281. TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning
๐ฌ 15
โญ 4.67
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Uncertainty Estimation Large Language Models LLM Reasoning
While Large Language Models (LLMs) have demonstrated impressive capabilities, their output quality remains inconsistent across various application scenarios, making it difficult to identify trustworthy responses, especially in complex tasks requiring multi-step reasoning.
In this paper, we propose a **Tok**en-level **U**ncertainty estimation framework for **R**easoning (**TokUR**) that enables LLMs to self-assess and self-improve their responses in mathematical reasoning.
Specifically, we introduce low-rank random weight perturbation during LLM decoding to generate predictive distributions for token-level uncertainty estimation, and we aggregate these uncertainty quantities to capture the semantic uncertainty of generated responses.
Experiments on mathematical reasoning datasets of varying difficulty demonstrate that TokUR exhibits a strong correlation with answer correctness and model robustness, and the uncertainty signals produced by TokUR can be leveraged to enhance the modelโs reasoning performance at test time.
These results highlight the effectiveness of TokUR as a principled and scalable approach for improving the reliability and interpretability of LLMs in challenging reasoning tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4282. Graph-Theoretic Intrinsic Reward: Guiding RL with Effective Resistance
๐ฌ 16
โญ 4.67
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Intrinsic Motivation Goal Conditioned RL Effective Resistance
Exploration of dynamic environments with sparse rewards is a significant challenge in Reinforcement Learning, often leading to inefficient exploration and brittle policies. To address this, we introduce a novel graph-based intrinsic reward using Effective Resistance, a metric from spectral graph theory. This reward formulation guides the agent to seek configurations that are directly correlated to successful goal reaching states. We provide theoretical guarantees, proving that our method not only learns a robust policy but also achieves faster convergence by serving as a variance reduction baseline to the standard discounted reward formulation. We perform extensive empirical analysis across several challenging environments to demonstrate that our approach significantly outperforms state-of-the-art baselines, demonstrating improvements of up to 59% in success rate, 56% in timesteps taken to reach the goal, and 4 times more accumulated reward. We augment all of the supporting lemmas and theoretically motivated hyperparameter choices with corresponding experiments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4283. Riemannian Optimization on Relaxed Indicator Matrix Manifold
๐ฌ 17
โญ 4.67
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Optimization Clustering Graph Cut
The indicator matrix plays an important role in machine learning, but optimizing it is an NP-hard problem. We propose a new relaxation of the indicator matrix and compared with other existing relaxations, it can flexibly incorporate class information. We prove that this relaxation forms a manifold, which we call the Relaxed Indicator Matrix Manifold (RIM manifold). Based on Riemannian geometry, we develop a Riemannian toolbox for optimization on the RIM manifold. Specifically, we provide several methods of Retraction, including a fast Retraction method to obtain geodesics. We point out that the RIM manifold is a generalization of the double stochastic manifold, and it is much faster than existing methods on the double stochastic manifold, which has a complexity of \( \mathcal{O}(n^3) \), while RIM manifold optimization is \( \mathcal{O}(n) \) and often yields better results. We conducted extensive experiments, including image denoising, with millions of variables to support our conclusion, and applied the RIM manifold to Ratio Cut, we provide a rigorous convergence proof and achieve clustering results that outperform the state-of-the-art methods. Our Code is presented in Appendix H.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4284. Adaptive Rollout Allocation for Online Reinforcement Learning with Verifiable Rewards
๐ฌ 13
โญ 4.67
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Resource Allocation Large Language model post training
Sampling efficiency is a key bottleneck in reinforcement learning with verifiable rewards. Existing group-based policy optimization methods, such as GRPO, allocate a fixed number of rollouts for all training prompts. This uniform allocation implicitly treats all prompts as equally informative, and could lead to inefficient computational budget usage and impede training progress. We introduce VIP, a Variance-Informed Predictive allocation strategy that allocates a given rollout budget to the prompts in the incumbent batch to minimize the expected gradient variance of the policy update. At each iteration, VIP uses a lightweight Gaussian process model to predict per-prompt success probabilities based on recent rollouts. These probability predictions are translated into variance estimates, which are then fed into a convex optimization problem to determine the optimal rollout allocations under a hard compute budget constraint. Empirical results show that VIP consistently improves sampling efficiency and achieves higher performance than uniform or heuristic allocation strategies in multiple benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4285. AutoCode: LLMs as Problem Setters for Competitive Programming
๐ฌ 13
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Competitive Programming Test Case Generation Problem Generation
Writing competitive programming problems is exacting. Authors must: set constraints, input distributions, and edge cases that rule out shortcuts; target specific algorithms (e.g., max-flow, dynamic programming, data structures); and calibrate complexity beyond the reach of most competitors. We argue that this makes for an ideal test of general large language model capabilities and study whether they can do this reliably. We introduce AutoCode, which uses multiple rounds of validation to yield competition-grade problem statements and test cases. On held-out problems, AutoCode test suites approach 99% consistency with official judgments, a significant improvement over current state-of-the-art methods like HardTests, which achieve less than 81%. Furthermore, starting with a random seed problem, AutoCode can create novel variants with reference and brute-force solutions. By cross-verifying these generated solutions against test cases, we can further filter out malformed problems. Our system ensures high correctness, as verified by human experts. AutoCode successfully produces novel problems judged by Grandmaster-level (top 0.3%) competitive programmers to be of contest quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4286. Towards Self-Robust LLMs: Intrinsic Prompt Noise Resistance via CoIPO
๐ฌ 13
โญ 4.67
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ large language model robustness post training
Large language models (LLMs) have demonstrated remarkable and steadily improving performance across a wide range of tasks.
However, LLM performance may be highly sensitive to prompt variations especially in scenarios with limited openness or strict output formatting requirements, indicating insufficient robustness.
In real-world applications, user prompts provided to LLMs often contain imperfections, which may undermine the quality of the model's responses.
To address this issue, previous work has primarily focused on preprocessing prompts, employing external tools or even LLMs to refine prompt formulations in advance.
However, these approaches overlook the intrinsic robustness of LLMs, and their reliance on external components introduces additional computational overhead and uncertainty.
In this work, we propose a Contrastive Learning-based Inverse Direct Preference Optimization (CoIPO) method that minimizes the discrepancy between the label-aligned logits produced by the model under a clean prompt and its noisy counterpart, and conduct a detailed analysis using mutual information theory.
We augment the FLAN dataset by constructing paired prompts, each consisting of a clean prompt and its corresponding noisy version for training.
Additionally, to evaluate the effectiveness, we develop NoisyPromptBench, a benchmark enhanced and derived from the existing PromptBench.
Experimental results conducted on NoisyPromptBench demonstrate that our proposed method achieves a significant improvement in average accuracy over the current state-of-the-art approaches.
The source code of CoIPO, pair-wise FLAN datasets, and NoisyPromptBench have already been released on https://anonymous.4open.science/r/CoIPO-61D6.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4287. Confident and Adaptive Generative Speech Recognition via Conformal Risk Control
๐ฌ 9
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ automatic speech recognition conformal prediction conformal risk control large language models ASR error correction uncertainty quantification adaptive hypothesis selection N-best hypotheses LoRA fine-tuning word error rate statistical guarantees generative error correction
Automatic Speech Recognition (ASR) systems frequently produce transcription errors due to acoustic variability, which require post-processing correction methods. Recent approaches leverage Large Language Models (LLMs) for generative ASR error correction using N-best hypotheses but rely on fixed set sizes regardless of input complexity and do not provide performance guarantees. We propose an adaptive framework that dynamically determines the optimal number of hypotheses for each input using risk control. This mechanism leverages ASR confidence scores and applies Learn then test (LTT) to control the expected relative word error rate degradation compared to the best achievable performance for a given model and hypothesis set. Experimental results demonstrate that our approach provides theoretical guarantees with high-probability bounds while matching or exceeding fixed-size correction baselines and requiring fewer hypotheses on average, achieving substantial computational savings under diverse acoustic conditions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4288. Natural Language PDDL (NL-PDDL) for Open-world Goal-oriented Commonsense Regression Planning in Embodied AI
๐ฌ 9
โญ 4.67
๐ applications to robotics, autonomy, planning
๐ท๏ธ Open-world Planning Lifted Regression Embodied Agents
Planning in open-world environments, where agents must act with partially observed states and incomplete knowledge, is a central challenge in embodied AI. Open-world planning involves not only sequencing actions but also determining what information the agent needs to sense to enable those actions. Existing approaches using Large Language Models (LLM) and Vision-Language Models (VLM) cannot reliably plan over long horizons and complex goals, where they often hallucinate and fail to reason causally over agent-environment interactions. Alternatively, classical PDDL planners offer correct and principled reasoning, but fail in open-world settings: they presuppose complete models and depend on exhaustive grounding over all objects, states, and actions; they cannot address misalignment between goal specifications (e.g., โheat the breadโ) and action specifications (e.g., โtoast the breadโ); and they do not generalize across modalities (e.g., text, vision). To address these core challenges: (i) we extend symbolic PDDL into a flexible natural language representation that we term NL-PDDL, improving accessibility for non-expert users as well as generalization over modalities; (ii) we generalize regression-style planning to NL-PDDL with commonsense entailment reasoning to determine what needs to be observed for goal achievement in partially-observed environments with potential goalโaction specification misalignment; and (iii) we leverage the lifted specification of NL-PDDL to facilitate open-world planning that avoids exhaustive grounding and yields a time and space complexity independent of the number of ground objects, states, and actions. Our experiments in three diverse domains โ classical Blocksworld and the embodied ALFWorld environment with both textual and visual states โ show that NL-PDDL substantially outperforms existing baselines, is more robust to longer horizons and more complex goals, and generalizes across modalities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4289. OrthoSolver: A Neural Proper Orthogonal Decomposition Solver For PDEs
๐ฌ 17
โญ 4.67
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Partial differential equations Neural operator Information-theoretic Proper orthogonal decomposition
Proper Orthogonal Decomposition (POD) is a cornerstone reduced-order modeling technique for accelerating the solution of partial differential equations (PDEs) by extracting energy-optimal orthogonal bases. However, POD's inherent linear assumption limits its expressive power for complex nonlinear dynamics, and its snapshot-based fixed bases generalize poorly to unseen scenarios. Meanwhile, emerging deep learning solvers have explored integrating decomposition architectures, yet their purely data-driven nature lacks essential physical priors and leads to modal collapse, where decomposed modes lose discriminative power.
To address these challenges, we revisit POD from an information-theoretic perspective. We theoretically establish that POD's classical energy-maximization criterion is, in essence, a principle of maximizing mutual information. Guided by this insight, we propose OrthoSolver, a neural POD framework that generalizes this core information-theoretic principle to the nonlinear domain. OrthoSolver iteratively and adaptively extracts a set of compact and expressive nonlinear basis modes by directly maximizing their mutual information with the data field. Furthermore, an orthogonality regularization is imposed to preserve the diversity of the learned modes and effectively mitigate mode collapse. Extensive experiments on seven PDE benchmarks demonstrate that OrthoSolver consistently outperforms state-of-the-art deep learning baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4290. 3D RNA Inverse Design with Reinforcement Learning-Guided Diffusion Models
๐ฌ 10
โญ 4.67
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ RNA Inverse Design Reinforcement Learning RNA Structure
The inverse design of RNA three-dimensional (3D) structures is crucial for engineering functional RNAs in synthetic biology and therapeutics. While recent deep learning approaches have advanced this field, they are typically optimized and evaluated using Native Sequence Recovery (NSR)โa limited surrogate for structural fidelity, as different sequences can fold into similar 3D structures, and high NSR does not necessarily indicate correct folding. To address this limitation, we propose a novel two-stage framework that directly optimizes for 3D structural similarity. First, we develop and pre-train a GNN-based generative diffusion model conditioned on the target 3D structure, achieving a $9\\%$
improvement in NSR over state-of-the-art methods. Then, we fine-tune the model using an improved policy gradient algorithm with four task-specific reward functions based on 3D self-consistency metrics. Experimental results show that our approach improves structural similarity by over $100\\%$ across all metrics and discovers designs that are distinct from native sequences.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4291. Towards a Universally Transferable Acceleration Method for Density Functional Theory
๐ฌ 10
โญ 4.67
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Density Functional Theory E(3)-equivariant networks
Recently, sophisticated deep learning-based approaches have been developed for generating efficient initial guesses to accelerate the convergence of density functional theory (DFT) calculations.
While the actual initial guesses are often density matrices (DM), quantities that can convert into density matrices also qualify as alternative forms of initial guesses.
Hence, existing works mostly rely on the prediction of the Hamiltonian matrix for obtaining high-quality initial guesses.
However, the Hamiltonian matrix is both numerically difficult to predict and intrinsically non-transferable, hindering the application of such models in real scenarios.
In light of this, we propose a method that constructs DFT initial guesses by predicting the electron density in a compact auxiliary basis representation using E(3)-equivariant neural networks.
Trained exclusively on small molecules with up to 20 atoms, our model achieves an average 33.3% reduction in SCF iterations for molecules three times larger (up to 60 atoms).
This result is particularly significant given that baseline Hamiltonian-based methods fail to generalize, often increasing the iteration count by over 80\% or failing to converge entirely on these larger systems. Furthermore, we demonstrate that this acceleration is robustly scalable: the model successfully accelerates calculations for systems with up to 900 atoms (polymers and polypeptides) without retraining.
To the best of our knowledge, this work represents the first and robust candidate for a universally transferable DFT acceleration method.
We are also releasing the SCFbench dataset and its accompanying code to facilitate future research in this promising direction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4292. LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards
๐ฌ 16
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Long Context Reinforcement Learning with Verifiable Rewards
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual groundingโthe ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to exponentially vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model's scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4293. Reward Is Enough: LLMs Are In-Context Reinforcement Learners
๐ฌ 15
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Test-Time Scaling Inference-Time Improvement LLMs RL
Reinforcement learning (RL) is a human-designed framework for solving sequential decision-making problems. In this work, we demonstrate that, surprisingly, RL emerges in LLMs at inference time โ a phenomenon known as in-context RL (ICRL). To reveal this capability, we introduce a simple multi-round prompting framework, called ICRL prompting. The goal of ICRL prompting is to guide LLMs to perform reinforcement learning for self-improvement on a given task. After the LLM generates a response at the current round, we give numerical scalar feedback on the response, called the rewards. At the next round, we prompt the LLM again with the same task and a context consisting of all previous responses and rewards. We observe that the quality of the LLM's response increases as the context grows. In other words, the LLM is able to maximize the scalar reward signal at inference time, just like an RL algorithm. We evaluate ICRL prompting on Game of 24, creative writing, ScienceWorld, and Olympiad-level math competitions (AIME and HMMT), demonstrating significant improvements over baselines such as Self-Refine and Reflexion. Surprisingly, in some experiments, the reward signals are generated by the LLM itself, yet performance improvements are still observed from ICRL prompting, offering a new paradigm for test-time scaling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4294. Accelerating Diffusion Planners in Offline RL via Reward-Aware Consistency Trajectory Distillation
๐ฌ 13
โญ 4.67
๐ reinforcement learning
๐ท๏ธ offline RL consistency models diffusion models
Although diffusion models have achieved strong results in decision-making tasks, their slow inference speed remains a key limitation. While consistency models offer a potential solution, existing applications to decision-making either struggle with suboptimal demonstrations under behavior cloning or rely on complex concurrent training of multiple networks under the actor-critic framework. In this work, we propose a novel approach to consistency distillation for offline reinforcement learning that directly incorporates reward optimization into the distillation process. Our method achieves single-step diffusion sampling while generating higher-reward action trajectories through decoupled training and noise-free reward guidance. Empirical evaluations on the Gym MuJoCo, FrankaKitchen, and long horizon planning benchmarks demonstrate that our approach can achieve a $9.7$% improvement over previous state-of-the-art while leveraging CTM (Kim et al., 2023) to offer up to $142\times$ speedup over diffusion counterparts in inference time.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4295. In Good GRACES: Principled Teacher Selection for Knowledge Distillation
๐ฌ 14
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Knowledge distillation Directional coverage Gradient variance Cross Validation Best Teacher prediction
Knowledge distillation is an efficient strategy to use data generated by large teacher language models to train smaller โcapableโ student models, but selecting the optimal teacher for a specific student-task combination requires expensive trial-and-error. We propose a lightweight score called GRACE to quantify how effective a teacher will be when post-training a student model to solve math problems. GRACE efficiently measures distributional properties of student gradients, and it can be computed without access to a verifier, teacher logits, teacher internals, or test data. From an information-theoretic perspective, GRACE measures leave-one-out stability in gradient-based algorithms, directly connecting it to the generalization performance of distilled student models. On GSM8K and MATH, GRACE correlates strongly (up to 86%) with the performance of the distilled Llama and OLMo students. In particular, training on GRACE-selected teacher provides at least a 6% improvement over naively using the best-performing teacher. We further demonstrate the utility of GRACE in providing guidance on crucial design choices in distillation, including (1) the best temperature to use when generating from the teacher, (2) the best teacher to use given a size constraint, and (3) the best teacher to use within a specific model family. Altogether, our findings demonstrate that GRACE can efficiently and effectively identify the most compatible teacher for a given student and provide fine-grained guidance on how to perform distillation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4296. Multi-Agent Guided Policy Optimization
๐ฌ 19
โญ 4.67
๐ reinforcement learning
๐ท๏ธ multi-agent reinforcement learning teacher-student learning centralized training with decentralized execution
Due to practical constraints such as partial observability and limited communication, Centralized Training with Decentralized Execution (CTDE) has become the dominant paradigm in cooperative Multi-Agent Reinforcement Learning (MARL).
However, existing CTDE methods often underutilize centralized training or lack theoretical guarantees.
We propose Multi-Agent Guided Policy Optimization (MAGPO), a novel framework that better leverages centralized training by integrating centralized guidance with decentralized execution.
MAGPO uses an autoregressive joint policy for scalable, coordinated exploration and explicitly aligns it with decentralized policies to ensure deployability under partial observability.
We provide theoretical guarantees of monotonic policy improvement and empirically evaluate MAGPO on 43 tasks across 6 diverse environments.
Results show that MAGPO consistently outperforms strong CTDE baselines and matches or surpasses fully centralized approaches, offering a principled and practical solution for decentralized multi-agent learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4297. GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
๐ฌ 17
โญ 4.67
๐ datasets and benchmarks
๐ท๏ธ benchmark real-world tasks RL environments model evaluation reinforcement learning AI impacts dataset evals benchmarks multi-modal computer use agents long-horizon tasks AI artificial intelligence ML machine learning deep learning LLMs language models
We introduce GDPval, a benchmark evaluating AI model capabilities on real-world economically valuable knowledge-work tasks. GDPval covers the majority of Department of Labor O*NET Work Activities for 44 occupations across the top 9 sectors contributing to U.S. GDP (Gross Domestic Product). Tasks are constructed from the representative work of industry professionals with an average of 14 years of experience. We find that frontier model performance on GDPval is improving roughly linearly over time, and that the current best frontier models are approaching industry experts in deliverable quality. We analyze the potential for frontier models, when paired with human oversight, to perform GDPval tasks cheaper and faster than unaided experts. We also demonstrate that increased reasoning effort, increased task context, and increased scaffolding improves model performance on GDPval. Finally, we open-source a gold subset of 220 tasks and provide a public automated grading service to facilitate future research in understanding real-world model capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4298. Exploring the Potential of Encoder-free Architectures in 3D LMMs
๐ฌ 30
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multi-modal Large Language Model 3D Understanding Large Language Model
Encoder-free architectures have been preliminarily explored in the 2D Large Multimodal Models (LMMs), yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to alleviate the challenges of encoder-based 3D LMMs. These long-standing challenges include the failure to adapt to varying point cloud resolutions during inference and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the pre-trained encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, **ENEL**. Our 7B model rivals the state-of-the-art model, PointLLM-PiSA-13B, achieving 57.91%, 61.0%, and 55.20% on the classification, captioning, and VQA tasks, respectively. Our results show that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4299. Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation
๐ฌ 12
โญ 4.67
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ 3D Aware Feature Distillation Visual Foundation Models 3D Gaussian Splatting Feed-Forward 3D Reconstruction DINOv2
Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then "splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, "distilling" geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning process where the teacherโs consistency improves alongside that of the student. We conduct a comprehensive evaluation on a suite of downstream tasks, including monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation. Our method significantly outperforms prior works, not only achieving substantial gains in 3D awareness but also enhancing the underlying semantic richness of 2D features.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4300. Only Brains Align with Brains: Cross-Region Patterns Expose Limits of Normative Models
๐ฌ 9
โญ 4.67
๐ applications to neuroscience & cognitive science
๐ท๏ธ brain alignment benchmarking representational similarity analysis video models
Neuroscientists and computer vision scientists alike have relied on model-brain alignment benchmarks to find parallels between artificial and biological vision systems. These benchmarks rank models according to alignment measures (AM) such as representational similarity analysis (RSA) and linear predictivity (LP). However, recent works have revealed a number of problems with these rankings, such as their sensitivity towards the choice of AM, raising the deeper conceptual question of what it means for a model to be โbrain-aligned.โ
Here, we introduce the notion of *alignment patterns* - characteristic patterns of alignment between brain regions-and posit that models should reproduce these patterns in order to be considered brain-aligned.
First, we apply a standard benchmarking pipeline to a broad spectrum of vision models on the BOLD-Moments video fMRI dataset across visual regions of interest (ROIs).
We find that, while this pipeline can identify nominally best predictive models, many other models fall within subject-level variability and are therefore practically equivalent in terms of brain alignment.
We then apply our complementary relational criterion: a ROI-aligned model should reproduce that ROIs cross-region alignment pattern. We find that, while these patterns are highly stable across brains of different subjects, even top-ranked models fail to capture them. Notably, models that appear practically equivalent in predictive accuracy diverge sharply under the relational criterion, revealing both the limitations with respect to discriminative power of existing evaluation pipelines, as well as alignment pattern analysis as a way of increasing this discriminative power.
Finally, we argue for a principled distinction between brain-predictivity and brain-alignment. For applications such as digital twins, prediction performance may suffice; but for understanding the inductive biases of the visual system, models should satisfy stricter distributional and relational criteria.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4301. Stability Under Scrutiny: Benchmarking Representation Paradigms for Online HD Mapping
๐ฌ 26
โญ 4.67
๐ datasets and benchmarks
๐ท๏ธ Online HD Map Construction Temporal Stability Benchmarking Autonomous Driving Evaluation Metrics
As one of the fundamental intermediate modules in autonomous driving, online high-definition (HD) maps have attracted significant attention due to their cost-effectiveness and real-time capabilities. Since vehicles always cruise in highly dynamic environments, spatial displacement of onboard sensors inevitably causes shifts in real-time HD mapping results, and such instability poses fundamental challenges for downstream tasks. However, existing online map construction models tend to prioritize improving each frame's mapping accuracy, while the mapping stability has not yet been systematically studied. To fill this gap, this paper presents the first comprehensive benchmark for evaluating the temporal stability of online HD mapping models. We propose a multi-dimensional stability evaluation framework with novel metrics for Presence, Localization, and Shape Stability, integrated into a unified mean Average Stability (mAS) score. Extensive experiments on 42 models and variants show that accuracy (mAP) and stability (mAS) represent largely independent performance dimensions. We further analyze the impact of key model design choices on both criteria, identifying architectural and training factors that contribute to high accuracy, high stability, or both. To encourage broader focus on stability, we will release a public benchmark. Our work highlights the importance of treating temporal stability as a core evaluation criterion alongside accuracy, advancing the development of more reliable autonomous driving systems. The benchmark toolkit, code, and models will be available at \hyperlink{https://stablehdmap.github.io/}{https://stablehdmap.github.io/}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4302. Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation
๐ฌ 15
โญ 4.67
๐ generative models
๐ท๏ธ Diffusion Models Alignment
Diffusion models have achieved impressive results in generating high-quality images. Yet, they often struggle to faithfully align the generated images with the input prompts. This limitation is associated with synchronous denoising, where all pixels simultaneously evolve from random noise to clear images. As a result, during generation, the prompt-related regions can only reference the unrelated regions at the same noise level, failing to obtain clear context and ultimately impairing text-to-image alignment. To address this issue, we propose asynchronous diffusion models, a novel framework that allocates distinct timesteps to different pixels and reformulates the pixel-wise denoising process. By dynamically modulating the timestep schedules of individual pixels, prompt-related regions are denoised more gradually than unrelated regions, thereby allowing them to leverage clearer inter-pixel context. Consequently, these prompt-related regions achieve better alignment in the final images. Extensive experiments demonstrate that our asynchronous diffusion models can significantly improve text-to-image alignment across diverse prompts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4303. CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions
๐ฌ 19
โญ 4.67
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Neural Dynamic Simulation; Visual Dynamics Grounding; Unsupervised Learning
Deep learning has demonstrated remarkable capabilities in simulating complex dynamic systems. However, existing methods require known physical properties as supervision or inputs, and this dependence limits their applicability under unknown conditions. To explore this challenge, we introduce Cloth Dynamics Grounding (CDG), a novel scenario that involves unsupervised learning of cloth dynamics from sparse multi-view visual observations.
We further propose Cloth Dynamics Splatting (CloDS), an unsupervised dynamic learning framework designed for CDG. To enable unsupervised learning of cloth dynamics, we develop a three-stage training framework for CloDS. Moreover, to address the challenges posed by large non-linear deformations and severe self-occlusions in CDG, we introduce a dual-position opacity modulation that supports bidirectional mapping between 2D observations and 3D geometry via mesh-based Gaussian splatting. It jointly considers the absolute and relative position of Gaussian components.
Comprehensive experimental evaluations demonstrate that CloDS effectively learns cloth dynamics from visual data while maintaining strong generalization capabilities for unseen configurations. Our code is available at https://anonymous.4open.science/r/CloDS_ICLR/. Visualization results are available at https://anonymous.4open.science/r/CloDS_video_ICLR/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4304. AFD-INSTRUCTION: A Comprehensive Antibody Instruction Dataset with Functional Annotations for LLM-Based Understanding and Design
๐ฌ 21
โญ 4.67
๐ datasets and benchmarks
๐ท๏ธ instruction dataset large language models antibody design
Large language models (LLMs) have significantly advanced protein representation learning. However, their capacity to interpret and design antibodies through natural language remains limited. To address this challenge, we present AFD-Instruction, the first large-scale instruction dataset with functional annotations tailored to antibodies. This dataset encompasses two key components: antibody understanding, which infers functional attributes directly from sequences, and antibody design, which enables de novo sequence generation under functional constraints. These components provide explicit sequence-function alignment and support antibody design guided by natural language instructions. Extensive instruction-tuning experiments on general-purpose LLMs demonstrate that AFD-Instruction consistently improves performance across diverse antibody-related tasks. By linking antibody sequences with textual descriptions of function, AFD-Instruction establishes a new foundation for advancing antibody modeling and accelerating therapeutic discovery.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4305. MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning
๐ฌ 8
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Proactive Interaction Video Dialogue Video MLLM
Recent advances in video multimodal large language models (Video MLLMs) have significantly enhanced video understanding and multi-modal interaction capabilities. While most existing systems operate in a turn-based manner where the model can only reply after user turns, proactively deciding when to reply during video playback presents a promising yet challenging direction for real-time applications. In this work, we propose a novel text-to-text approach to proactive interaction, where the model autonomously determines whether to respond or remain silent at each turn based on dialogue history and visual context up to current frame of an streaming video. To overcome difficulties in previous methods such as manually tuning response decision thresholds and annotating precise reply times, we introduce a multi-turn RL based training method that encourages timely and accurate responses without requiring precise response time annotations. We train our model MMDuet2 on a dataset of 52k videos with two types of dialogues via SFT and RL. Experimental results demonstrate that MMDuet2 outperforms existing proactive Video MLLM baselines in response timing and quality, achieving state-of-the-art performance on the ProactiveVideoQA benchmark.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4306. ASIDE: Architectural Separation of Instructions and Data in Language Models
๐ฌ 13
โญ 4.67
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ large language models instruction-data-separation conditional embedding mechanism LLM safety prompt injections
Despite their remarkable performance, large language models lack elementary safety features, making them susceptible to numerous malicious attacks. In particular, previous work has identified the absence of an intrinsic separation between instructions and data as the root cause of the success of prompt injection attacks. In this work, we propose a new architectural element, ASIDE, that allows language models to clearly separate instructions and data at the level of token embeddings. ASIDE applies an orthogonal rotation to the embeddings of data tokens, thus creating clearly distinct representations of instructions and data tokens without introducing any additional parameters. As we demonstrate experimentally across a range of models, instruction-tuning LLMs with ASIDE (1) achieves substantially higher instruction-data separation without performance loss and (2) makes the models more robust to prompt injection benchmarks, even without dedicated safety training. Additionally, we provide insights into the mechanism underlying our method through an analysis of the model representations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4307. HGNet: Scalable Foundation Model for Automated Knowledge Graph Generation from Scientific Literature
๐ฌ 10
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Knowledge Graphs Representation Learning Graph Neural Networks Geometric Deep Learning Scientific Text Mining
Automated knowledge graph (KG) construction is essential for navigating the rapidly expanding body of scientific literature. However, existing approaches face persistent challenges: they struggle to recognize long multi-word entities, often fail to generalize across domains, and typically overlook the hierarchical and logically constrained nature of scientific knowledge. While general-purpose large language models (LLMs) offer some adaptability, they are computationally expensive and yield inconsistent accuracy on specialized, domain-heavy tasks such as scientific knowledge graph construction. As a result, current KGs are shallow and inconsistent, limiting their utility for exploration and synthesis. We propose a two-stage framework for scalable, zero-shot scientific KG construction. The first stage, Z-NERD, introduces (i) Orthogonal Semantic Decomposition (OSD), which promotes domain-agnostic entity recognition by isolating semantic โturnsโ in text, and (ii) a Multi-Scale TCQK attention mechanism that captures coherent multi-word entities through n-gramโaware attention heads. The second stage, HGNet, performs relation extraction with hierarchy-aware message passing, explicitly modeling parent, child, and peer relations. To enforce global consistency, we introduce two complementary objectives: a Differentiable Hierarchy Loss to discourage cycles and shortcut edges, and a Continuum Abstraction Field (CAF) Loss that embeds abstraction levels along a learnable axis in Euclidean space. To the best of our knowledge, this is the first approach to formalize hierarchical abstraction as a continuous property within standard Euclidean embeddings, offering a simpler and more interpretable alternative to hyperbolic methods. To address data scarcity, we also release SPHERE, a large-scale, multi-domain benchmark for hierarchical relation extraction. Our framework establishes a new state of the art on benchmarks such as SciERC, SciER and SPHERE benchmarks, improving named entity recognition (NER) by 8.08\% and relation extraction (RE) by 5.99\% on the official out-of-distrubtion test sets. In zero-shot settings, the gains are even more pronounced, with improvements of 10.76\% for NER and 26.2\% for RE, marking a significant step toward reliable and scalable scientific knowledge graph construction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4308. Rating Quality of Diverse Time Series Data by Meta-learning from LLM Judgment
๐ฌ 29
โญ 4.67
๐ learning on time series and dynamical systems
๐ท๏ธ Data quality assessment Data selection Time series data Large language models
High-quality time series (TS) data are essential for ensuring TS model performance, rendering research on rating TS data quality indispensable. Existing methods have shown promising rating accuracy within individual domains, primarily by extending data quality rating techniques such as influence functions and Shapley values to account for temporal characteristics. However, they neglect the fact that real-world TS data can span vastly different domains and exhibit distinct properties, hampering the accurate and efficient rating of diverse TS data.
In this paper, we propose TSRating, a novel and unified framework for rating the quality of time series data crawled from diverse domains. TSRating leverages LLMs' inherent ample knowledge, acquired during their extensive pretraining, to comprehend and discern quality differences in diverse TS data. We verify this by devising a series of prompts to elicit quality comparisons from LLMs for pairs of TS samples. We then fit a dedicated rating model, termed TSRater, to convert the LLMs' judgments into efficient quality predictions by inferring future TS samples through TSRater's inference. To ensure cross-domain adaptability, we develop a meta-learning scheme to train TSRater on quality comparisons collected from nine distinct domains. To improve training efficiency, we employ signSGD for inner-loop updates, thus circumventing the demanding computation of hypergradients. Extensive experimental results on eleven benchmark datasets across three time series tasks, each using both conventional TS models and TS foundation models, demonstrate that TSRating outperforms baselines in terms of estimation accuracy, efficiency, and domain adaptability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4309. JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models
๐ฌ 16
โญ 4.67
๐ datasets and benchmarks
๐ท๏ธ AudioLM ALM Benchmark Dataset Jailbreak Attacks
Large Audio Language Models (LALMs) integrate the audio modality directly into the model, rather than converting speech into text and inputting text to Large Language Models (LLMs).
While jailbreak attacks on LLMs have been extensively studied, the security of LALMs with audio modalities remains largely unexplored.
Currently, there is a lack of an adversarial audio dataset and a unified framework specifically designed to evaluate and compare attacks and LALMs.
In this paper, we present JALMBench, a comprehensive benchmark to assess the safety of LALMs against jailbreak attacks.
JALMBench includes a dataset containing 11,316 text samples and 245,355 audio samples (>1,000 hours).
It supports 12 mainstream LALMs, 4 text-transferred and 4 audio-originated attack methods, and 5 defense methods.
Using JALMBench, we provide an in-depth analysis of attack efficiency, topic sensitivity, voice diversity, and architecture.
Additionally, we explore mitigation strategies for the attacks at both the prompt level and the response level.
We find that LALM safety is strongly influenced by modality and architectural choices,
demonstrating that text-based safety alignment can partially transfer to audio inputs and that interleaving audio-text strategy enables more robust cross-modal generalization of safety.
Moreover, current general moderation for jailbreak only slightly improves security, and calls on the community to explore more defense methods for LALMs.
Our work is the first to systematically uncover these design principles, providing a roadmap for building resilient multimodal language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4310. Personalized Feature Translation for Expression Recognition: An Efficient Source-Free Domain Adaptation Method
๐ฌ 11
โญ 4.67
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ unsupervised domain adaptation source free personalization videos facial expression recognition domain translation
Facial expression recognition (FER) models are employed in many video-based affective computing applications, such as human-computer interaction and healthcare monitoring. However, deep FER models often struggle with subtle expressions and high inter-subject variability, limiting their performance in real-world applications. To improve their performance, source-free domain adaptation (SFDA) methods have been proposed to personalize a pretrained source model using only unlabeled target domain data, thereby avoiding data privacy, storage, and transmission constraints. This paper addresses a challenging scenario, where source data is unavailable for adaptation, and only unlabeled target data consisting solely of neutral expressions is available. SFDA methods are not typically designed to adapt using target data from only a single class. Further, using models to generate facial images with non-neutral expressions can be unstable and computationally intensive. In this paper, personalized feature translation (PFT) is proposed for SFDA. Unlike current image translation methods for SFDA, our lightweight method operates in the latent space. We first pre-train the translator on the source domain data to transform the subject-specific style features from one source subject into another. Expression information is preserved by optimizing a combination of expression consistency and style-aware objectives. Then, the translator is adapted on neutral target data, without using source data or image synthesis. By translating in the latent space, PFT avoids the complexity and noise of face expression generation, producing discriminative embeddings optimized for classification. Using PFT eliminates the need for image synthesis, reduces computational overhead (using a lightweight translator), and only adapts part of the model, making the method efficient compared to image-based translation. Extensive experiments on four challenging video FER benchmark datasets, BioVid, StressID, BAH, and AffWild2, show that PFT consistently outperforms state-of-the-art SFDA methods, providing a cost-effective approach that is suitable for real-world, privacy-sensitive FER applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4311. Expertise Can Be Helpful for Reinforcement Learning-based Macro Placement
๐ฌ 21
โญ 4.67
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Chip Placement Reinforcement Learning
Chip placement determines the locations of electronic components on a chip layout, which directly impacts performance, power, and area (PPA) metrics, and thus is a critical step in electronic design automation (EDA). As modern chips scale to accommodate millions of components, manual placement by human experts becomes infeasible, necessitating the use of automated algorithms.
Recently, reinforcement learning (RL) has emerged as a promising approach for automating macro placement, owing to its high optimization efficiency and potential for generalization.
Despite their promise, existing RL-based methods often neglect the value of expert knowledge accumulated through years of engineering practice. They tend to optimize oversimplified proxy objectives, resulting in suboptimal placements that deviate significantly from expert-designed solutions.
To bridge this gap, we propose a novel RL-based placement framework that integrates EDA domain expertise from two complementary perspectives: (1) \textit{Expert Knowledge Injection}: Incorporating well-established placement knowledge, such as dataflow guidance, periphery bias, macro grouping, and I/O keepout constraints, to guide the learning process toward human-level solutions. (2) \textit{Expert Workflow Imitation}: Emulating the post-refinement process of human experts (i.e., updating the design iteratively based on backend PPA feedback) to progressively optimize timing metrics by employing preference optimization.
Experiments on the ICCAD 2015 and OpenROAD benchmarks demonstrate that our method achieves substantial improvements in PPA metrics~(e.g., 12.84\% in total negative slack and 20.79\% in worst negative slack compared to the runner-up method on average), outperforming advanced analytical, black-box optimization, and RL-based methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4312. UrbanGS: Efficient and Scalable Architecture for Geometrically Accurate Large-Scene Reconstruction
๐ฌ 13
โญ 4.67
๐ learning on graphs and other geometries & topologies
๐ท๏ธ 3DGS; 3D surface reconstruction
While 3D Gaussian Splatting (3DGS) delivers high-quality, real-time rendering for bounded scenes, its extension to large-scale urban environments introduces critical challenges in geometric consistency, memory efficiency, and computational scalability. We present UrbanGS, a scalable reconstruction framework that effectively addresses these challenges for city-scale applications.
We propose a Depth-Consistent D-Normal Regularization module. In contrast to existing approaches that rely solely on monocular normal estimatorsโwhich effectively update rotation parameters but poorly optimize other geometric attributesโour method integrates D-Normal constraints with external depth supervision. This enables comprehensive updates of all geometric parameters. By further incorporating an adaptive confidence weighting mechanism based on gradient consistency and inverse depth deviation, our approach significantly enhances multi-view depth alignment and geometric coherence.
To improve scalability, we introduce a Spatially Adaptive Gaussian Pruning (SAGP) strategy, which dynamically adjusts Gaussian density based on local geometric complexity and visibility to reduce redundancy. Additionally, a unified partitioning and view assignment scheme is designed to eliminate boundary artifacts and optimize computational load. Extensive experiments on multiple urban datasets demonstrate that UrbanGS achieves superior performance in rendering quality, geometric accuracy, and memory efficiency, offering a systematic solution for high-fidelity large-scale scene reconstruction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4313. PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs
๐ฌ 11
โญ 4.67
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ KV Cache Quantization
Recently, significant progress has been made in developing reasoning-capable Large Language Models (LLMs) through long Chain-of-Thought (CoT) techniques.
However, this long-CoT reasoning process imposes substantial memory overhead due to the large Key-Value (KV) Cache memory overhead.
Post-training KV Cache quantization has emerged as a promising compression technique and has been extensively studied in short-context scenarios.
However, directly applying existing methods to long-CoT LLMs causes significant performance degradation due to the following two reasons:
(1) Large cumulative error: Existing methods fail to adequately leverage available memory, and they directly quantize the KV Cache during each decoding step, leading to large cumulative quantization error.
(2) Short-context calibration: Due to Rotary Positional Embedding (RoPE), the use of short-context data during calibration fails to account for the distribution of less frequent channels in the Key Cache, resulting in performance loss.
We propose Progressive Mixed-Precision KV Cache Quantization (PM-KVQ) for long-CoT LLMs to address the above issues in two folds:
(1) To reduce cumulative error, we design a progressive quantization strategy to gradually lower the bit-width of KV Cache in each block. Then, we propose block-wise memory allocation to assign a higher bit-width to more sensitive transformer blocks.
(2) To increase the calibration length without additional overhead, we propose a new calibration strategy with positional interpolation that leverages short calibration data with positional interpolation to approximate the data distribution of long-context data.
Extensive experiments on 7Bโ70B long-CoT LLMs show that PM-KVQ improves reasoning benchmark performance by up to 8% over SOTA baselines under the same memory budget and achieves 2.73โ5.18$\times$ throughput over the original 16-bit LLMs.
Our code will be released soon.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4314. ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization
๐ฌ 16
โญ 4.67
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Autoformalization Automated Theorem Proving
Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem's semantic intent. This limitation arises from the LLM approaches' treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 22.6 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5\% of cases.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4315. Learning to Reason in Structured In-context Environments with Reinforcement Learning
๐ฌ 16
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language model reinforcement learning llm reasoning structured in-context environment
Large language models (LLMs) have achieved significant advancements in reasoning capabilities through reinforcement learning (RL) via environmental exploration. As the intrinsic properties of the environment determine the abilities that LLMs can learn, the environment plays a important role in the RL finetuning process. An ideal LLM reasoning environment should possess three core characteristics: scalability, generalizable reasoning, and verifiability. However, existing mathematical and coding environments are difficult to scale due to heavy reliance on expert annotation, while the skills learned in game-based environments are too specialized to generalize. To bridge this gap, we introduce the \textbf{S}tructured \textbf{I}n-context \textbf{E}nvironment (SIE) framework. SIE achieves scalability by automatically constructing reasoning environments from large-scale structured data, where the rich compositional patterns naturally support generalizable reasoning. Moreover, the explicit schemas and reasoning chains in structured data provide a foundation for rule-based verifiability. Experimental results show that SIE framework not only achieves substantial improvements in in-domain structured reasoning, but also enables the learned compositional reasoning skills to generalize effectively to out-of-domain mathematical and logical reasoning tasks. We further explored learning in information-limited partial SIEs and found that LLMs can infer the missing information through exploring the environment, leading to robust reasoning improvements and generalization performance. Our code can be available at \url{https://anonymous.4open.science/r/SIE_ICLR-EE6F}
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4316. DND: Boosting Large Language Models with Dynamic Nested Depth
๐ฌ 6
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model
We introduce Dynamic Nested Depth (DND), a novel method that improves performance for off-the-shelf LLMs by selecting critical tokens to reprocess in a nested depth manner. Specifically, at the end of the given transformer layer, DND identifies more critical tokens with a router and feeds them back for an extra round of processing, effectively "reviewing" difficult tokens while avoiding redundant computation for easier ones. The dynamic selection mechanism is tailored for precise control via two novel strategies: a router controlling loss to enhance token selection distinguishability, and a threshold control scheme to ensure selection stability. We demonstrate the effectiveness of DND by directly integrating it into pre-trained dense and MoE models during a post-training phase. On diverse benchmarks, DND boosts the performances of the dense Qwen3-1.7B, Llama3.2-1B, and Gemma3-1B by 1.88%, 2.61%, and 2.50% and the MoE Qwen3-30B-A3B by 0.87%, all with a minimal parameter and computing increase.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4317. Designing Affine-Invariant Neural Networks for Photometric Corruption Robustness and Generalization
๐ฌ 15
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Computer Vision Robust neural network invariance equivariance biological imaging microscopy classification object localization
Standard Convolutional Neural Networks are notoriously sensitive to photometric variations, a critical flaw that data augmentation only partially mitigates without offering formal guarantees. We introduce the *Scale-Equivariant Shift-Invariant* (*SEqSI*) model, a novel architecture that achieves intensity scale equivariance and intensity shift invariance by design, enabling full invariance to global intensity affine transformations with appropriate post-processing. By strategically prepending a single shift-invariant layer to a scale-equivariant backbone, *SEqSI* provides these formal guarantees while remaining fully compatible with common components like ReLU. We benchmark *SEqSI* against *Standard*, *Scale-Equivariant* (*SEq*), and *Affine-Equivariant* (*AffEq*) models on 2D and 3D image-classification and object-localization tasks. Our experiments demonstrate that *SEqSI* architectural properties provide certified robustness to affine intensity transformations and enhances generalization across non-affine corruptions and domain shifts in challenging real-world applications like biological image analysis. This work establishes *SEqSI* as a practical and principled approach for building photometrically robust models without major trade-offs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4318. Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization
๐ฌ 13
โญ 4.67
๐ generative models
๐ท๏ธ Diffusion Model
Human visual preferences are inherently multi-dimensional, encompassing aspects of aesthetics, detail fidelity, and semantic alignment. However, existing open-source preference datasets provide only single, holistic annotations, resulting in severe label noiseโimages that excel in some dimensions (e.g., compositional) but are deficient in others (e.g., details) are simply marked as ``winner" or ``loser". We theoretically demonstrate that this compression of multi-dimensional preferences into binary labels generates conflicting gradient signals that misguide the optimization process in Diffusion Direct Preference Optimization (DPO). To address this label noise from conflicting multi-dimensional preferences, we propose Semi-DPO, a semi-supervised learning approach. We treat pairs with consistent preferences across all dimensions as clean labeled data, while those with conflicting signals are considered noisy unlabeled data. Our method first trains a model on a clean, consensus-filtered data subset. This model then acts as its own implicit classifier to generate pseudo-labels for the larger, noisy set, which are used to iteratively refine the model's alignment. This approach effectively mitigates label noise and enhances image generation quality, achieving better alignment with multi-dimensional human preferences. Experimental results demonstrate that Semi-DPO significantly improves alignment with multi-dimensional human preferences, achieving state-of-the-art performance without requiring additional human annotation or the need to train a dedicated reward models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4319. Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward
๐ฌ 19
โญ 4.67
๐ reinforcement learning
๐ท๏ธ Data Efficiency Reinforcement Learning with Verifiable Reward
Recent advances in large language models (LLMs) have utilized reinforcement learning with verifiable rewards (RLVR) to improve reasoning capabilities.
However, scaling these methods typically requires massive data and extensive rollout computations, leading to high training costs and low data efficiency.
To mitigate this issue, we propose DEPO, a Data-Efficient Policy Optimization approach that combines optimized strategies for both offline and online data selection.
In the offline phase, we curate a high-quality subset of training data based on multiple objectives, including diversity, influence, and difficulty.
During online RLVR training, we propose a sample-level explorability metric to dynamically filter out samples with low exploration potential, thereby reducing substantial rollout computational costs.
Additionally, we employ a replay mechanism for under-explored samples to ensure sufficient training, which enhances the final convergence performance.
Experiments on five reasoning benchmarks show that DEPO consistently outperforms existing methods in both offline and online data selection scenarios.
Notably, using only 20% of the training data, our approach achieves a 1.85 $\times$ speed-up on AIME24 and a 1.66 $\times$ speed-up on AIME25 compared to GRPO trained on the full dataset.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4320. Beyond Structure: Invariant Crystal Property Prediction with Pseudo-Particle Ray Diffraction
๐ฌ 27
โญ 4.67
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Crystal representation Invariance Pseudo-Particle Diffraction Reciprocal space Long-term interaction
Crystal property prediction, governed by quantum mechanical principles, is computationally prohibitive to solve exactly for large many-body systems using traditional density functional theory. While machine learning models have emerged as efficient approximations for large-scale applications, their performance is strongly influenced by the choice of atomic representation. Although modern graph-based approaches have progressively incorporated more structural information, they often fail to capture long-term atomic interactions due to finite receptive fields and local encoding schemes. This limitation leads to distinct crystals being mapped to identical representations, hindering accurate property prediction. To address this, we introduce PRDNet that leverages unique reciprocal-space diffraction besides graph representations. To enhance sensitivity to elemental and environmental variations, we employ a data-driven pseudo-particle to generate a synthetic diffraction pattern. PRDNet ensures full invariance to crystallographic symmetries. Extensive experiments are conducted on Materials Project, JARVIS-DFT, and MatBench, demonstrating that the proposed model achieves state-of-the-art performance. The code can be found in the supplemental materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4321. DeLiVR: Differential Spatiotemporal Lie Bias for Efficient Video Deraining
๐ฌ 15
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Restoration Lie Groups Positional Bias
Videos captured in the wild often suffer from rain streaks, blur, and noise. In addition, even slight changes in camera pose can amplify cross-frame mismatches and temporal artifacts. Existing methods rely on optical flow or heuristic alignment, which are computationally expensive and less robust. To address these challenges, Lie groups provide a principled way to represent continuous geometric transformations, making them well-suited for enforcing spatial and temporal consistency in video modeling. Building on this insight, we propose DeLiVR, an efficient video deraining method that injects spatiotemporal Lie-group differential biases directly into attention scores of the network. Specifically, the method introduces two complementary components. First, a rotation-bounded Lie relative bias predicts the in-planeย ย angle of each frame using a compact prediction module, which normalized coordinates are rotated and compared with base coordinates to achieve geometry-consistent alignment before feature aggregation. Second, a differential group displacement computes angular differences between adjacent frames to estimate a velocity. These biases are combined with temporal decay and a banded attention mask to emphasize short-range reliable relations while suppressing long-range noise. DeLiVR achieves sharper details, fewer rain remnants, and stronger temporal coherence on both synthetic and real rainy benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4322. NIMO: a Nonlinear Interpretable MOdel
๐ฌ 18
โญ 4.67
๐ interpretability and explainable AI
๐ท๏ธ linear model lasso interpretability feature effect deep learning
Deep learning has achieved remarkable success across many domains, but it has also created a growing demand for interpretability in model predictions. Although many explainable machine learning methods have been proposed, post-hoc explanations lack guaranteed fidelity and are sensitive to hyperparameter choices, highlighting the appeal of inherently interpretable models. For example, linear regression provides clear feature effects through its coefficients. However, such models are often outperformed by more complex neural networks (NNs) that usually lack inherent interpretability. To address this dilemma, we introduce NIMO, a framework that combines inherent interpretability with the expressive power of neural networks. Building on the simple linear regression, NIMO is able to provide flexible and intelligible feature effects. Relevantly, we develop an optimization method based on parameter elimination, that allows for optimizing the NN parameters and linear coefficients effectively and efficiently. By relying on adaptive ridge regression we can easily incorporate sparsity as well. We show empirically that our model can provide faithful and intelligible feature effects while maintaining good predictive performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4323. Behavioral Embeddings of Programs: A Quasi-Dynamic Approach for Optimization Prediction
๐ฌ 13
โญ 4.67
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Program Representation Compiler Optimization Behavioral Embedding
Learning effective numerical representations, or embeddings, of programs is a fundamental prerequisite for applying machine learning to automate and enhance compiler optimization. Prevailing paradigms, however, present a dilemma. Static representations, derived from source code or intermediate representation (IR), are efficient and deterministic but offer limited insight into how a program will behave or evolve under complex code transformations. Conversely, dynamic representations, which rely on runtime profiling, provide profound insights into performance bottlenecks but are often impractical for large-scale tasks due to prohibitive overhead and inherent non-determinism. This paper transcends this trade-off by proposing a novel quasi-dynamic framework for program representation. The core insight is to model a program's optimization sensitivity. We introduce the Program Behavior Spectrum, a new representation generated by probing a program's IR with a diverse set of optimization sequences and quantifying the resulting changes in its static features. To effectively encode this high-dimensional, continuous spectrum, we pioneer a compositional learning approach. Product Quantization is employed to discretize the continuous reaction vectors into structured, compositional sub-words. Subsequently, a multi-task Transformer model, termed PQ-BERT, is pre-trained to learn the deep contextual grammar of these behavioral codes. Comprehensive experiments on two representative compiler optimization tasks---Best Pass Prediction and -Oz Benefit Prediction---demonstrate that our method outperforms state-of-the-art static baselines. Our code is publicly available at https://anonymous.4open.science/r/PREP-311F/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4324. Scalable Chain of Thoughts via Elastic Reasoning
๐ฌ 18
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Test time scaling Large language models Length control Abstract:
Large reasoning models (LRMs) have achieved remarkable progress on complex tasks by generating extended chains of thought (CoT). However, their uncontrolled output lengths pose significant challenges for real-world deployment, where inference-time budgets on tokens, latency, or compute are strictly constrained. We propose Elastic Reasoning, a novel framework for scalable chain of thoughts that explicitly separates reasoning into two phasesโthinking and solutionโwith independently allocated budgets. At test time, Elastic Reasoning prioritizes the completeness of solution segments, significantly improving reliability under tight resource constraints. To train models that are robust to truncated thinking, we introduce a lightweight budget-constrained rollout strategy, integrated into GRPO, which teaches the model to reason adaptively when the thinking process is cut short and generalizes effectively to unseen budget constraints without additional training. Empirical results on mathematical (AIME, MATH500) and programming (LiveCodeBench, Codeforces) benchmarks demonstrate that Elastic Reasoning performs robustly under strict budget constraints, while incurring significantly lower training cost than baseline methods. Remarkably, our approach also produces more concise and efficient reasoning even in unconstrained settings. Elastic Reasoning offers a principled and practical solution to the pressing challenge of controllable reasoning at scale. Code is available in the supplementary material.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4325. GRL-SNAM: Geometric Reinforcement Learning with Differential Hamiltonians for Navigation and Mapping in Unknown Environments
๐ฌ 8
โญ 4.67
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning generalized Hamiltonian manifolds Differential Policy Optimization
We present GRL-SNAM, a geometric reinforcement learning framework for Simultaneous Navigation and Mapping in unknown environments. GRL-SNAM differs from traditional SLAM and other reinforcement learning methods by relying exclusively on local sensory observations without constructing a global map. Our approach formulates navigation and mapping as coupled dynamics on generalized Hamiltonian manifolds: sensory inputs are translated into local energy landscapes that encode reachability, obstacle barriers, and deformation constraints, while policies for sensing, planning, and reconfiguration evolve stagewise under Differential Policy Optimization (DPO). A reduced Hamiltonian serves as an adaptive score function, updating kinetic/potential terms, embedding barrier constraints, and continuously refining trajectories as new local information arrives. We evaluate GRL-SNAM on 2D deformable navigation tasks, where a hyperelastic robot learns to squeeze through narrow gaps, detour around obstacles, and generalize to unseen environments. We evaluate GRL-SNAM on procedurally generated 2D deformableโrobot tasks comparing against local reactive baselines (PF, CBF, staged DWA) and global A* references (rigid, clearance-aware) under identical stagewise sensing constraints. GRL-SNAM shows superior path quality while using the minimal map coverage, preserves clearance, generalizes to unseen layouts, and demonstrates that Hamiltonian-structured RL enables high-quality navigation through minimal exploration via local energy refinement rather than global mapping.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4326. Positional Encoding Field
๐ฌ 9
โญ 4.67
๐ generative models
๐ท๏ธ Positional Encoding Novel View Synthesis Geometry-Aware Generation Image Editing
Diffusion Transformers (DiTs) have emerged as the dominant architecture for visual generation, powering state-of-the-art image and video models. By representing images as patch tokens with positional encodings (PEs), DiTs combine Transformer scalability with spatial and temporal inductive biases. In this work, we revisit how DiTs organize visual content and discover that patch tokens exhibit a surprising degree of independence: even when PEs are perturbed, DiTs still produce globally coherent outputs, indicating that spatial coherence is primarily governed by PEs. Motivated by this finding, we introduce the Positional Encoding Field (PE-Field), which extends positional encodings from the 2D plane to a structured 3D field. PE-Field incorporates depth-aware encodings for volumetric reasoning and hierarchical encodings for fine-grained sub-patch control, enabling DiTs to model geometry directly in 3D space. Our PE-Fieldโaugmented DiT achieves state-of-the-art performance on single-image novel view synthesis and generalizes to controllable spatial image editing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4327. Tighter Performance Theory of FedExProx
๐ฌ 11
โญ 4.67
๐ optimization
๐ท๏ธ distributed optimization extrapolation
We revisit FedExProx -- a recently proposed distributed optimization method designed to enhance convergence properties of parallel proximal algorithms via extrapolation. In the process, we uncover a surprising flaw: its known theoretical guarantees on quadratic optimization tasks are no better than those offered by the vanilla Gradient Descent (GD) method. Motivated by this observation, we develop a novel analysis framework, establishing a tighter linear convergence rate for non-strongly convex quadratic problems. By incorporating both computation and communication costs, we demonstrate that FedExProx can indeed provably outperform GD, in stark contrast to the original analysis. Furthermore, we consider partial participation scenarios and analyze two adaptive extrapolation strategies -- based on gradient diversity and Polyak stepsizes -- again significantly outperforming previous results. Moving beyond quadratics, we extend the applicability of our analysis to general functions satisfying the Polyak-ลojasiewicz condition, outperforming the previous strongly convex analysis while operating under weaker assumptions. Backed by empirical results, our findings point to a new and stronger potential of FedExProx, paving the way for further exploration of the benefits of extrapolation in federated learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4328. Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective
๐ฌ 15
โญ 4.67
๐ interpretability and explainable AI
๐ท๏ธ Reinforcement Fine-Tuning Catastrophic Forgetting Data Distribution Learning Dynamics
Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks. While effective at task adaptation, their impact on prior knowledge remains unclear. In this paper, we introduce jigsaw puzzles as a novel task absent from existing pretraining corpora and systematically study the behavior of SFT and RFT on open-source multimodal model, Qwen2.5-VL series. Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly but maintains prior knowledge. We study this phenomenon through learning dynamics by examining both the magnitude and direction of how training data influence prior knowledge. Our analysis shows that RFT mainly reinforces correct samples naturally aligned with the base modelโs probability landscape, leading to weaker interference with prior knowledge. Moreover, training on RFT-simulated rollouts, which exert a small magnitude of influence and are well aligned in direction to prior knowledge, allows SFT to preserve prior knowledge better while rapidly learning new tasks. These findings suggest that distribution of training data, rather than algorithmic differences, plays a central role in forgetting, and highlight RFT's potential for stable continual learning in multimodal large language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4329. Beyond Uniformity: Sample and Frequency Meta Weighting for Post-Training Quantization of Diffusion Models
๐ฌ 10
โญ 4.67
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Post-training Quantization LLMs
Post-training quantization (PTQ) is an attractive approach for compressing diffusion models to speed up the sampling process and reduce the memory footprint. Most existing PTQ methods uniformly sample data from various time steps in the denoising process to construct a calibration set for quantization and consider calibration samples equally important during quantization process. However, treating all calibration samples equally may not be optimal. One notable property in the denoising process of diffusion models is low-frequency features are primarily recovered in early stages, while high-frequency features are recovered in later stages of the denoising process. However, none of previous works on quantization for diffusion models consider this property to enhance the effectiveness of quantized models. In this paper, we propose a novel meta-learning approach for PTQ of diffusion models that jointly optimizes the contributions of calibration samples and the weighting of frequency components at each time step for quantizing noise estimation networks. Specifically, our approach automatically learns to assign optimal weights to calibration samples while selectively focusing on mimicking specific frequency components of data generated by the full-precision noise estimation network at each denoising time step. Extensive experiments on CIFAR-10, LSUN-Bedrooms, FFHQ, and ImageNet datasets demonstrate that our approach consistently outperforms state-of-the-art PTQ methods for diffusion models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4330. VenusX: Unlocking Fine-Grained Functional Understanding of Proteins
๐ฌ 13
โญ 4.67
๐ datasets and benchmarks
๐ท๏ธ protein substructure prediction protein function prediction molecule representation learning pre-trained protein language model fine-grained protein annotation
Deep learning models have driven significant progress in predicting protein function and interactions at the protein level. While these advancements have been invaluable for many biological applications such as enzyme engineering and function annotation, a more detailed perspective is essential for understanding protein functional mechanisms and evaluating the biological knowledge captured by models. This study introduces VenusX, the first benchmark designed to assess protein representation learning with a focus on fine-grained intra-protein functional understanding. VenusX comprises three major task categories across six types of annotations, including residue-level binary classification, fragment-level multi-class classification, and pairwise functional similarity scoring for identifying critical active sites, binding sites, conserved sites, motifs, domains, and epitopes. The benchmark features over 878,000 samples curated from major open-source databases such as InterPro, BioLiP, and SAbDab. By providing mixed-family and cross-family splits at three sequence identity thresholds, our benchmark enables a comprehensive assessment of model performance on both in-distribution and out-of-distribution scenarios. For baseline evaluation, we assess a diverse set of popular and open-source models, including pre-trained protein language models, sequence-structure hybrids, structure-based methods, and alignment-based techniques. Their performance is reported across all benchmark datasets and evaluation settings using multiple metrics, offering a thorough comparison and a strong foundation for future research. Our code (https://anonymous.4open.science/r/VenusX-4674), data (https://huggingface.co/collections/anonymous-researcher-123/venusx-68cc5163ade527b0974bab29), and a leaderboard (https://anonymous-researcher-816.github.io/) are provided as open-source resources.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4331. Query-Specific Causal Graph Pruning Under Tiered Knowledge
๐ฌ 3
โญ 4.67
๐ causal reasoning
๐ท๏ธ Causal Inference Graph Pruning Tiered Knowledge
We present a systematic method for pruning edges from causal graphs by leveraging tiered knowledge. We characterize conditions under which edges can be removed from a causal graph while preserving the identifiability of (conditional) causal effects. This result enables causal identification on simplified graphs that are substantially smaller than the original graphs. This approach is particularly valuable when researchers are interested in causal relationships within specific tiers while controlling for broader influences from other tiers without fully specifying them. Building on this, we introduce a \emph{query-specific} causal discovery procedure that takes a causal query as an additional input and recovers a reduced graph tailored to the query from observational data. Through theoretical analysis and empirical studies, we demonstrate that our procedure can achieve exponential speedups compared to the existing method when tiered knowledge is available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4332. Benchmarking LLM Tool-Use in the Wild
๐ฌ 10
โญ 4.67
๐ datasets and benchmarks
๐ท๏ธ benchmarking automatic evaluation of datasets evaluation methodologies evaluation metrics reproducibility statistical testing for evaluation
Fulfilling user needs through Large Language Model multi-turn, multi-step tool-use is rarely a straightforward process. Real user interactions are inherently $\textbf{wild}$, being intricate, messy, and flexible. We identify three key challenges from user behaviour: $\textit{compositional tasks}$ that demand efficient orchestration of tool-call topologies, $\textit{implicit intent}$ spread across dialogue turns that require contextual inference, and $\textit{instruction transition}$, which mixes task queries, clarifications, and casual conversation, forcing LLMs to adjust their policies on the fly. Existing benchmarks overlook these behaviors, making the apparent progress of LLMs on tool-use spurious. To address this, we introduce $\textbf{\textit{WildToolBench}}$, an LLM tool-use benchmark grounded in real-world user behavior patterns. Comprehensive evaluations of 57 LLMs reveal that no model achieves an accuracy of more than 15\%, indicating a substantial gap in the robustness of LLMs' agentic ability. Controlled experiments and in-depth analyses further indicate that the real challenge for LLM tool-use lies not in artificially complex tasks, but in the wild nature of user behavior, emphasizing the need to reconsider the interactions among $\textit{LLMs}$, $\textit{users}$, and $\textit{tools}$.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4333. TrustGen: A Platform of Dynamic Benchmarking on the Trustworthiness of Generative Foundation Models
๐ฌ 23
โญ 4.67
๐ datasets and benchmarks
๐ท๏ธ trustworthiness generative model large language model vision-language model dynamic evaluation benchmark
Generative foundation models (GenFMs), such as large language models and text-to-image systems, have demonstrated remarkable capabilities in various downstream applications. As they are increasingly deployed in high-stakes applications, assessing their trustworthiness has become both a critical necessity and a substantial challenge. Existing evaluation efforts are fragmented, rapidly outdated, and often lack extensibility across modalities. This raises a fundamental question: how can we systematically, reliably, and continuously assess the trustworthiness of rapidly advancing GenFMs across diverse modalities and use cases? To address these gaps, we introduce TrustGen, a dynamic and modular benchmarking system designed to systematically evaluate the trustworthiness of GenFMs across text-to-image, large language, and vision-language modalities. TrustGen standardizes trust evaluation through a unified taxonomy of over 25 fine-grained dimensionsโincluding truthfulness, safety, fairness, robustness, privacy, and machine ethicsโwhile supporting dynamic data generation and adaptive evaluation through three core modules: Metadata Curator, Test Case Builder, and Contextual Variator. Taking TrustGen into action to evaluate the trustworthiness of 39 models reveals four key insights. (1) State-of-the-art GenFMs achieve promising overall trust performance, yet significant limitations remain in specific dimensions such as hallucination resistance, fairness, and privacy preservation. (2) Contrary to prevailing assumptions, open-source models now rival and occasionally surpass proprietary systems in trustworthiness metrics. (3) The trust gap among top-performing models is narrowing, likely due to increased industry convergence on best practices. (4) Trustworthiness is not an isolated property; it interacts complexly with other behaviors, such as helpfulness and ethical decision-making. TrustGen is a transformative step toward standardized, scalable, and actionable trustworthiness evaluation, supporting dynamic assessments across diverse modalities and trust dimensions that evolve alongside the generative AI landscape.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4334. DM4CT: Benchmarking Diffusion Models for Computed Tomography Reconstruction
๐ฌ 14
โญ 4.67
๐ datasets and benchmarks
๐ท๏ธ benchmark dataset inverse problem computed tomography diffusion models reconstruction
Diffusion models have recently emerged as powerful priors for solving inverse problems. While computed tomography (CT) is theoretically a linear inverse problem, it poses many practical challenges. These include correlated noise, artifact structures, reliance on system geometry, and misaligned value ranges, which make the direct application of diffusion models more difficult than in domains like natural image generation. To systematically evaluate how diffusion models perform in this context and compare them with established reconstruction methods, we introduce DM4CT, a comprehensive benchmark for CT reconstruction. DM4CT includes datasets from both medical and industrial domains with sparse-view and noisy configurations. To explore the challenges of deploying diffusion models in practice, we additionally acquire a high-resolution CT dataset at a high-energy synchrotron facility and evaluate all methods under real experimental conditions. We benchmark nine recent diffusion-based methods alongside seven strong baselines, including model-based, unsupervised, and supervised approaches. Our analysis provides detailed insights into the behavior, strengths, and limitations of diffusion models for CT reconstruction. The real-world dataset is publicly available at zenodo.org/records/15420527, and the codebase is open-sourced at github.com/DM4CT/DM4CT.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4335. Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
๐ฌ 17
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ multimodal agent long-term memory
We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update episodic and semantic memories, gradually accumulating world knowledge. Its memory is organized in an entity-centric, multimodal manner, enabling deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn reasoning and retrieves relevant memories to complete tasks. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a long-video question answering benchmark comprising 100 newly recorded robot-perspective videos (M3-Bench-robot) and 920 diverse web-sourced videos (M3-Bench-web). We annotate QA pairs designed to test capabilities essential for agent applications, such as person understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances multimodal agents toward more human-like long-term memory and provides insights for their practical design.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4336. Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models
๐ฌ 12
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ local routing consistency MoE analysis expert offloading
Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference.
To effectively deploy large MoE models on memory-constrained devices, many systems introduce expert offloading which caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand.
While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this **local routing consistency** varies across models and remains understudied.
In this paper, we propose two metrics to measure local routing consistency of MoE models:
(1) **Segment Routing best Performance (SRP)**, which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and
(2) **Segment Cache best Hit rate (SCH)**, which measures the hit rate of an expert cache utilizing a length of future information under a cache limit.
We analyze 20 MoE LLMs with diverse sizes and architectures and use toy models to verify key factors related to local routing consistency.
We find a strong trade-off between local routing consistency and *local* load balance, while showing that *global* load balance can coexist with local routing consistency.
Meanwhile, settings like shared experts that decrease expert combination space can lead to low local routing consistency.
We further reveal that domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones, and that most models balance between cache effectiveness and efficiency with cache sizes approximately twice the active experts.
These findings pave the way for memory-efficient MoE design and deployment without compromising inference speed.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4337. Random Anchors with Low-rank Decorrelated Learning: A Minimalist Pipeline for Class-Incremental Medical Image Classification
๐ฌ 15
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Medical Image Classification; Feature Calibration; Continual Learning
Class-incremental learning (CIL) in medical image-guided diagnosis requires models to preserve knowledge of historical disease classes while adapting to emerging categories. Pre-trained models (PTMs) with well-generalized features provide a strong foundation, yet most PTM-based CIL strategies, such as prompt tuning, task-specific adapters and model mixtures, rely on increasingly complex designs. While effective in general-domain benchmarks, these methods falter in medical imaging, where low intra-class variability and high inter-domain shifts (from scanners, protocols and institutions) make CIL particularly prone to representation collapse and domain misalignment. Under such conditions, we find that lightweight representation calibration strategies, often dismissed in general-domain CIL for their modest gains, can be remarkably effective for adapting PTMs in medical settings. To this end, we introduce Random Anchors with Low-rank Decorrelated Learning (RA-LDL), a minimalist representation-based framework that combines (a) PTM-based feature extraction with optional ViT-Adapter tuning, (b) feature calibration via frozen Random Anchor projection and a single-session-trained Low-Rank Projection (LRP), and (c) analytical closed-form decorrelated learning. The entire pipeline requires only one training session and minimal task-specific tuning, making it appealing for efficient deployment. Despite its simplicity, RA-LDL achieves consistent and substantial improvements across both general-domain and medical-specific PTMs, and outperforms recent state-of-the-art methods on four diverse medical imaging datasets. These results highlight that minimalist representation recalibration, rather than complex architectural modifications, can unlock the underexplored potential of PTMs in medical CIL. We hope this work establishes a practical and extensible foundation for future research in class-incremental image-guided diagnosis. Code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4338. KinemaDiff: Towards Diffusion for Coherent and Physically Plausible Human Motion Prediction
๐ฌ 17
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Human Motion Prediction
Stochastic Human Motion Prediction (HMP) has become an essential task for the realm of computer vision, for its capacity to anticipate accurate and diverse future human trajectories. Current diffusion-based techniques typically enforce skeletal consistency by encoding structural priors into network architectures. Although effective in promoting plausible kinematics, this approach provides only indirect control over the generative process and often fails to guarantee strict physical constraint satisfaction. In this work, we propose a structure-aligned and joint-aware diffusion framework that enforces physical constraints by embedding skeletal topology and joint-specific dynamics directly into the diffusion process. Specifically, our framework consists of two key modules, the Joint-Adaptive Noise Generator and the Structure-Aligned Constraint Enforcer. The former component, Joint-Adaptive Noise Generator, infers joint-specific dynamics and injects
heterogeneous, instance-aware noise per joint and sample to capture spatial variability and enhance motion diversity. The latter component, Structure-Aligned Constraint Enforcer, encodes skeletal topology by modeling joint connectivity and bone lengths from historical motions, and it constrains each denoising step to preserve anatomical consistency. Through their synergistic operation, these modules grant KinemaDiff direct control over physical realism and motion diversity, addressing the common limitations of indirect structural priors and uniform noise application. Extensive experiments on multiple benchmarks demonstrate the effectiveness of our method, attributable to tailoring the diffusion process through structural alignment and joint-adaptive noise modeling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4339. ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents
๐ฌ 23
โญ 4.67
๐ applications to robotics, autonomy, planning
๐ท๏ธ large language model computer use agent reinforcement learning
We introduce ComputerRL, a framework for autonomous desktop intelligence that enables agents to operate complex digital workspaces skillfully. ComputerRL features the API-GUI paradigm, which unifies programmatic API calls and direct GUI interaction to address the inherent mismatch between machine agents and human-centric desktop environments. Scaling end-to-end RL training is crucial for improvement and generalization across diverse desktop tasks; however, it remains challenging due to environmental inefficiency and instability during extended training. To support scalable and robust training, we develop a distributed RL infrastructure capable of orchestrating thousands of parallel virtual desktop environments to accelerate large-scale online RL. Furthermore, we propose Entropulse, a training strategy that alternates reinforcement learning with supervised fine-tuning, effectively mitigating entropy collapse during extended training runs. We employ ComputerRL on open models GLM-4-9B-0414 and GLM-4.1V-9B-Thinking, and evaluate them on the OSWorld benchmark. The GLM-ComputerRL-9B achieves a new state-of-the-art accuracy of 48.9%, demonstrating significant improvements for general agents in desktop automation. Our code and demos are available at https://computer-rl.vercel.app/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4340. From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation
๐ฌ 10
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Language Models Autoregressive Image Generation Chain-of-Thought
Combining Chain-of-Thought (CoT) with Reinforcement Learning (RL) improves text-to-image (T2I) generation, yet the underlying interaction between CoT's exploration and RL's optimization remains unclear. We present a systematic entropy-based analysis that yields three key insights: (1) CoT expands the generative exploration space, while RL contracts it toward high-reward regions; (2) final reward is strongly negatively correlated with both the mean and variance of image-token entropy, highlighting the need to reduce uncertainty and instability; and (3) the entropy of the textual CoT directly governs downstream image quality, with lower-entropy CoTs leading to better generations. Motivated by these findings, we propose Entropy-Guided Group Relative Policy Optimization (EG-GRPO), a fine-tuning strategy that reallocates optimization budget by uncertainty: low-entropy tokens are excluded from reward-driven updates to preserve stability, while high-entropy tokens receive an entropy bonus that encourages structured exploration without collapse. Experiments on standard T2I benchmarks demonstrate that EG-GRPO achieves state-of-the-art performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4341. EGG-SR: Embedding Symbolic Equivalence into Symbolic Regression via Equality Graph
๐ฌ 6
โญ 4.67
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Symbolic Regression symbolic equivalence Monte Carlo Tree Search Deep Reinforcement Learning Large Language Model
Symbolic regression seeks to uncover physical laws from experimental data by searching for closed-form expressions, which is an important task in AI-driven scientific discovery. Yet the exponential growth of the search space of expression renders the task computationally challenging.
A promising yet underexplored direction for reducing the effective search space and accelerating training lies in *symbolic equivalence*: many expressions, although syntactically different, define the same functionโfor example, $\log(x_1^2x_2^3)$, $\log(x_1^2)+\log(x_2^3)$, and $2\log(x_1)+3\log(x_2)$.
Existing algorithms treat such variants as distinct outputs, leading to redundant exploration and slow learning.
We introduce EGG-SR, a unified framework that integrates equality graphs (e-graphs) into diverse symbolic regression algorithms, including Monte Carlo Tree Search (MCTS), deep reinforcement learning (DRL), and large language models (LLMs).
EGG-SR compactly represents equivalent expressions through the proposed EGG module, enabling more efficient learning by:
(1) pruning redundant subtree exploration in EGG-MCTS,
(2) aggregating rewards across equivalence classes in EGG-DRL, and
(3) enriching feedback prompts in EGG-LLM.
Theoretically, we establish that embedding e-graphs tightens the regret bound of MCTS and reduces the variance of the DRL gradient estimator.
Empirically, EGG-SR consistently enhances a class of modern symbolic regression algorithms across multiple benchmarks, discovering equations with lower normalized mean squared error.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4342. Learning Data-Efficient and Generalizable Neural Operators via Fundamental Physics Knowledge
๐ฌ 16
โญ 4.67
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Neural Operator PDE Fundamental Physics Knowledge
Recent advances in scientific machine learning (SciML) have enabled neural operators (NOs) to serve as powerful surrogates for modeling the dynamic evolution of physical systems governed by partial differential equations (PDEs). While existing approaches focus primarily on learning simulations from the target PDE, they often overlook more fundamental physical principles underlying these equations. Inspired by how numerical solvers are compatible with simulations of different settings of PDEs, we propose a multiphysics training framework that jointly learns from both the original PDEs and their simplified basic forms. Our framework enhances data efficiency, reduces predictive errors, and improves out-of-distribution (OOD) generalization, particularly in scenarios involving shifts of physical parameters and synthetic-to-real transfer. Our method is architecture-agnostic and demonstrates consistent improvements in normalized root mean square error (nRMSE) across a wide range of 1D/2D/3D PDE problems. Through extensive experiments, we show that explicit incorporation of fundamental physics knowledge significantly strengthens the generalization ability of neural operators.
We promise to release models and data upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4343. Non-Collaborative User Simulators for Tool Agents
๐ฌ 25
โญ 4.67
๐ datasets and benchmarks
๐ท๏ธ Tool Agent User Simulator Non-collaborative User Dialogue Simulation
Tool agents interact with users through multi-turn dialogues to accomplish various tasks. Recent studies have adopted user simulation methods to develop these agents in multi-turn settings. However, existing user simulators tend to be agent-friendly, exhibiting only cooperative behaviors, which fails to train and test agents against non-collaborative users in the real world.
To address this, we propose a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. Our user simulator can simulate challenging and natural non-collaborative behaviors while reliably delivering all intents and information necessary to accomplish the task.
Our experiments on MultiWOZ and $\tau$-bench reveal significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users. We provide detailed analyses of agents' weaknesses under each non-collaborative condition, such as escalated hallucinations and dialogue breakdowns.
Ultimately, we contribute an easily extensible user simulation framework to help the research community develop tool agents and preemptively diagnose them under challenging real-world conditions within their own services.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4344. Structure-Aware Graph Hypernetworks for Neural Program Synthesis
๐ฌ 12
โญ 4.67
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ neural program synthesis weight-space learning meta-learning permutation-equivariant graph networks zero-shot generalization
We study the neural program synthesis of $\textit{parameterized}$ function families through the lens of meta-learning with hypernetworks. Given a user intent $U$, a meta-learner $M_{\phi}$ produces a full weight set $\hat{\theta}=M_{\phi}(U)$ for a target neural network with fixed architecture $S$, and the instantiated network $m_{S,\hat{\theta}}(X)\to Y$ realizes the behavior intended for $U$. Classical hypernetworks typically $\textit{ignore the target networkโs structure}$ and emit a flat list of weights; as a consequence, they fail to account for $\textit{neuron-permutation symmetry}$โmany distinct parameterizations of $S$ implement the same functionโso equivalent solutions are treated as different targets, fragmenting supervision and hurting out-of-distribution generalization. To address this, we propose $\textit{Meta-GNN}$, a hypernetwork that constructs a $\textit{neural graph}$ from the target architecture $S$ and applies $\textbf{structure-aware}$ message passing with parameter-tied encoders and decoders. This design reduces the search space during learning by collapsing equivalent classes of target networks, without loss of expressivity. Empirically, across modular arithmetic ($\textit{AddMod}$-$p$), array operations ($\textit{SumFirst}$-$n$), and inverse-rule tasks from 1D-ARC, $\textit{Meta-GNN}$ substantially improves learning and $\textbf{out-of-distribution generalization}$ compared to classic hypernetworks and direct $(U,X)\to Y$ baselines. Mechanistic analyses reveal $\textit{what is learned}$: on $\textit{AddMod}$-$p$ the synthesized Transformers recover the canonical clock representation and admit a compact closed-form map $U\mapsto\theta$. These results demonstrate that structure-aware Meta-GNNs enable reliable generalization to $\textit{unseen program parameterizations}$, providing a critical advance for the nascent field of neural program synthesis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4345. TS-DDAE: A novel Temporal-Spectral Denoising Diffusion AutoEncoder for Wireless Signal Recognition Model Pre-training
๐ฌ 9
โญ 4.67
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Diffusion Wireless Signal Recognition Pre-training
Wireless Signal Recognition (WSR) aims to identify the property of received signals using Artificial Intelligence (AI) without any prior knowledge, which has been widely used in civil and military radios. The current AI trend of pre-training and fine-tuning has shown great performance, and the existing pre-trained WSR models also achieve impressive results. However, they either apply the "mask-reconstruction" pre-training strategy, which may disrupt intricate local dependencies of signals, or overlook latent spectral characteristics. Therefore, in this paper, we follow the diffusion models and propose a pre-training framework for WSR, named the Temporal-Spectral Denoising Diffusion AutoEncoder (TS-DDAE), which learns signal representations by corrupting signals with temporal and spectral noise, and then reconstructing the original data with a learned neural network. Moreover, we design a novel neural architecture, named TS-Net, which couples self-attention for temporal encoder with channel attention for spectral encoder. Extensive experiments on several datasets and WSR tasks show that TS-DDAE achieves superior performance compared to state-of-the-art (SOTA) baselines, which demonstrate the potential to be a foundation model for WSR.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4346. Prior-aware and Context-guided Group Sampling for Active Probabilistic Subsampling
๐ฌ 16
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Subsampling Active acquisition Accelerated MRI Hyperspectral imaging Top-k sampling
Subsampling significantly reduces the number of measurements, thereby streamlining data processing and transfer overhead, and shortening acquisition time across diverse real-world applications. The recently introduced Active Deep Probabilistic Subsampling (A-DPS) approach jointly optimizes both the subsampling pattern and the downstream task model, enabling instance- and subject-specific sampling trajectories and effective adaptation to new data at inference time. However, this approach does not to fully leverage valuable dataset priors and relies on top-1 sampling, which can impedes the optimization process. Herein, we enhance A-DPS by integrating a deterministic (fixed) prior-informed sampling pattern derived from the training dataset, along with group-based sampling via top-k sampling, to achieve more robust optimizationโmethod we call Prior-aware and context-guided Group-based Active DPS (PGA-DPS). We also provide a theoretical analysis supporting improved optimization via group sampling, and validate this with empirical results. We evaluated PGA-DPS on three tasks: classification, image reconstruction, and segmentation, using the MNIST, CIFAR-10, fastMRI knee, and hyperspectral AeroRIT datasets, respectively. In every case, PGA-DPS outperformed A-DPS, DPS, and all other sampling methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4347. From Observations to Events: Event-Aware World Models for Reinforcement Learning
๐ฌ 15
โญ 4.67
๐ reinforcement learning
๐ท๏ธ model-based reinforcement learning online learning reinforcement learning
While model-based reinforcement learning (MBRL) improves sample efficiency by learning world models from raw observations, existing methods struggle to generalize across structurally similar scenes and remain vulnerable to spurious variations such as textures or color shifts. From a cognitive science perspective, humans segment continuous sensory streams into discrete events and rely on these key events for decision-making. Motivated by this principle, we propose the Event-Aware World Model (EAWM), a general framework that learns event-aware representations to streamline policy learning without requiring handcrafted labels. EAWM employs an automated event generator to derive events from raw observations and introduces a Generic Event Segmentor (GES) to identify event boundaries, which mark the start and end time of event segments. Through event prediction, the representation space is shaped to capture meaningful spatio-temporal transitions. Beyond this, we present a unified formulation of seemingly distinct world model architectures and show the broad applicability of our methods. Experiments on Atari 100K, Craftax 1M, and DeepMind Control 500K, DMC-GB2 500K demonstrate that EAWM consistently boosts the performance of strong MBRL baselines by 10\%โ45\%, setting new state-of-the-art results across benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4348. Reshaping Reasoning in LLMs: A Theoretical Analysis of RL Training Dynamics through Pattern Selection
๐ฌ 17
โญ 4.67
๐ learning theory
๐ท๏ธ Reinforcement Learning Language Models Reasoning Patterns Training Dynamics
While reinforcement learning (RL) demonstrated remarkable success in enhancing the reasoning capabilities of language models, the training dynamics of RL in LLMs remain unclear. In this work, we provide an explanation of the RL training process through empirical analysis and rigorous theoretical modeling. First, through systematic reasoning-pattern-level and token-level analysis across the RL training process, we show that while different reasoning patterns exhibit relatively stable success rates during training, RL primarily optimizes a sparse subset of critical tokens, thereby reshaping reasoning pattern distributions to affect model performance. Building on these empirical insights, we develop a theoretical framework to understand the training dynamics of RL with two typical rewards: verifiable reward (RLVR) and model's internal feedback (RLIF). For RLVR, we analyze the training dynamics under two special cases: one where models readily converge to optimal reasoning strategies, and another where optimization becomes challenging, revealing that the base model's reasoning quality is crucial for determining convergence behavior. For RLIF, we examine how internal rewards initially improve model performance but can potentially lead to degradation with continued training. Extensive experiments validate our findings, advancing both theoretical understanding and practical applications of RL in language model enhancement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4349. GradPruner: Gradient-guided Layer Pruning Enabling Efficient Fine-Tuning and Inference for LLMs
๐ฌ 11
โญ 4.67
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ LLM Fine-Tuning Layer Pruning
Fine-tuning Large Language Models (LLMs) with downstream data is often considered time-consuming and expensive. Structured pruning methods are primarily employed to improve the inference efficiency of pre-trained models. Meanwhile, they often require additional time and memory for training, knowledge distillation, structure search, and other strategies, making efficient model fine-tuning challenging to achieve. To simultaneously enhance the training and inference efficiency of downstream task fine-tuning, we introduce GradPruner, which can prune layers of LLMs guided by gradients in the early stages of fine-tuning. GradPruner uses the cumulative gradients of each parameter during the initial phase of fine-tuning to compute the Initial Gradient Information Accumulation Matrix (IGIA-Matrix) to assess the importance of layers and perform pruning. We sparsify the pruned layers based on the IGIA-Matrix and merge them with the remaining layers. Only elements with the same sign are merged to reduce interference from sign variations. We conducted extensive experiments on two LLMs across eight well-known datasets in downstream tasks. Including medical, financial, and general benchmark tasks. The results demonstrate that GradPruner has achieved a parameter reduction of 40% with only a 0.99% decrease in accuracy. Our code is available at https://anonymous.4open.science/r/LLM-GradPrune-436D.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4350. Test-Time Optimization of 3D Point Cloud LLM via Manifold-Aware In-Context Guidance and Refinement
๐ฌ 20
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D point cloud large language model
Multimidal Large Language Models (MLLMs) have demonstrated impressive capabilities in textual and 2D visual reasoning, yet their ability to understand and reason over 3D data remains limited. The issues become more challenging for understanding standalone 3D point cloud due to the high interclass confusion. In this work, we propose Point-Graph LLM (PGLLM), a framework that enables more effective 3D point cloud understanding by integrating in-context prompting and score refinement at test-time, respecting supporting data manifold. Our method first employs a pre-trained point cloud encoder which are used to construct a graph where edges encode visual similarity. Each support point cloud sample is converted to a textual caption via pre-trained PointLLM. For a test query, the graph is used to retrieve relevant neighbors whose captions serve as contextual demonstrations for a second stage LLM for final reasoning, a process we term in-context guidance. Furthermore, we introduce a confidence score refinement mechanism based on label propagation to enhance the reliability of LLM predictions for classification and out-of-distribution (OOD) detection tasks. All above optimizations are carried out fully at test-time. Extensive experiments across diverse 3D datasets and tasks demonstrate that PGLLM consistently improves accuracy and robustness over prior baselines with very almost no additional computation cost, showcasing a promising direction toward native 3D reasoning with MLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4351. STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning
๐ฌ 16
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ spatial-temporal video grounding vision language model reinforcement learning
In visionโlanguage models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatialโtemporal video grounding (STVG). Prior approaches typically focus on enhancing visualโtextual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per-frame coordinate prediction as a compact instance-level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable inputs to the VLMs. Furthermore, we introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization. Extensive experiments on six benchmarks demonstrate the effectiveness of our approach. STVG-R1 surpasses the baseline Qwen2.5-VL-7B by a remarkable margin of 20.9% on m_IoU on the HCSTVG-v2 benchmark, establishing a new state of the art (SOTA). Surprisingly, STVG-R1 also exhibits strong zero-shot generalization to multi-object referring video object segmentation task, achieving a SOTA 47.3% J&F on MeViS.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4352. Transformers as a Measure-Theoretic Associative Memory: A Statistical Perspective
๐ฌ 13
โญ 4.67
๐ learning theory
๐ท๏ธ associative memory learning theory
Transformers excel through content-addressable retrieval and the ability to exploit contexts of, in principle, unbounded length.
We recast associative memory at the level of probability measures, treating a context as a distribution over tokens and viewing attention as an integral operator on measures.
Concretely, for mixture contexts $\nu = I^{-1} \sum_{i=1}^I \mu^{(i)}$ and a query $x_{\mathrm{q}}(i^\*)$, the task decomposes into (i) recall of the relevant component $\mu^{(i^\*)}$ and (ii) prediction from $(\mu_{i^\*},x_{\mathrm{q}})$.
We study learned softmax attention (not a frozen kernel) trained by empirical risk minimization and show that a shallow measure-theoretic Transformer composed with an MLP learns the recall-and-predict map under a spectral assumption on the input densities. We further establish a matching minimax lower bound with the same rate exponent (up to multiplicative constants), proving sharpness of the convergence order. The framework offers a principled recipe for designing and analyzing Transformers that recall from arbitrarily long, distributional contexts with provable generalization guarantees.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4353. CaTs and DAGs: Integrating Directed Acyclic Graphs with Transformers for Causally Constrained Predictions
๐ฌ 12
โญ 4.67
๐ causal reasoning
๐ท๏ธ transformers causal inference causality inductive bias DAGs
Artificial Neural Networks (ANNs), including fully-connected networks and transformers, are highly flexible and powerful function approximators, widely applied in fields like computer vision and natural language processing. However, their inability to inherently respect causal structures can limit their robustness, making them vulnerable to covariate shift and difficult to interpret/explain. This poses significant challenges for their reliability in real-world applications. In this paper, we introduce Causal Transformers (CaTs), a general model class designed to operate under predefined causal constraints, as specified by a Directed Acyclic Graph (DAG). CaTs retain the powerful function approximation abilities of traditional neural networks while adhering to the underlying structural constraints, improving robustness, reliability, and interpretability at inference time. This approach opens new avenues for deploying neural networks in more demanding, real-world scenarios where robustness and explainability is critical.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4354. Conformalized Decision Risk Assessment
๐ฌ 17
โญ 4.67
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Conformal prediction inverse optimization risk assessment decision making under uncertainty
High-stakes decisions in healthcare, energy, and public policy have long depended on human expertise and heuristics, but are now increasingly supported by predictive and optimization-based tools. A prevailing paradigm in operations research is predict-then-optimize, where predictive models estimate uncertain inputs and optimization models recommend decisions. However, such approaches often sideline human judgment, creating a disconnect between algorithmic outputs and expert intuition that undermines trust and adoption in practice.
To bridge this gap, we propose CREDO, a framework that, for any candidate decision proposed by human experts, provides a distribution-free upper bound on the probability of suboptimality---informed by both the optimization structure and the data distribution. By combining inverse optimization geometry with conformal generative prediction, CREDO delivers statistically rigorous yet practically interpretable risk certificates. This framework allows human decision-makers to audit and validate their decisions under uncertainty, strengthening the alignment between algorithmic tools and human intuition.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4355. Splat Feature Solver
๐ฌ 14
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Spalts Feature Lifting Solver Optimization
Feature lifting has emerged as a crucial component in 3D scene understanding, enabling the attachment of rich image feature descriptors (e.g., DINO, CLIP) onto splat-based 3D representations. The core challenge lies in optimally assigning rich general attributes to 3D primitives while addressing the inconsistency issues from multi-view images. We present a unified, kernel- and feature-agnostic formulation of the feature lifting problem as a sparse linear inverse problem, which can be solved efficiently in closed form. Our approach admits a provable upper bound on the global optimal error under convex losses for delivering high quality lifted features. To address inconsistencies and noise in multi-view observations, we introduce two complementary regularization strategies to stabilize the solution and enhance semantic fidelity. Tikhonov Guidance enforces numerical stability through soft diagonal dominance, while Post-Lifting Aggregation filters noisy inputs via feature clustering. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on open-vocabulary 3D segmentation benchmarks, outperforming training-based, grouping-based, and heuristic-forward baselines while producing the lifted features in minutes. Demo Video, \textbf{Code} and \textbf{demo website} are all inside the supplementary.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4356. CLARC: C/C++ Benchmark for Robust Code Search
๐ฌ 14
โญ 4.67
๐ datasets and benchmarks
๐ท๏ธ Code Search Benchmark Robustness
Effective retrieval of code snippets from natural language queries is essential for code reuse and developer productivity. However, current benchmarks are limited: they predominantly focus on Python, lack support for industry-focused languages like C/C++, miss structured categorization, and are susceptible to models that exploit superficial lexical features instead of code semantics. To address these limitations, we introduce CLARC (C/C++ LAnguage Retrieval with Anonymized Code), a benchmark of 1,245 C/C++ query-code pairs that is fully compilable, configurable, and extensible. CLARC systematically categorizes snippets into three groups based on dependency complexity, allowing for a nuanced evaluation of retrieval performance under varying levels of code complexity. CLARC also provides configurable settings, including anonymized identifiers and low-level representations, to evaluate model robustness across different levels of code context and abstraction. Evaluation of six state-of-the-art code search methods shows significant performance drops under identifier anonymization, exposing existing modelsโ persistent reliance on superficial cues. Their poor performance on low-level languages such as Assembly and WebAssembly further reveals limited effectiveness beyond high-level programming languages. We also introduce an automated pipeline for scalable benchmark generation, validated through hypothesis tests, enabling the efficient creation of high-quality code search datasets that can be reused by other dataset builders. Our dataset is publicly available at https://huggingface.co/datasets/ClarcTeam/CLARC.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4357. Generalization of RLVR Using Causal Reasoning as a Testbed
๐ฌ 21
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Reinforcement Learning with Verifiable Rewards Generalization Causal Reasoning
Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for post-training large language models (LLMs) on complex reasoning tasks. Yet, the conditions under which RLVR yields robust generalization remain poorly understood. This paper provides an empirical study of RLVR generalization in the setting of probabilistic inference over causal graphical models. This setting offers two natural axes along which to examine generalization: (i) the level of the probabilistic queryโassociational, interventional, or counterfactualโand (ii) the structural complexity of the query, measured by the size of its relevant subgraph. We construct datasets of causal graphs and queries spanning these difficulty axes and fine-tune Qwen-2.5-Instruct models using RLVR or supervised fine-tuning (SFT). We vary both the model scale (3Bโ32B) and the query level included in training. We find that RLVR yields stronger within-level and across-level generalization than SFT, but only for specific combinations of model size and training query level. Further analysis shows that RLVRโs effectiveness depends on the modelโs initial reasoning competence. With sufficient initial competence, RLVR improves an LLMโs marginalization strategy and reduces errors in intermediate probability calculations, producing substantial accuracy gains, particularly on more complex queries. These findings show that RLVR can improve specific causal reasoning subskills, with its benefits emerging only when the model has sufficient initial competence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4358. MUSE: Model-Agnostic Tabular Watermarking via Multi-Sample Selection
๐ฌ 38
โญ 4.67
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Watermark Tabular Generative Model
We introduce MUSE, a novel watermarking paradigm for tabular generative models. Existing approaches often exploit DDIM invertibility to watermark tabular diffusion models, but tabular diffusion models suffer from poor invertibility, leading to degraded performance. To overcome this limitation, we leverage the computational efficiency of tabular generative models and propose a multi-sample selection paradigm, where watermarks are embedded by generating multiple candidate samples and selecting one according to a specialized scoring function.
The key advantages of MUSE include (1) Model-agnostic: compatible with any tabular generative model that supports repeated sampling; (2) Flexible: offers flexible designs to navigate the trade-off between generation quality, detectability, and robustness; (3) Calibratable: theoretical analysis provides principled calibration of watermarking strength, ensuring minimal distortion to the original data distribution.
Extensive experiments on five datasets demonstrate that MUSE substantially outperforms existing methods. Notably, it reduces the distortion rates by 84-88% for fidelity metrics compared with the best performing baselines, while achieving 1.0 TPR@0.1%FPR detection rate.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4359. Token Distillation: Attention-Aware Input Embeddings for New Tokens
๐ฌ 12
โญ 4.67
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ embedding initialization tokenizer vocabulary adaptation
Current language models rely on static vocabularies determined at pretraining time, which can lead to decreased performance and increased computational cost for domains underrepresented in the original vocabulary.
New tokens can be added to solve this problem, when coupled with a good initialization for their new embeddings. However, existing embedding initialization methods require expensive further training or pretraining of additional modules.
In this paper, we propose Token Distillation and show that by distilling representations obtained using the original tokenization, we can quickly learn high-quality input embeddings for new tokens.
Experimental results with a wide range of open-weight models show that Token Distillation outperforms even strong baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4360. Free Point-wise Anomaly Detection via Fold-bifurcation
๐ฌ 26
โญ 4.67
๐ learning on time series and dynamical systems
๐ท๏ธ time-series anomly detection
Anomaly detection in time series is essential for applications from industrial monitoring to financial risk management. Recent methods --- including forecasting error models, representation learning, augmentation, and weak-label learning --- have achieved strong results for specific anomaly types such as sudden point or gradual collective anomalies. While many prior works report window-level metrics that may mask errors, several recent methods evaluate at the point level as well. Our goal is to use a stricter point-wise protocol to make masking effects explicit. We introduce FOLD (Point-wise Anomaly Detection via fold-bifurcation), a framework that reframes detection as tracking a systemโs proximity to a critical transition. FOLD extracts stress signals from a forecasting model and integrates them with a fold-bifurcation inspired ODE to produce the risk state, flagging anomalies once it crosses a threshold calibrated on normal data. This requires no anomaly labels and no additional detector training, enabling a parameter-free and efficient detection process. By modeling anomalies as stress accumulation toward a tipping point, FOLD naturally aligns with point-wise detection, providing a unifying and interpretable perspective that complements type-specific methods. Experiments on 40 benchmarks against 34 state-of-the-art baselines show that FOLD achieves competitive or superior performance, with particular strength under strict point-wise evaluation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4361. Discrete Variational Autoencoding via Policy Search
๐ฌ 5
โญ 4.67
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Discrete Variational Autoencoding Reinforcement Learning
Discrete latent bottlenecks in variational autoencoders (VAEs) offer high bit efficiency and can be modeled with autoregressive discrete distributions, enabling parameter-efficient multimodal search with transformers. However, discrete random variables do not allow for exact differentiable parameterization; therefore, discrete VAEs typically rely on approximations, such as Gumbel-Softmax reparameterization or straight-through gradient estimates, or employ high-variance gradient-free methods such as REINFORCE that have had limited success on high-dimensional tasks such as image reconstruction. Inspired by popular techniques in policy search, we propose a training framework for discrete VAEs that leverages the natural gradient of a non-parametric encoder to update the parametric encoder without requiring reparameterization. Our method, combined with automatic step size adaptation and a transformer-based encoder, scales to challenging datasets such as ImageNet and outperforms both approximate reparameterization methods and quantization-based discrete autoencoders in reconstructing high-dimensional data from compact latent spaces.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4362. MnemoDyn: Learning Resting State Dynamics from $40$K FMRI sequences
๐ฌ 18
โญ 4.67
๐ applications to neuroscience & cognitive science
๐ท๏ธ Dynamical system Brain Imaging
We present a dynamical-systems based model for resting-state functional magnetic resonance imaging (rs-fMRI), trained on a dataset of roughly $40$K rs-fMRI sequences covering a wide variety of public and available-by-permission datasets. While most existing proposals use transformer backbones, we utilize multi-resolution temporal modeling of the dynamics across parcellated brain regions. We show that MnemoDyn is compute efficient and generalizes very well across diverse populations and scanning protocols. When benchmarked against current state-of-the-art transformer-based approaches, MnemoDyn consistently delivers superior reconstruction quality.
Overall, we find that with such large-scale pre-training on (non-proprietary) rs-fMRI datasets, we get a highly performant model for various downstream tasks. Our results also provide evidence of the efficacy of the model on small sample size studies which has implications for neuroimaging studies at large where resting state fMRI is a commonly acquired imaging modality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4363. MedAraBench: Large-scale Arabic Medical Question Answering Dataset and Benchmark
๐ฌ 19
โญ 4.67
๐ datasets and benchmarks
๐ท๏ธ Dataset Benchmark Large Language Models Arabic Natural Language Processing Medical Question Answering
Arabic remains one of the most underrepresented languages in natural language processing research, particularly in medical applications, due to the limited availability of open-source data and benchmarks. The lack of resources hinders efforts to evaluate and advance the multilingual capabilities of Large Language Models (LLMs). In this paper, we introduce MedAraBench, a large-scale dataset consisting of Arabic multiple-choice question-answer pairs across various medical specialties. We constructed the dataset by manually digitizing a large repository of academic materials created by medical professionals in the Arabic-speaking region. We then conducted extensive preprocessing and split the dataset into training and test sets to support future research efforts in the area. To assess the quality of the data, we adopted two frameworks, namely expert human evaluation and LLM-as-a-judge. Our dataset is diverse and of high quality, spanning 19 specialties and five difficulty levels. For benchmarking purposes, we assessed the performance of eight state-of-the-art open-source and proprietary models, such as GPT-5, Gemini 2.0 Flash, and Claude 4-Sonnet. Our findings highlight the need for further domain-specific enhancements. We release the dataset and evaluation scripts to broaden the diversity of medical data benchmarks, expand the scope of evaluation suites for LLMs, and enhance the multilingual capabilities of models for deployment in clinical settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4364. Towards KnowledgeโandโDataโDriven Organic Reaction Prediction: RAGโEnhanced and ReasoningโPowered Hybrid System with LLMs
๐ฌ 14
โญ 4.67
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Organic Reaction Prediction Large Language Models RetrievalโAugmented Generation ChainโofโThought Reasoning
In organic reaction prediction, many recent approaches ranging from traditional task-specific models to Large Language Models (LLMs), have demonstrated notable success. However, these methods are inherently data-driven, exhibit constrained interpretability, and have hit fundamental performance bottlenecks. To overcome these limitations, we present Reaction-Thinker, a hybrid, knowledgeโand-dataโdriven system that is enhanced by RetrievalโAugmented Generation (RAG) and powered by advanced reasoning, improving both the interpretability of prediction process and the explainability of results. We develop similar-case retrieval database and train a RAGโbased LLM through supervised fine-tuning (SFT) to apply both reaction types and similar reaction cases as knowledge. We also construct a reaction reasoning chain-of-thought (CoT) dataset and train a reasoning-based LLM through SFT, then further optimize it using Group Relative Policy Optimization (GRPO). Experimental results show that our method outperforms all compared LLMs and task-specific models, achieving the highest accuracy (Exact Match) and fingerprint similarity (FTS). Ablation study indicates improvements in relative accuracy of 7.5% and 13.9% for RAG and GRPO, respectively. Further analysis of mispredictions reveals limitations in conventional evaluation metrics, which motivates our proposed benchmarking refinement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4365. GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs
๐ฌ 32
โญ 4.67
๐ optimization
๐ท๏ธ Quantized large language models low-rank error correction group-shared factorization randomized SVD selective restoration low-latency inference
Quantization techniques such as BitsAndBytes, AWQ, and GPTQ are widely used as a standard method in deploying large language models but often degrades accuracy when using low-bit representations, e.g., 4 bits. Low-rank correction methods (e.g., LQER, QERA, ASER) has been proposed to mitigate this issue, however, they restore all layers and insert error-correction modules into every decoder block, which increases latency and memory overhead. To address this limitation, we propose GlowQ, a group-shared low-rank approximation for quantized LLMs that caches a single shared right factor per input-sharing group and restores only the groups or layers that yield the highest accuracy benefit.
GlowQ computes the high-precision projection once per input-sharing group and reuses it across its modules, reducing parameter and memory overhead, and retaining the expressivity of layer-specific corrections. We also propose a selective variant, GlowQ-S, that applies the cached shared module only where it provides the largest benefit. Compared with strong baselines, our approach reduces TTFB by \(5.6\%\) and increases throughput by \(9.6\%\) on average, while reducing perplexity on WikiText-2 by \(0.17\%\) and increasing downstream accuracy by 0.42 percentage points. The selective model GlowQ-S further reduces latency, cutting TTFB by \(23.4\%\) and increasing throughput by \(37.4\%\), while maintaining accuracy within 0.2 percentage points on average.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4366. Towards Sampling Data Structures for Tensor Products in Turnstile Streams
๐ฌ 7
โญ 4.67
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ data structures sampling turnstile streams lower bound hardness space complexity
This paper studies the computational challenges of large-scale attention-based models in artificial intelligence by introducing innovative sampling methods in the streaming setting. Inspired by the classical definition of the $\ell_2$ sampler and the recent progress of the attention scheme in Large Language Models (LLMs), we propose the definition of the attention sampler. Our approach significantly reduces the computational burden of traditional attention mechanisms. We demonstrate the effectiveness of the attention sampler from a theoretical perspective, including space and update time. Additionally, our framework exhibits scalability and broad applicability across various model architectures and domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4367. Implicit Regularisation in Diffusion Models: An Algorithm-Dependent Generalisation Analysis
๐ฌ 10
โญ 4.67
๐ generative models
๐ท๏ธ diffusion models generalization stability algorithm dependent learning theory
The success of denoising diffusion models raises important questions regarding their generalisation behaviour, particularly in high-dimensional settings. Notably, it has been shown that when training and sampling are performed perfectly, these models memorise training dataโimplying that some form of regularisation is essential for generalisation. Existing theoretical analyses primarily rely on algorithm-independent techniques such as uniform convergence, heavily utilising model structure to obtain generalisation bounds. In this work, we instead leverage the algorithmic aspects that promote generalisation in diffusion models, developing a general theory of algorithm-dependent generalisation for this setting. Borrowing from the framework of algorithmic stability, we introduce the notion of score stability, which quantifies the sensitivity of score-matching algorithms to dataset perturbations. We derive generalisation bounds in terms of score stability, and apply our framework to several fundamental learning settings, identifying sources of regularisation. In particular, we consider denoising score matching with early stopping (denoising regularisation), sampler-wide coarse discretisation (sampler regularisation), and optimising with SGD (optimisation regularisation). By grounding our analysis in algorithmic properties rather than model structure, we identify multiple sources of implicit regularisation unique to diffusion models that have so far been overlooked in the literature.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4368. Learning Dynamics Feature Representation via Policy Attention for Dynamic Path Planning in Urban Road Networks
๐ฌ 9
โญ 4.67
๐ reinforcement learning
๐ท๏ธ Dynamic Path Planning; Reinforcement Learning; State Representation; Dynamics Feature Representation; Policy Attention Mechanism
Dynamic Path Planning (DPP) in urban road networks faces fundamental challenges, as traffic conditions change rapidly over time and often render planned routes ineffective. Reinforcement Learning (RL) provides an effective way to adaptively handle such uncertainties by incorporating traffic dynamics into state, but its performance crucially depends on how these dynamics are represented. Existing approaches either rely on global traffic information, which ensures decision completeness but suffers from redundancy and high computational cost, or oversimplified local features, which are efficient but often omit critical dynamics and lead to suboptimal paths. To address this, we propose a Dynamics Feature Representation (DFR) framework that progressively refines global traffic dynamics into compact features for RL-based DPP. Specifically, we introduce a policy attention mechanism that identifies a core subset of global dynamics by extracting the top-k shortest paths, and further constructs node-related local features by intersecting with n-hop neighborhoods, enabling near-optimal policy learning. Theoretical analysis demonstrates that DFR guarantees state completeness, while empirical results confirm that, compared to classical baselines and standard RL methods, DFR significantly improves path planning performance and accelerates convergence. This work highlights the central role of feature representation in RL-based DPP and proposes a general framework that balances information sufficiency with computational efficiency, paving the way for scalable dynamic decision-making in real-world transportation systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4369. Explaining Grokking and Information Bottleneck through Neural Collapse Emergence
๐ฌ 15
โญ 4.67
๐ optimization
๐ท๏ธ deep learning grokking information bottleneck neural collapse training dynamics
The training dynamics of deep neural networks often defy expectations, even as these models form the foundation of modern machine learning.
Two prominent examples are grokking, where test performance improves abruptly long after the training loss has plateaued, and the information bottleneck principle, where models progressively discard input information irrelevant to the prediction task as training proceeds.
However, the mechanisms underlying these phenomena and their relations remain poorly understood.
In this work, we present a unified explanation of such late-phase phenomena through the lens of neural collapse, which characterizes the geometry of learned representations.
We show that the contraction of population within-class variance is a key factor underlying both grokking and information bottleneck, and relate this measure to the neural collapse measure defined on the training set.
By analyzing the dynamics of neural collapse, we show that distinct time scales between fitting the training set and the progression of neural collapse account for the behavior of the late-phase phenomena.
Finally, we validate our theoretical findings on multiple datasets and architectures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4370. How does the optimizer implicitly bias the model merging loss landscape?
๐ฌ 13
โญ 4.67
๐ optimization
๐ท๏ธ loss landscape mode connectivity model merging optimization implicit bias
Model merging methods combine models with different capabilities into a single
one while maintaining the same inference cost. Two popular approaches are lin-
ear interpolation, which linearly interpolates between model weights, and task
arithmetic, which combines task vectors obtained by the difference between fine-
tuned and base models. While useful in practice, what properties make merging
effective are poorly understood. This paper explores how the optimization pro-
cess affects the loss landscape geometry and its impact on merging success. We
show that a single quantity โ the effective noise scale โ unifies the impact of opti-
mizer and data choices on model merging. Across architectures and datasets, the
effectiveness of merging success is a non-monotonic function of effective noise,
with a distinct optimum. Decomposing this quantity, we find that larger learning
rates, stronger weight decay, smaller batch sizes, and data augmentation all inde-
pendently modulate the effective noise scale, exhibiting the same qualitative trend.
Unlike prior work that connects optimizer noise to the flatness or generalization of
individual minima, we show that it also affects the global loss landscape, predict-
ing when independently trained solutions can be merged. Our findings broaden
the understanding of how optimization shapes the loss landscape geometry and its
downstream consequences for model merging, suggesting the possibility of fur-
ther manipulating the training dynamics to improve mergeability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4371. Teach2Eval: An Interaction-Driven LLMs Evaluation Method via Teaching Effectiveness
๐ฌ 10
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ New Evaluation Method Multi-dimensional Evaluation Large Language Models Data Contamination Teach2Eval
Recent progress in large language models (LLMs) has outpaced the development of effective evaluation methods. Evaluating LLMs with static, task-specific benchmarks is increasingly fragile due to contamination and saturation, and it fails to capture interactive reasoning. We introduce Teach2Eval, which reframes evaluation as teaching: a candidate model guides weaker students, and the studentsโ gains constitute the score. This interaction yields robustness to contamination and exposes orthogonal abilities with fine-grained metrics across Application, Judgment, Guidance, and Reflection. The framework scales automatically by exploiting natural error distributions from weak students, requiring neither bespoke rubrics nor human graders. Across 30 LLMs and 60 datasets, Teach2Eval achieves Spearman above 0.95 with human-preference leaderboards (e.g., Chatbot Arena/LiveBench), surpassing direct baselines, while offering actionable training signals (capability hierarchies, early overfitting) at low cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4372. P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling
๐ฌ 19
โญ 4.67
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ personalizd alignment generative reward model test-time user-based scaling
Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse, scenario-specific preferences into a small, fixed set of evaluation principles, and (2) struggling with generalization to new users with limited feedback. To this end, we propose **P-GenRM**, the first **P**ersonalized **Gen**erative **R**eward **M**odel with test-time user-based scaling. P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios. It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism: at the individual level, it adaptively scales and aggregates each userโs scoring scheme; at the prototype level, it incorporates preferences from similar users. This design mitigates noise in inferred preferences and enhances generalization to unseen users through prototype-based transfer. Empirical results show that P-GenRM achieves state-of-the-art results on widely-used personalized reward model benchmarks, with an average improvement of ~2.31\%, and demonstrates strong generalization on an out-of-distribution dataset. Notably, Test-time User-based scaling provides an additional ~3\% boost, demonstrating stronger personalized alignment with test-time scalability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4373. ATOM: A Pretrained Neural Operator for Multitask Molecular Dynamics
๐ฌ 15
โญ 4.67
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Molecular dynamics neural operator transformer ai for science equivariant
Molecular dynamics (MD) simulations underpin modern computational drug discovery, materials science, and biochemistry. Recent machine learning models provide high-fidelity MD predictions without the need for repeated quantum-mechanical force calculations, enabling significant speedups over conventional pipelines. Yet many such methods typically enforce strict equivariance and rely on sequential rollouts, thus limiting their flexibility and simulation efficiency. They are also commonly single-task, trained on individual molecules and fixed time frames, which restricts generalization to unseen compounds and extended timesteps. To address these issues, we propose Atomistic Transformer Operator for Molecules (ATOM), a pretrained transformer neural operator for multi-task molecular dynamics. ATOM adopts a quasi-equivariant design that does not require an explicit molecular graph and employs a temporal attention mechanism to enable accurate, parallel decoding of multiple future states. To support operator pretraining across chemicals and timescales, we curate TG80, a large, diverse, and numerically stable MD dataset with over 2.5 million femtoseconds of trajectories across 80 compounds. ATOM achieves state-of-the-art performance on established single-task benchmarks, such as MD17, RMD17, and MD22. After multi-task pretraining on TG80, ATOM shows exceptional zero-shot and robust generalization to unseen molecules across varying time horizons. We believe ATOM represents a significant step toward accurate, efficient, and transferable molecular dynamics models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4374. Some Neural Networks Inherently Preserve Subspace Clustering Structure
๐ฌ 7
โญ 4.67
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Clustering Subspace Neural Networks Activation Functions Preserving Structure
It has long been conjectured and empirically observed that neural networks tend to preserve clustering structure. This paper formalizes this conjecture. Specifically, we establish precise conditions for cluster structure preservation and derive bounds to quantify its extent. Through this analysis we are able to show that certain neural networks are learning parameters that preserve the clustering structure of the original data in their embeddings, without the need to impose mechanisms to promote this behavior. Extensive numerical analysis and experiments validate our results. Our findings offer deeper insight into neural network behavior, explaining why certain data types (such as images, audio, and text) benefit more from deep learning. Beyond theory, our findings guide better initialization, feature encoding, and regularization strategies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4375. PTNET: A PROPOSAL-CENTRIC TRANSFORMER NET- WORK FOR 3D OBJECT DETECTION
๐ฌ 12
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Object Detection Point Clouds Two-stage Transformer
3D object detection from LiDAR point cloud data is important for autonomous driving systems. Recent two-stage 3D object detectors struggle to achieve satisfactory performance due to limitations in proposal quality, stemming from the degradation of geometric detail information in the generated proposal features caused by high sparsity and uneven distribution of point clouds, as well as a lack of effective exploitation of surrounding contextual cues in the independent proposal refinement stage. To this end, we propose a Proposal-centric
Transformer Network (PTN), which includes a Hierarchical Attentive Feature Alignment (HAFA) module and a Collaborative Proposal Refinement Module (CPRM). More concretely, to obtain multi-granularity proposal representations, HAFA employs a dual-stream architecture that extracts both coarse-grained voxel features and fine-grained point features to enhance proposal features, then harmo-
nizes them through a feature alignment network in a unified space. The CPRM first generates object queries for all objects and then establishes contextual-aware interactions to extract complementary information from semantically similar and spatially relevant proposals. PTN achieves promising performance on large-scale Waymo and KITTI benchmark, demonstrating the superiority of PTN.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4376. PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach
๐ฌ 17
โญ 4.67
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ agentic misalignment dataset and benchmark LLM safety
Recent advances in Large Language Models (LLMs) have sparked concerns over their potential to acquire and misuse dangerous capabilities, posing frontier risks to society. Current safety evaluations primarily test for what a model *can* do---its capabilities---without assessing what it *would* do if endowed with high-risk capabilities. This leaves a critical blind spot: models may strategically conceal capabilities or rapidly acquire them, while harboring latent inclinations toward misuse. We argue that **propensity**---the likelihood of a model to pursue harmful actions if empowered---is a critical, yet underexplored, axis of safety evaluation. We present
**PropensityBench**, a novel benchmark framework that assesses the proclivity of models to engage in risky behaviors when equipped with simulated dangerous capabilities using proxy tools. Our framework includes 5,874 scenarios with 6,648 tools spanning four high-risk domains: self-proliferation, cybersecurity, biosecurity and chemical security.
We simulate access to powerful capabilities via a controlled agentic environment and evaluate the models' choices under varying operational pressures that reflect real-world constraints or incentives models may encounter, such as resource scarcity or gaining more autonomy.
Across open-source and proprietary frontier models, we uncover alarming signs of propensity: models frequently choose high-risk tools when under pressure, despite lacking the capability to execute such actions unaided.
These findings call for a shift from static capability audits toward dynamic propensity assessments as a prerequisite for deploying frontier AI systems safely.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4377. Token-Efficient Long-Term Interest Sketching and Internalized Reasoning for LLM-based Recommendation
๐ฌ 12
โญ 4.67
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ LLM-based Recommendation Rating Prediction Efficient Reasoning
Large language models (LLMs) can solve complex real-world tasks when prompted to generate chain-of-thought (CoT) reasoning, motivating their use for preference reasoning in recommender systems. However, applying LLM reasoning on recommendation faces two practical challenges. First, LLMs struggle to reason over long, noisy user histories that often span hundreds of items while truncation discards signals needed to capture long-term interests. Second, in decoder-only architectures, CoT requires generating rationale tokens autoregressively, leading to prohibitive inference latency for real-world deployment. To address the challenges, we propose SIREN, a framework that enables effective LLM-based rating prediction via long-term interest sketching and internalized reasoning. First, instead of prompting raw histories, we build a compact, token-bounded interest sketch that preserves persistent preferences and suppresses noise. Specifically, we encode and cluster item descriptions to discover semantic topics, then compress each userโs history into a short list of liked and disliked topics, facilitating LLM reasoning.
Second, we develop an internalized reasoning strategy for efficient inference. We adopt a two-stage training paradigm: (i) train the LLM to reason explicitly for rating prediction with rule-based reinforcement learning, since ground-truth CoTs are unavailable in recommendation; and (ii) learn to internalize CoT into model parameters through hidden alignment. At inference, the LLM directly generates the rating with near-CoT quality.
Extensive experiments show that SIREN reduces average input tokens by $48.7\%$ compared to raw-history prompting, outperforms existing methods while delivering over $100\times$ lower inference latency than CoT-based LLM recommenders. Code and data are available at https://anonymous.4open.science/r/LLM4Rec-C7CF.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4378. Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning
๐ฌ 12
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Clinical Decision Making Large Language Models
Clinical decision-making is a dynamic, interactive, and cyclic process where doctors have to repeatedly decide on which clinical action to perform and consider newly uncovered information for diagnosis and treatment. Large Language Models (LLMs) have the potential to support clinicians in this process, however, most applications of LLMs in clinical decision support suffer from one of two limitations: Either they assume the unrealistic scenario of immediate availability of all patient information and do not model the interactive and iterative investigation process, or they restrict themselves to the limited "out-of-the-box" capabilities of large pre-trained models without performing task-specific training. In contrast to this, we propose to model clinical decision-making for diagnosis with a hypothesis-driven uncertainty-aware language agent, LA-CDM, that converges towards a diagnosis via repeatedly requesting and interpreting relevant tests. Using a hybrid training paradigm combining supervised and reinforcement learning, we train LA-CDM with three objectives targeting critical aspects of clinical decision-making: accurate hypothesis generation, hypothesis uncertainty estimation, and efficient decision-making. We evaluate our methodology on MIMIC-CDM, a real-world dataset covering four abdominal diseases containing various clinical tests and show the benefit of explicitly training clinical decision-making for increasing diagnostic performance and efficiency. We will release our code upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4379. Unlocking the Potential of Weighting Methods in Federated Learning Through Communication Compression
๐ฌ 18
โญ 4.67
๐ optimization
๐ท๏ธ Convex optimization Compression Stochastic optimization
Modern machine learning problems are frequently formulated in federated learning domain and incorporate inherently heterogeneous data. Weighting methods operate efficiently in terms of iteration complexity and represent a common direction in this setting. At the same time, they do not address directly the main obstacle in federated and distributed learning -- communication bottleneck. We tackle this issue by incorporating compression into the weighting scheme. We establish the convergence under a convexity assumption, considering both exact and stochastic oracles. Finally, we evaluate the practical performance of the proposed method on real-world problems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4380. SwiftTS: A Swift Selection Framework for Time Series Pre-trained Models via Multi-task Meta-Learning
๐ฌ 15
โญ 4.67
๐ learning on time series and dynamical systems
๐ท๏ธ time series forecasting model selection transfer learning
Pre-trained models exhibit strong generalization to various downstream tasks. However, given the numerous models available in the model hub, identifying the most suitable one by individually fine-tuning is time-consuming. In this paper, we propose \textbf{SwiftTS}, a swift selection framework for time series pre-trained models. To avoid expensive forward propagation through all candidates, SwiftTS adopts a learning-guided approach that leverages historical dataset-model performance pairs across diverse horizons to predict model performance on unseen datasets. It employs a lightweight dual-encoder architecture that embeds time series and candidate models with rich characteristics, computing patchwise compatibility scores between data and model embeddings for efficient selection. To further enhance the generalization across datasets and horizons, we introduce a horizon-adaptive expert composition module that dynamically adjusts expert weights, and the transferable cross-task learning with cross-dataset and cross-horizon task sampling to enhance out-of-distribution (OOD) robustness. Extensive experiments on 14 downstream datasets and 8 pre-trained models demonstrate that SwiftTS achieves state-of-the-art performance in time series pre-trained model selection. The code and datasets are available at \href{}{https://anonymous.4open.science/r/SwiftTS-395C}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4381. Can Speech LLMs Think while Listening?
๐ฌ 22
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ SpeechLLM Chain-of-Thought
Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought (CoT) prompting or fine-tuning has been shown to significantly improve the reasoning abilities of text-based LLMs. In this work, we investigate the effect of CoT fine-tuning for multi-stream speech LLMs, demonstrating that reasoning in text space improves the accuracy of speech LLMs by 2.4x, on average, over a suite of spoken reasoning tasks. Beyond accuracy, the latency of the spoken response is a crucial factor for interacting with voice-based agents. Inspired by the human behavior of "thinking while listening," we propose methods to reduce the additional latency from reasoning by allowing the model to start reasoning before the user query has ended. To achieve this, we introduce an entropy-based metric, "question completeness," which acts as an indicator to guide the model on the optimal time to start reasoning. This method provides greater control over the accuracy-latency trade-off compared with heuristic-based approaches and, under equivalent latency conditions, yields a 4% accuracy gain on ARC-Easy. Finally, we use Direct Preference Optimization (DPO) on preference data created using rejection sampling to push the accuracy-latency pareto frontier further, resulting in a 70% reduction in latency without loss in accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4382. On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
๐ฌ 10
โญ 4.67
๐ reinforcement learning
๐ท๏ธ LLM reasoning reinforcement learning
Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of KL direction (forward vs. reverse), normalization (normalized vs. unnormalized), and estimator ($k_1/k_2/k_3$), is scattered across the literature and often intertwined with off-policy estimation. We ask a focused question: under the off-policy setting, what weighting is required for each KL variant so that the surrogate we optimize yields the exact gradient of the intended KL-regularized objective? We answer this with a compact, unified derivation we call the Regularized Policy Gradient (\textbf{RPG}) view. RPG (i) unifies normalized and unnormalized KL variants and shows that the widely-used $k_3$ penalty is exactly the unnormalized KL; (ii) specifies conditions under which REINFORCE-style losses with stop-gradient are gradient-equivalent to fully differentiable surrogates; (iii) identifies and corrects an off-policy importance-weighting mismatch in GRPO's KL term; and (iv) introduces RPG-Style Clip, a truncated-importance-sampling step within RPG-REINFORCE that enables stable, off-policy policy-gradient training at scale. On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO. Notably, RPG is a \emph{stable and scalable} RL algorithm for LLM reasoning, realized via (a) a KL-correct objective, (b) truncated importance sampling, and (c) an iterative reference-policy update scheme.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4383. Differentiable Simulation of Hard Contacts with Soft Gradients for Learning and Control
๐ฌ 11
โญ 4.67
๐ applications to robotics, autonomy, planning
๐ท๏ธ differentiable simulation contact forces musculoskeletal control
Contact forces introduce discontinuities into robot dynamics that severely limit the use of simulators for gradient-based optimization. Penalty-based simulators such as MuJoCo, soften contact resolution to enable gradient computation. However, realistically simulating hard contacts requires stiff solver settings, which leads to incorrect simulator gradients when using automatic differentiation. Contrarily, using non-stiff settings strongly increases the sim-to-real gap. We analyze penalty-based simulators to pinpoint why gradients degrade under hard contacts. Building on these insights, we propose DiffMJX, which couples adaptive time integration with penalty-based simulation to substantially improve gradient accuracy. A second challenge is that contact gradients vanish when bodies separate. To address this, we introduce contacts from distance (CFD) which combines penalty-based simulation with straight-through estimation. By applying CFD exclusively in the backward pass, we obtain informative pre-contact gradients while retaining physical realism.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4384. Toward Complex-Valued Neural Networks for Waveform Generation
๐ฌ 12
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ waveform generation complex-valued neural networks iSTFT-based vocoder generative adversarial network
Neural vocoders have recently advanced waveform generation, yielding natural and expressive audio. Among these approaches, iSTFT-based vocoders have gained attention. They predict a complex-valued spectrogram and then synthesize the waveform via iSTFT, thereby avoiding redundant, computationally expensive upsampling. However, current approaches use real-valued networks that process the real and imaginary parts independently. This separation limits their ability to capture the inherent structure of complex spectrograms. We present ComVo, a Complex-valued neural Vocoder whose generator and discriminator use native complex arithmetic. This enables an adversarial training framework that provides structured feedback directly in the complex domain. To guide phase transformations in a structured manner, we introduce phase quantization, which discretizes phase values and regularizes the training process. Finally, we propose a block-matrix computation scheme to improve training efficiency by reducing redundant operations. Experiments demonstrate that ComVo achieves higher synthesis quality than comparable real-valued baselines, and that its block-matrix scheme reduces training time by 25%. Audio samples and code are available at https://anonymous7136.github.io/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4385. QuaMo: Quaternion Motions for Vision-based 3D Human Kinematics Capture
๐ฌ 16
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ human kinematics quaternion motion vision
Vision-based 3D human motion capture from videos remains a challenge in computer vision.
Traditional 3D pose estimation approaches often ignore the temporal consistency between frames, causing implausible and jittery motion.
The emerging field of kinematics-based 3D motion capture addresses these issues by estimating the temporal transitioning between poses instead.
A major drawback in current kinematics approaches is their reliance on Euler angles.
Despite their simplicity, Euler angles suffer from discontinuity that leads to unstable motion reconstructions, especially in online settings where trajectory refinement is unavailable.
Contrarily, quaternions have no discontinuity and can produce continuous transitions between poses.
In this paper, we propose QuaMo, a novel Quaternion Motions method using quaternion differential equations (QDE) for human kinematics capture.
We utilize the state-space model, an effective system for describing real-time kinematics estimations, with quaternion state and the QDE describing quaternion velocity.
The corresponding angular acceleration are computed from a meta-PD controller with a novel acceleration enhancement that adaptively regulates the control signals as the human quickly change to new pose.
Unlike previous work, our QDE is solved under the quaternion geometric constraints that results in more accurate estimations.
Experimental results show that our novel formulation of the QDE with acceleration enhancement accurately estimates 3D human kinematics with no discontinuity and minimal implausible artifact.
QuaMo outperforms comparable state-of-the-art methods on multiple datasets, namely Human3.6M, Fit3D, SportsPose and a subset of AIST.
The code is made available upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4386. Circuit Insights: Towards Interpretability Beyond Activations
๐ฌ 14
โญ 4.67
๐ interpretability and explainable AI
๐ท๏ธ mechanistic interpretability automated interpretability explainable AI transcoders large language models circuits
The fields of explainable AI and mechanistic interpretability aim to uncover the internal structure of neural networks, with circuit discovery as a central tool for understanding model computations. Existing approaches, however, rely on manual inspection and remain limited to toy tasks. Automated interpretability offers scalability by analyzing isolated features and their activations, but it often misses interactions between features and depends strongly on external LLMs and dataset quality. Transcoders have recently made it possible to separate feature attributions into input-dependent and input-invariant components, providing a foundation for more systematic circuit analysis. Building on this, we propose WeightLens and CircuitLens, two complementary methods that go beyond activation-based analysis. WeightLens interprets features directly from their learned weights, removing the need for explainer models or datasets while matching or exceeding the performance of existing methods on context-independent features. CircuitLens captures how feature activations arise from interactions between components, revealing circuit-level dynamics that activation-only approaches cannot identify. Together, these methods increase interpretability robustness and enhance scalable mechanistic analysis of circuits while maintaining efficiency and quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4387. T1: Tool-integrated Verification for Test-time Compute Scaling in Small Language Models
๐ฌ 10
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language model math reasoning tool use test-time scaling small language model
Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs).
However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving verification by sLMs underexplored.
In this work, we investigate whether sLMs can reliably verify the output candidates under test-time scaling.
We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking.
To address this limitation, we propose Tool-integrated verification (T1), a two-stage framework that first filters candidates with external tools and then uses an sLM for final verification, offloading memorization-heavy steps to tools such as a code interpreter.
Within T1 we prove that offloading to external tools reduces the memorization burden on sLMs and improves test-time scaling performance.
Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model.
Moreover, T1 improves the verification accuracy of both process reward models (PRMs) and critic models.
Our findings highlight the potential of tool integration to substantially improve the verification abilities of sLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4388. P$^2$-DPO:Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization
๐ฌ 34
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ MLLMs
Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) aims to learn directly from the corrected preferences provided by humans, thereby addressing the hallucination issue. Despite its success, this paradigm has yet to specifically target two critical causes of visual hallucination: the perceptual bottleneck in attended regions and insufficient Visual Robustness against image degradation. Furthermore, existing preference pairs are constructed by directly editing textual outputs without visual signals, and their off-policy nature limits effectiveness in guiding model learning. To address these challenges, we propose Perceptual Processing Direct Preference Optimization (P$^2$-DPO), a novel training paradigm where the model generates and learns from its own preference pairs, thereby directly addressing the identified visual bottlenecks while inherently avoiding the issues of vision-agnostic and off-policy data. It introduces: (1) an on-policy preference pair construction method targeting Focus-and-Enhance perception and Visual Robustness, and (2) a well-designed Calibration Loss to precisely align visual signals with the causal generation of text. Experimental results demonstrate that with a comparable amount of training data and cost, P$^2$-DPO outperforms even state-of-the-art methods that rely on costly human feedback on benchmarks such as POPE and MMHal-Bench. Furthermore, evaluations on Attention Region Fidelity (ARF) and image degradation scenarios validate the effectiveness of P$^2$-DPO in addressing perceptual bottlenecks in attended regions and improving Visual Robustness against degraded inputs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4389. Traceable Black-Box Watermarks For Federated Learning
๐ฌ 15
โญ 4.67
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Federated Learning Watermark Black-box watermark Intellectual property protection
Due to the distributed nature of Federated Learning (FL) systems, each local client has access to the global model, posing a critical risk of model leakage. Existing works have explored injecting watermarks into local models to enable intellectual property protection. However, these methods either focus on non-traceable watermarks or traceable but white-box watermarks. We identify a gap in the literature regarding the formal definition of traceable black-box watermarking and the formulation of the problem of injecting such watermarks into FL systems. In this work, we first formalize the problem of injecting traceable black-box watermarks into FL. Based on the problem, we propose a novel server-side watermarking method, $\mathbf{TraMark}$, which creates a traceable watermarked model for each client, enabling verification of model leakage in black-box settings. To achieve this, $\mathbf{TraMark}$ partitions the model parameter space into two distinct regions: the main task region and the watermarking region. Subsequently, a personalized global model is constructed for each client by aggregating only the main task region while preserving the watermarking region. Each model then learns a unique watermark exclusively within the watermarking region using a distinct watermark dataset before being sent back to the local client. Extensive results across various FL systems demonstrate that $\mathbf{TraMark}$ ensures the traceability of all watermarked models while preserving their main task performance. The code is available at https://anonymous.4open.science/r/TraMark.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4390. Eliciting Harmful Capabilities by Fine-Tuning on Safeguarded Outputs
๐ฌ 8
โญ 4.67
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ adversarial robustness LLMs machine learning distillation jailbreaks classifier guarded system adversarial attacks safety
Model developers implement safeguards in frontier models to prevent misuse, for example, by employing classifiers to filter dangerous outputs. In this work, we demonstrate that even robustly safeguarded models can be used to elicit harmful capabilities in open-source models through \textit{elicitation attacks}. Our elicitation attacks consist of three stages:
(i) constructing prompts in adjacent domains to a target harmful task that do not request dangerous information; (ii) obtaining responses to these prompts from safeguarded frontier models;
(iii) fine-tuning open-source models on these prompt-output pairs. Since the requested prompts cannot be used to directly cause harm, they are not refused by frontier model safeguards. We evaluate these elicitation attacks within the domain of hazardous chemical synthesis and processing, and demonstrate that our attacks recover approximately 40\% of the capability gap between the base open-source model and an unrestricted frontier model. We then show that the efficacy of elicitation attacks scales with the capability of the frontier model and the amount of generated fine-tuning data. Our work demonstrates the challenge of mitigating ecosystem level risks with output-level safeguards.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4391. IceCache: Memory-Efficient KV-cache Management for Long-Sequence LLMs
๐ฌ 18
โญ 4.67
๐ generative models
๐ท๏ธ LLM Inference; KV-cahce Optimization; Sparse Attention
Key-Value (KV) cache plays a pivotal role in accelerating inference in large language models (LLMs) by storing intermediate attention outputs, thereby avoiding redundant computation during auto-regressive generation. However, the cache's memory footprint scales linearly with sequence length, often resulting in memory bottlenecks on constrained hardware. While prior work has explored offloading KV-cache to the CPU and maintaining a reduced subset on the GPU, these approaches frequently suffer from imprecise token prioritization and degraded performance in long-generation tasks such as multi-turn dialogues and chain-of-thought reasoning. In this paper, we propose a novel KV-cache management strategy called IceCache, that integrates semantic token clustering with PagedAttention, a memory-efficient paging mechanism. By clustering semantically related tokens and organizing them into a hierarchical, dynamically updateable structure, our method improves cache hit rates and memory bandwidth utilization during CPU-GPU transfers. Experimental results show that IceCache achieves over 99\% accuracy with a 256-token budget and still maintains 97\% accuracy with only a 64-token budget, compared to the full KV-cache model. It outperforms existing baselines even while using just 25\% of the KV-cache token budget, demonstrating its superior accuracy in long-sequence scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4392. Not All Documents Are What You Need for Extracting Instruction Tuning Data
๐ฌ 14
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ data extraction data efficient instruction tuning
Instruction tuning improves the LLMs performance but depends on high-quality training data. Recently, LLMs have been used to synthesize data, enhancing training with seeds like question-answer (QA) pairs. However, this synthesis often results in instruction examples similar to the seeds, lacking diversity and biasing real applications. Thus, we propose to extract instruction tuning data from web corpus with much rich knowledge. The most straightforward strategy is to quickly retrieve domain specific documents from the corpus and then extract all QA pairs of these documents for tuning LLMs, which has two main limitations. (1) Extracting all QA pairs using LLMs is prohibitively expensive; and (2) These extracted pairs are not all beneficial for the downstream applications, and incorporating all of them for tuning may even hurt the model performance. To overcome the limitations, we introduce $\texttt{EQUAL}$, an $\textbf{E}$ffective and scalable data extraction framework that iteratively interleaves document selection and extract high-$\textbf{QUAL}$ity QA pairs to optimize instruction tuning. $\texttt{EQUAL}$ first clusters the document set based on the embeddings generated by contrastive learning. Then it leverages the multi-armed bandit based strategy to quickly identify document clusters where can extract high-quality QA pairs for training. This iterative framework significantly reduces computational costs while improving model performance much. Experiments on AutoMathText, KnowledgePile and StackOverflow across 13 downstream tasks demonstrate that $\texttt{EQUAL}$ reduces computational costs by 5โ10$\times$ while improving accuracy by 2.5\% on LLaMA-3.1-8B, Qwen2.5-7B and Mistral-7B. Code and data is available at https://anonymous.4open.science/r/EQUAL-DD20.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4393. Trade in Minutes! Rationality-Driven Agentic System for Quantitative Financial Trading
๐ฌ 17
โญ 4.67
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Multi-Agent System Algorithmic Trading Mathematical Reflection Large Language Models
Recent advancements in large language models (LLMs) and agentic systems have shown exceptional decision-making capabilities, revealing significant potential for autonomic finance. Current financial trading agents predominantly simulate anthropomorphic roles that inadvertently introduce emotional biases and rely on peripheral information, while being constrained by the necessity for continuous inference during deployment. In this paper, we pioneer the harmonization of strategic depth in agents with the mechanical rationality essential for quantitative trading. Consequently, we present TiMi (Trade in Minutes), a rationality-driven multi-agent system that architecturally decouples strategy development from minute-level deployment. TiMi leverages specialized LLM capabilities of semantic analysis, code programming, and mathematical reasoning within a comprehensive policy-optimization-deployment chain. Specifically, we propose a two-tier analytical paradigm from macro patterns to micro customization, layered programming design for trading bot implementation, and closed-loop optimization driven by mathematical reflection. Extensive evaluations across 200+ trading pairs in stock and cryptocurrency markets empirically validate the efficacy of TiMi in stable profitability, action efficiency, and risk control under volatile market dynamics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4394. MILCO: Learned Sparse Retrieval Across Languages via a Multilingual Connector
๐ฌ 20
โญ 4.67
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ learned sparse retrieval multilingual retrieval cross-lingual retrieval neural lexical search
Learned Sparse Retrieval (LSR) combines the efficiency of bi-encoders with the transparency of lexical matching, but existing approaches struggle to scale beyond English. We introduce MILCO, an LSR architecture that maps queries and documents from different languages into a shared English lexical space via a multilingual connector. MILCO is trained with a specialized two-stage regime that combines Sparse Alignment Pretraining with contrastive training to provide representation transparency and effectiveness while mitigating semantic collapse. Motivated by the observation that uncommon entities are often lost when projected into English, we propose a new LexEcho head, which enhances robustness by augmenting the English lexical representation with a source-language view obtained through a special [ECHO] token. MILCO achieves state-of-the-art multilingual and cross-lingual LSR performance, outperforming leading dense, sparse, and multi-vector baselines such as BGE-M3 and Qwen3-Embed on standard multilingual benchmarks, while supporting dynamic efficiency through post-hoc pruning. Notably, when using mass-based pruning to reduce document representations to only 30 active dimensions on average, MILCO 560M outperforms the similarly-sized Qwen3-Embed 0.6B with 1024 dimensions, , while achieving 3$\times$ lower retrieval latency and 10$\times$ smaller index size.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4395. Measuring Bias Amplification in Multi-Agent Systems with Large Language Models
๐ฌ 3
โญ 4.67
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Multi-Agent System Bias Evaluation
Recent advances in large language models (LLMs) have led to significant progress
in mitigating social biases at the individual model level. However, a core vulnerability persists: small, stochastic biases can be amplified through multi-step
interaction, leading to skewed system-level outcomes. A promising, yet unverified,
hypothesis is that the architectural diversity of multi-agent systems (MAS)โwhere
LLM-based agents with different roles and perspectives interactโcould naturally
mitigate this amplification. In this work, we rigorously test this hypothesis and
investigate the phenomenon of bias amplification in MAS across sensitive attributes,
including gender, age, and race. We introduce Discrim-Eval-Open, an open-ended,
multi-option benchmark designed to measure system-level bias and bypass the performative neutrality of modern LLMs. We further propose novel metrics, including
an adaptation of the Gini coefficient, to quantify the extremity of system-wide
outputs. Our experiments reveal that iterative bias amplification is a pervasive
issue that is not solved by MAS architectures. This amplification persists across
various configurations, spanning agent roles, communication topologies, iteration
depths, and model types, even when individual agents exhibit minimal bias in
isolation. Moreover, we observe a systemic tendency to favor younger age groups,
females, and Black communities. Finally, we demonstrate that even the inclusion
of objective, neutral inputs can exacerbate bias amplification, exposing a critical
vulnerability in system-level robustness. These findings challenge the assumption
that architectural complexity alone fosters equity, underscoring the urgent need to
address the fundamental dynamics of bias amplification within LLM-based MAS.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4396. Grouping Nodes with known Value Differences: A lossless UCT-based Abstraction Algorithm
๐ฌ 7
โญ 4.67
๐ optimization
๐ท๏ธ Artificial intelligence Sequential decision-making Abstractions MCTS
A core challenge of Monte Carlo Tree Search (MCTS) is its sample efficiency, which can be addressed by building and using state and/or state-action pair abstractions in parallel to the tree search, such that information can be shared among nodes of the same layer. On the Go Abstractions in Upper Confidence bounds applied to Trees (OGA-UCT) is the state-of-the-art MCTS abstraction algorithm for deterministic environments that builds its abstraction using the Abstractions of State-Action Pairs (ASAP) framework, which aims to detect states and state-action pairs with the same value under optimal play by analysing the search graph. ASAP, however, requires two state-action pairs to have the same immediate reward, which is a rigid condition that limits the number of abstractions that can be found and thereby the sample efficiency. In this paper, we break with the paradigm of grouping value-equivalent states or state-action pairs and instead group states and state-action pairs with possibly different values as long as the difference between their values can be inferred. We call this abstraction framework Known Value Difference Abstractions (KVDA), which infers the value differences by analysis of the immediate rewards and modifies OGA-UCT to use this framework instead. The modification is called KVDA-UCT, which detects significantly more abstractions than OGA-UCT, introduces no additional parameter, and outperforms OGA-UCT on a variety of deterministic environments and parameter settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4397. CoDi: Subject-Consistent and Pose-Diverse Text-to-Image Generation
๐ฌ 4
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Consistent Subject Generation; Text-to-Image; Diffusion
Subject-consistent generation (SCG)-aiming to maintain a consistent subject identity across diverse scenes-remains a challenge for text-to-image (T2I) models. Existing training-free SCG methods often achieve consistency at the cost of layout and pose diversity, hindering expressive visual storytelling. To address the limitation, we propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi, that enables consistent subject generation with diverse pose and layout. Motivated by the progressive nature of diffusion, where coarse structures emerge early and fine details are refined later, CoDi adopts a two-stage strategy: Identity Transport (IT) and Identity Refinement (IR). IT operates in the early denoising steps, using optimal transport to transfer identity features to each target image in a pose-aware manner. This promotes subject consistency while preserving pose diversity. IR is applied in the later denoising steps, selecting the most salient identity features to further refine subject details. Extensive qualitative and quantitative results on subject consistency, pose diversity, and prompt fidelity demonstrate that CoDi achieves both better visual perception and stronger performance across all metrics. The code is provided in the supplementary material.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4398. Divid: Disentangled Spatial-Temporal Modeling within LLMs for Temporally Grounded Video Understanding
๐ฌ 27
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video understanding; Temporal grounding; VideoLLM
Recent advances in Video LLMs have improved video understanding performance, but temporally grounded understanding in long-form videos remains challenging. Most models encode video frames into a flat sequence of visual tokens, which are then processed together with textual input by the LLM. While effective for short videos, this approach becomes inefficient for long-form videos due to lengthy token sequences that exceed context limits and incur high computational costs. Slow-Fast architectures partially address this by separating temporal and spatial features during encoding, but these features are still processed jointly within the LLM, lacking true spatio-temporal disentanglement. Moreover, spatial features are typically sampled in a query-agnostic manner, risking the loss of task-relevant content. To address these limitations, we propose Divid, a novel dual-branch framework that explicitly disentangles spatial and temporal modeling within the LLM decoder. Specifically, the temporal branch processes densely sampled, low-resolution frames to effectively capture long-range motion dynamics, while the spatial branch selects a sparse set of high-resolution keyframes guided by temporal attention. To unify the two branches, we design a lightweight spatio-temporal soft-router that adaptively fuses temporal and spatial cues at the token level, conditioned on the input query. This disentangled architecture not only improves temporal alignment accuracy but also leads to computational savings by minimizing redundant visual processing. Furthermore, we introduce TempGCap, a large-scale dataset consisting of 559K timestamp-grounded video-text pairs, providing rich temporal supervision. Extensive experiments on temporal grounding and grounded videoQA benchmarks demonstrate the superior performance and efficiency of our proposed Divid.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4399. Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer
๐ฌ 17
โญ 4.67
๐ generative models
๐ท๏ธ Diffusion DiT Image Editing Video Editing Color Editing
Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free approaches provide broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility. Code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4400. Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents
๐ฌ 16
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Turn-Level Reward Search Agent Agentic RL
Large language model (LLM)โbased agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policyโs probability of producing the correct answer. These intrinsic rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines, achieving higher accuracy and improved sample efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4401. Reinforced Latent Reasoning for LLM-based Recommendation
๐ฌ 13
โญ 4.67
๐ learning on time series and dynamical systems
๐ท๏ธ Latent reasoning Recommendation
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks, sparking growing interest in their application to preference reasoning in recommendation systems. Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data. However, these methods face significant practical limitations due to (1) the difficulty of obtaining high-quality CoT data in recommendation and (2) the high inference latency caused by generating CoT reasoning. In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning. This approach eliminates the need for explicit CoT generation and improves inference efficiency, as a small set of latent tokens can effectively capture the entire reasoning process. Building on this idea, we propose *Reinforced Latent Reasoning for Recommendation* (LatentR$^3$), a novel end-to-end training framework that leverages reinforcement learning (RL) to optimize latent reasoning without relying on any CoT data.LatentR$^3$ adopts a two-stage training strategy: first, supervised fine-tuning to initialize the latent reasoning module, followed by pure RL training to encourage exploration through a rule-based reward design. Our RL implementation is based on a modified GRPO algorithm, which reduces computational overhead during training and introduces continuous reward signals for more efficient learning. Extensive experiments demonstrate that LatentR$^3$ enables effective latent reasoning without any direct supervision of the reasoning process, significantly improving performance when integrated with different LLM-based recommendation methods. Our codes are available at https://anonymous.4open.science/r/R3-A278/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4402. Memorizing Long-tail Data Can Help Generalization Through Composition
๐ฌ 8
โญ 4.67
๐ learning theory
๐ท๏ธ memorization composition long-tail data
Deep learning has led researchers to rethink the relationship between memorization and generalization. In many settings, memorization does not hurt generalization due to implicit regularization and may help by memorizing long-tailed examples. In this paper, we consider the synergy between memorization and simple composition \--- the ability to make correct prediction on a combination of long-tailed features. Theoretically, we show that for a linear setting, memorization together with composition can help the model make correct predictions on rare test examples that require a combination of long-tailed features, even if such combinations were never observed in the training data. Experiments on neural network architecture on simple data show that the theoretical insight extends beyond the linear setting, and we further observe that the composition capability of the model depends on its architecture.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4403. Can Large Language Models Match the Conclusions of Systematic Reviews?
๐ฌ 11
โญ 4.67
๐ datasets and benchmarks
๐ท๏ธ Benchmarks Multi-document Reasoning Medical AI
Systematic reviews (SR), in which experts summarize and analyze evidence across
individual studies to provide insights on a specialized topic, are a cornerstone
for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large
language models (LLMs) to automate SR generation. However, the ability of
LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly
characterized. We therefore ask: Can LLMs match the conclusions of systematic
reviews written by clinical experts when given access to the same studies?
To explore this question, we present MedEvidence, a benchmark pairing findings
from 100 SRs with the studies they are based on. We benchmark 24 LLMs on
MedEvidence, including reasoning, non-reasoning, medical specialist, and models
across varying sizes (from 7B-700B). Through our systematic evaluation, we find
that reasoning does not necessarily improve performance, larger models do not
consistently yield greater gains, and knowledge-based fine-tuning degrades accuracy on MedEvidence. Instead, most models exhibit similar behavior: performance
tends to degrade as token length increases, their responses show overconfidence,
and, contrary to human experts, all models show a lack of scientific skepticism
toward low-quality findings. These results suggest that more work is still required
before LLMs can reliably match the observations from expert-conducted SRs, even
though these systems are already deployed and being used by clinicians. We release our codebase
and benchmark
to the broader research community to further
investigate LLM-based SR systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4404. GlobeDiff: State Diffusion Process for Partial Observability in Multi-Agent System
๐ฌ 16
โญ 4.67
๐ reinforcement learning
๐ท๏ธ Multi-Agent Systems Partial Observability Diffusion Models
In the realm of multi-agent systems, the challenge of partial observability is a critical barrier to effective coordination and decision-making. Existing approaches, such as belief state estimation and inter-agent communication, often fall short. Belief-based methods are limited by their focus on past experiences without fully leveraging global information, while communication methods often lack a robust model to effectively utilize the auxiliary information they provide.
To solve this issue, we propose Global State Diffusion Algorithm to infer the global state based on the local observations.
By formulating the state inference process as a multi-modal diffusion process, GlobeDiff overcomes ambiguities in state estimation while simultaneously inferring the global state with high fidelity.
We prove that the estimation error of GlobeDiff under both unimodal and multi-modal distributions can be bounded.
Extensive experimental results demonstrate that GlobeDiff achieves superior performance and is capable of accurately inferring the global state.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4405. TableMaster: A Recipe to Advance Table Understanding with Language Models
๐ฌ 12
โญ 4.67
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Table Understanding Table Reasoning Large Language Model Natural Language Processing
Tables serve as a fundamental format for representing structured relational data. While current language models (LMs) excel at many text-based tasks, they still face challenges in table understanding due to the complex characteristics of tabular data, such as their structured nature. In this paper, we aim to enhance LMs for improved table understanding. We identify four key challenges: 1) difficulty in locating target data, 2) deficiency in table semantics, 3) numerical inaccuracies in textual reasoning, and 4) semantic inflexibility in symbolic reasoning. To address these issues, we propose TableMaster, a recipe and comprehensive framework that integrates multiple solutions to overcome these obstacles. TableMaster first extracts relevant table content and verbalizes it with enriched semantic context. Additionally, we introduce adaptive reasoning, a flexible approach that dynamically adjusts between textual and symbolic reasoning, tailoring the reasoning process to each query. Extensive analyses and experiments demonstrate our findings and the effectiveness of TableMaster. On the WikiTQ dataset, TableMaster achieves an accuracy of 78.13% using GPT-4o-mini, surpassing existing baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4406. A Biologically Plausible Dense Associative Memory with Exponential Capacity
๐ฌ 17
โญ 4.67
๐ applications to neuroscience & cognitive science
๐ท๏ธ Dense Associative Memory Hopfield Network
Krotov and Hopfield (2021) proposed a biologically plausible two-layer associative memory network with memory storage capacity exponential in the number of visible neurons. However, the capacity was only linear in the number of hidden neurons. This limitation arose from the choice of nonlinearity between the visible and hidden units, which enforced winner-takes-all dynamics in the hidden layer, thereby restricting each hidden unit to encode only a single memory. We overcome this limitation by introducing a novel associative memory network with a threshold nonlinearity that enables distributed representations. In contrast to winner-takes-all dynamics, where each hidden neuron is tied to an entire memory, our network allows hidden neurons to encode basic components shared across many memories. Consequently, complex patterns are represented through combinations of hidden neurons. These representations reduce redundancy and allow many correlated memories to be stored compositionally. Thus, we achieve much higher capacity: exponential in the number of hidden units, provided the number of visible units is sufficiently larger than the number of hidden neurons. Exponential capacity arises because all binary states of the hidden units can become stable memory patterns with an appropriately chosen threshold. Moreover, the distributed hidden representation, which has much lower dimensionality than the visible layer, preserves class-discriminative structure, supporting efficient nonlinear decoding. These results establish a new regime for associative memory, enabling high-capacity, robust, and scalable architectures consistent with biological constraints.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4407. A Bayesian Nonparametric Framework for Private, Fair, and Balanced Tabular Data Synthesis
๐ฌ 16
โญ 4.67
๐ generative models
๐ท๏ธ Bayesian nonparametric Dirichlet process Differential privacy Tabular data generation
A fundamental challenge in data synthesis is protecting the fairness and privacy of
the individual, particularly in data-scarce environments where underrepresented
groups are at risk of further marginalization by reproducing the biases inherent in
the data modeling process. We introduce a privacy- and fairness-aware for a class
of generative models, which fuses the conditional generator within the framework
of Bayesian nonparametric learning (BNPL). This conditional structure imposes
fairness constraints in our generative model by minimizing the mutual information
between generated outcomes and protected attributes. Unlike existing methods
that primarily focus on sensitive binary-valued attributes, our framework extends
seamlessly to non-binary attributes. Moreover, our method provides a systematic
solution to class imbalance, ensuring adequate representation of underrepresented
protected groups. Our proposed approach offers a scalable, privacy-preserving
framework for ethical and equitable data generation, which we demonstrate by
theoretical guarantees and extensive experiments on sensitive empirical examples.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4408. Contextual and Seasonal LSTMs for Time Series Anomaly Detection
๐ฌ 22
โญ 4.67
๐ learning on time series and dynamical systems
๐ท๏ธ time series anomaly detection
Univariate time series (UTS), where each timestamp records a single variable, serve as crucial indicators in web systems and cloud servers. Anomaly detection in UTS plays an essential role in both data mining and system reliability management. However, existing reconstruction-based and prediction-based methods struggle to capture certain subtle anomalies, particularly small point anomalies and slowly rising anomalies. To address these challenges, we propose a novel prediction-based framework named Contextual and Seasonal LSTMs (CS-LSTMs). CS-LSTMs are built upon a noise decomposition strategy and jointly leverage contextual dependencies and seasonal patterns, thereby strengthening the detection of subtle anomalies. By integrating both time-domain and frequency-domain representations, CS-LSTMs achieve more accurate modeling of periodic trends and anomaly localization. Extensive evaluations on public benchmark datasets demonstrate that CS-LSTMs consistently outperform state-of-the-art methods, highlighting their effectiveness and practical value in robust time series anomaly detection.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4409. M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding
๐ฌ 14
โญ 4.67
๐ datasets and benchmarks
๐ท๏ธ Chain-of-Thought Multimodal Large Language Models M3CoTBench Benchmark
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. An opaque process lacks reliable bases for judgment, making it difficult to assist doctors in diagnosis.
To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features (1) a diverse, multi-level difficulty dataset covering 24 examination types, (2) 13 varying-difficulty tasks, (3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and (4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4410. MatRIS: Toward Reliable and Efficient Pretrained Machine Learning Interaction Potentials
๐ฌ 13
โญ 4.67
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Universal Machine Learning Interatomic Potentials Training Efficiency Accuracy
Universal MLIPs (uMLIPs) demonstrate broad applicability across diverse material systems and have emerged as a powerful and transformative paradigm in chemical and computational materials science. Equivariant uMLIPs achieve state-of-the-art accuracy in a wide range of benchmarks by incorporating equivariant inductive bias. However, the reliance on tensor products and high-degree representations makes them computationally costly. This raises a fundamental question: as quantum mechanical-based datasets continue to expand, can we develop a more compact model to thoroughly exploit high-dimensional atomic interactions?
In this work, we present MatRIS (Materials Representation and Interaction Simulation), an invariant uMLIP that introduces attention-based modeling of three-body interactions. MatRIS leverages a novel separable attention mechanism with linear complexity $O(N)$, enabling both scalability and expressiveness. MatRIS delivers accuracy comparable to that of leading equivariant models on a wide range of popular benchmarks (Matbench-Discovery, MatPES, MDR phonon, Molecular dataset, etc). Taking Matbench-Discovery as an example, MatRIS achieves an F1 score of up to 0.847 while improving training efficiency by 13.0โ13.5$\times$ at comparable accuracy. The work indicates that our carefully designed invariant models can match or exceed the accuracy of equivariant models at a fraction of the cost, shedding light on the development of accurate and efficient uMLIPs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4411. EMFuse: Energy-based Model Fusion for Decision Making
๐ฌ 14
โญ 4.67
๐ reinforcement learning
๐ท๏ธ Model Fusion Energy-Based Model Decision Making
Model fusion has emerged as a promising research direction, offering a resource-efficient paradigm that leverages existing pre-trained models to circumvent the need for training from scratch. In this work, we investigate the fusion of models specifically adapted for decision-making tasks. This challenge divides into two distinct, yet related subproblems: the direct fusion of models that act as policy and the fusion of dynamics models that subsequently induce a policy. We suggest that these seemingly divergent subproblems can be unified through the lens of energy-based models (EBMs), which parameterizes a conditional distribution via an energy function where lower energy implies higher probability. Our framework, \textbf{EMFuse}, provides this convergence by leveraging the concept of energy as a common currency for fusion. For direct fusion of policies, such as those in language models, the output distribution is commonly softmax (Boltzmann), which essentially defines the negative logarithmic probability as an energy function. For dynamics models, existing works often train a set of models on the same dataset to obtain robust uncertainty estimation; such an ensemble approach leads to an exponential explosion in computational complexity when it comes to dynamics fusion across multiple sets of models. To overcome this, we introduce the Any-step Dynamics Energy-based Transition Model (ADETM), a novel architecture that performs efficient single-model-per-dataset uncertainty estimation with its energy-based backbone, thereby avoiding this computational explosion. Our EMFuse framework surpasses other baselines by 0.34\% to 6.63\% on single/cross domain discrete decision-making benchmarks, and achieved an extra 2.3 to 7.4 normalized points on average in D4RL MuJoCo continuous-control scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4412. When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets
๐ฌ 14
โญ 4.67
๐ datasets and benchmarks
๐ท๏ธ reinforcement learning direct preference optimization post training large language models data quality data annotation
Aligning large language models (LLMs) is a central objective of post-training, often achieved through reward modeling and reinforcement learning methods. Among these, direct preference optimization (DPO) has emerged as a widely adopted technique that fine-tunes LLMs on preferred completions over less favorable ones. While most frontier LLMs do not disclose their curated preference pairs, the broader LLM community has released several open-source DPO datasets, including TuluDPO, ORPO, UltraFeedback, HelpSteer, and Code-Preference-Pairs. However, systematic comparisons remain scarce, largely due to the high computational cost and the lack of rich quality annotations, making it difficult to understand how preferences were selected, which task types they span, and how well they reflect human judgment on a per-sample level. In this work, we present the first comprehensive, data-centric analysis of popular open-source DPO corpora. We leverage the Magpie framework to annotate each sample for task category, input quality, and preference reward, a reward-model-based signal that validates the preference order without relying on human annotations. This enables a scalable, fine-grained inspection of preference quality across datasets, revealing structural and qualitative discrepancies in reward margins. Building on these insights, we systematically curate a new DPO mixture, **UltraMix**, that draws selectively from all five corpora while removing noisy or redundant samples. UltraMix is 30\% smaller than the best-performing individual dataset yet exceeds its performance across key benchmarks. We publicly release all annotations, metadata, and our curated mixture to facilitate future research in data-centric preference optimization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4413. HFSTI-Net: Hierarchical Frequency-spatial-temporal Interactions for Video Polyp Segmentation
๐ฌ 6
โญ 4.67
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Frequency Learning Video Segmentation Medical Segmentation Video Polyp Segmentation
Automatic video polyp segmentation (VPS) is crucial for preventing and treating colorectal cancer by ensuring accurate identification of polyps in colonoscopy examinations. However, its clinical application is hampered by two key challenges: shape collapse, which compromises structural integrity, and episodic amnesia, which causes instability in challenging video sequences. To address these challenges, we present a novel video segmentation network, \emph{HFSTI-Net}, which integrates global perception with spatiotemporal consistency in spatial, temporal, and frequency domains. Specifically, to address shape collapse under low contrast or visual ambiguity, we design a Hierarchical Frequency-spatial Interaction (HFSI) module that fuses spatial and frequency cues for fine-grained boundary localization. Furthermore, we propose a recurrent mask-guided propagation (RMP) module that introduces a dual enhancement mechanism based on feature memory and mask alignment, effectively incorporating spatiotemporal information to alleviate inter-frame inconsistencies and ensuring long-term segmentation stability. Extensive experiments on the SUN-SEG and CVC-612 datasets demonstrate that our method achieves real-time inference and outperforms other state-of-the-art approaches. The codes will be made available upon publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4414. Learning to Be Uncertain: Pre-training World Models with Horizon-Calibrated Uncertainty
๐ฌ 12
โญ 4.67
๐ reinforcement learning
๐ท๏ธ World Models Unsupervised Pre-training Temporal Relative Embeddings Horizon-Calibrated Uncertainty
Pre-training world models on large, action-free video datasets offers a promising path toward generalist agents, but a fundamental flaw undermines this paradigm. Prevailing methods train models to predict a single, deterministic future, an objective that is ill-posed for inherently stochastic environments where actions are unknown. We contend that a world model should instead learn a structured, probabilistic representation of the future where predictive uncertainty correctly scales with the temporal horizon. To achieve this, we introduce a pre-training framework, **H**orizon-c**A**librated
**U**ncertainty **W**orld **M**odel (HAUWM), built on a probabilistic ensemble that predicts frames at randomly sampled future horizons. The core of our method is a Horizon-Calibrated Uncertainty (HCU) loss, which explicitly shapes the latent space by encouraging predictive variance to grow as the model projects further into the future. This approach yields a latent dynamics model that is not only predictive but also equipped with a reliable measure of temporal confidence. When fine-tuned for downstream control, our pre-trained model significantly outperforms state-of-the-art methods across a diverse suite of benchmarks, including MetaWorld, the DeepMind Control Suite, and RoboDesk. These results highlight the critical role of structured uncertainty in robust decision-making.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4415. Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving
๐ฌ 12
โญ 4.67
๐ applications to robotics, autonomy, planning
๐ท๏ธ autonomous driving VLA discrete diffusion
End-to-End (E2E) solutions have emerged as a mainstream approach for autonomous driving systems, with Vision-Language-Action (VLA) models representing a new paradigm that leverages pre-trained multimodal knowledge from Vision-Language Models (VLMs) to interpret and interact with complex real-world environments. However, these methods remain constrained by the limitations of imitation learning, which struggles to inherently encode physical rules during training. Existing approaches often rely on complex rule-based post-refinement, employ reinforcement learning that remains largely limited to simulation, or utilize diffusion guidance that requires computationally expensive gradient calculations. To address these challenges, we introduce ReflectDrive, a novel learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion. We first discretize the two-dimensional driving space to construct an action codebook, enabling the use of pre-trained Diffusion Language Models for planning tasks through fine-tuning. Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient computation. Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors. Based on this, we apply local search methods to identify unsafe tokens and determine feasible solutions, which then serve as safe anchors for inpainting-based regeneration. Evaluated on the NAVSIM benchmark, ReflectDrive demonstrates significant advantages in safety-critical trajectory generation, offering a scalable and reliable solution for autonomous driving systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4416. Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation
๐ฌ 25
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Personalization Reasoning
Preference alignment has enabled large language models (LLMs) to better reflect human expectations, but current methods mostly optimize for population-level preferences, overlooking individual users. Personalization is essential, yet early approachesโsuch as prompt customization or fine-tuningโstruggle to reason over implicit preferences, limiting real-world effectiveness. Recent โthink-then-generateโ methods address this by reasoning before response generation. However, they face challenges in long-form generation: their static one-shot reasoning must capture all relevant information for the full response generation, making learning difficult and limiting adaptability to evolving content. To address this issue, we propose **FlyThinker**, an efficient โthink-while-generatingโ framework for personalized long-form generation. FlyThinker employs a separate reasoning model that generates latent token-level reasoning in parallel, which is fused into the generation model to dynamically guide response generation. This design enables reasoning and generation to run concurrently, ensuring inference efficiency. In addition, the reasoning model is designed to depend only on previous responses rather than its own prior outputs, which preserves training parallelism across different positionsโallowing all reasoning tokens for training data to be produced in a single forward pass like standard LLM training, ensuring training efficiency. Extensive experiments on real-world benchmarks demonstrate that FlyThinker achieves better personalized generation while keeping training and inference efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4417. Bayesian Neural Networks for Functional ANOVA Model
๐ฌ 3
โญ 4.67
๐ interpretability and explainable AI
๐ท๏ธ Bayesian Machine Learning Functional ANOVA Model Bayesian Neural Networks Interpretable AI
With the increasing demand for interpretability in machine learning, functional ANOVA decomposition has gained renewed attention as a principled tool for breaking down high-dimensional function into low-dimensional components that reveal the contributions of different variable groups.
Recently, Tensor Product Neural Network (TPNN) has been developed and applied as basis functions in the functional ANOVA model, referred to as ANOVA-TPNN.
A disadvantage of ANOVA-TPNN, however, is that the components to be estimated must be specified in advance, which makes it difficult to incorporate higher-order TPNNs into the functional ANOVA model due to computational and memory constraints.
In this work, we propose Bayesian-TPNN, a Bayesian inference procedure for the functional ANOVA model with TPNN basis functions, enabling the detection of higher-order components with reduced computational cost compared to ANOVA-TPNN.
We develop an efficient MCMC algorithm and demonstrate that Bayesian-TPNN performs well by analyzing multiple benchmark datasets.
Theoretically, we prove that the posterior of Bayesian-TPNN is consistent.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4418. GradShield: Alignment Preserving Finetuning
๐ฌ 13
โญ 4.67
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ large language model finetuning alignment defense safety
Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards unsafe behaviors. To address this, we introduce GradShield, a principled filtering method that safeguards LLMs during finetuning by identifying and removing harmful data points before they corrupt the model's alignment. It removes potentially harmful data by computing a Finetuning Implicit Harmfulness Score (FIHS) for each data point and employs an adaptive thresholding algorithm.We apply GradShield to multiple utility fine-tuning tasks combined with different levels of harmful data, and evaluate the safety and utility performance of the resulting LLMs under various metrics. Our results show that GradShield outperforms all baseline methods, as it consistently maintains a low Attack Success Rate (ASR) of under $6\%$, while preserving the utility performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4419. Unlearning during Training: Domain-Specific Gradient Ascent for Domain Generalization
๐ฌ 29
โญ 4.67
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Unlearning Transfer Learning Domain Generalization
Deep neural networks often exhibit degraded performance under domain shifts due to reliance on domain-specific features. Existing domain generalization (DG) methods attempt to mitigate this during training but lack mechanisms to adaptively correct domain-specific reliance once it emerges. We propose Identify and Unlearn (IU), a model-agnostic module that continually mitigates such reliance post-epoch. We introduce an unlearning score to identify training samples that disproportionately increase model complexity while contributing little to generalization, and an Inter-Domain Variance (IDV) metric to reliably identify domain-specific channels. To suppress the adverse influence of identified samples, IU employs a Domain-Specific Gradient-Ascent (DSGA) procedure that selectively removes domain-specific features while preserving domain-invariant features. Extensive experiments across seven benchmarks and fifteen DG baselines show that IU consistently improves out-of-distribution generalization, achieving average accuracy gains of up to 3.0\%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4420. DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations
๐ฌ 15
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ voice conversation model parallel speech-text end-to-end dual-resolution
Recent studies on end-to-end (E2E) speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing E2E approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLMโs autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling, featuring dual-resolution speech representations. Notably, while current methods utilize mainly 12.5Hz input audio representation, our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz, significantly reducing computational cost and alleviating the frequency discrepancy between speech and text tokens and in turn better exploiting LLMsโ capabilities. Experimental results demonstrate that DrVoice-7B establishes new state-of-the-art (SOTA) on prominent speech benchmarks including OpenAudioBench, VoiceBench, UltraEval-Audio and Big Bench Audio, making it a leading open-source speech foundation model in โผ7B models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4421. High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes
๐ฌ 8
โญ 4.67
๐ learning theory
๐ท๏ธ stochastic gradient descent momentum adaptive step-sizes scaling limits high dimensional probability spiked tensor PCA single index model
We develop a high-dimensional scaling limit for Stochastic Gradient Descent with Polyak Momentum (SGD-M) and adaptive step-sizes. This provides a framework to rigourously compare online SGD with some of its popular variants. We show that the scaling limits of SGD-M coincide with those of online SGD after an appropriate time rescaling and a specific choice of step-size. However, if the step-size is kept the same between the two algorithms, SGD-M will amplify high-dimensional effects, potentially degrading performance relative to online SGD. We demonstrate our framework on two popular learning problems: Spiked Tensor PCA and Single Index Models. In both cases, we also examine online SGD with an adaptive step-size based on normalized gradients. In the high-dimensional regime, this algorithm yields multiple benefits: its dynamics admit fixed points closer to the population minimum and widens the range of admissible step-sizes for which the iterates converge to such solutions. These examples provide a rigorous account, aligning with empirical motivation, of how early preconditioners can stabilize and improve dynamics in settings where online SGD fails.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4422. Influence Dynamics and Stagewise Data Attribution
๐ฌ 10
โญ 4.67
๐ interpretability and explainable AI
๐ท๏ธ Training data attribution influence functions singular learning theory stagewise development phase transitions developmental interpretability Bayesian influence functions
Current training data attribution (TDA) methods treat influence as static, ignoring the fact that neural networks learn in distinct stages. This stagewise development, driven by phase transitions on a degenerate loss landscape, means a sample's importance is not fixed but changes throughout training. In this work, we introduce a developmental framework for data attribution, grounded in singular learning theory. We predict that influence can change non-monotonically, including sign flips and sharp peaks at developmental transitions. We first confirm these predictions analytically and empirically in a toy model, showing that dynamic shifts in influence directly map to the model's progressive learning of a semantic hierarchy. Finally, we demonstrate these phenomena at scale in language models, where token-level influence changes align with known developmental stages.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4423. DeepRAG: Thinking to Retrieve Step by Step for Large Language Models
๐ฌ 14
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ retrieval-augmented generation adaptive retrieve
Large Language Models (LLMs) have shown remarkable reasoning capabilities, while their practical applications are limited by severe factual hallucinations due to limitations in the timeliness, accuracy, and comprehensiveness of their parametric knowledge. Meanwhile, enhancing retrieval-augmented generation (RAG) with reasoning remains challenging due to ineffective task decomposition and redundant retrieval, which can introduce noise and degrade response quality. In this paper, we propose DeepRAG, a framework that models retrieval-augmented reasoning as a Markov Decision Process (MDP), enabling reasonable and adaptive retrieval. By iteratively decomposing queries, DeepRAG dynamically determines whether to retrieve external knowledge or rely on parametric reasoning at each step. Experiments show that DeepRAG improves retrieval efficiency and boosts answer accuracy by 25.41%, demonstrating its effectiveness in enhancing retrieval-augmented reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4424. Scaling Agents via Continual Pre-training
๐ฌ 14
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ Continual Pre-training Deep Research Agent Agentic Training Data Synthesis
Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4425. Watermark-based Attribution of AI-Generated Images
๐ฌ 10
โญ 4.67
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Image Watermark Watermark-based Attribution AI-generated Images
Several companies have deployed watermark-based detection to identify AI-generated images. However, attribution--the ability to trace back to the user of a generative AI (GenAI) service who created a given AI-generated image--remains largely unexplored despite its growing importance. In this work, we aim to bridge this gap by conducting the first systematic study on watermark-based, user-level attribution of AI-generated images. Our key idea is to assign a unique watermark to each user of the GenAI service and embed this watermark into the AI-generated images created by that user. Attribution is then performed by identifying the user whose watermark best matches the one extracted from the given image. This approach, however, faces a key challenge: How should watermarks be selected for users to maximize attribution performance? To address the challenge, we first theoretically derive lower bounds on detection and attribution performance through rigorous probabilistic analysis for any given set of user watermarks. Then, we select watermarks for users to maximize these lower bounds, thereby optimizing detection and attribution performance. Our theoretical and empirical results show that watermark-based attribution inherits both the accuracy and (non-)robustness properties of the underlying watermark. Specifically, attribution remains highly accurate when the watermarked AI-generated images is either not post-processed or subjected to common post-processing such as JPEG compression, as well as black-box adversarial post-processing with limited query budgets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4426. Learning of Population Dynamics: Inverse Optimization Meets JKO Scheme
๐ฌ 14
โญ 4.67
๐ generative models
๐ท๏ธ population dynamics JKO scheme inverse problem
Learning population dynamics involves recovering the underlying process that governs particle evolution, given evolutionary snapshots of samples at discrete time points. Recent methods frame this as an energy minimization problem in probability space and leverage the celebrated JKO scheme for efficient time discretization. In this work, we introduce ``iJKOnet``, an approach that combines the JKO framework with inverse optimization techniques to learn population dynamics. Our method relies on a conventional *end-to-end* adversarial training procedure and does not require restrictive architectural choices, e.g., input-convex neural networks. We establish theoretical guarantees for our methodology and demonstrate improved performance over prior JKO-based methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4427. CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation
๐ฌ 12
โญ 4.67
๐ datasets and benchmarks
๐ท๏ธ AI Safety Red-Teaming Safety Alignment Korean Red-Teaming
Existing red-teaming benchmarks, when adapted to new languages via direct translation, fail to capture socio-technical vulnerabilities rooted in local culture and law, creating a critical blind spot in LLM safety evaluation.To address this gap, we introduce CAGE (Culturally Adaptive Generation), a framework that systematically adapts the adversarial intent of proven red-teaming prompts to new cultural contexts. At the core of CAGE is the Semantic Mold, a novel approach that disentangles a promptโs adversarial structure from its cultural content. This approach enables the modeling of realistic, localized threats rather than testing for simple jailbreaks. As a representative example, we demonstrate our framework by creating KoRSET, a Korean benchmark, which proves more effective at revealing vulnerabilities than direct translation baselines. CAGE offers a scalable solution for developing meaningful, context-aware safety benchmarks across diverse cultures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4428. One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration
๐ฌ 19
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ world modeling programmatic RL probabilistic program symbolic rule learning intrinsically motivated and open-ended learning
Symbolic world modeling is the task of inferring and representing the transitional dynamics of an environment as an executable program. Previous research
on symbolic world modeling has focused on simple, deterministic environments
with abundant data and human-provided guidance. We address the more realistic and challenging problem of learning a symbolic world model in a complex, stochastic environment with severe constraints: a limited interaction budget
where the agent has only โone lifeโ to explore a hostile environment and no external guidance in the form of human-provided, environment-specific rewards or
goals. We introduce OneLife, a framework that models world dynamics through
conditionally-activated programmatic laws within a probabilistic programming
framework. Each law operates through a precondition-effect structure, allowing
it to remain silent on irrelevant aspects of the world state and predict only the attributes it directly governs. This creates a dynamic computation graph that routes
both inference and optimization only through relevant laws for each transition,
avoiding the scaling challenges that arise when all laws must contribute to predictions about a complex, hierarchical state space, and enabling accurate learning
of stochastic dynamics even when most rules are inactive at any given moment.
To evaluate our approach under these demanding constraints, we introduce a new
evaluation protocol that measures (a) state ranking, the ability to distinguish plausible future states from implausible ones, and (b) state fidelity, the ability to generate future states that closely resemble reality. We develop and evaluate our framework on Crafter-OO, our reimplementation of the popular Crafter environment
that exposes a structured, object-oriented symbolic state and and a pure transition function that operates on that state alone. OneLife can successfully learn
key environment dynamics from minimal, unguided interaction, outperforming a
strong baseline on 16 out of 23 scenarios tested.
We also demonstrate
the world modelโs utility for planning, where rollouts simulated within the world
model successfully identify superior strategies in multi-step goal-oriented tasks.
Our work establishes a foundation for autonomously constructing programmatic world models of unknown,
complex environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4429. DRIFT: Divergent Response in Filtered Transformations for Robust Adversarial Defense
๐ฌ 11
โญ 4.67
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Adversarial Robustness Transferability of Adversarial Attacks Randomized Defenses Gradient Consensus
Deep neural networks remain highly vulnerable to adversarial examples, and most defenses collapse once gradients can be reliably estimated. We identify \emph{gradient consensus}โthe tendency of randomized transformations to yield aligned gradientsโas a key driver of adversarial transferability. Attackers exploit this consensus to construct perturbations that remain effective across transformations. We introduce \textbf{DRIFT} (Divergent Response in Filtered Transformations), a stochastic ensemble of lightweight, learnable filters trained to actively disrupt gradient consensus. Unlike prior randomized defenses that rely on gradient masking, DRIFT enforces \emph{gradient dissonance} by maximizing divergence in Jacobian- and logit-space responses while preserving natural predictions. Our contributions are threefold: (i) we formalize gradient consensus and provide a theoretical analysis linking consensus to transferability; (ii) we propose a consensus-divergence training strategy combining prediction consistency, Jacobian separation, logit-space separation, and adversarial robustness; and (iii) we show that DRIFT achieves substantial robustness gains on ImageNet across CNNs and Vision Transformers, outperforming state-of-the-art preprocessing, adversarial training, and diffusion-based defenses under adaptive white-box, transfer-based, and gradient-free attacks. DRIFT delivers these improvements with negligible runtime and memory cost, establishing gradient divergence as a practical and generalizable principle for adversarial defense.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4430. DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
๐ฌ 19
โญ 4.67
๐ foundation or frontier models, including LLMs
๐ท๏ธ MCTS RLVR
Although Reinforcement Learning with Verifiable Rewards (RLVR) has become an essential component for developing advanced reasoning skills in language models, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search (MCTS) directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95\% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models, while using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4431. Gradient-Based Diversity Optimization with Differentiable Top-$k$ Objective
๐ฌ 13
โญ 4.67
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Diversity Optimization Gradient-based learning Recommendation
Predicting relevance is a pervasive problem across digital platforms, covering social media, entertainment, and commerce. However, when optimized solely for relevance and engagement, many machine-learning models amplify data biases and produce homogeneous outputs, reinforcing filter bubbles and content uniformity. To address this issue, we introduce a pairwise top-k diversity objective with a differentiable smooth-ranking approximation, providing a model-agnostic way to incorporate diversity optimization directly into standard gradient-based learning. Building on this objective, we cast relevance and diversity as a joint optimization problem, we analyze the resulting gradient trade-offs, and propose two complementary strategies: direct optimization, which modifies the learning objective, and indirect optimization, which reweights training data. Both strategies can be applied either when training models from scratch or when fine-tuning existing relevance-optimized models. We use recommendation as a natural evaluation setting where scalability and diversity are critical, and show through extensive experiments that our methods consistently improve diversity with negligible accuracy loss. Notably, fine-tuning with our objective is especially efficient, requiring only a few gradient steps to encode diversity at scale.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4432. AC-Sampler: Accelerate and Correct Diffusion Sampling with Metropolis-Hastings Algorithm
๐ฌ 17
โญ 4.67
๐ generative models
๐ท๏ธ Diffusion model Metropolis-Hastings Algorithm Langevin Dynamics
Diffusion-based generative models have recently achieved state-of-the-art performance in high-fidelity image synthesis. These models learn a sequence of denoising transition kernels that gradually transform a simple prior distribution into a complex data distribution. However, requiring many transitions not only slows down sampling but also accumulates approximation errors.
We introduce the Accelerator-Corrector Sampler (AC-Sampler), which accelerates and corrects diffusion sampling without fine-tuning. It generates samples directly from intermediate timesteps using the MetropolisโHastings (MH) algorithm while correcting them to target the true data distribution. We derive a tractable density ratio for arbitrary timesteps with a discriminator, enabling computation of MH acceptance probabilities. Theoretically, our method yields samples better aligned with the true data distribution than the original model distribution. Empirically, AC-Sampler achieves FID 2.38 with only 15.8 NFEs, compared to the base samplerโs FID 3.23 with 17 NFEs on unconditional CIFAR-10. On CelebA-HQ 256ร256, it attains FID 6.6 with 98.3 NFEs. AC-Sampler can be combined with existing acceleration and correction techniques, demonstrating its flexibility and broad applicability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4433. Enhancing Language Model Reasoning with Structured Multi-Level Modeling
๐ฌ 16
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Chain-of-Thought reasoning Direct Preference Optimization Process supervision Twisted Sequential Monte Carlo Large language models
Inference-time scaling enhances a modelโs reasoning by extending its chain-of-thought (CoT). However, existing approaches typically rely on a single policy trained with outcome-reward reinforcement learning (RL), which often suffers from long-horizon plan failures, i.e., the implicit plan drifts away from any valid strategy. This problem is particularly severe for smaller language models (LMs) with long CoTs due to their limited capacity. To address this, we propose Multi-Level Reasoning (MLR), which reformulates long-CoT generation as a two-level stochastic process. Specifically, MLR employs two policies: a high-level planner that generates step descriptors (abstract subgoals) and a low-level executor that produces detailed content conditioned on these descriptors. The planner then generates the next subgoal based on the summarized current step, forming an alternating planโexecute loop. To maintain scalability, we adopt a minimal design, where the base model serves as the low-level policy and a lightweight LoRA module implements the high-level policy. For training, we observe that outcome-reward RL is inefficient and weakly informative for long trajectories (e.g., those exceeding 4K tokens). To overcome this, we introduce online Step-DPO, a process-level preference optimization scheme that leverages Twisted Sequential Monte Carlo (TSMC) to provide scalable stepwise supervision. This yields more effective training, improved stability, and higher accuracy. Extensive experiments on challenging math, science, and logical reasoning benchmarks show that, with only 10% SFT data and 5% of preference data, MLR outperforms both the DeepSeek-R1 distillation and the outcome-reward RL baselines across multiple base models and tasks. More importantly, MLR exhibits slower performance degradation on long-horizon reasoning, demonstrating stronger robustness under extended CoT generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4434. Noise Tolerance of Distributionally Robust Learning
๐ฌ 11
โญ 4.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Distributional Robustness Wasserstein Distance Deep Learning Operator Learning
Given the importance of building robust machine learning models, considerable efforts have recently been put into developing training strategies that achieve robustness to outliers and adversarial attacks. Yet, a major aspect that remains an open problem is systematic robustness to global forms of noise such as those that come from measurements and quantization. Hence, we propose in this work an approach to train regression models from data with additive forms of noise, leveraging the Wasserstein distance as a loss function. Importantly, our approach is agnostic to the model structure, unlike the increasingly popular Wasserstein Distributionally Robust Learning paradigm (WDRL) which, we show, does not achieve improved robustness when the regression function is not convex or Lipschitz. We provide a theoretical analysis of the scaling of the regression functions in terms of the variance of the noise, for both formulations and show consistency of the proposed loss function. Lastly, we conclude with numerical experiments on physical PDE Benchmarks and electric grid data, demonstrating competitive performance with an order of magnitude reduction in computational cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4435. Self-Predictive Representations for Combinatorial Generalization in Behavioral Cloning
๐ฌ 18
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Sequential Decision Making Combinatorial Generalization Representation Learning
While goal-conditioned behavior cloning (GCBC) methods can perform well on in-distribution training tasks, they do not necessarily generalize zero-shot to tasks that require conditioning on novel state-goal pairs, i.e. combinatorial generalization. In part, this limitation can be attributed to a lack of temporal consistency in the state representation learned by BC; if temporally correlated states are properly encoded to similar latent representations, then the out-of-distribution gap for novel state-goal pairs would be reduced. We formalize this notion by demonstrating how encouraging long-range temporal consistency via successor representations (SR) can facilitate
generalization. We then propose a simple yet effective representation learning objective, $\text{BYOL-}\gamma$ for GCBC, which theoretically approximates the successor representation in the finite MDP case through self-predictive representations, and achieves competitive empirical performance across a suite of challenging tasks requiring combinatorial generalization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4436. Understanding and Improving Hyperbolic Deep Reinforcement Learning
๐ฌ 28
โญ 4.50
๐ reinforcement learning
๐ท๏ธ reinforcement learning representation learning hyperbolic space hyperbolic deep learning
The performance of reinforcement learning (RL) agents depends critically on the quality of the underlying feature representations. Hyperbolic feature spaces are well-suited for this purpose, as they naturally capture hierarchical and relational structure often present in complex RL environments. However, leveraging these spaces commonly faces optimization challenges due to the nonstationarity of RL. In this work, we identify key factors that determine the success and failure of training hyperbolic deep RL agents. By analyzing the gradients of core operations in the Poincarรฉ ball and hyperboloid models of hyperbolic geometry, we show that large-norm embeddings destabilize gradient-based training, leading to trust-region violations in proximal policy optimization (PPO). Based on these insights, we introduce Hyper++, a new hyperbolic PPO agent that consists of three components: (1) stable critic training through a categorical value loss instead of regression; (2) feature regularization guaranteeing bounded norms while avoiding the curse of dimensionality from clipping; and (3) using a more optimization-friendly formulation of hyperbolic network layers. In experiments on ProcGen, we show that Hyper++ guarantees stable learning, outperforms prior hyperbolic agents, and reduces wall-clock time by approximately 30%. On Atari-5 with Double DQN, Hyper++ strongly outperforms Euclidean and hyperbolic baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4437. Any-Order Any-Subset AutoRegressive Model
๐ฌ 15
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language model permutation language model
We propose Any-order Any-subset Autoregressive modeling (A3), a novel sequence generation framework that generalizes standard autoregressive (AR) factorization to support the prediction of arbitrary token groups in any order. A3 overcomes the limitations of conventional left-to-right decoding by enabling flexible groupwise generation while preserving probabilistic rigor and training stability. Our design combines a two-stream attention architecture with a progressive training strategy, allowing both efficient parallel decoding and robust modeling of diverse dependency structures. Empirical results demonstrate that A3 achieves a superior trade-off between generation speed, flexibility, and quality compared to state-of-the-art AR and diffusion-based methods. This work offers a unified approach for a flexible, efficient, and novel language modeling paradigm.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4438. Making Slow Thinking Faster: Compressing LLM Chain-of-Thought via Step Entropy
๐ฌ 15
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Efficient LLM CoT compression
Large Language Models (LLMs) using Chain-of-Thought (CoT) prompting excel at complex reasoning but generate verbose thought processes with considerable redundancy, leading to increased inference costs and reduced efficiency. We introduce a novel CoT compression framework based on step entropy, a metric that quantifies the informational contribution of individual reasoning steps to identify redundancy. Through theoretical analysis and extensive empirical validation on mathematical reasoning benchmarks, we demonstrate that steps with low entropy are indeed highly redundant. Our experiments reveal that an astonishing 80% of low-entropy intermediate steps can be pruned without significant degradation in the final answer accuracy across DeepSeek-R1-7B, 14B and Qwen3-8B. This finding sharply contrasts with random or high-entropy pruning, which severely impairs reasoning performance. Building on this, we propose a novel two-stage training strategy combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) reinforcement learning. This approach enables LLMs to autonomously learn to generate compressed COTs during inference by strategically incorporating [SKIP] tokens. Our method significantly enhances LLM inference efficiency while rigorously preserving accuracy, offering profound implications for practical LLM deployment and a deeper understanding of reasoning structures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4439. Adapt Data to Model: Adaptive Transformation Optimization for Domain-shared Time Series Foundation Models
๐ฌ 28
โญ 4.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ time series foundation models transformation adaptation
Large time series models (LTMs) have recently demonstrated powerful capabilities for universal forecasting. However, these models still struggle to address the variety and nonstationarity of time series, resulting in an unsatisfying balance between forecasting performance and generalizability. Instead of breeding unceasingly new models for diverse domains, this paper proposes a novel framework, time-series adaptive transformation optimization (TATO), that enables a frozen pre-trained LTM to fit various downstream domains through an empirically optimal time-series transformation pipeline. Three representative types of time series transformations, including context slicing, scale normalization, and outlier correction, are constructed to help LTMs fit the target domain. A two-stage ranking is also designed to ensure the optimization's robustness by filtering out transformation pipelines with shortcomings in specific metrics. Extensive evaluations on state-of-the-art pre-trained LTMs and widely used datasets demonstrate that TATO performs universally well and significantly enhances domain-adaptive forecasting performance, achieving a maximum MSE reduction of 68.4% and an average decrease of 16.0%. In most cases, the time required by TATO to optimize time series transformation pipelines is under 2 minutes, making it practical for real-world applications. We have published the source code of TATO at https://anonymous.4open.science/r/TATO-D55C.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4440. Aligning Deep Implicit Preferences by Learning to Reason Defensively
๐ฌ 5
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Preference Alignment Reward Modeling as Reasoning Process Supervision
Personalized alignment is crucial for enabling Large Language Models (LLMs) to engage effectively in user-centric interactions. However, current methods face a dual challenge: they fail to infer users' deep implicit preferences (including unstated goals, semantic context and risk tolerances), and they lack the defensive reasoning required to navigate real-world ambiguity. This cognitive gap leads to responses that are superficial, brittle and short-sighted. To address this, we propose Critique-Driven Reasoning Alignment (CDRA), which reframes alignment from a scalar reward-matching task into a structured reasoning process. First, to bridge the preference inference gap, we introduce the DeepPref benchmark. This dataset, comprising 3000 preference-query pairs across 20 topics, is curated by simulating a multi-faceted cognitive council that produces critique-annotated reasoning chains to deconstruct query semantics and reveal latent risks. Second, to instill defensive reasoning, we introduce the Personalized Generative Process Reward Model (Pers-GenPRM), which frames reward modeling as a personalized reasoning task. It generates a critique chain to evaluate a response's alignment with user preferences before outputting a final score based on this rationale. Ultimately, this interpretable, structured reward signal guides policy model through Critique-Driven Policy Alignment, a process-level online reinforcement learning algorithm integrating both numerical and natural language feedback. Experiments demonstrate that CDRA excels at discovering and aligning with users' true preferences while executing robust reasoning. Our code and dataset are available at https://anonymous.4open.science/r/Deep-pref-9DE9.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4441. DIFFSPARSE: ACCELERATING DIFFUSION TRANSFORMERS WITH LEARNED TOKEN SPARSITY
๐ฌ 19
โญ 4.50
๐ optimization
๐ท๏ธ Diffusion Acceleration Sparsity
Diffusion models demonstrate outstanding performance in image generation, but their multi-step inference mechanism requires immense computational cost. Previous works accelerate inference by leveraging layer or token cache techniques to reduce computational cost. However, these methods fail to achieve superior acceleration performance in few-step diffusion transformer models due to inefficient feature caching strategies, manually designed sparsity allocation, and the practice of retaining complete forward computations in several steps in these token cache methods.
To tackle these challenges, we propose a differentiable layer-wise sparsity optimization framework for diffusion transformer models, leveraging token caching to reduce token computation costs and enhance acceleration. Our method optimizes layer-wise sparsity allocation in an end-to-end manner through a learnable network combined with a dynamic programming solver. Additionally, our proposed two-stage training strategy eliminates the need for full-step processing in existing methods, further improving efficiency.
We conducted extensive experiments on a range of diffusion-transformer models, including DiT-XL/2, PixArt-$\alpha$, FLUX, and Wan2.1. Across these architectures, our method consistently improves efficiency without degrading sample quality. For example, on PixArt-$\alpha$ with 20 sampling steps, we reduce computational cost by 54% while achieving generation metrics that surpass those of the original model, substantially outperforming prior approaches. These results demonstrate that our method delivers large efficiency gains while often improving generation quality.
.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4442. M$^2$-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining
๐ฌ 20
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ GUI Agent; Vision-Language Models; Data synthesis; Monte Carlo Tree Search
Graphical User Interface (GUI) agent is pivotal to advancing intelligent human-computer interaction paradigms. Constructing powerful GUI agents necessitates the large-scale annotation of high-quality user-behavior trajectory data (\textit{i.e.}, intentโtrajectory pairs) for training. However, manual annotation methods and current GUI agent data mining approaches typically face three critical challenges: high construction cost, poor data quality, and low data richness. To address these issues, we propose M$^2$-Miner, the first low-cost and automated mobile GUI agent data-mining framework based on Monte Carlo Tree Search (MCTS). For better data mining efficiency and quality, we present a collaborative multi-agent framework, comprising InferAgent, OrchestraAgent, and JudgeAgent for guidance, acceleration, and evaluation. To further enhance the efficiency of mining and enrich intent diversity, we design an intent recycling strategy to extract extra valuable interaction trajectories. Additionally, a progressive model-in-the-loop training strategy is introduced to improve the success rate of data mining. Extensive experiments have demonstrated that the GUI agent fine-tuned using our mined data achieves state-of-the-art performance on several commonly used mobile GUI benchmarks. Our work will be released to facilitate the community research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4443. MarS-FM: Generative Modeling of Molecular Dynamics via Markov State Models
๐ฌ 19
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ molecular dynamics generative models proteins flow matching markov state models
Molecular Dynamics (MD) is a powerful computational microscope for probing protein functions. However, the need for fine-grained integration and the long timescales of biomolecular events make MD computationally expensive. To address this, several generative models have been proposed to generate surrogate trajectories at lower cost. Yet, these models typically learn a fixed-lag transition density, causing the training signal to be dominated by frequent but uninformative transitions. We introduce a new class of generative models,
**MSM Emulators**, which instead learn to sample transitions across discrete states defined by an underlying Markov State Model (MSM). We instantiate this class with Markov Space Flow Matching (MarS-FM), whose sampling offers more than two orders of magnitude speedup compared to implicit- or explicit-solvent MD simulations. We benchmark Mars-FM ability to reproduce MD statistics through structural observables such as RMSD, radius of gyration, and secondary structure content. Our evaluation spans protein domains (up to 500 residues) with significant chemical and structural diversity, including unfolding events, and enforces strict sequence dissimilarity between training and test sets to assess generalization. Across all metrics, MarS-FM outperforms existing methods, often by a substantial margin.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4444. Imitating the Truth: Attention-aware Truth-Guided Enhancement for Hallucination Mitigation in Large Vision-Language Models
๐ฌ 26
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Hallucination Mitigation Large Vision-Language Models Attention Intervention
Large Vision-Language Models (LVLMs) achieve impressive multimodal reasoning but remain prone to hallucinations, generating content inconsistent with visual evidence.
Existing mitigation methods often rely on auxiliary modules or coarse decoding-time adjustments, overlooking the fine-grained dynamics that distinguish truthful (real) tokens from hallucinatory ones.
In this paper, we introduce \textbf{AGE (Attention-aware Truth-Guided Enhancement)}, a training-free framework that performs fine-grained, layer-wise interventions guided by attention patterns of real tokens.
Our analysis reveals that real and hallucinated tokens follow distinct stage-specific attention behaviors, and hallucinations emerge when models fail to reproduce these behaviors.
AGE addresses this by introducing two lightweight interventions: (i) Imitating the image attention, derived from discrepancies between real and hallucinated tokens, and (ii) Imitating the text attention when semantic grounding is required.
Extensive experiments on widely used benchmarks, including COCO Image Captioning, POPE, and MME, demonstrate that AGE consistently mitigates hallucinations across diverse LVLMs such as LLaVA, MiniGPT-4, and mPLUG-Owl2, without additional training or loss of fluency.
Our results highlight that imitating truth-grounded attention dynamics is a simple yet powerful principle to improve the reliability of LVLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4445. Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation
๐ฌ 5
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ MLLMs Alignment LVLM Hallucination
Multimodal large language models (MLLMs) have achieved remarkable progress in visionโlanguage reasoning, yet they remain vulnerable to hallucination, where generated content deviates from the visual evidence. Existing mitigation strategies either demand costly supervision during training or introduce additional latency at inference. Recent vision-enhancement methods attempt to address this by reinforcing visual tokens during decoding, but they typically inject all tokens indiscriminately, leading to interference from background regions and distracting the model from critical cues. To overcome this challenge, we propose an **A**daptive v**I**sual **R**einforcement framework for MLLMs, dubbed as **AIR**. AIR consists of two main components: prototype-based token reduction, which condenses the large pool of visual tokens into a compact subset to suppress redundancy, and OT-guided patch reinforcement, which quantifies the alignment between hidden state and patch embeddings to selectively integrate the most consistent patches into the feed-forward layers. As a result, AIR enhances the modelโs reliance on salient visual information and effectively mitigates hallucination. Extensive experiments across representative MLLMs demonstrate that AIR substantially reduces hallucination while preserving general capabilities, establishing it as an effective and independent solution for building reliable MLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4446. ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask
๐ฌ 11
โญ 4.50
๐ generative models
๐ท๏ธ Temporal Consistency Video Generation
Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos.
Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints.
We introduce **ConsisDrive**, an identity-preserving driving world model designed to enforce temporal consistency at the instance level.
Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks
to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions,
thereby preserving object identity consistency;
and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity.
By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4447. Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning
๐ฌ 27
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Unlearning Robustness Zeroth-Order Optimization
Large language model (LLM) unlearning aims to surgically remove the influence of undesired data or knowledge from an existing model while preserving its utility on unrelated tasks. This paradigm has shown promise in addressing privacy and safety concerns. However, recent findings reveal that unlearning effects are often *fragile*: post-unlearning manipulations such as weight quantization or fine-tuning can quickly neutralize the intended forgetting. Prior efforts to improve robustness primarily reformulate unlearning objectives by explicitly assuming the role of vulnerability sources. In this work, we take a different perspective by investigating the role of the *optimizer*, independent of unlearning objectives and formulations, in shaping unlearning robustness. We show that the "*grade*" of the optimizer, defined by the level of information it exploits, ranging from zeroth-order (gradient-free) to first-order (gradient-based) to second-order (Hessian-based), is tightly linked to the resilience of unlearning. Surprisingly, we find that downgrading the optimizer, such as using zeroth-order methods or compressed-gradient variants (*e.g.,* gradient sign-based optimizers), often leads to stronger robustness. While these optimizers produce noisier and less precise updates, they encourage convergence to harder-to-disturb basins in the loss landscape, thereby resisting post-training perturbations. By connecting zeroth-order methods with randomized smoothing, we further highlight their natural advantage for robust unlearning. Motivated by these insights, we propose a *hybrid optimizer* that combines first-order and zeroth-order updates, preserving unlearning efficacy while enhancing robustness. Extensive experiments on the MUSE and WMDP benchmarks, across multiple LLM unlearning algorithms, validate that our approach achieves more resilient forgetting without sacrificing unlearning quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4448. Learning to Generate Unit Test via Adversarial Reinforcement Learning
๐ฌ 20
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Code generation Reinforcement Learning
Unit testing is a core practice in programming, enabling systematic evaluation of programs produced by human developers or large language models (LLMs). Given the challenges in writing comprehensive unit tests, LLMs have been employed to automate unit test generation, yet methods for training LLMs to produce high-quality unit tests remain underexplored. In this work, we propose UTRL, a novel reinforcement learning (RL) framework that trains an LLM to generate high-quality unit test given a programming instruction. Our key idea is to iteratively train two LLMs, the unit test generator and the code generator, in an adversarial manner via RL: (1) the unit test generator is trained to maximize a discrimination reward, encouraging it to produce tests that reveal faults in the code generatorโs solutions; and (2) the code generator is trained to maximize a code reward, encouraging it to produce solutions that pass the unit tests generated by the unit test generator. In our experiment, we demonstrate that unit tests generated by Qwen3-4B trained via UTRL show higher quality compared to unit tests generated by the same model trained via supervised fine-tuning on ground-truth unit tests, yielding code evaluations that more closely align with those induced by the ground-truth tests. Moreover, Qwen3-4B trained with UTRL outperforms frontier models like GPT-4.1 and GPT-4o in generating high-quality unit tests, highlighting the effectiveness of UTRL in training LLMs for the unit test generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4449. Explainable Token-level Noise Filtering for LLM Fine-tuning Datasets
๐ฌ 14
โญ 4.50
๐ optimization
๐ท๏ธ Large Language Model Fine-tuning Token-level Data
Large Language Models (LLMs) have seen remarkable advancements, achieving state-of-the-art results in diverse applications. Fine-tuning, an important step for adapting LLMs to specific downstream tasks, typically involves further training on corresponding datasets. However, a fundamental discrepancy exists between current fine-tuning datasets and the token-level optimization mechanism of LLMs: most datasets are designed at the sentence-level, which introduces token-level noise, causing negative influence to final performance. In this paper, we propose XTF, an explainable token-level noise filtering framework. XTF decomposes the complex and subtle contributions of token-level data to the fine-tuning process into three distinct and explicit attributes (reasoning importance, knowledge novelty, and task relevance), which can be assessed using scoring methods, and then masks the gradients of selected noisy tokens accordingly to optimize the performance of fine-tuned LLMs. We conduct extensive experiments on three representative downstream tasks (math, code and medicine) across 7 mainstream LLMs. The results demonstrate that XTF can significantly improve downstream performance by up to 13.7% compared to regular fine-tuning. Our work highlights the importance of token-level dataset optimization, and demonstrates the potential of strategies based on attribute decomposition for explaining complex training mechanisms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4450. Toward Practical Equilibrium Propagation: Brain-inspired Recurrent Neural Network with Feedback Regulation and Residual Connections
๐ฌ 14
โญ 4.50
๐ learning theory
๐ท๏ธ biologically plausible learning equilibrium propagation brain-inspired network structure residual connection feedback regulation
Brain-like intelligent systems need brain-like learning methods. Equilibrium Propagation (EP) is a biologically plausible learning framework with strong potential for brain-inspired computing hardware. However, existing implementations of EP suffer from instability and prohibitively high computational costs. Inspired by the structure and dynamics of the brain, we propose a biologically plausible Feedback-regulated REsidual recurrent neural network (FRE-RNN) and study its learning performance in EP framework. Feedback regulation enables rapid convergence by attenuating feedback signals and reducing the disturbance of feedback path to feedforward path. The improvement in convergence property reduces the computational cost and training time of EP by orders of magnitude, delivering performance on par with backpropagation (BP) in benchmark tasks. Meanwhile, residual connections with brain-inspired topologies help alleviate the vanishing gradient problem that arises when feedback pathways are weak in deep RNNs. Our approach substantially enhances the applicability and practicality of EP. The techniques developed here also offer guidance to implementing in-situ learning in physical neural networks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4451. CUPID: A Plug-in Framework for Joint Aleatoric and Epistemic Uncertainty Estimation with a Single Model
๐ฌ 23
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ uncertainty estimation model interpretability trustworthy AI
Accurate estimation of uncertainty in deep learning is critical for deploying models in high-stakes domains such as medical diagnosis and autonomous decision-making, where overconfident predictions can lead to harmful outcomes. In practice, understanding the reason behind a modelโs uncertainty and the type of uncertainty it represents can support risk-aware decisions, enhance user trust, and guide additional data collection. However, many existing methods only address a single type of uncertainty or require modifications and retraining of the base model, making them difficult to adopt in real-world systems. We introduce CUPID (Comprehensive Uncertainty Plug-in estImation moDel), a general-purpose module that jointly estimates aleatoric and epistemic uncertainty without modifying or retraining the base model. CUPID can be flexibly inserted into any layer of a pretrained network. It models aleatoric uncertainty through a learned Bayesian identity mapping and captures epistemic uncertainty by analyzing the modelโs internal responses to structured perturbations. We evaluate CUPID across a range of tasks, including classification, regression, and out-of-distribution detection. The results show that it consistently delivers competitive performance while offering layer-wise insights into the origins of uncertainty. By making uncertainty estimation modular, interpretable, and model-agnostic, CUPID supports more transparent and trustworthy AI.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4452. Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity
๐ฌ 11
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ LLMs Layer Relevance Mechanistic Interpretability Structured Pruning
Large language models (LLMs) have revolutionized natural language processing. Understanding their internal mechanisms is crucial for developing more interpretable and optimized architectures. Mechanistic interpretability has led to the development of various methods for assessing layer relevance, with cosine similarity being a widely used tool in the field. On this work, we demonstrate that cosine similarity is a poor proxy for the actual performance degradation caused by layer removal. Our theoretical analysis shows that a layer can exhibit an arbitrarily low cosine similarity score while still being crucial to the model's performance. On the other hand, empirical evidence from a range of LLMs confirms that the correlation between cosine similarity and actual performance degradation is often weak or moderate, leading to misleading interpretations of a transformer's internal mechanisms. We propose a more robust metric for assessing layer relevance: the actual drop in model accuracy resulting from the removal of a layer. Even though it is a computationally costly metric, this approach offers a more accurate picture of layer importance, allowing for more informed pruning strategies and lightweight models. Our findings have significant implications for the development of interpretable LLMs and highlight the need to move beyond cosine similarity in assessing layer relevance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4453. TRIDENT: Cross-Domain Trajectory Spatio-Temporal Representation via Distance-Preserving Triplet Learning
๐ฌ 29
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Spatiotemporal representation learning Trajectory analysis Cross-domain generalization Triplet loss Distance metric learning self-supervised representation learning
We present the TRIplet-based Distance-preserving Embedding Network for Trajectories (TRIDENT), a spatio-temporal representation framework for compressing and retrieving trajectories across scales, from badminton courts to large-scale urban environments. Existing methods often assume smooth, continuous motion, but real trajectories exhibit event-driven annotation, abrupt direction changes, GPS errors, irregular sampling, and domain shifts, exposing the inefficiency, limited generalization, and inability to robustly integrate temporal order with spatial sequence structure of prior models. TRIDENT addresses these challenges by combining GCN spatial embeddings with temporal features in a Dual-Attention Encoder (DAEncoder), along with a Nonlinear Tanh-Projection Attention Pooling (NTAP) module that preserves local order and robustness under noise. For metric learning, we introduce a Distance-preserving Multi-kernel Triplet Loss (DMT) to preserve pairwise spatio-temporal distances in the native feature space and their rank order within the embedding, thereby reducing geometry distortion and improving cross-domain generalization. Experiments on urban mobility and badminton datasets show that TRIDENT outperforms strong baselines in retrieval accuracy, efficiency, and cross-domain generalization. Furthermore, the learned embeddings capture spatio-temporal sequence patterns, facilitating tactical analysis of badminton rallies via silhouette-guided spectral clustering that provides more actionable insights than direct trajectory classification. An anonymous repo with code and data is in the supplement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4454. Task-Adaptive Parameter-Efficient Fine-Tuning for Weather Foundation Models
๐ฌ 32
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Weather Foundation Model Parameter-Efficient Fine-Tuning Earth Science
While recent advances in machine learning have equipped Weather Foundation Models (WFMs) with substantial generalization capabilities across diverse downstream tasks, the escalating computational requirements associated with their expanding scale increasingly hinder practical deployment. Current Parameter-Efficient Fine-Tuning (PEFT) methods, designed for vision or language tasks, fail to address the unique challenges of weather downstream tasks, such as variable heterogeneity, resolution diversity, and spatiotemporal coverage variations, leading to suboptimal performance when applied to WFMs. To bridge this gap, we introduce WeatherPEFT, a novel PEFT framework for WFMs incorporating two synergistic innovations. First, during the forward pass, Task-Adaptive Dynamic Prompting (TADP) dynamically injects the embedding weights within the encoder to the input tokens of the pre-trained backbone via internal and external pattern extraction, enabling context-aware feature recalibration for specific downstream tasks. Furthermore, during backpropagation, Stochastic Fisher-Guided Adaptive Selection (SFAS) not only leverages Fisher information to identify and update the most task-critical parameters, thereby preserving invariant pre-trained knowledge, but also introduces randomness to stabilize the selection. We demonstrate the effectiveness and efficiency of WeatherPEFT on three downstream tasks, where existing PEFT methods show significant gaps versus Full-Tuning, and WeatherPEFT achieves performance parity with Full-Tuning using fewer trainable parameters. The code of this work is available at https://anonymous.4open.science/r/WeatherPEFT-A068.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4455. Hyper-SET: Designing Transformers via Hyperspherical Energy Minimization
๐ฌ 5
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ Transformer Hopfield Energy Principled Model Design
Transformer-based models have achieved remarkable success, but their core components, Transformer layers, are largely heuristics-driven and engineered from the bottom up, calling for a prototypical model with high interpretability and practical competence. To this end, we conceptualize a principled, top-down approach grounded in energy-based interpretation. Specifically, we formalize token dynamics as a joint maximum likelihood estimation on the hypersphere, featuring two properties: semantic alignment in the high-dimensional space and distributional uniformity in the low-dimensional space. By quantifying them with extended Hopfield energy functions, we instantiate this idea as a constrained energy minimization problem, which enables designs of symmetric attention and feedforward modules with RMS normalization. We further present *Hyper-Spherical Energy Transformer* (Hyper-SET), a recurrent-depth alternative to vanilla Transformers naturally emerging from iterative energy optimization on the hypersphere. With shared parameters across layers, Hyper-SET can scale to arbitrary depth with fewer parameters. Theoretically grounded and compact, it achieves competitive or superior performance across diverse tasks, including Sudoku solving, image classification, and masked image modeling. We also design novel variations under the proposed general principle, such as linear attention and gated feedforward layer, and showcase its scalability with depth-wise LoRA. Our results highlight Hyper-SET as a step toward interpretable and principled Transformer design.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4456. Towards Learned Optimization Free Lunch
๐ฌ 19
โญ 4.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ learned optimization meta-learning
Learned optimizers are powerful alternatives to hand-designed rules like Adam, yet they have seen limited practical adoption since they often fail to meta-generalize beyond their training distribution and incur high meta-training cost. For instance, prior work, VeLO, scaled meta-training to 4,000 TPU months ($\sim$10$\times$ GPT-3 compute) to meta-train a general-purpose optimizer but it failed to generalize beyond 600M parameters tasks. In this work, we present a surprising finding: by crafting a simple normalized optimizer architecture, and augmenting meta-training, it becomes feasible to meta-train a performant general-purpose learned update rule on a tiny fraction of VeLO compute, 4.5 GPU hours to be precise. Our learned update rule scales stably to a billion-scale pretraining task (GPT-3 XL 1.3B) which is six orders of magnitude larger than its meta-training distribution. Furthermore, it shows strong performance across diverse out-of-distribution ML workloads and is compatible with modern optimization harness that includes orthogonalization, distinct update rules for input-output and hidden weights, and decoupled weight decay. In all, this work paves the way for practically applicable _learnable_ optimization algorithms, unlocking exploration of richer meta-training and data curation recipes to further improve performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4457. Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!
๐ฌ 22
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ finetuning data stealing open-source LLMs backdoor training
Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers to obtain task-specific models. Surprisingly, we reveal a new and concerning risk along with the practice: the provider of the open-source LLMs can later extract the private downstream fine-tuning data through simple backdoor training, only requiring black-box access to the fine-tuned downstream model. Our comprehensive experiments, across 4 popularly used open-source models with 3B to 32B parameters and 2 downstream datasets, suggest that the extraction performance can be strikingly high: in practical settings, as much as 76.3\% downstream fine-tuning data (queries) out of a total 5,000 samples can be perfectly extracted, and the success rate can increase to 94.9\% in more ideal settings. We further investigate several defense strategies, but none achieve satisfactory effectiveness in mitigating the risk. Overall, we highlight the emergency of this newly identified data breaching risk in fine-tuning, and we hope more follow-up research can push the progress of addressing this concerning risk.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4458. DPQuant: Efficient and Private Model Training via Dynamic Quantization Scheduling
๐ฌ 9
โญ 4.50
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ differential privacy quantization
Differentially-Private SGD (DP-SGD) is a powerful technique to protect user privacy when using sensitive data to train neural networks. During training, converting model weights and activations into low-precision formats, i.e., quantization, can drastically reduce training times, energy consumption, and cost, and is thus a widely used technique. In this work, we demonstrate that quantization causes significantly higher accuracy degradation in DP-SGD compared to regular SGD. We observe that this is caused by noise injection in DP-SGD, which amplifies quantization variance, leading to disproportionately large accuracy degradation. To address this challenge, we present DPQuant, a dynamic quantization framework that adaptively selects a changing subset of layers to quantize at each epoch. Our method combines two key ideas that effectively reduce quantization variance: (i) probabilistic sampling of the layers that rotates which layers are quantized every epoch, and (ii) loss-aware layer prioritization, which uses a differentially private loss sensitivity estimator to identify layers that can be quantized with minimal impact on model quality. This estimator consumes a negligible fraction of the overall privacy budget, preserving DP guarantees. Empirical evaluations on ResNet18, ResNet50, and DenseNet121 across a range of datasets demonstrate that DPQuant consistently outperforms static quantization baselines, achieving near Pareto-optimal accuracy-compute trade-offs and up to $2.21\times$ theoretical throughput improvements on lowโprecision hardware, with less than 2% drop in validation accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4459. Advancing Universal Deep Learning for Electronic-Structure Hamiltonian Prediction of Materials
๐ฌ 29
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ AI for Materials Deep Learning Electronic-Structure Hamiltonian Prediction
Deep learning methods for electronic-structure Hamiltonian prediction has offered significant computational efficiency advantages over traditional density functional theory (DFT), yet the diversity of atomic types, structural patterns, and the high-dimensional complexity of Hamiltonians pose substantial challenges to the generalization performance. In this work, we contribute on both the methodology and dataset sides to advance universal deep learning paradigm for Hamiltonian prediction. On the method side, we propose $NextHAM$, a neural E(3)-symmetry and expressive correction method for efficient and generalizable materials electronic-structure Hamiltonian prediction. First, we introduce the zeroth-step Hamiltonians, which can be efficiently constructed by the initial charge density of DFT, as informative input descriptors that enable the model to effectively capture prior knowledge of electronic structures. Second, we present a neural Transformer architecture with strict E(3)-Symmetry and high non-linear expressiveness for Hamiltonian prediction. Third, we propose a novel training objective to ensure the accuracy performance of Hamiltonians in both real space and reciprocal space, preventing error amplification and the occurrence of "ghost states'' caused by the large condition number of the overlap matrix. On the dataset side, we curate a broad-coverage large benchmark, namely Materials-HAM-SOC, comprising $17,000$ material structures spanning more than $60$ elements from six rows of the periodic table and explicitly incorporating spinโorbit coupling (SOC) effects, providing high-quality data resources for training and evaluation. Experimental results on Materials-HAM-SOC demonstrate that NextHAM achieves excellent accuracy in predicting Hamiltonians and band structures, with spin-off-diagonal block reaching the accuracy of sub-$\mu$eV scale. These results establish NextHAM as a universal and highly accurate deep learning model for electronic-structure prediction, delivering DFT-level precision with dramatically improved computational efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4460. Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy
๐ฌ 27
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Generative Models Human Motion Synthesis Representation Alignment Pose-Free Animation
Human image animation aims to generate a video from a static reference image, guided by pose information extracted from a driving video. Existing approaches often rely on pose estimators to extract intermediate representations, but such signals are prone to errors under occlusion or complex poses. Building on these observations, we present DirectAnimator, a framework that bypasses pose extraction and directly learns from raw driving videos. We introduce a Driving Cue Triplet consisting of pose, face, and location cues that captures motion, expression, and alignment in a semantically rich yet stable form, and we fuse them through a CueFusion DiT block for reliable control during denoising. To make learning dependable when the driving and reference identities differ, we devise a Same2X training strategy that aligns cross-ID features with those learned from same-ID data, regularizing optimization and accelerating convergence.
Extensive experiments demonstrate that DirectAnimator attains state-of-the-art visual quality and identity preservation while remaining robust to occlusions and complex articulation, and it does so with fewer computational resources. Our project page is at https://directanimator.github.io/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4461. VQ-Transplant: Efficient VQ-Module Integration for Pre-trained Visual Tokenizers
๐ฌ 30
โญ 4.50
๐ generative models
๐ท๏ธ VQ-Transplant Plug-and-play integration Computational cost reduction Pre-trained tokenizers
Vector Quantization (VQ) underpins modern discrete visual tokenization. However, training quantization modules for state-of-the-art VQ-based models requires significant computational resources which, in practice, all but prevents the development of novel, cutting-edge VQ techniques under resource constraints. To address this limitation, we propose VQ-Transplant, a simple framework that enables plug-and-play integration of new VQ modules into frozen, pre-trained tokenizers by replacing their native VQ modules. Crucially, the proposed transplantation process preserves all encoder-decoder parameters, obviating the need for costly end-to-end retraining when modifying the quantization method. To mitigate decoder-quantization mismatch, we introduce a lightweight decoder adaptation strategy (trained for only 5 epochs on ImageNet-1k) to align feature priors with the new quantization space. In our empirical evaluation, we find that VQ-Transplant allows obtaining near state-of-the-art reconstruction fidelity for industry-level models like VAR while reducing the training cost by 95%. VQ-Transplant democratizes quantization research by enabling resource-efficient integration of novel VQ techniques while matching industry-level reconstruction performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4462. Parallel Token Generation for Language Models
๐ฌ 15
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ transformer autoregressive model multi-token prediction generative model large language models
Autoregressive transformers are the backbone of modern large language models. Despite their success, inference remains slow due to strictly sequential prediction. Prior attempts to predict multiple tokens per step typically impose independence assumptions across tokens, which limits their ability to match the full expressiveness of standard autoregressive models. In this work, we break this paradigm by proposing an efficient and universal framework to jointly predict multiple tokens in a single transformer call, without limiting the representational power. Inspired by ideas from inverse autoregressive normalizing flows, we convert a series of random variables deterministically into a token sequence, incorporating the sampling procedure into a trained model. This allows us to train parallelized models both from scratch and by distilling an existing autoregressive model. Empirically, our distilled model matches its teacher's output for an average of close to 50 tokens on toy data and 5 tokens on a coding dataset, all within a single forward pass.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4463. RAVEN: End-to-end Equivariant Robot Learning with RGB Cameras
๐ฌ 16
โญ 4.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Robotic Manipulation Policy Learning Equivariance
Recent work has shown that equivariant policy networks can achieve strong performance on robot manipulation tasks with limited human demonstrations. However, existing equivariant methods typically require structured inputs, such as 3D point clouds or top-down camera views, which prevents their use in low-cost setups or dynamic environments. In this work, we propose the first $\mathrm{SE}(3)$-equivariant policy learning framework that operates with only RGB image observations. The key insight is to treat image-based data as collections of rays that, unlike 2D pixels, transform under 3D roto-translations. Extensive experiments in both simulation with diverse robot configurations and real-world settings demonstrate that our method consistently surpasses strong baselines in both performance and efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4464. $\textit{MADFormer}$: Mixed Autoregressive and Diffusion Transformers for Continuous Image Generation
๐ฌ 9
โญ 4.50
๐ generative models
๐ท๏ธ Autoregressive Diffusion Continuous Image Generation
Recent progress in multimodal generation has increasingly combined autoregressive (AR) and diffusion-based approaches, leveraging their complementary strengths: AR models capture long-range dependencies and produce fluent, context-aware outputs, while diffusion models operate in continuous latent spaces to refine high-fidelity visual details. However, existing hybrids often lack systematic guidance on how and why to allocate model capacity between these paradigms. In this work, we introduce $\textit{MADFormer}$, a Mixed Autoregressive and Diffusion Transformer that serves as a testbed for analyzing AR-diffusion trade-offs. $\textit{MADFormer}$ partitions image generation into spatial blocks, using AR layers for one-pass global conditioning across blocks and diffusion layers for iterative local refinement within each block. Through controlled experiments on FFHQ-1024 and ImageNet, we identify two key insights: (1) block-wise partitioning significantly improves performance on high-resolution images, and (2) vertically mixing AR and diffusion layers yields better quality-efficiency balances---improving FID by up to 75\% under constrained inference compute. Our findings offer practical design principles for future hybrid generative models. Code and models will be released upon publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4465. Cross-Embodied Co-Design for Dexterous Hands
๐ฌ 10
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Co-Design Manipulation Robotics Cross Embodiment Robot Hands Robot Learning Reinforcement Learning Hardware Design
Dexterous manipulation is limited by both control and design, without consensus as to what makes manipulators best for performing dexterous tasks. This raises a fundamental challenge: how should we design and control robot manipulators that are optimized for dexterity? We present a co-design framework that learns task-specific hand morphology and complementary dexterous control policies. The framework supports 1) an expansive morphology search space including joint, finger, and palm generation, 2) scalable evaluation across the wide design space via morphology-conditioned cross-embodied control, and 3) real-world fabrication with accessible components. We evaluate the approach across multiple dexterous tasks, including in-hand rotation with simulation and real deployment. Our framework enables an end-to-end pipeline that can design, train, fabricate, and deploy a new robotic hand in under 24 hours. The full framework will be open-sourced and available on our website.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4466. Graph-of-Agents: A Graph-based Framework for Multi-Agent LLM Collaboration
๐ฌ 22
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Collaboration Multi-Agent LLM
With an ever-growing zoo of LLMs and benchmarks, the need to orchestrate multiple models for improved task performance has never been more pressing. While frameworks like Mixture-of-Agents (MoA) attempt to coordinate LLMs, they often fall short in terms of (1) selecting relevant agents, (2) facilitating effective intra-agent communication, and (3) integrating responses efficiently. In this work, we propose Graph-of-Agents (GoA), a new graph-based framework for modeling multi-agent LLM communication. Our approach begins with node sampling, selecting only the most relevant agents by leveraging model cards that summarize each modelโs domain, task specialization, and other characteristics. Next, we construct edges between the selected agents by evaluating their responses against one another to determine relevance ordering. Directed message passing is then performed from highly relevant agents to less relevant ones to enhance their responses, followed by reverse message passing to refine the original responses of the more relevant agents. Finally, the updated responses are aggregated via graph-based pooling (e.g., max or mean pooling) to produce a single, unified answer. We evaluate GoA on diverse multi-domain benchmarks (MMLU, MMLU-Pro, GPQA) and domain-specific benchmarks (MATH, HumanEval, MedMCQA), with an agent pool of 6 LLMs spanning multiple domains. Surprisingly, GoA achieves superior performance18 using only 3 selected agents, outperforming recent multi-agent LLM baselines that utilize all 6 agents simultaneously. By adopting a graph structure, GoA offers both scalability and effectiveness through structured message passingโpositioning it as a strong candidate for navigating the challenges of the ever-growing LLM zoo.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4467. Local Reinforcement Learning with Action-Conditioned Root Mean Squared Q-Functions
๐ฌ 8
โญ 4.50
๐ reinforcement learning
๐ท๏ธ neural network localized learning reinforcement leaning biologically plausible learning
The Forward-Forward (FF) Algorithm is a recently proposed learning procedure for neural networks that employs two forward passes instead of the traditional forward and backward passes used in backpropagation. However, FF remains largely confined to supervised settings, leaving a gap at domains where learning signals can be yielded more naturally such as RL. In this work, inspired by FF's goodness function using layer activity statistics, we introduce Action-conditioned Root mean squared Q-Functions (ARQ), a novel value estimation method that applies a goodness function and action conditioning for local RL using temporal difference learning. Despite its simplicity and biological grounding, our approach achieves superior performance compared to state-of-the-art local backprop-free RL methods in the MinAtar and the DeepMind Control Suite benchmarks, while also outperforming algorithms trained with backpropagation on most tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4468. Generalizable Coarse-to-Fine Robot Manipulation via Language-Aligned 3D Keypoints
๐ฌ 13
โญ 4.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ 3D manipulation Imitation Learning Coarse-to-fine Policy
Hierarchical coarse-to-fine policy, where a coarse branch predicts a region of interest to guide a fine-grained action predictor, has demonstrated significant potential in robotic 3D manipulation tasks by especially enhancing sample efficiency and enabling more precise manipulation.
However, even augmented with pre-trained models, these hierarchical policies still suffer from generalization issues.
To enhance generalization to novel instructions and environment variations, we propose Coarse-to-fine Language-Aligned manipulation Policy (CLAP), a framework that integrates three key components: 1) task decomposition, 2) VLM fine-tuning for 3D keypoint prediction, and 3) 3D-aware representation.
Through comprehensive experiments in simulation and on a real robot, we demonstrate its superior generalization capability.
Specifically, on GemBench, a benchmark designed for evaluating generalization, our approach achieves a 12\% higher average success rate than the SOTA method while using only 1/5 of the training trajectories.
In real-world experiments, our policy, trained on only 10 demonstrations, successfully generalizes to novel instructions and environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4469. Learning ANDโOR Templates for Compositional Representation in Art and Design
๐ฌ 9
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ AND-OR Template Compositional Template Representation Semi-Supervised Learning Maximum-Entropy
This work proposes a compositional ANDโOR template for art and design that encodes the partโrelationโgeometry organization of images in a structured and interpretable form. Within a maximum-entropy log-linear model, we define a unified consistency score as log-likelihood gain against a reference distribution and decompose it into term-level evidence, enabling an evidence-to-prescription mapping for actionable composition guidance. Learning is performed by a penalized EM-style block-pursuit with sparsity and local mutual exclusivity: object templates are learned first and reused as scene terminals to induce scene templates. A semi-supervised structural expansion, which is triggered by matching gain and structural-consistency thresholds, bootstraps new branches from unlabeled, high-quality images. Evaluations on a curated compositional dataset and AVA/AADB themes show strong agreement with expert paradigms, interpretable parse trees, and competitive performance with deep baselines while exhibiting higher alignment with human ratings. The learned templates also act as lightweight structural conditions to steer AIGC generation and layout design. Overall, the framework delivers a transferable structural prior with favorable data/parameter efficiency and a unified pathway for explainable visual assessment and generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4470. Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning
๐ฌ 23
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ hierarchical reasoning LLM reinforcement learning
Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs), yet underlying mechanisms driving this success remain largely opaque. Our analysis reveals that puzzling phenomena like ``aha moments", ``length-scaling'' and entropy dynamics are not disparate occurrences but hallmarks of an emergent reasoning hierarchy, akin to the separation of high-level strategic planning from low-level procedural execution in human cognition. We uncover a compelling two-phase dynamic: initially, a model is constrained by procedural correctness and must improve its low-level skills. The learning bottleneck then decisively shifts, with performance gains being driven by the exploration and mastery of high-level strategic planning. This insight exposes a core inefficiency in prevailing RL algorithms like GRPO, which apply optimization pressure agnostically and dilute the learning signal across all tokens. To address this, we propose Hierarchy-Aware Credit Assignment (HICRA), an algorithm that concentrates optimization efforts on high-impact planning tokens. Our extensive experiments validate that HICRA significantly outperforms strong baselines, and offer deep insights into how reasoning advances through the lens of strategic exploration.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4471. Knowledge Externalization: Reversible Unlearning and Modular Retrieval in Multimodal Large Language Models
๐ฌ 22
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Machine unlearning Multimodal Large Language Model
Multimodal Large Language Models (MLLMs) achieve remarkable cross-modal understanding by training on vast web-scale datasets, but inadvertently internalize sensitive personal and proprietary information. Existing machine unlearning methods address this by irreversibly altering model parameters to permanently erase knowledge. This destructive paradigm conflicts with modern privacy regulations that mandate auditable, reversible, and user-controllable data management. To address these challenges, we propose Knowledge Externalization, a novel framework for reversible and modular knowledge management in MLLMs. We first propose Dual-Stream Memory Tuning, a method that transfers targeted knowledge from a model's internal parameters into external memory tokens. To mitigate gradient interference when externalizing multiple concepts, we further introduce Soft Orthogonal Weighting, a technique that preserves the independence of each token. Our resulting framework demonstrates three key capabilities: (i) It achieves effective forgetting of target concepts within the base model, while enabling high-fidelity knowledge restoration using the corresponding memory token. (ii) It supports continuous knowledge editing, allowing the information stored within an external token to be dynamically updated post-externalization. (iii) It displays a remarkable emergent ability for compositionality, where multiple memory tokens (including edited ones) can be freely combined to simultaneously recover knowledge corresponding to each concept. Our source code will be released in the near future.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4472. GaussianFusion: Unified 3D Gaussian Representation for Multi-Modal Fusion Perception
๐ฌ 13
โญ 4.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Gaussian Representation BEV Representation Detection Occupancy
The birdโs-eye view (BEV) representation enables multi-sensor features to be fused within a unified space, serving as the primary approach for achieving comprehensive multi-task perception. However, the discrete grid representation of BEV leads to significant detail loss and limits feature alignment and cross-modal information interaction in multimodal fusion perception. In this work, we break from the conventional BEV paradigm and propose a new universal framework for multi-task multi-modal fusion based on 3D Gaussian representation. This approach naturally unifies multi-modal features within a shared and continuous 3D Gaussian space, effectively preserving edge and fine texture details. To achieve this, we design a novel forward-projection-based multi-modal Gaussian initialization module and a shared cross-modal Gaussian encoder that iteratively updates Gaussian properties based on an attention mechanism. GaussianFusion is inherently a task-agnostic model, with its unified Gaussian representation naturally supporting various 3D perception tasks. Extensive experiments demonstrate the generality and robustness of GaussianFusion. On the nuScenes dataset, it outperforms the 3D object detection baseline BEVFusion by 2.6 NDS. Its variant surpasses GaussFormer on 3D semantic occupancy with 1.55 mIoU improvement while using only 30% of the Gaussians and achieving a 450% speedup.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4473. Learning to Generate Stylized Handwritten Text via a Unified Representation of Style, Content, and Noise
๐ฌ 10
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ handwriting text generation flow matching in-contaext generation
Handwritten Text Generation (HTG) seeks to synthesize realistic and personalized handwriting by modeling stylistic and structural traits. While recent diffusion-based approaches have advanced generation fidelity, they typically rely on auxiliary style or content encoders with handcrafted objectives, leading to complex training pipelines and limited interaction across factors. In this work, we present InkSpire, a diffusion transformer based model that unifies style, content, and noise within a shared latent space. By eliminating explicit encoders, InkSpire streamlines optimization while enabling richer feature interaction and stronger in-context generation. To further enhance flexibility, we introduce a multi-line masked infilling strategy that allows training directly on raw text-line images, together with a revised positional encoding that supports arbitrary-length multi-line synthesis and fine-grained character editing. Moreover, InkSpire is trained on a bilingual ChineseโEnglish corpus, enabling a single model to handle both Chinese and English handwriting generation with high fidelity and stylistic diversity, thereby overcoming the need for language-specific systems. Extensive experiments on IAM and ICDAR2013 demonstrate that InkSpire achieves superior structural accuracy and stylistic diversity compared to prior state-of-the-art methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4474. What Matters for Batch Online Reinforcement Learning in Robotics?
๐ฌ 8
โญ 4.50
๐ reinforcement learning
๐ท๏ธ reinforcement learning autonomous improvement imitation leaning
The ability to learn from large batches of autonomously collected data for policy improvement---a paradigm we refer to as batch online reinforcement learning---holds the promise of enabling truly scalable robot learning by significantly reducing the need for human effort of data collection while getting benefits from self-improvement. Yet, despite the promise of this paradigm, it remains challenging to achieve due to algorithms not being able to learn effectively from the autonomous data. For example, prior works have applied imitation learning and filtered imitation learning methods to the batch online RL problem, but these algorithms often fail to efficiently improve from the autonomously collected data or converge quickly to a suboptimal point. This raises the question of what matters for effective batch online reinforcement learning in robotics. Motivated by this question, we perform a systematic empirical study of three axes---(i) algorithm class, (ii) policy extraction methods, and (iii) policy expressivity---and analyze how these axes affect performance and scaling with the amount of autonomously collected data. Through our analysis, we make several observations. First, we observe that the use of Q-functions to guide batch online RL significantly improves performance over imitation-based methods. Building on this, we show that an implicit method of policy extraction---via choosing the best action in the distribution of the policy---is necessary over traditional explicit policy extraction methods from offline RL. Next, we show that an expressive policy class is preferred over less expressive policy classes. Based on this analysis, we propose a general recipe for effective batch online RL. We then show a simple addition to the recipe, namely using temporally-correlated noise to obtain more diversity, results in further performance gains. Our recipe obtains significantly better performance and scaling compared to prior methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4475. Patching Gaps In LLM Reasoning With Interventional Training
๐ฌ 24
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM reasoning intervention SFT RL
Reinforcement learning (RL) training of large language models (LLMs) is limited by the policy's ability to generate rollouts with non-zero rewards: without such rewards, the policy is not updated and learning is stalled on hard problems, which are problems that the policy consistently fails to sample any correct rollouts for. We find that many hard problems remain unsolved due to the repeated generation of incorrect intermediate steps in a long reasoning trace; identifying and fixing these requires performing better \emph{credit assignment}. But existing approaches for credit assignment are either impractical or impose a substantial data-writing burden on oracles (\textit{e.g.}, humans). In this paper, we introduce \textbf{Interventional Training} (InT), a framework that leverages single-step oracle interventions to improve LLM reasoning. Given a reasoning attempt and ground-truth answer, the oracle detects and then provides language feedback on a single intermediate reasoning step, which is much cheaper than obtaining a full reasoning trace. \methodname{} then \emph{patches} the LLM by running supervised fine-tuning on the on-policy rollout up to the error, followed by the correction from the oracle. RL on this patched model now generates counterfactual traces and with merely $\approx$$100$ interventions from the oracle, \methodname{} solves 16\% more hard test problems that were previously unsolved (only zero rewards) and also improves performance across multiple standard evals.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4476. TCD-Arena: Assessing Robustness of Time Series Causal Discovery Methods Against Assumption Violations
๐ฌ 26
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Causal Discovery Benchmark Robustness Time-Series Causality
Causal Discovery (CD) is a powerful framework for scientific inquiry. Yet, its practical adoption is hindered by a reliance on strong, often unverifiable assumptions and a lack of robust performance assessment. To address these limitations and advance empirical CD evaluation, we present **TCD-Arena**, a modularized and extendable testing kit to assess the robustness of time series CD algorithms against stepwise more severe assumption violations. For demonstration, we conduct an extensive empirical study comprising over 50 million individual CD attempts and reveal nuanced robustness profiles for 27 distinct assumption violations. Further, we investigate CD ensembles and find that they can boost general robustness, which has implications for real-world applications. With this, we strive to ultimately facilitate the development of CD methods that are reliable for a diverse range of synthetic and potentially real-world data conditions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4477. Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment
๐ฌ 21
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Inverse Reinforcement Learning LLM Alignment Group Relative Policy Optimization
Alignment is vital for safely deploying large language models (LLMs). Existing techniques are either reward-based--train a reward model on preference pairs and optimize with reinforcement learning (RL)--or reward-free--directly fine-tune on ranked outputs. Recent research show that well-tuned reward-based pipelines remain the most robust, and single-response demonstrations can outperform pairwise preference data.
However, there still exist two key challenges: (1) imbalanced safety dataset that overrepresent common hazards while neglecting long-tail threats; and (2) static reward models that ignore task difficulty, limiting optimization efficiency and attainable gains.
To address these limitations, we propose DR-IRL, which Dynamically adjusts Rewards through Inverse Reinforcement Learning.
We first train categoryโspecific reward models using a balanced safety dataset of seven harmful categories as demonstration via IRL.
Then we enhance Group Relative Policy Optimization (GRPO) by introducing dynamic reward scaling--adjusting rewards by task difficulty--data-level hardness by text encoder cosine similarity, model-level responsiveness by reward gaps.
Extensive experiments across various benchmarks and LLMs demonstrate that DR-IRL outperforms all baseline methods in safety alignment while maintaining usefulness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4478. Reinforcement Unlearning via Group Relative Policy Optimization
๐ฌ 19
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Machine Unlearning Group Relative Policy Optimization Reinforcement Learning Large Language Models Preference Optimization
During pretraining, LLMs inadvertently memorize sensitive or copyrighted data, posing significant compliance challenges under legal frameworks like the GDPR and the EU AI Act. Fulfilling these mandates demands techniques that can remove information from a deployed model without retraining from scratch. Todayโs unlearning approaches attempt to address this need, but often leak the very data they aim to erase, sacrifice fluency and robustness, or depend on costly external reward models. We introduce PURGE (Policy Unlearning through Relative Group Erasure), a novel method grounded in the Group Relative Policy Optimization framework that formulates unlearning as a verifiable problem. PURGE uses an intrinsic reward signal that penalizes any mention of forbidden concepts, allowing safe and consistent unlearning. Our approach achieves up to $\times$46 lower token usage per target than state-of-the-art methods, while improving fluency by +5.48\% and adversarial robustness by +12.02\% over the base model. Extensive evaluation on the Real World Knowledge Unlearning (RWKU) benchmark shows that PURGE reaches 11\% unlearning effectiveness while preserving 98\% of original utility. PURGE shows that framing LLM unlearning as a verifiable task, enables more reliable, efficient, and scalable forgetting, suggesting a promising new direction for unlearning research that combines theoretical guarantees, improved safety, and practical deployment efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4479. NFT: Bridging Supervised Learning and Reinforcement Learning in Math Reasoning
๐ฌ 4
โญ 4.50
๐ generative models
๐ท๏ธ Reinforcement Learning Supervised Learning GRPO LLM
Reinforcement Learning (RL) has played a central role in the recent surge of LLMs' math abilities by enabling verification-driven training through binary verifier signals. In contrast, Supervised Learning (SL) is rarely considered for such verification-driven training, largely due to its heavy reliance on reference answers and inability to reflect on mistakes. In this work, we challenge the prevailing notion that self-improvement is exclusive to RL and propose Negative-aware Fine-Tuning (NFT) --- a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers. In online training, instead of throwing away self-generated negative answers, NFT constructs an \textit{implicit} negative policy to model them. This implicit policy is parameterized with the same positive LLM we target to optimize on positive data, enabling direct policy optimization on all LLMs' generations.
We conduct experiments on 7B and 32B models in math reasoning tasks. Results consistently show that through the additional leverage of negative feedback, NFT significantly improves over SL baselines like rejection fine-tuning, matching, or even surpassing leading RL algorithms like GRPO and DAPO. Furthermore, we demonstrate that NFT and GRPO are actually equivalent in strict-on-policy training, even though they have entirely different theoretical foundations. Our experiments and theoretical findings bridge the gap between SL and RL methods in binary-feedback learning systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4480. DOPPLER: Dual-Policy Learning for Device Assignment in Asynchronous Dataflow Graphs
๐ฌ 18
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning; Graph Scheduling; Distributed Systems;
We study the problem of assigning operations in a dataflow graph to devices to minimize execution time in a work-conserving system, with emphasis on complex machine learning workloads. Prior learning-based approaches face three limitations: (1) reliance on bulk-synchronous frameworks that under-utilize devices, (2) learning a single placement policy without modeling the system dynamics, and (3) depending solely on reinforcement learning in pre-training while ignoring optimization during deployment. We propose Doppler, a three-stage framework with two policiesโSEL for selecting operations and PLC for placing them on devices. Doppler consistently outperforms baselines by reducing execution time and improving sampling efficiency through faster per-episode training. Our results show that DOPPLER achieves up to 52.7% lower execution times than the best baseline.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4481. Annotation-Efficient Honesty Alignment via Confidence Elicitation and Calibration
๐ฌ 23
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Trustworthy LLMs Alignment for Honesty
Honesty alignmentโthe ability of large language models (LLMs) to recognize their knowledge boundaries and express calibrated confidenceโis essential for trustworthy deployment. Existing methods either rely on training-free confidence estimation (e.g., token probabilities, self-consistency) or training-based calibration with correctness annotations. While effective, the latter demands costly, large-scale labeling.
We introduce Elicitation-Then-Calibration (EliCal), a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations. This design substantially reduces annotation requirements while improving generalization across tasks. To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals.
Experiments show that EliCal achieves near-optimal alignment with only 1k correctness annotations ($\sim$0.18\% of full supervision) and better alignment performance on unseen MMLU tasks than the calibration-only baseline, offering a scalable solution toward universal honesty alignment in LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4482. Learning to Solve Orienteering Problem with Time Windows and Variable Profits
๐ฌ 4
โญ 4.50
๐ optimization
๐ท๏ธ Learning to optimize vehicle routing problem orienteering problem
The orienteering problem with time windows and variable profits (OPTWVP) is common in many real-world applications and involves continuous time variables. Current approaches fail to develop an efficient solver for this orienteering problem variant with discrete and continuous variables. In this paper, we propose a learning-based two-stage DEcoupled discrete-Continuous optimization with Service-time-guided Trajectory (DeCoST), which aims to effectively decouple the discrete and continuous decision variables in the OPTWVP problem, while enabling efficient and learnable coordination between them. In the first stage, a parallel decoding structure is employed to predict the path and the initial service time allocation. The second stage optimizes the service times through a linear programming (LP) formulation and provides a long-horizon learning of service time estimation. We rigorously prove the global optimality of the second-stage solution. Experiments on OPTWVP instances demonstrate that DeCoST outperforms both state-of-the-art constructive solvers and the latest meta-heuristic algorithms in terms of solution quality and computational efficiency, achieving up to 6.6x inference speedup on instances with fewer than 500 nodes. Moreover, the proposed framework is compatible with various constructive solvers and consistently enhances the solution quality for OPTWVP.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4483. Scaling Attention via Feature Sparsity
๐ฌ 30
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Self-Attention Sparse Representation
Scaling Transformers to ultra-long contexts is bottlenecked by the $O(n^2 d)$ cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these approaches consistently degrade accuracy. In this paper, we instead explore an orthogonal axis: \emph{feature sparsity}. We propose \textbf{Sparse Feature Attention (SFA)}, where queries and keys are represented as $k$-sparse codes that preserve high-dimensional expressivity while reducing the cost of attention from $\Theta(n^2 d)$ to $\Theta(n^2 k^2/d)$. To make this efficient at scale, we introduce \textbf{FlashSFA}, an IO-aware kernel that extends FlashAttention to operate directly on sparse overlaps without materializing dense score matrices. Across GPT-2 and Qwen3 pretraining, SFA matches dense baselines while improving speed by up to $2.5\times$ and reducing FLOPs and KV-cache by nearly 50\%. On synthetic and downstream benchmarks, SFA preserves retrieval accuracy and robustness at long contexts, outperforming short-embedding baselines that collapse feature diversity. These results establish feature-level sparsity as a complementary and underexplored axis for efficient attention, enabling Transformers to scale to orders-of-magnitude longer contexts with minimal quality loss.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4484. MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers
๐ฌ 19
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ MCP Safety Benchmark MCP Attack
Large language models (LLMs) are evolving into agentic systems that reason, plan, and operate external tools. The Model Context Protocol (MCP) is a key enabler of this transition, offering a standardized interface for connecting LLMs with heterogeneous tools and services. Yet MCP's openness and multi-server workflows introduce new safety risks that existing benchmarks fail to capture, as they focus on isolated attacks or lack real-world coverage. We present \textbf{MCP-SafetyBench}, a comprehensive benchmark built on real MCP servers that supports realistic multi-turn evaluation across five domainsโbrowser automation, financial analysis, location navigation, repository management, and web search. It incorporates a unified taxonomy of 20 MCP attack types spanning server, host, and user sides, and includes tasks requiring multi-step reasoning and cross-server coordination under uncertainty. Using MCP-SafetyBench, we systematically evaluate leading open- and closed-source LLMs, revealing large disparities in safety performance and escalating vulnerabilities as task horizons and server interactions grow. Our results highlight the urgent need for stronger defenses and establish MCP-SafetyBench as a foundation for diagnosing and mitigating safety risks in real-world MCP deployments.Our benchmark is available at https://anonymous.4open.science/r/MCP-SafetyBench-5738.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4485. PGRF-Net: A Prototype-Guided Relational Fusion Network for Diagnostic Multivariate Time-Series Anomaly Detection
๐ฌ 13
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Multivariate Timeseries Anomaly Detection Time-Series Diagnostics Prototype Learning Relational Time-Series Modeling
Multivariate time-series anomaly detection (MTSAD) faces a critical trade-off between detection performance and model transparency. We propose PGRF-Net, a novel architecture designed to achieve state-of-the-art performance while providing actionable diagnostic insights. At its core, PGRF-Net uses a Multi-Faceted Evidence Extractor that combines prototype learning with the discovery of dynamic relational structures between variables. This extractor generates four distinct types of anomaly evidence: predictive deviation, structural changes in learned variable dependencies, contextual deviation from normal-behavior prototypes, and the magnitude of localized spike events. This evidence is then processed by an Adaptive Evidence Fusion Network, which learns to weigh each source via data-driven gating. PGRF-Net is trained via a two-stage unsupervised strategy for robust extractor learning and subsequent fusion tuning. Extensive experiments on five widely-used public MTSAD benchmarks demonstrate its superior detection performance. Crucially, by decomposing the final anomaly score into these four evidence types, our model facilitates diagnostic analysis, offering a practical step towards more transparent MTSAD.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4486. Sample-efficient and Scalable Exploration in Continuous-Time RL
๐ฌ 13
โญ 4.50
๐ reinforcement learning
๐ท๏ธ continuous-time reinforcement learning model-based RL intrinsic rewards epistemic uncertainty exploration-exploitation trade-off
Reinforcement learning algorithms are typically designed for discrete-time dynamics, even though the underlying real-world control systems are often continuous in time. In this paper, we study the problem of continuous-time reinforcement learning, where the unknown system dynamics are represented using nonlinear ordinary differential equations (ODEs). We leverage probabilistic models, such as Gaussian processes and Bayesian neural networks, to learn an uncertainty-aware model of the underlying ODE. Our algorithm, COMBRL, greedily maximizes a weighted sum of the extrinsic reward and model epistemic uncertainty. This yields a scalable and sample-efficient approach to continuous-time model-based RL. We show that COMBRL achieves sublinear regret in the reward-driven setting, and in the unsupervised RL setting (i.e., without extrinsic rewards), we provide a sample complexity bound. In our experiments, we evaluate COMBRL in both standard and unsupervised RL settings and demonstrate that it scales better, is more sample-efficient than prior methods, and outperforms baselines across several deep RL tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4487. Reward Model Routing in Alignment
๐ฌ 22
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multi-armed bandits Preference optimization Reward model Online DPO
Reinforcement learning from human or AI feedback (RLHF/RLAIF) has become the standard paradigm for aligning large language models (LLMs). However, most pipelines rely on a single reward model (RM), limiting alignment quality and risking overfitting. Recent work explores RM routingโdynamically selecting an RM from a candidate pool to exploit complementary strengths while maintaining \(O(1)\) RM callsโbut existing methods suffer from cold-start and insufficient exploration. We propose {\name}, a hybrid routing framework that combines offline RM strengths learning with online Bayesian selection. In the offline stage, a multi-task router is trained on preference data to estimate per-RM reliability. In the online stage, a Bayesian Thompson sampling router performs per-query RM selection, initializing RM-specific weight vectors with offline embeddings as Gaussian priors and adaptively updating their posteriors with online rewards to adapt to the evolving policy distribution. Extensive experiments on instruction-following (AlpacaEval-2, Arena-Hard, MT-Bench) and reasoning (GSM8K, MMLU) benchmarks show that {\name} consistently outperforms individual RMs, RM ensembling, and existing routing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4488. On the Computational Limits of AI4S-RL : A Unified $\varepsilon$-$N$ Analysis
๐ฌ 21
โญ 4.50
๐ reinforcement learning
๐ท๏ธ AI for Science Reinforcement Learning PDE Control Surrogate Modeling
Recent work increasingly adopts AI for Science (AI4S) models to replace expensive PDE solvers as simulation environments for reinforcement learning (RL), enabling faster training in complex physical control tasks. However, using approximate simulators introduces modeling errors that affect the learned policy. In this paper, we introduce a unified $\varepsilon$-$N$ framework that quantifies the minimal computational cost $N^*(\varepsilon)$ required for an AI4S model to ensure that tabular RL can estimate the value function with unbiasedness, with probability at least $1 - \delta$. This characterization allows us to connect surrogate accuracy, grid resolution, and RL policy quality under a shared probabilistic language. We analyze how the discretization level $K$ of AI4S and RL space governs both PDE surrogate error and RL lattice approximation error, and we employ spectral theory and Sobolev estimates to derive optimal grid strategies that minimize total cost while preserving learning fidelity. Our theory reveals that different systems, such as ODE- and PDE-governed environments, require different allocations of effort between physical simulation and RL optimization. Overall, our framework offers a principled foundation for designing efficient, scalable, and cost-aware AI4S-RL systems with provable learning guarantees.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4489. Unlearning Isn't Invisible: Detecting Unlearning Traces in LLMs from Model Outputs
๐ฌ 25
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Machine Unlearning
Machine unlearning (MU) for large language models (LLMs), commonly referred to as LLM unlearning, seeks to remove specific undesirable data or knowledge from a trained model, while maintaining its performance on standard tasks. While unlearning plays a vital role in protecting data privacy, enforcing copyright, and mitigating sociotechnical harms in LLMs, we identify a new vulnerability post-unlearning: unlearning trace detection. We discover that unlearning leaves behind persistent "fingerprints" in LLMs, detectable traces in both model behavior and internal representations. These traces can be identified from output responses, even when prompted with forget-irrelevant inputs. Specifically, even a simple supervised classifier can determine whether a model has undergone unlearning, using only its prediction logits or even its textual outputs. Further analysis shows that these traces are embedded in intermediate activations and propagate nonlinearly to the final layer, forming low-dimensional, learnable manifolds in activation space. Through extensive experiments, we demonstrate that unlearning traces can be detected with over 90% accuracy even under forget-irrelevant inputs, and that larger LLMs exhibit stronger detectability. These findings reveal that unlearning leaves measurable signatures, introducing a new risk of reverse-engineering forgotten information when a model is identified as unlearned, given an input query.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4490. Simulation to Rules: A Dual-VLM Framework for Formal Visual Planning
๐ฌ 25
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Vision Language Models Planning PDDL LLM Tool Use
Vision Language Models (VLMs) show strong potential for visual planning but struggle with precise spatial and long-horizon reasoning. In contrast, Planning Domain Definition Language (PDDL) planners excel at long-horizon formal planning, but cannot interpret visual inputs. Recent works combine these complementary advantages by enabling VLMs to turn visual planning problems into PDDL files for formal planning. However, while VLMs can generate PDDL problem files satisfactorily, they struggle to accurately generate the PDDL domain files, which describe all the planning rules. As a result, prior methods rely on human experts to predefine domain files or on constant environment access for refinement. We propose VLMFP, a Dual-VLM-guided framework that can autonomously generate both PDDL problem and domain files for formal visual planning. VLMFP introduces two VLMs to ensure reliable PDDL file generation: A SimVLM that simulates action consequences based on input rule descriptions, and a GenVLM that generates and iteratively refines PDDL files by comparing the PDDL and SimVLM execution results. VLMFP unleashes multiple levels of generalizability: The same generated PDDL domain file works for all the different instances under the same problem, and VLMs generalize to different problems with varied appearances and rules. We evaluate VLMFP with 6 grid-world domains and test its generalization to unseen instances, appearance, and game rules. On average, SimVLM accurately describes 95.5\%, 82.6\% of scenarios, simulates 85.5\%, 87.8\% of action sequence, and judges 82.4\%, 85.6\% goal reaching for seen and unseen appearances, respectively. With the guidance of SimVLM, VLMFP can generate PDDL files to reach 70.0\%, 54.1\% valid plans for unseen instances in seen and unseen appearances, respectively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4491. Go-Browse: Training Web Agents with Structured Exploration
๐ฌ 16
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ LLM Agents Web Agents Synthetic Data Exploration
One of the fundamental problems in digital agents is their lack of understanding of their environment. For instance, a web browsing agent may get lost in unfamiliar websites, uncertain what pages must be visited to achieve its goals. To address this, we propose Go-Browse, a method for automatically collecting diverse and realistic web agent data at scale through structured exploration of web environments. Go-Browse achieves efficient exploration by framing data collection as a graph search, enabling reuse of information across exploration episodes. We instantiate our method on the WebArena benchmark, collecting a dataset of 10K successful task-solving trajectories and 40K interaction steps across 100 URLs. Fine-tuning a 7B parameter language model on this dataset achieves a success rate of 21.7% on the WebArena benchmark, beating GPT-4o mini by 2.4% and exceeding current state-of-the-art results for sub-10B parameter models by 2.9%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4492. LiveClin: A Live Clinical Benchmark without Leakage
๐ฌ 30
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ MultiModal Medical Benchmark
The reliability of medical LLM evaluation is critically undermined by data contamination and knowledge obsolescence, leading to inflated scores on static benchmarks. To address these challenges, we introduce LiveClin, a live benchmark designed for the faithful replication of clinical practice. Built from contemporary, peer-reviewed case reports and updated biannually, LiveClin ensures clinical currency and resists data contamination. Using a verified AI-human workflow involving 239 physicians, we transform authentic patient cases into complex, multimodal evaluation scenarios that span the entire clinical pathway. The benchmark currently comprises 1,407 case reports and 6,605 questions. Our evaluation of 26 models on LiveClin reveals the profound difficulty of these real-world scenarios, with the top-performing model achieving a Case Accuracy of just 35.7%. We find that the era of "free lunch" improvements from simple model scaling is over, as newer models do not consistently outperform their predecessors. Furthermore, our analysis uncovers distinct reasoning weaknesses across model classes. LiveClin thus provides a continuously evolving, clinically-grounded framework to steer the development of medical LLMs towards greater reliability and real-world utility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4493. Steering Embedding Models with Geometric Rotation: Mapping Semantic Relationships Across Languages and Models
๐ฌ 17
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ Embedding Models Steering Vectors High Dimensional Geometry LLMs Unit Hypersphere
Understanding how language and embedding models encode semantic relationships is fundamental to model interpretability.
While early word embeddings exhibited intuitive vector arithmetic (''king'' - ''man'' + ''woman'' = ''queen''), modern high-dimensional text representations lack straightforward interpretable geometric properties.
We introduce Rotor-Invariant Shift Estimation (RISE), a geometric approach that represents semantic-syntactic transformations as consistent rotational operations in embedding space, leveraging the manifold structure of modern language representations.
RISE operations have the ability to operate across both languages and models with high transfer of performance, suggesting the existence of analogous cross-lingual geometric structure.
We compare and evaluate RISE using two baseline methods, three embedding models, three datasets, and seven morphologically diverse languages in five major language groups.
Our results demonstrate that RISE consistently maps discourse-level semantic-syntactic transformations with distinct grammatical features (e.g., negation and conditionality) across languages and models.
This work provides the first systematic demonstration that discourse-level semantic-syntactic transformations correspond to consistent geometric operations in multilingual embedding spaces, empirically supporting the Linear Representation Hypothesis at the sentence level.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4494. Leveraging Explanation to Improve Generalization of Meta Reinforcement Learning
๐ฌ 31
โญ 4.50
๐ reinforcement learning
๐ท๏ธ meta-reinforcement learning generalization theory
A common and effective human strategy to improve a poor outcome is to first identify prior experiences most relevant to the outcome and then focus on learning from those experiences. This paper investigates whether this human strategy can improve generalization of meta-reinforcement learning (MRL). MRL learns a meta-prior from a set of training tasks such that the meta-prior can adapt to new tasks in a distribution. However, the meta-prior usually has imbalanced generalization, i.e., it adapts well to some tasks but adapts poorly to others. We propose a two-stage approach to improve generalization. The first stage identifies "critical" training tasks that are most relevant to achieve good performance on the poorly adapted tasks. The second stage improves generalization by encouraging the meta-prior to pay more attention to the critical tasks. We use conditional mutual information to mathematically formalize the notion of "paying more attention". We formulate a bilevel optimization problem to maximize the conditional mutual information by augmenting the critical tasks and propose an algorithm to solve the bilevel optimization problem. We theoretically guarantee that (1) the algorithm converges at the rate of $O(1/\sqrt{K})$ and (2) the generalization improves after the task augmentation. We use two real-world experiments, two MuJoCo experiments, and a Meta-World experiment to validate the algorithm.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4495. pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models
๐ฌ 13
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multi-Modal Adapter Personalized Federated Fine-Tuning Few-Shot Learning of Vision Language Models
Vision-Language Models (VLMs) like CLIP have demonstrated remarkable generalization in zero- and few-shot settings, but adapting them efficiently to decentralized, heterogeneous data remains a challenge. While prompt tuning has emerged as a popular parameter-efficient approach in personalized federated learning, existing methods often sacrifice generalization in favor of personalization, struggling particularly on unseen classes or domains. In this work, we propose pFedMMA, the first personalized federated learning framework that leverages multi-modal adapters for vision-language tasks. Each adapter contains modality-specific up- and down-projection layers alongside a globally shared projection that aligns cross-modal features. Our optimization strategy allows clients to locally adapt to personalized data distributions while collaboratively training the shared projection to improve global generalization. This design is also communication-efficient, as only the shared component is exchanged during communication rounds. Through extensive experiments across eleven datasets, including domain- and label-shift scenarios, we show that pFedMMA achieves state-of-the-art trade-offs between personalization and generalization, outperforming recent federated prompt tuning methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4496. PRISM: Progressive Robust Learning for Open-World Continual Category Discovery
๐ฌ 19
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Continual Category Discovery Generalized category discovery Domain shift
Continual Category Discovery (CCD) aims to leverage models trained on known categories to automatically discover novel category concepts from continuously arriving streams of unlabeled data, while retaining the ability to recognize previously known classes. Despite recent progress, existing methods often assume that data across all stages are drawn from a single, stationary distributionโa condition rarely satisfied in open-world scenarios. In this paper, we challenge this stationary-distribution assumption by introducing the Open-World Continual Category Discovery (OW-CCD) setting. We address this challenge with PRISM (\underline{P}rogressive \underline{R}obust d\underline{I}scovery under \underline{S}trea\underline{M}ing data), an adaptive continual discovery framework consisting of three key components. First, inspired by spectral properties, we develop a high-frequency-driven category separation technique that exploits high-frequency componentsโpreserving more global informationโto distinguish known from unknown categories. Second, for known categories, we design a sparse assignment matching strategy, which performs proximal sparse sample-to-label matching to assign reliable cluster labels to known-class samples. Finally, to better recognize novel categories, we propose an invariant knowledge transfer module that enforces domain-invariant category relation consistency, thereby facilitating robust knowledge transfer from known to unknown classes under domain shifts. Extensive experiments on the SSB-C and DomainNet benchmarks demonstrate that our method significantly outperforms state-of-the-art CCD approaches, highlighting its effectiveness and superiority.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4497. Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models
๐ฌ 29
โญ 4.50
๐ generative models
๐ท๏ธ Diffusion model 3D generation Graph diffusion
Rigged 3D assets are fundamental to 3D deformation and animation. However,
existing 3D generation methods face challenges in generating animatable geometry,
while rigging techniques lack fine-grained structural control over skeleton
creation. To address these limitations, we introduce Stroke3D, a novel framework
that directly generates rigged meshes from user inputs: 2D drawn strokes and a
descriptive text prompt. Our approach pioneers a two-stage pipeline that separates
the generation into: 1) Controllable Skeleton Generation, we employ a
Skeletal Graph VAE (Sk-VAE) to encode the skeletonโs graph structure into a latent
space, where a Skeletal Graph DiT (Sk-DiT) generates a skeletal embedding.
The generation process is conditioned on both the text for semantics and the 2D
strokes for explicit structural control, with the VAEโs decoder reconstructing the
final high-quality 3D skeleton; and 2) Enhanced Mesh Synthesis via TextuRig
and SKA-DPO, where we then synthesize a textured mesh conditioned on the
generated skeleton. For this stage, we first enhance an existing skeleton-to-mesh
model by augmenting its training data with TextuRigโa dataset of textured and
rigged meshes with captions, curated from Objaverse-XL. Additionally, we employ
a preference optimization strategy, SKA-DPO, guided by a skeleton-mesh
alignment score, to further improve geometric fidelity. Together, our framework
enables a more intuitive workflow for creating ready-to-animate 3D content. To
the best of our knowledge, our work is the first to generate rigged 3D meshes
conditioned on user-drawn 2D strokes. Extensive experiments demonstrate that
Stroke3D produces plausible skeletons and high-quality meshes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4498. CoDA: From Text-to-Image Diffusion Models to Truly Training-Free Dataset Distillation
๐ฌ 27
โญ 4.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Dataset Distillation Text-to-Image Diffusion Model Core Distribution Alignment
Prevailing Dataset Distillation (DD) methods leveraging generative models confront two fundamental limitations. First, despite pioneering the use of diffusion models in DD and delivering impressive performance, the vast majority of approaches paradoxically require a diffusion model pre-trained on the full target dataset, undermining the very purpose of DD and incurring prohibitive training costs. Second, although some methods turn to general text-to-image models without relying on such target-specific training, they suffer from a significant distributional mismatch, as the web-scale priors encapsulated in these foundation models fail to faithfully capture the target-specific semantics, leading to suboptimal performance. To tackle these challenges, we propose Core Distribution Alignment (CoDA), a framework that enables effective DD using only an off-the-shelf text-to-image model. Our key idea is to first identify the ``intrinsic core distribution'' of the target dataset using a robust density-based discovery mechanism. We then steer the generative process to align the generated samples with this core distribution. By doing so, CoDA effectively bridges the gap between general-purpose generative priors and target semantics, yielding highly representative distilled datasets. Extensive experiments suggest that, without relying on a generative model specifically trained on the target dataset, CoDA achieves performance on par with or even superior to previous methods with such reliance across all benchmarks, including ImageNet-1K and its subsets. Notably, it establishes a new state-of-the-art accuracy of 60.4\% at the 50-images-per-class (IPC) setup on ImageNet-1K.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4499. Tools are under-documented: Simple Document Expansion Boosts Tool Retrieval
๐ฌ 12
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ tool retrieval information retrieval
Large Language Models (LLMs) have recently demonstrated strong capabilities in tool use, yet progress in tool retrieval remains hindered by incomplete and heterogeneous tool documentation.
To address this challenge, we introduce Tool-DE, a new benchmark and framework that systematically enriches tool documentation with structured fields to enable more effective tool retrieval, together with two dedicated models, Tool-Embed and Tool-Rank. We design a scalable document expansion pipeline that leverages both open- and closed-source LLMs to generate, validate, and refine enriched tool profiles at low cost, producing large-scale corpora with 50k instances for embedding-based retrievers and 200k for rerankers. On top of this data, we develop two models specifically tailored for tool retrieval: Tool-Embed, a dense retriever, and Tool-Rank, an LLM-based reranker. Extensive experiments on ToolRet and Tool-DE demonstrate that document expansion substantially improves retrieval performance, with Tool-Embed and Tool-Rank achieving new state-of-the-art results on both benchmarks. We further analyze the contribution of individual fields to retrieval effectiveness, as well as the broader impact of document expansion on both training and evaluation. Overall, our findings highlight both the promise and limitations of LLM-driven document expansion, positioning Tool-DE, along with the proposed Tool-Embed and Tool-Rank, as a foundation for future research in tool retrieval.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4500. Unifying Diffusion and Autoregression for Generalizable Vision-Language-Action Model
๐ฌ 25
โญ 4.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Vision-language-action model Robotic manipulation
A central objective of manipulation policy design is to enable robots to comprehend human instructions and predict generalized actions in unstructured environments. Recent autoregressive vision-language-action (VLA) approaches discretize actions into bins to exploit the pretrained reasoning and generation paradigms of vision-language models (VLMs). While these models achieve efficient and scalable training, the discretization undermines the continuity required for precise control. In contrast, diffusion-based VLA methods incorporate an additional diffusion head to predict continuous actions, but they rely solely on feature representations extracted from the VLM, without leveraging the pretrained large language model (LLM) as an expert for iterative action generation. To integrate the complementary strengths of autoregressive and diffusion generation, we introduce HybridVLA, which innovatively leverages a shared LLM backbone to perform iterative action prediction through both paradigms. Specifically, a collaborative training recipe is proposed, incorporating diffusion denoising into the next-token prediction process and mitigating interference between the two generation paradigms. With this recipe, we find these two action prediction methods not only reinforce each other but also exhibit varying strengths across different scenarios. Therefore, we design a collaborative action ensemble mechanism that adaptively fuses both predictions, leading to more robust control. HybridVLA outperforms previous state-of-the-art VLA methods by 17\% and 19\% in mean success rate on simulation and real-world tasks, respectively, while demonstrating generalization to unseen configurations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4501. Parameter-Efficient Reinforcement Learning using Prefix Optimization
๐ฌ 8
โญ 4.50
๐ reinforcement learning
๐ท๏ธ reinforcement learning with verifiable rewards parameter efficient tuning
Reinforcement Learning with Verifiable Rewards (RLVR) is a leading approach for tuning language models on mathematical reasoning tasks. However, it remains unclear whether RLVR's gains stem from genuine reasoning improvements or simply from steering the model toward answer formats that already appear in the reference distribution. Inspired by recent evidence \citep{zhao2025echo,yue2025does}, we study this question by optimizing only the first $k$ tokens (e.g. $k=32$) of each solution, generating the remainder of the response from the reference model. We study two methods for prefix optimization, using a naive algorithm that clusters prefixes and selects the best prefix (Prefix Clustering), and a method that optimizes the prefix by finetuning a lightweight adapter model with RL (Prefix-RL). We show that tuning only the first $k$ tokens can significantly improve the accuracy on math, suggesting that at least some of the gains from RL are due to upweighting a preferable solution strategy. Our results suggest that simple prefix optimization methods can provide an efficient alternative to RL, delivering substantial improvements across different models and benchmarks for a tiny fraction of the compute required for standard RL, and that these gains are robust across prefix lengths and random seeds.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4502. RepSpec: Structural Re-parameterized Draft Model Training for Speculative Decoding
๐ฌ 12
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reparameterization Speculative Decoding
As the parameter size of large language models (LLMs) continues to grow, the latency of autoregressive inference increases due to memory-bound computational inefficiency. To address this, speculative decoding has been proposed, where a large target model verifies multiple tokens generated in parallel by a smaller draft model. However, the performance of speculative decoding is fundamentally limited by the draft modelโs capacity, which stems from the parameter gap between the two models. To overcome this limitation, we propose RepSpec, which combines structural re-parameterization with draft model training. During training, redundant linear structures are introduced and later merged into the backbone network during inference, thus enhancing the draft modelโs training effectiveness without increasing inference cost. By applying our method to improve the current state-of-the-art approach, EAGLE, we achieve a significant improvement in accepted sequence length. Furthermore, considering the specific characteristics of the speculative decoding scenario, we explore a hybrid training strategy that combines linear and nonlinear structures, which yields a further improvement in acceptance length.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4503. Contextual Causal Bayesian Optimisation
๐ฌ 14
โญ 4.50
๐ optimization
๐ท๏ธ Bayesian Optimization Causality Optimal Control
We introduce a unified framework
for contextual and causal Bayesian optimisation,
which aims to design intervention policies
maximising the expectation of a target variable.
Our approach leverages both observed contextual
information and known causal graph structures
to guide the search. Within this framework, we
propose a novel algorithm that jointly optimises
over policies and the sets of variables on which
these policies are defined. This thereby extends and
unifies two previously distinct approaches:
Causal Bayesian Optimisation and Contextual
Bayesian Optimisation, while also addressing
their limitations in scenarios that yield
suboptimal results. We derive worst-case and
instance-dependent high-probability regret bounds
for our algorithm. We report experimental results
across diverse environments, corroborating that
our approach achieves sublinear regret and reduces
sample complexity in high-dimensional settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4504. Enhancing Complex Symbolic Logical Reaยญsoning of Large Language Models via Sparse Multi-Agent Debate
๐ฌ 32
โญ 4.50
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Logical Reasoning Symbolic AI Multi-agent System Large Language Models
Large language models (LLMs) struggle with complex logical reasoning. Previous work has primarily explored single-agent methods, with their performance remains fundamentally limited by the capabilities of a single model. To our knowledge, this paper first introduce a multi-agent approach specifically to enhance the logical reasoning abilities of LLMs. Considering the respective strengths and weaknesses of symbolic and natural language reasoning, we propose a multi-agent framework where individual agents reason in both symbolic and natural languages and then engage in a debate. To ensure the accuracy of symbolic translation, we also leverage multiple agents to translate and debate in different symbolic languages. Due to the prohibitive communication and token costs of multi-turn interactions, we further propose an adaptive sparse communication strategy to ensure efficiency. Specifically, our method prunes unnecessary communication by assessing the agent confidence and information gains, allowing each agent to selectively maintain its memory with others' most valuable outputs to help generate answers. Extensive experiments demonstrate that not only our multi-agent debate framework outperforms previous methods in logical reasoning tasks, but also our sparse communication approach outperforms the fully-connected communication with 25% token costs reduced, improving both effectiveness and efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4505. Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning
๐ฌ 20
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM reasoning Reinforcement Learning
Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient mechanism to address the above limitations via decomposing each iteration into three stages: a short fast trajectory of inner steps on the same batch, a reposition step to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces number of rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points on math reasoning benchmarks. It also achieves up to 4.93$\times$ fewer rollouts and a 4.19$\times$ reduction in wall-clock time to match GRPOโs best accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4506. Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching
๐ฌ 14
โญ 4.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Discrete Flow Matching Guidance Posterior Sampling
Guidance provides a simple and effective framework for posterior sampling by steering the generation process towards the desired distribution. When modeling discrete data, existing approaches mostly focus on guidance with the first-order Taylor approximation to improve the sampling efficiency. However, such an approximation is inappropriate in discrete state spaces since the approximation error could be large. A novel guidance framework for discrete data is proposed to address this problem: We derive the exact transition rate for the desired distribution given a learned discrete flow matching model, leading to guidance that only requires a single forward pass in each sampling step, significantly improving efficiency. This unified novel framework is general enough, encompassing existing guidance methods as special cases, and it can also be seamlessly applied to the masked diffusion model. We demonstrate the effectiveness of our proposed guidance on energy-guided simulations and preference alignment on text-to-image generation and multimodal understanding tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4507. WholeBodyVLA: Towards Unified Latent VLA for Whole-body Loco-manipulation Control
๐ฌ 18
โญ 4.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Humanoid Robots Vision-Language-Action Model Locomotion and Manipulation Whole-Body Control Reinforcement Learning
Humanoid robots require precise locomotion and dexterous manipulation to per- form challenging locomanipulation tasks. Yet existing approaches, modular or end-to-end, are deficient in manipulation-aware locomotion. This confines the robot to a limited workspace, preventing it from performing large-space loco-manipulation. We attribute this to: (1) the challenge of acquiring loco- manipulation knowledge due to the scarcity of humanoid teleoperation data, and (2) the difficulty of faithfully and reliably executing locomotion commands, stem- ming from the limited precision and stability of existing RL controllers. To acquire richer loco-manipulation knowledge, we propose a unified latent learning frame- work that enables Vision-Language-Action (VLA) system to learn from low-cost action-free egocentric videos. Moreover, an efficient data collection pipeline is de- vised to augment the dataset and scale the benefits. To more precisely execute the desired locomotion commands, we present a locoโmanipulationโoriented (LMO) RL policy specifically tailored for accurate and stable core loco-manipulation movements, such as advancing, turning, and squatting. Building on these com- ponents, we introduce WholeBodyVLA, a unified framework for humanoid loco- manipulation. To the best of our knowledge, WholeBodyVLA is one of its kind enabling large-space humanoid locoโmanipulation. It is verified via comprehen- sive experiments on the AgiBot X2 humanoid, outperforming prior baseline by 21.3%. It also demonstrates strong generalization and high extensibility across a broad range of tasks. Code and checkpoints would be made public.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4508. FARI: Robust One-Step Inversion for Watermarking in Diffusion Models
๐ฌ 27
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ digital watermark diffusion models inversion
Inversion-based watermarking is a promising approach to authenticate diffusion-generated images, yet practical use is bottlenecked by inversion that is both slow and error-prone. While the primary challenge in the watermarking setting is robustness against external distortions, existing approaches over-optimize internal truncation error, and because that error scales with the sampler step size, they are inherently confined to high-NFE (number of function evaluations) regimes that cannot meet the dual demands of speed and robustness. In this work, we have two key observations: (i) the inversion trajectory has markedly lower curvature than the forward generation path does, making it highly compressible and amenable to low-NFE approximation; and (ii) in inversion for watermark verification, the trade-off between speed and truncation error is less critical, since external distortions dominate the error. A faster inverter provides a dual benefit: it is not only more efficient, but it also enables end-to-end adversarial training to directly target robustness, a task that is computationally prohibitive for the original, lengthy inversion trajectories. Building on this, we propose **FARI** (**F**ast **A**symmetric **R**obust **I**nversion), a one-step inversion framework paired with lightweight adversarial LoRA fine-tuning of the denoiser for watermark extraction. While consolidation slightly increases internal error, FARI delivers large gains in both speed and robustness: with ~20 minutes of fine-tuning on a single NVIDIA RTX A6000 GPU, it surpasses 50-step DDIM inversion on watermark-verification robustness while dramatically reducing inference time.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4509. Better Bounds for the Distributed Experts Problem
๐ฌ 10
โญ 4.50
๐ learning theory
๐ท๏ธ distributed algorithms learning with experts communication complexity reinforcement learning
In this paper, we study the distributed experts problem, where $n$ experts are distributed across $s$ servers for $T$ timesteps. The loss of each expert at each time $t$ is the $\ell_p$ norm of the vector that consists of the losses of the expert at each of the $s$ servers at time $t$. The goal is to minimize the regret $R$, i.e., the loss of the distributed protocol compared to the loss of the best expert, amortized over the all $T$ times, while using the minimum amount of communication. We give a protocol that achieves regret roughly $R\gtrsim\frac{1}{\sqrt{T}\cdot\text{poly}\log(nsT)}$, using $\mathcal{O}\left(\frac{n}{R^2}+\frac{s}{R^2}\right)\cdot\max(s^{1-2/p},1)\cdot\text{poly}\log(nsT)$ bits of communication, which improves on previous work.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4510. The Curious Case of In-Training Compression of State Space Models
๐ฌ 33
โญ 4.50
๐ learning on time series and dynamical systems
๐ท๏ธ State Space Models Sequence modeling Model compression Dynamical Systems
State Space Models (SSMs), developed to tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference. At their core are recurrent dynamical systems that maintain a hidden state, with update costs scaling with the state dimension. A key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden. Control theory, and more specifically Hankel singular value analysis, provides a potent framework for the measure of energy for each state, as well as the balanced truncation of the original system down to a smaller representation with performance guarantees. Leveraging the eigenvalue stability properties of Hankel matrices, we apply this lens to SSMs during training, where only dimensions of high influence are identified and preserved. Our approach applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models. Experiments show that in-training reduction significantly accelerates optimization while preserving expressivity, with compressed models retaining task-critical structure lost by models trained directly at smaller dimension. In other words, SSMs that begin large and shrink during training achieve computational efficiency while maintaining higher performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4511. MIRA: Memory-Integrated Reinforcement Learning Agent with Limited LLM Guidance
๐ฌ 17
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Reinforcement learning (RL) Large language models (LLMs) Memory Graph LLM-derived priors Sample Efficiency Sparse-Reward Environments
Reinforcement learning (RL) agents often face high sample complexity in sparse or delayed reward settings, due to limited prior knowledge. Conversely, large language models (LLMs) can provide subgoal structures, plausible trajectories, and abstract priors that support early learning. Yet heavy reliance on LLMs introduces scalability issues and risks dependence on unreliable signals, motivating ongoing efforts to integrate LLM guidance without compromising RLโs autonomy. We propose MIRA (Memory-Integrated Reinforcement Learning Agent), which incorporates a structured, evolving memory graph to guide early learning. This graph stores decision-relevant information, such as trajectory segments and subgoal decompositions, and is co-constructed from the agentโs high-return experiences and LLM outputs, amortizing LLM queries into a persistent memory instead of relying on continuous real-time supervision. From this structure, we derive a utility signal that softly adjusts advantage estimation to refine policy updates without altering the underlying reward function. As training progresses, the agentโs policy surpasses the initial LLM-derived priors, and the utility term decays, leaving long-term convergence guarantees intact. We show theoretically that this utility-based shaping improves early-stage learning in sparse-reward settings. Empirically, MIRA outperforms RL baselines and reaches returns comparable to methods that rely on frequent LLM supervision, while requiring substantially fewer online LLM queries.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4512. Diversity-Enhanced Reasoning for Subjective Questions
๐ฌ 17
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM subjective reasoning diversity-enhanced training
Large Reasoning Models (LRMs) with long chain-of-thought capabilities, optimized via reinforcement learning with verifiable rewards (RLVR), excel at **objective reasoning** tasks like mathematical problem solving and code generation.
However, RLVR is known for degrading generation diversity, which causes LRMs to fall short on **subjective reasoning** that has multiple answers depending on different role perspectives.
While recent studies recognize the importance of diversity-enhanced training in objective reasoning, limited attention has been given to subjective tasks.
In this paper, we find that subjective reasoning can be improved by introducing perspective diversity and token-level diversity, with the former one providing a coherent scaffolding anchored to a real-world stakeholder group and the latter one broadening the answer search space.
We propose MultiRole-R1, a diversity-enhanced training framework featuring an unsupervised data construction pipeline that synthesizes reasoning chains incorporating various role perspectives.
It also employs reinforcement learning via Group Relative Policy Optimization with reward shaping, taking diversity as a reward signal in addition to verifiable reward.
Training on subjective tasks solely, MultiRole-R1 increases the in-domain and out-of-domain accuracy by 14.1% and 7.64%, and even enhances the performance on advanced math reasoning such as AIME 2024.
We further show that diversity is a more consistent indicator of accuracy than reasoning length.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4513. Dataless Weight Disentanglement in Task Arithmetic via Kronecker-Factored Approximate Curvature
๐ฌ 22
โญ 4.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ task arithmetic model editing task vector
Task Arithmetic (TA) provides a modular and scalable way to adapt foundation models. Combining multiple task vectors, however, can lead to cross-task interference, causing representation drift and degraded performance. Representation drift regularization provides a natural remedy to disentangle task vectors, but existing approaches typically require external task data, which conflicts with TAโs modularity and availability constraints like privacy concerns. We propose a data-free approach by framing representation drift regularization as a curvature matrix approximation problem. This allows us leverage well-established techniques; in particular, we adopt Kronecker-Factored Approximate Curvature (KFAC) to obtain practical regularizers. Our method is data-free, has constant complexity with respect to the number of tasks, and improves performance on TA benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4514. Agentic Context Engineering: Learning Comprehensive Contexts for Self-Improving Language Models
๐ฌ 26
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Context Engineering Self-Improving LLMs
Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation, modifying model inputs with instructions, strategies, or evidence, rather than weight updates.
While prior methods improve usability, they often suffer from a brevity bias, discarding domain-specific insights in favor of short summaries, and from context collapse, where iterative rewriting erodes details over time.
Building on the adaptive memory introduced by Dynamic Cheatsheet, we present ACE (**A**gentic **C**ontext **E**ngineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation.
ACE prevents collapse by applying structured, incremental updates that preserve detailed knowledge and scale with long-context models.
Across agentic and domain-specific benchmarks, ACE consistently outperforms strong baselines, improving application performance by 9.0\% while reducing adaptation latency and rollout cost.
Notably, ACE could adapt effectively without labeled supervision, instead leveraging natural execution feedback, and on the AppWorld leaderboard it matches the top-1-ranked production-level agent while using a smaller open-source model.
These results demonstrate that comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4515. Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
๐ฌ 12
โญ 4.50
๐ generative models
๐ท๏ธ Generative Model; Video Generation; World Modeling
Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the modelโs intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera viewโconditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4516. Adversarial Attacks Already Tell the Answer: Directional Bias-Guided Test-time Defense for Vision-Language Models
๐ฌ 22
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Adversarial Attacks Vision-Language Models Test-time Defense
Vision-Language Models (VLMs), such as CLIP, have shown strong zero-shot generalization but remain highly vulnerable to adversarial perturbations, posing serious risks in real-world applications. Test-time defenses for VLMs have recently emerged as a promising and efficient approach to defend against adversarial attacks without requiring costly large-scale retraining. In this work, we uncover a surprising phenomenon: under diverse input transformations, adversarial images in CLIPโs feature space consistently shift along a dominant direction, in contrast to the dispersed patterns of clean images. We hypothesize that this dominant shift, termed the Defense Direction, opposes the adversarial shift, pointing features back toward their correct class centers. Building on this insight, we propose Directional Bias-guided Defense (DBD), a test-time framework that estimates the Defense Direction and employs a DB-scoreโbased two-stream reconstruction strategy to recover robust representations. Experiments on 15 datasets demonstrate that DBD not only achieves SOTA adversarial robustness while preserving clean accuracy, but also reveals the counterintuitive result that adversarial accuracy can even surpass clean accuracy. This demonstrates that adversarial perturbations inherently encode directional priors about the true decision boundary.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4517. PhaseFormer: From Patches to Phases for Efficient and Effective Time Series Forecasting
๐ฌ 20
โญ 4.50
๐ learning on time series and dynamical systems
๐ท๏ธ time series forecasting nonstationary efficiency
Periodicity is a fundamental characteristic of time series data and has long played a central role in forecasting. Recent deep learning methods strengthen the exploitation of periodicity by treating patches as basic tokens, thereby improving predictive effectiveness. However, their efficiency remains a bottleneck due to large parameter counts and heavy computational costs. This paper provides, for the first time, a clear explanation of why patch-level processing is inherently inefficient, supported by strong evidence from real-world data. To address these limitations, we introduce a phase perspective for modeling periodicity and present an efficient yet effective solution, PhaseFormer.
PhaseFormer features phase-wise prediction through compact phase embeddings and efficient cross-phase interaction enabled by a lightweight routing mechanism. Extensive experiments demonstrate that PhaseFormer achieves state-of-the-art performance on the evaluated benchmarks with around 1k parameters, consistently across benchmark datasets. Notably, it excels on large-scale and complex datasets, where models with comparable efficiency often struggle. This work marks a significant step toward truly efficient and effective time series forecasting.
Code is available at this repository: https://anonymous.4open.science/r/ICLR26-PhaseFormer-17678.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4518. Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering
๐ฌ 26
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Multimodal Models Meta-learning Soft Prompts Test-Time Adaptation
Large Multimodal Models (LMMs) often rely on in-context learning (ICL) to perform new visual question answering (VQA) tasks with minimal supervision. However, ICL performance, especially in smaller LMMs, does not always improve monotonically when increasing the number of examples. We hypothesize that this happens because the LMM is overwhelmed by extraneous information in the image embeddings that is irrelevant to the downstream task. To address this, we propose a meta-learning approach that induces few-shot capabilities in LMMs through a fixed set of soft prompts distilled from task-relevant visual features, which are adapted at test time using a small number of examples. We facilitate this distillation through an attention-mapper module that can be easily integrated with any LMM architecture and is jointly learned with soft prompts. Evaluation on the VL-ICL Bench shows that our method successfully achieves task adaptation in low-data regimes with just a few gradient steps, outperforming ICL by 21.2%. Comparisons with parameter-efficient finetuning methods demonstrate that meta-learning further enhances this adaptation by 7.7% for various VQA tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4519. SESaMo: Symmetry-Enforcing Stochastic Modulation for Normalizing Flows
๐ฌ 13
โญ 4.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Generative models Normalizing Flows Symmetries Equivariant Models
Deep generative models have recently garnered significant attention across various fields, from physics to chemistry, where sampling from unnormalized Boltzmann-like distributions represents a fundamental challenge. In particular, autoregressive models and normalizing flows have become prominent due to their appealing ability to yield closed-form probability densities. Moreover, it is well-established that incorporating prior knowledgeโsuch as symmetriesโinto deep neural networks can substantially improve training performances. In this context, recent advances have focused on developing symmetry-equivariant generative models, achieving remarkable results.
Building upon these foundations, this paper introduces Symmetry-Enforcing Stochastic Modulation (SESaMo). Similar to equivariant normalizing flows, SESaMo enables the incorporation of inductive biases (e.g., symmetries) into normalizing flows through a novel technique called \textit{stochastic modulation}. This approach enhances the flexibility of the generative model by enforcing exact symmetries while, for the first time, enabling the model to learn broken symmetries during training.
Our numerical experiments benchmark SESaMo in different scenarios, including an 8-Gaussian mixture model and physically relevant field theories, such as the $\phi^4$ theory and the Hubbard model.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4520. Uncertainty-driven Embedding Convolution
๐ฌ 35
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Probabilistic embeddings Embedding convolution Uncertainty-aware similarity
Text embeddings are essential components in modern NLP pipelines. While numerous embedding models have been proposed, their performance varies across domains. This variability motivates the use of ensemble techniques to combine complementary strengths. However, most existing ensemble methods operate on deterministic embeddings and fail to account for model-specific uncertainty, limiting their robustness and reliability in downstream applications. To address these limitations, we propose Uncertainty-driven Embedding Convolution (UEC). UEC first transforms deterministic embeddings into probabilistic ones in a post-hoc manner. It then computes adaptive ensemble weights based on embedding uncertainty, grounded in a Bayes-optimal solution under a surrogate loss. Additionally, UEC employs an uncertainty-aware similarity function that directly incorporates uncertainty into the similarity scoring, providing a theoretically grounded and efficient surrogate to distributional distances. Extensive experiments on diverse benchmarks demonstrate that UEC consistently improves both performance and robustness by leveraging principled uncertainty modeling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4521. RECODE: A Benchmark for Research Code DEvelopment with Interactive Human Feedback
๐ฌ 31
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Code generation
Large language models (LLMs) show the promise in supporting scientific research implementation, yet their ability to generate correct and executable code remains limited. Existing works largely adopt one-shot settings, ignoring the iterative and feedback-driven nature of realistic workflows of scientific research development. To address this gap, we present RECODE, a benchmark of 102 tasks from research papers and repositories that evaluates LLMs through multi-turn interactions with human feedback. It includes structured instructions, unit tests, and a five-level feedback hierarchy to reflect realistic researcherโagent collaboration. We further present ReCodeAgent, a framework that integrates feedback into iterative code generation. Experimentswith leading LLMs, including GPT-5, Claude-Sonnet-4, DeepSeek-V3.1, and Gemini 2.5, show substantial performance gains with richer feedback, while also highlighting ongoing challenges in the generation of complex research code. RECODE establishes a foundation for developing adaptive, feedback-driven LLM agents in scientific research implementation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4522. A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models
๐ฌ 26
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Finetuning Large Language Models
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful learn-to-reason paradigm for Large Reasoning Models to tackle complex tasks.
However, current RLVR paradigm is still not efficient enough, as it works in a trial-and-error manner. To perform better, the model needs to explore the reward space by numerously generating responses and learn from fragmented reward signals, blind to the overall reward patterns.
Fortunately, verifiable rewards make the natural language description of the reward function possible, and meanwhile, LLMs have demonstrated strong in-context learning ability.
This motivates us to explore if Large Reasoning Models can benefit from a **motivation** of the task, i.e., awareness of the reward function, during the reinforcement finetuning process, as we humans sometimes do when learning.
In this paper, we introduce ***M**otivation-**e**nhanced **R**einforcement **F**inetuning* (**MeRF**), an intuitive yet effective method enhancing reinforcement finetuning of LLMs by involving \emph{``telling LLMs rules of the game''}.
Specifically, **MeRF** directly injects the reward specification into the prompt, which serves as an in-context motivation for the model to be aware of the optimization objective.
This simple modification leverages the in-context learning ability of LLMs, aligning generation with optimization, thereby incentivizing the model to generate desired outputs from both inner motivation and external reward.
Empirical evaluations demonstrate that **MeRF** achieves substantial performance gains over RLVR baseline.
Moreover, ablation studies show that MeRF performs better with greater consistency between the in-context motivation and the external reward function, while the model also demonstrates an ability to adapt to misleading motivations through reinforcement finetuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4523. Pyramid Patchification Flow for Visual Generation
๐ฌ 16
โญ 4.50
๐ generative models
๐ท๏ธ visual generation๏ผflow matching๏ผpyramidal patchification
Diffusion Transformers (DiTs) typically use the same patch size for $\operatorname{Patchify}$ across timesteps, enforcing a constant token budget across timesteps. In this paper, we introduce Pyramidal Patchification Flow (PPFlow), which reduces the number of tokens for high-noise timesteps to improve the sampling efficiency. The idea is simple: use larger patches at higher-noise timesteps and smaller patches at lower-noise timesteps. The implementation is easy: share the DiT's transformer blocks across timesteps, and learn separate linear projections for different patch sizes in $\operatorname{Patchify}$ and $\operatorname{Unpatchify}$. Unlike Pyramidal Flow that operates on pyramid representations,, our approach operates over full latent representations, eliminating trajectory ``jump points'', and thus avoiding re-noising tricks for sampling. Training from pretrained SiT-XL/2 requires only $+8.9\%$ additional training FLOPs and delivers $2.02\times$ denoising speedups with image generation quality kept; training from scratch achieves comparable sampling speedup, e.g., $2.04\times$ speedup in SiT-B. Training from text-to-image model FLUX.1, PPFlow can achieve $1.61 - 1.86 \times$ speedup from 512 to 2048 resolution with comparable quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4524. Cut Less, Fold More: Model Compression through the Lens of Projection Geometry
๐ฌ 13
โญ 4.50
๐ optimization
๐ท๏ธ model compression structured pruning model folding projection geometry
Compressing neural networks without retraining is vital for deployment at scale. We study calibration-free compression through the lens of projection geometry: structured pruning is an axis-aligned projection, whereas model folding performs a low-rank projection via weight clustering. We formalize both as orthogonal operators and show that, within a rank distance of one, folding provably yields smaller parameter reconstruction error, and under mild smoothness assumptions, smaller functional perturbations than pruning. At scale, we evaluate >1'000 checkpoints spanning ResNet18, PreActResNet18, ViT-B/32, and CLIP ViT-B/32 on CIFAR-10 and ImageNet-1K, covering diverse training hyperparameters (optimizers, learning rates, augmentations, regularization, sharpness-aware training). We show that folding typically achieves higher post-compression accuracy, with the largest gains at moderateโhigh compression. The gap narrows and occasionally reverses at specific training setups. Our results position folding as a geometry-aware, calibration-free alternative to pruning that is often superior in practice and principled in theory.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4525. Non-Clashing Teaching in Graphs: Algorithms, Complexity, and Bounds
๐ฌ 16
โญ 4.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Machine teaching non-clashing teaching VC-dimension graphs parameterized complexity treedepth vertex cover number
Kirkpatrick et al. [ALT 2019] and Fallat et al. [JMLR 2023] introduced non-clashing teaching and proved that it is the most efficient batch machine teaching model satisfying the collusion-avoidance benchmark established in the seminal work of Goldman and Mathias [COLT 1993]. Recently, (positive) non-clashing teaching was thoroughly studied for balls in graphs, yielding numerous algorithmic and combinatorial results. In particular, Chalopin et al. [COLT 2024] and Ganian et al. [ICLR 2025] gave an almost complete picture of the complexity landscape of the positive variant, showing that it is tractable only for restricted graph classes due to the non-trivial nature of the problem and concept class.
In this work, we consider (positive) non-clashing teaching for closed neighborhoods in graphs. This concept class is not only extensively studied in various related contexts, but it also exhibits broad generality, as any finite binary concept class can be equivalently represented by a set of closed neighborhoods in a graph. In comparison to the works on balls in graphs, we provide improved algorithmic results, notably including FPT algorithms for more general classes of parameters, and we complement these results by deriving stronger lower bounds. Lastly, we obtain combinatorial upper bounds for wider classes of graphs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4526. Learning from Historical Activations in Graph Neural Networks
๐ฌ 26
โญ 4.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Deep Learning; Graph Neural Networks; Graph Pooling
Graph Neural Networks (GNNs) have demonstrated remarkable success in various domains such as social networks, molecular chemistry, and more. A crucial component of GNNs is the pooling procedure, in which the node features calculated by the model are combined to form an informative final descriptor to be used for the downstream task. However, previous graph pooling schemes rely on the last GNN layer features as an input to the pooling or classifier layers, potentially under-utilizing important activations of previous layers produced during the forward pass of the model, which we regard as \emph{historical graph activations}. This gap is particularly pronounced in cases where a nodeโs representation can shift significantly over the course of many graph neural layers, and worsen by graph-specific challenges such as over-smoothing in deep architectures. To bridge this gap, we introduce HistoGraph, a novel twoโstage attentionโbased final aggregation layer that first applies a unified layer-wise attention over intermediate activations, followed by node-wise attention. By modeling the evolution of node representations across layers, our HistoGraph leverages both the activation history of nodes and the graph structure to refine features used for final prediction. Empirical results on multiple graph classification benchmarks demonstrate that HistoGraph offers strong performance that consistently improves traditional techniques, with particularly strong robustness in deep GNNs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4527. Transformers with Endogenous In-Context Learning: Bias Characterization and Mitigation
๐ฌ 39
โญ 4.50
๐ learning theory
๐ท๏ธ In-Context Learning Hidden Confounder Debiasing
In-context learning (ICL) enables pre-trained transformers (TFs) to perform few-shot learning across diverse tasks, fostering growing research into its underlying mechanisms. However, existing studies typically assume a causally-sufficient regime, overlooking spurious correlations and prediction bias introduced by hidden confounders (HCs). As HC commonly exists in real-world cases, current ICL understandings may not align with actual data structures. To fill this gap, we contribute the pioneer theoretical analysis towards a novel problem setup termed as ICL-HC, which offers understanding the effect of HC on the pre-training of TFs and the following ICL prediction. Our theoretical results entail that pre-trained TFs exhibits certain prediction bias with proportional to the confounding strength. To migrate such prediction bias, we further propose a gradient-free debiasing method named Double-Debiasing (DDbias) by collecting and prompting with extremely few unconfounded examples, correcting pre-trained TFs with unbiased ICL predictions. Extensive experiments on regression tasks across diverse designs of the TF architectures and data generation protocols verify both our theoretical results and the effectiveness of the proposed DDbias method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4528. Beyond Raw Detection Scores: Markov-Informed Calibration for Boosting Machine-Generated Text Detection
๐ฌ 30
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Machine-generated Text Detection Markov-aware Calibration Raw Detection Score
While machine-generated texts (MGTs) offer great convenience, they also pose risks such as disinformation and phishing, highlighting the need for reliable detection. Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting. Given their diverse designs, we first place representative metric-based methods within a unified framework, enabling a clear assessment of their advantages and limitations. Our analysis identifies a core challenge across these methods: the token-level detection score is easily biased by the inherent randomness of the MGTs generation process. To address this, we theoretically and empirically reveal two relationships of context detection scores that may aid calibration: Neighbor Similarity and Initial Instability. We then propose a Markov-informed score calibration strategy that models these relationships using Markov random fields, and implements it as a lightweight component via a mean-field approximation, allowing our method to be seamlessly integrated into existing detectors. Extensive experiments in various real-world scenarios, such as cross-LLM and paraphrasing attacks, demonstrate significant gains over baselines with negligible computational overhead. The code is available at \url{https://anonymous.4open.science/r/MRF-Enhance}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4529. Reliable Evaluation of MRI Motion Correction: Dataset and Insights
๐ฌ 10
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ 3D MRI motion correction Accelerated MRI Dataset Evaluation approach Image reconstruction
Correcting motion artifacts in scientific and medical imaging is important, as they significantly impact image quality.
However, evaluating deep learning-based and classical motion correction methods remains fundamentally difficult due to the lack of accessible ground-truth target data.
To address this challenge, we study three evaluation approaches: real-world evaluation based on reference scans, simulated motion, and reference-free evaluation, each with its merits and shortcomings.
To enable evaluation with real-world motion artifacts, we release PMoC3D, a dataset consisting of unprocessed $\textbf{P}$aired $\textbf{Mo}$tion-$\textbf{C}$orrupted $\textbf{3D}$ brain MRI data.
To advance evaluation quality, we introduce MoMRISim, a feature-space metric trained for evaluating motion reconstructions.
We assess each evaluation approach and find real-world evaluation together with MoMRISim, while not perfect, to be most reliable.
Evaluation based on simulated motion systematically exaggerates algorithm performance, and reference-free evaluation overrates oversmoothed deep learning outputs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4530. PE-SGD: Differentially Private Deep Learning via Evolution of Gradient Subspace for Text
๐ฌ 22
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Differential Privacy Private Evolution Generation Model
Differentially Private Stochastic Gradient Descent (DP-SGD) and its variants like DP-Adam ensure data privacy by injecting noise into per-sample gradients. Although effective with large private datasets, their performance degrades significantly when private training data is limited. Recent works leverage public data to learn a gradient subspace and project noisy private sample gradients on to this subspace, achieving improved performance. However, they have overlooked two crucial aspects: the limitation of using a fixed projection subspace throughout training and the importance of choosing where to inject noise. Therefore, we propose Private Evolution aided Stochastic Gradient Descent (***PE-SGD***), a differentially private training framework effective for scenarios with limited private data. ***PE-SGD*** uses an evolutionary strategy to update the gradient projection subspace during training process. We also identify a more effective noise injection point for better alignment between approximate DP-protected gradient and real private gradient. This enables ***PE-SGD*** to outperform DP-SGD and other baselines, particularly in the regime of limited private data and small privacy budget.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4531. RainPro-8: An Efficient Deep Learning Model to Estimate Rainfall Probabilities Over 8 Hours
๐ฌ 16
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Precipitation Forecasting Probabilistic Forecasting High-Resolution Forecasting
We present a deep learning model for high-resolution probabilistic precipitation forecasting over an 8-hour horizon in Europe, overcoming the limitations of radar-only deep learning models with short forecast lead times. Our model efficiently integrates multiple data sources - including radar, satellite, and physics-based numerical weather prediction (NWP) - while capturing long-range interactions, resulting in accurate forecasts with robust uncertainty quantification through consistent probabilistic maps. Featuring a compact architecture, it enables more efficient training and faster inference than existing models. Extensive experiments demonstrate that our model surpasses current operational NWP systems, extrapolation-based methods, and deep-learning nowcasting models, setting a new standard for high-resolution precipitation forecasting in Europe, ensuring a balance between accuracy, interpretability, and computational efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4532. DNT: a Deeply Normalized Transformer that can be trained by Momentum SGD
๐ฌ 19
โญ 4.50
๐ optimization
๐ท๏ธ Transformer Deep Normalization SGD
Transformers have become the de facto backbone of modern deep learning, yet their training typically demands an advanced optimizer with adaptive learning rate like AdamW, rather than a momentum SGDW (mSGDW). Previous works show that it is mainly due to a heavy-tailed distribution of the gradients. In this paper, we introduce a Deeply Normalized Transformer (DNT), which is meticulously engineered to overcome this limitation enabling seamless training with vanilla mSGDW while yielding comparable performance to the Transformers trained via AdamW. To be specific, in DNT, we strategically integrate normalization techniques at proper positions in the Transformers to effectively modulate the Jacobian matrices of each layer, balance the influence of weights, activations, and their interactions, and thus enable the distributions of gradients concentrated. We provide both theoretical justifications of the normalization technique used in our DNT and extensive empirical evaluation on two popular Transformer architectures to validate that: a) DNT outperforms its counterparts (i.e., ViT and GPT), and b) DNT can be effectively trained with vanilla mSGDW.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4533. Embedding-Based Context-Aware Reranker
๐ฌ 20
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Passage reranking Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) systems rely on retrieving relevant evidence from a corpus to support downstream generation.
The common practice of splitting a long document into multiple shorter passages enables finer-grained and targeted information retrieval.
However, it also introduces challenges when a correct retrieval would require inference across passages, such as resolving coreference, disambiguating entities, and aggregating evidence scattered across multiple sources.
Many state-of-the-art (SOTA) reranking methods, despite utilizing powerful large pretrained language models with potentially high inference costs, still neglect the aforementioned challenges.
Therefore, we propose Embedding-Based Context-Aware Reranker (EBCAR), a lightweight reranking framework operating directly on embeddings of retrieved passages with enhanced cross-passage understandings through the structural information of the passages and a hybrid attention mechanism, which captures both high-level interactions across documents and low-level relationships within each document.
We evaluate EBCAR against SOTA rerankers on the ConTEB benchmark, demonstrating its effectiveness for information retrieval requiring cross-passage inference and its advantages in both accuracy and efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4534. Entropic Confinement and Mode Connectivity in Overparameterized Neural Networks
๐ฌ 16
โญ 4.50
๐ optimization
๐ท๏ธ Optimization Mode Connectivity Generalization Entropy Curvature Flatness Sharpness
Modern neural networks exhibit a striking property: basins of attraction in the loss landscape are often connected by low-loss paths, yet optimization dynamics generally remain confined to a single convex basin and rarely explore intermediate points. We resolve this paradox by identifying entropic barriers arising from the interplay between curvature variations along these paths and noise in optimization dynamics. Empirically, we find that curvature systematically rises away from minima, producing effective forces that bias noisy dynamics back toward the endpoints โ even when the loss remains nearly flat. These barriers persist longer than energetic barriers, shaping the late-time localization of solutions in parameter space. Our results highlight the role of curvature-induced entropic forces in governing both connectivity and confinement in deep learning landscapes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4535. Recurrent Action Transformer with Memory
๐ฌ 13
โญ 4.50
๐ reinforcement learning
๐ท๏ธ RL Offline RL Memory Transformers POMDP
Transformers have become increasingly popular in offline reinforcement learning (RL) due to their ability to treat agent trajectories as sequences, reframing policy learning as a sequence modeling task. However, in partially observable environments (POMDPs), effective decision-making depends on retaining information about past events - something that standard transformers struggle with due to the quadratic complexity of self-attention, which limits their context length. One solution to this problem is to extend transformers with memory mechanisms. We propose the Recurrent Action Transformer with Memory (RATE), a novel transformer-based architecture for offline RL that incorporates a recurrent memory mechanism designed to regulate information retention. We evaluate RATE across a diverse set of environments: memory-intensive tasks (ViZDoom-Two-Colors, T-Maze, Memory Maze, Minigrid-Memory, and POPGym), as well as standard Atari and MuJoCo benchmarks. Our comprehensive experiments demonstrate that RATE significantly improves performance in memory-dependent settings while remaining competitive on standard tasks across a broad range of baselines. These findings underscore the pivotal role of integrated memory mechanisms in offline RL and establish RATE as a unified, high-capacity architecture for effective decision-making over extended horizons.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4536. BioTamperNet: Affinity-Guided State-Space Model Detecting Tampered Biomedical Images
๐ฌ 17
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Generative Local Forgery Detection Information-Theoretic Gradient Fingerprints
We propose BioTamperNet, a novel framework for detecting duplicated regions in tampered biomedical images, leveraging affinity-guided attention inspired by State Space Model (SSM) approximations. Existing forensic models, primarily trained on natural images, often underperform on biomedical data where subtle manipulations can compromise experimental validity. To address this, BioTamperNet introduces an affinity-guided self-attention module to capture intra-image similarities and an affinity-guided cross-attention module to model cross-image correspondences. Our design integrates lightweight SSM-inspired linear attention mechanisms to enable efficient, fine-grained localization. Trained end-to-end, BioTamperNet simultaneously identifies tampered regions and their source counterparts. Extensive experiments on the benchmark bio-forensic datasets demonstrate significant improvements over competitive baselines in accurately detecting duplicated regions. All source code and dataset will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4537. Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
๐ฌ 19
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ long-term memory conversation retrieval-augmented generation
Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a
novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHTโa framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance
across various models, achieving an average improvement of 3.5%โ12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4538. Metis: Training LLMs with FP4 Quantization
๐ฌ 13
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ FP4 Full Quantization Training LLM
This work identifies anisotropy in the singular value spectra of parameters, activations, and gradients as the fundamental barrier to low-bit training of large language models (LLMs). These spectra are dominated by a small fraction of large singular values, inducing wide numerical ranges that cause quantization bias and severe spectral distortion, ultimately degrading training performance. This work presents \emph{Metis}, a spectral-domain quantization framework that partitions anisotropic spectra into narrower sub-distributions for independent quantization, thereby reducing errors and preserving spectral structure. To minimize overhead,
Metis leverages two key properties of the dominant spectral subspace: preservation via sparsely random sampling and preservation via random projection, reducing decomposition cost to a negligible level. On LLaMA-3 8B trained with 100B tokens, Metis enables robust W4A4G4 training with FP4 quantization of weights, activations, and gradients, yielding only a 0.4\% training loss gap and a 0.1\% degradation in downstream accuracy relative to BF16. Beyond matching BF16 fidelity, Metis also surpasses our implementation of Nvidiaโs recently announced (yet to be publicly released) FP4 recipe, consistently achieving lower loss and higher downstream accuracy while incurring significantly lower computational overhead.
The code implementation for Metis is available at: \url{https://anonymous.4open.science/r/Metis-quantization-644B}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4539. DTP: Delta-Guided Two Stage Pruning for Mamba-based Multimodal Large Language Models
๐ฌ 36
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Mamba Multimodal Large Language Models Token Pruning Efficiency Interpretability
Multimodal large language models built on the Mamba architecture offer efficiency advantages, yet remain hampered by redundant visual tokens that inflate inference cost, with the prefill stage accounting for the majority of total inference time. We introduce Delta-guided Two stage Pruning (DTP), a method that progressively reduces token redundancy through selective pruning at early layer and complete pruning at late layer. Unlike Transformer-oriented pruning methods, our approach derives token importance directly from Mambaโs internal parameters. The statistical distribution of these importance scores, combined with implicit attention patterns, then provides the basis for determining both the pruning layers and the tokens to be removed. Extensive evaluation across diverse benchmarks shows that DTP cuts computation by nearly 50\%, maintains higher task performance than existing pruning methods, and further achieves over a 35\% reduction in prefill latency. Beyond efficiency, our analysis reveals previously underexplored behaviors of visual tokens within Mamba layers, suggesting a principled perspective for designing future pruning techniques in Mamba-based Multimodal Large Language Models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4540. ACE-Bench: Benchmarking Agentic Coding in End-to-End Development of Complex Features
๐ฌ 30
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Agentic Coding Benchmark Large Language Models
Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current boundaries of their coding abilities. Existing agentic coding benchmarks, however, cover a limited task scope, e.g., bug fixing within a single pull request (PR), and often rely on non-executable evaluations or lack an automated approach for continually updating the evaluation coverage. To address such issues, we propose ACE-Bench, a benchmark designed to evaluate agentic coding performance in end-to-end, feature-oriented software development. ACE-Bench incorporates an execution-based evaluation protocol and a scalable test-driven method that automatically derives tasks from code repositories with minimal human effort. By tracing from unit tests along a dependency graph, our approach can identify feature-level coding tasks spanning multiple commits and PRs scattered across the development timeline, while ensuring the proper functioning of other features after the separation. Using this framework, we curated 212 challenging evaluation tasks and 889 executable environments from 16 open-source repositories in the first version of our benchmark. Empirical evaluation reveals that the state-of-the-art agent, such as Claude 4 Sonnet with OpenHands framework, which achieves a 70.4% resolved rate on SWE-bench, succeeds on only 7.5% of tasks, opening new opportunities for advancing agentic coding. Moreover, benefiting from our automated task collection toolkit, ACE-Bench can be easily scaled and updated over time to mitigate data leakage. The inherent verifiability of constructed environments also makes our method potentially valuable for agent training. Our data and code will be publicly released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4541. SONIC: Spectral Oriented Neural Invariant Convolutions
๐ฌ 12
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Spectral Neural Networks Spectral Parameterization Resolution Invariance State-Space Models Spectral Factorization Convolution Alternatives Oriented Filters Global Receptive Fields Robust Representation Learning
Convolutional Neural Networks (CNNs) rely on fixed-size kernels scanning local patches, which limits their ability to capture global context or long-range dependencies without very deep architectures. Vision Transformers (ViTs), in turn, provide global connectivity but lack spatial inductive bias, depend on explicit positional encodings, and remain tied to the initial patch size. Bridging these limitations requires a representation that is both structured and global. We introduce SONIC (Spectral Oriented Neural Invariant Convolutions), a continuous spectral parameterisation that models convolutional operators using a small set of shared, orientation-selective components. These components define smooth responses across the full frequency domain, yielding global receptive fields and filters that adapt naturally across resolutions. Across synthetic benchmarks, large-scale image classification, and 3D medical datasets, SONIC shows improved robustness to geometric transformations, noise, and resolution shifts, and matches or exceeds convolutional, attention-based, and prior spectral architectures with an order of magnitude fewer parameters. These results demonstrate that continuous, orientation-aware spectral parameterisations provide a principled and scalable alternative to conventional spatial and spectral operators.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4542. InnoGym: Benchmarking the Innovation Potential of AI Agents
๐ฌ 31
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ innovation agent benchmark
LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present \textbf{InnoGym}, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide \textbf{iGym}, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4543. Taming Polysemanticity in LLMs: Theory-Grounded Feature Recovery via Sparse Autoencoders
๐ฌ 32
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ sparse autoencoder; training dynamics; superposition; feature learning
We study the challenge of achieving theoretically grounded feature recovery using Sparse Autoencoders (SAEs) for the interpretation of Large Language Models.
Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations such as hyperparameter sensitivity and instability. We rethink this problem from the perspective of neuron activation frequencies, and through controlled experiments, we identify a striking phenomenon we term neuron resonance: neurons reliably learn monosemantic features when their activation frequency matches the feature's occurrence frequency in the data.
Building on this finding, we introduce a new SAE training algorithm based on ``bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity. We theoretically prove that this algorithm correctly recovers all monosemantic features when input data is sampled from our proposed statistical model. Furthermore, we develop an improved empirical variant, Group Bias Adaptation (GBA), and demonstrate its superior performance against benchmark methods when applied to LLMs with up to 2 billion parameters. This work represents a foundational step in demystifying SAE training by providing the first SAE algorithm with theoretical recovery guarantees and practical effectiveness for LLM interpretation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4544. RestoreVAR: Visual Autoregressive Generation for All-in-One Image Restoration
๐ฌ 17
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Image restoration generative models low-level vision
The use of latent diffusion models (LDMs) such as Stable Diffusion has significantly improved the perceptual quality of All-in-One image Restoration (AiOR) methods, while also enhancing their generalization capabilities. However, these LDM-based frameworks suffer from slow inference due to their iterative denoising process, rendering them impractical for time-sensitive applications. Visual autoregressive modeling (VAR), a recently introduced approach for image generation, performs scale-space autoregression and achieves comparable performance to that of state-of-the-art diffusion transformers with drastically reduced computational costs. Moreover, our analysis reveals that coarse scales in VAR primarily capture degradations while finer scales encode scene detail, simplifying the restoration process. Motivated by this, we propose RestoreVAR, a novel VAR-based generative approach for AiOR that significantly outperforms LDM-based models in restoration performance while achieving over $\mathbf{10\times}$ faster inference. To optimally exploit the advantages of VAR for AiOR, we propose architectural modifications and improvements, including intricately designed cross-attention mechanisms and a latent-space refinement module, tailored for the AiOR task. Extensive experiments show that RestoreVAR achieves state-of-the-art performance among generative AiOR methods, while also exhibiting strong generalization capabilities. The code will be made publicly available after the review process.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4545. Adaptive Domain Shift in Diffusion Models for Cross-Modality Image Translation
๐ฌ 28
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Generative Learning Imaging
Cross-modal image translation remains brittle and inefficient. Standard diffusion approaches often rely on a single, global linear transfer between domains. We find that this shortcut forces the sampler to traverse off-manifold, high-cost regions, inflating the correction burden and inviting semantic drift. We refer to this shared failure mode as fixed-schedule domain transfer. In this paper, we embed domain-shift dynamics directly into the generative process. Our model predicts a spatially varying mixing field at every reverse step and injects an explicit, target-consistent restoration term into the drift. This in-step guidance keeps large updates on-manifold and shifts the modelโs role from global alignment to local residual correction. We provide a continuous-time formulation with an exact solution form and derive a practical first-order sampler that preserves marginal consistency. Empirically, across translation tasks in medical imaging, remote sensing, and electroluminescence semantic mapping, our framework improves structural fidelity and semantic consistency while converging in fewer denoising steps.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4546. Time Is All It Takes: Spike-Retiming Attacks on Event-Driven Spiking Neural Networks
๐ฌ 33
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ spiking neural networks event dataset snn attacks adversarial attacks
Spiking neural networks (SNNs) compute with discrete spikes and exploit temporal structure, yet most adversarial attacks change intensities or event counts instead of timing. We study a timing-only adversary that retimes existing spikes while preserving spike counts and amplitudes in event-driven SNNs, thus remaining rate-preserving. We formalize a capacity-1 spike-retiming threat model with a unified trio of budgets: per-spike jitter $\mathcal{B}{\infty}$, total delay $\mathcal{B}{1}$, and tamper count $\mathcal{B}{0}$. Feasible adversarial examples must satisfy timeline consistency and non-overlap, which makes the search space discrete and constrained. To optimize such retimings at scale, we use projected-in-the-loop (PIL) optimization: shift-probability logits yield a differentiable soft retiming for backpropagation, and a strict projection in the forward pass produces a feasible discrete schedule that satisfies capacity-1, non-overlap, and the chosen budget at every step. The objective maximizes task loss on the projected input and adds a capacity regularizer together with budget-aware penalties, which stabilizes gradients and aligns optimization with evaluation. Across event-driven benchmarks (CIFAR10-DVS, DVS-Gesture, N-MNIST) and diverse SNN architectures, we evaluate under binary and integer event grids and a range of retiming budgets, and also test models trained with timing-aware adversarial training designed to counter timing-only attacks. For example, on DVS-Gesture the attack attains high success (over 90\%) while touching fewer than 2\% of spikes under $\mathcal{B}_{0}$. Taken together, our results show that spike retiming is a practical and stealthy attack surface that current defenses struggle to counter, providing a clear reference for temporal robustness in event-driven SNNs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4547. Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference
๐ฌ 27
โญ 4.50
๐ generative models
๐ท๏ธ LLM; Backdoor attack; Backdoor Elimination.
Backdoor attacks pose severe security threats to large language models (LLMs), where a model behaves normally under benign inputs but produces malicious outputs when a hidden trigger appears. Existing backdoor removal methods typically assume prior knowledge of triggers, access to a clean reference model, or rely on aggressive finetuning configurations, and are often limited to classification tasks. However, such assumptions fall apart in real-world generative LLM settings. In this work, we propose a new framework for purifying **generative LLM** without any prior trigger knowledge or clean references. Through systematic sanity checks, we find that backdoor associations are redundantly encoded across MLP layers, while attention modules primarily amplify trigger signals without establishing the behavior. Leveraging this insight, we shift the focus from isolating specific backdoor triggers to cutting off the triggerโbehavior associations, and design an immunization-inspired elimination approach: by constructing multiple synthetic backdoored variants of the given suspicious model, each trained with different malicious triggerโbehavior pairs, and contrasting them with their clean counterparts. The recurring modifications across variants reveal a shared **"backdoor signature"**โanalogous to antigens in a virus. Guided by this signature, we neutralize highly suspicious components in LLM and apply lightweight finetuning to restore its fluency, producing purified models that withstand diverse backdoor attacks and threat models while preserving generative capability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4548. MergePRAG: Orthogonal Merging of Passage-experts for Multi-hop Parametric RAG
๐ฌ 26
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multi-hop reasoning Knowledge enhancement Retrieval-augmented generation Hypernetwork-based expert generation
Large language models (LLMs) can be enhanced with external knowledge through two dominant approaches: (1) **retrieval-augmented generation (RAG)**, which supplements LLMs with in-context retrieved passages, and (2) **parametric knowledge adaptation (PKA)**, which directly updates model parameters with new domain knowledge. Recently, parametric RAG (PRAG) has emerged as a promising framework, extending RAG by translating retrieved passages into parameter updates, thereby mitigating inefficiency and noise sensitivity inherent to RAG. However, existing PRAG methods remain limited to single-pass retrieval, falling short of the \textbf{multi-hop RAG} setting that requires iterative retrieval and reasoning.
We propose **MergePRAG**(*Orthogonal Merging of Passage-experts for Multi-hop PRAG*), a novel framework that sequentially integrates retrieved passages into LLM parameters through a continual merging mechanism, which is advanced by two key proposals: (1) **orthogonal merging** using the GramโSchmidt process to minimize conflicts between "passage experts", and (2) **critical-layer parameterization** to efficiently encode in-context passages. Experiments on multi-hop open-domain QA and reasoning-aware knowledge editing show that MergePRAG consistently outperforms both standard and state-of-the-art RAGs as well as existing parametric adaptation methods, achieving superior effectiveness and efficiency. All datasets and code will be released at {https://anonymous.4open.science/r/MhQA_hypernetwork-B31F.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4549. Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization
๐ฌ 19
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Reinforcement Learning๏ผImage Aesthetic Assessment
Multimodal large language models (MLLMs) are well suited to image aesthetic assessment, as they can capture high-level aesthetic features leveraging their cross-modal understanding capacity. However, the scarcity of multimodal aesthetic reasoning data and the inherently subjective nature of aesthetic judgment make it difficult for MLLMs to generate accurate aesthetic judgments with interpretable rationales. To this end, we propose Aes-R1, a comprehensive aesthetic reasoning framework with reinforcement learning (RL). Concretely, Aes-R1 integrates a pipeline, AesCoT, to construct and filter high-quality chain-of-thought aesthetic reasoning data used for cold-start. After teaching the model to generate structured explanations prior to scoring, we then employ the Relative-Absolute Policy Optimization (RAPO), a novel RL algorithm that jointly optimizes absolute score regression and relative ranking order, improving both per-image accuracy and cross-image preference judgments. Aes-R1 enables MLLMs to generate grounded explanations alongside faithful scores, thereby enhancing aesthetic scoring and reasoning in a unified framework. Extensive experiments demonstrate that Aes-R1 improves the backboneโs average PLCC/SRCC by 47.9%/34.8%, surpassing state-of-the-art baselines of similar size. More ablation studies validate Aes-R1's robust generalization under limited supervision and in out-of-distribution scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4550. OBS-Diff: Accurate Pruning For Diffusion Models in One-Shot
๐ฌ 20
โญ 4.50
๐ generative models
๐ท๏ธ Pruning Diffusion Model
Large-scale text-to-image diffusion models, while powerful, suffer from prohibitive computational cost. Existing one-shot network pruning methods can hardly be directly applied to them due to the iterative denoising nature of diffusion models. To bridge the gap, this paper presents \textit{OBS-Diff}, a novel one-shot pruning framework that enables accurate and training-free compression of large-scale text-to-image diffusion models. Specifically, (i) OBS-Diff revitalizes the classic Optimal Brain Surgeon (OBS), adapting it to the complex architectures of modern diffusion models and supporting diverse pruning granularity, including unstructured, N:M semi-structured, and structured (MHA heads and FFN neurons) sparsity; (ii) To align the pruning criteria with the iterative dynamics of the diffusion process, by examining the problem from an error-accumulation perspective, we propose a novel timestep-aware Hessian construction that incorporates a logarithmic-decrease weighting scheme, assigning greater importance to earlier timesteps to mitigate potential error accumulation; (iii) Furthermore, a computationally efficient group-wise sequential pruning strategy is proposed to amortize the expensive calibration process. Extensive experiments show that OBS-Diff achieves state-of-the-art one-shot pruning for diffusion models, delivering inference acceleration with minimal degradation in visual quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4551. Rethinking Radiology Report Generation: From Narrative Flow to Topic-Guided Findings
๐ฌ 30
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Radiology report generation large-language models chest X-rays multi-modal alignment
Vision-Language Models (VLMs) for radiology report generation are typically trained to mimic the narrative flow of human experts. However, we identify a potential limitation in this conventional paradigm. We hypothesize that optimizing for narrative coherence encourages models to rely on linguistic priors and inter-sentence correlations, which can weaken their grounding in direct visual evidence and lead to factual inaccuracies. To investigate this, we design a controlled experiment demonstrating that as textual context increases, a model's reliance on the input image systematically decays. We propose LLaVA-TA (Topic-guided and Anatomy-aware), a new fine-tuning framework that directly addresses this challenge by re-engineering the generation process. Instead of producing a linear narrative, LLaVA-TA decomposes the report into a set of independent, clinically-relevant topics. By training the model to generate a discrete finding for each topic conditioned on both the full image and its corresponding anatomical region, we reduce the model's reliance on narrative flow and enforce stricter visual grounding. Our experiments show that LLaVA-TA sets a new state of the art on the MIMIC-CXR dataset, significantly improving clinical accuracy on metrics like RadGraph F1 (from 29.4 to 44.0) and CheXpert F1-14 (from 39.5 to 71.5) over strong baselines. Our work demonstrates that dismantling a report's narrative structure to enforce independent, visually-grounded observations is a crucial and effective step toward building more accurate and reliable medical VLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4552. : One LLM Token for Explicit Graph Structural Understanding
๐ฌ 19
โญ 4.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ LLM for Graph Graph Structure Learning Structure Hallucination
Large language models show great potential in unstructured data understanding, but still face significant challenges with graphs due to their structural hallucination. Existing approaches mainly either verbalize graphs into natural language, which leads to excessive token consumption and scattered attention, or transform graphs into trainable continuous embeddings (i.e., soft prompt), but exhibit severe misalignment with original text tokens. To solve this problem, we propose to incorporate one special token
to fully represent the \textbf{\underline{S}}tructure \textbf{\underline{O}}f \textbf{\underline{G}}raph
within a unified token space, facilitating explicit topology input and structural information sharing. Specifically, we propose a topology-aware structural tokenizer that maps each graph topology into a highly selective single token. Afterwards, we construct a set of hybrid structure Question-Answering corpora to align new structural tokens with existing text tokens. With this approach, empowers LLMs to understand, generate, and reason in a concise and accurate manner. Extensive experiments on five graph-level benchmarks demonstrate the superiority of our method, achieving a performance improvement of 9.9โ41.4\% compared to the baselines while exhibiting interpretability and consistency. Furthermore, our method provides a flexible extension to node-level tasks, enabling both global and local structural understanding. The codebase is publicly available\footnote{The code of our project is available at \href{https://anonymous.4open.science/r/SOG-8432}{https://anonymous.4open.science/r/SOG-8432}.}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4553. Implicit 4D Gaussian Splatting for Fast Motion with Large Inter-Frame Displacements
๐ฌ 26
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 4D Gaussian splatting 4D reconstruction Dynamic rendering
Recent 4D Gaussian Splatting (4DGS) methods often fail under fast motion with large inter-frame displacements, where Gaussian attributes are poorly learned during training, and fast-moving objects are often lost from the reconstruction. In this work, we introduce Spatiotemporal Position Implicit Network for 4DGS, coined SPIN-4DGS, which learns Gaussian attributes from explicitly collected spatiotemporal positions rather than modeling temporal displacements, thereby enabling more faithful splatting under fast motions with large inter-frame displacements. To avoid the heavy memory overhead of explicitly optimizing attributes across all spatiotemporal positions, we instead predict them with a lightweight feed-forward network trained under a rasterization-based reconstruction loss. Consequently, SPIN-4DGS learns shared representations across Gaussians, effectively capturing spatiotemporal consistency and enabling stable high-quality Gaussian splatting even under challenging motions. Across extensive experiments, SPIN-4DGS consistently achieves higher fidelity under large displacements, with clear improvements in PSNR and SSIM on challenging sports scenes from the CMU Panoptic dataset. For example, SPIN-4DGS notably outperforms the strongest baseline, D3DGS, by achieving +1.83 higher PSNR on the Basketball scene.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4554. Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models
๐ฌ 34
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ Large Language Models Retrieval-Augmented Generation Knowledge Conflict Controllable Generation Knowledge Reliance
Retrieval-Augmented Generation (RAG) mitigates hallucinations in Large Language Models (LLMs) by integrating external knowledge. However, conflicts between parametric knowledge and retrieved context pose challenges, particularly when retrieved information is unreliable or the model's internal knowledge is outdated. In such cases, LLMs struggle to determine whether to rely more on their own parameters or the conflicted context. To address this, we propose CK-PLUG, a plug-and-play method for controlling LLMs' reliance on parametric and contextual knowledge. We introduce a novel knowledge consistency metric, Confidence Gain, which detects knowledge conflicts by measuring entropy shifts in token probability distributions after context insertion. CK-PLUG then enables fine-grained control over knowledge preference by adjusting the probability distribution of tokens with negative confidence gain through a single tuning parameter. Experiments demonstrate CK-PLUG's ability to significantly regulate knowledge reliance in counterfactual RAG scenarios while maintaining generation fluency and knowledge accuracy. For instance, on LLaMA3-8B, memory recall (MR) of RAG response can be adjusted within a broad range (9.9%-71.9%), compared to the baseline of 42.1%. Moreover, CK-PLUG supports adaptive control based on the model's confidence in both internal and external knowledge, achieving consistent performance improvements across various general RAG tasks. Our code is available at: https://anonymous.4open.science/r/CK-PLUG-Ano-8E62
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4555. Diverse Dictionary Learning
๐ฌ 23
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Dictionary Learning Identifiability Theory
Given only observational data $X = g(Z)$, where both the latent variables $Z$ and the generating process $g$ are unknown, recovering $Z$ is ill-posed without additional assumptions. Existing methods often assume linearity or rely on auxiliary supervision and functional constraints. However, such assumptions are rarely verifiable in practice, and most theoretical guarantees break down under even mild violations, leaving uncertainty about how to reliably understand the hidden world. To make identifiability *actionable* in the real-world scenarios, we take a complementary view: in the general settings where full identifiability is unattainable, *what can still be recovered with guarantees*, and *what biases could be universally adopted*? We introduce the problem of *diverse dictionary learning* to formalize this view. Specifically, we show that intersections, complements, and symmetric differences of latent variables linked to arbitrary observations, along with the latent-to-observed dependency structure, are still identifiable up to appropriate indeterminacies even without strong assumptions. These set-theoretic results can be composed using set algebra to construct structured and essential views of the hidden world, such as *genus-differentia* definitions. When sufficient structural diversity is present, they further imply full identifiability of all latent variables. Notably, all identifiability benefits follow from a simple inductive bias during estimation that can be readily integrated into most models. We validate the theory and demonstrate the benefits of the bias on both synthetic and real-world data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4556. Hierarchical Prototype Learning for Semantic Segmentation
๐ฌ 11
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Semantic Segmentation Contrastive Learning Prototypical Networks
Conventional semantic segmentation methods often fail to distinguish fine-grained parts within the same object because of missing links between part-level cues and object-level semantics. Inspired by how humans recognize objects, which involves first identifying them as a whole and then distinguishing their parts, we propose a hierarchical prototype-based segmentation method called Hierarchical Prototype Segmentation (HiPoSeg). HiPoSeg builds a structured prototype space that captures both abstract object-level representations and detailed part-level features, enabling consistent alignment between levels. The model leverages a hierarchical contrastive learning strategy to structure semantic representations across levels, encouraging both intra-level discrimination and cross-level consistency. Experiments on standard benchmarks including Cityscapes, ADE20K, Mapillary Vistas 2.0, and PASCAL-Part-108 demonstrate that HiPoSeg produces consistent improvement of performance with an average +3.07\%p mIoU gain without any additional inference cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4557. Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum
๐ฌ 21
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Learning Visual Question Answering Reinforcement Learning
Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge, posing significant challenges due to noisy retrieval and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained multimodal large language models (MLLMs), making effective reasoning and domain adaptation difficult in the post-training stage. In this work, we propose \textit{Wiki-R1}, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasoning in MLLMs for KB-VQA. Wiki-R1 constructs a sequence of training distributions aligned with the modelโs evolving capability, bridging the gap from pretraining to the KB-VQA target distribution. We introduce \textit{controllable curriculum data generation}, which manipulates the retriever to produce samples at desired difficulty levels, and a \textit{curriculum sampling strategy} that selects informative samples likely to yield non-zero advantages during RL updates. Sample difficulty is estimated using observed rewards and propagated to unobserved samples to guide learning. Experiments on two KB-VQA benchmarks, Encyclopedic VQA and InfoSeek, demonstrate that Wiki-R1 achieves new state-of-the-art results, improving accuracy from 35.5\% to 37.1\% on Encyclopedic VQA and from 40.1\% to 44.1\% on InfoSeek.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4558. DistillKac: Few-Step Image Generation via Damped Wave Equations
๐ฌ 15
โญ 4.50
๐ generative models
๐ท๏ธ generative models Kac flow damped wave equation telegrapher equation finite-speed probability flow classifier-free guidance endpoint distillation few-step sampling
We present DistillKac, a fast image generator that uses the damped wave equation and its stochastic Kac representation to move probability mass at finite speed. In contrast to diffusion models whose reverse time velocities can become stiff and implicitly allow unbounded propagation speed, Kac dynamics enforce finite speed transport and yield globally bounded kinetic energy. Building on this structure, we introduce classifier free guidance in velocity space that preserves square integrability under mild conditions. We then propose endpoint only distillation that trains a student to match a frozen teacher over long intervals. We prove a stability result that promotes supervision at the endpoints to closeness along the entire path. Experiments demonstrate DistillKac delivers high quality samples with very few function evaluations while retaining the numerical stability benefits of finite speed probability flows.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4559. SAQ: Stabilizer-Aware Quantum Error Correction Decoder
๐ฌ 18
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Deep learning Quantum Information Error Correcting Codes
Quantum Error Correction (QEC) decoding faces a fundamental accuracy-efficiency tradeoff. Classical methods like Minimum Weight Perfect Matching (MWPM) exhibit variable performance across noise models and suffer from polynomial complexity, while tensor network decoders achieve high accuracy but at prohibitively high computational cost. Recent neural decoders reduce complexity but lack the accuracy needed to compete with computationally expensive classical methods. We introduce SAQ-Decoder, a unified framework combining transformer-based learning with constraint aware post-processing that achieves both near Maximum Likelihood (ML) accuracy and linear computational scalability with respect to the syndrome size. Our approach combines a dual-stream transformer architecture that processes syndromes and logical information with asymmetric attention patterns, and a novel differentiable logical loss that directly optimizes Logical Error Rates (LER) through smooth approximations over finite fields.
SAQ-Decoder achieves high accuracy decoding, with error thresholds of 10.99\% (independent noise) and 18.6\% (depolarizing noise) on toric codes that closely approach the theoretical ML bounds of 11.0\% and 18.9\% while outperforming existing neural and classical baselines in accuracy, complexity, and parameter efficiency. Our findings establish that learned decoders can simultaneously achieve competitive decoding accuracy and computational efficiency, addressing key requirements for practical fault-tolerant quantum computing systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4560. TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA
๐ฌ 26
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Knowledge Transfer PEFT
Large Language Models (LLMs) are widely applied in real world scenarios, but fine-tuning them comes with significant computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA mitigate these costs, but the adapted parameters are dependent on the base model and cannot be transferred across different backbones. One way to address this issue is through knowledge distillation, but its effectiveness inherently depends on training data. Recent work such as TransLoRA avoids this by generating synthetic data, but this adds complexity because it requires training an additional discriminator model. In this paper, we propose TiTok, a new framework that enables effective LoRA Transplantation through Token-level knowledge transfer. Specifically, TiTok captures task-relevant information through a token-wise contrastive excess between a source model with and without LoRA. This excess highlights informative tokens and enables selective filtering of synthetic data, all without additional models or overhead. Through experiments on three benchmarks across multiple transfer settings, our experiments show that TiTok is consistently effective, achieving average performance gains of +4โ8% compared to baselines overall.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4561. Judo: A Juxtaposed Domain-oriented Multimodal Reasoner for Industrial Anomaly QA
๐ฌ 13
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Industrial Anomaly QA Large Multimodal models
Industrial anomaly detection has been significantly advanced by large multimodal models (LMMs), enabling diverse human instructions beyond detection, particularly through visual-grounded reasoning for better image understanding. However, the lack of domain-specific knowledge of LMMs limits the accurate generation of responses in complex industrial scenarios. In this work, we present
JUDO, Juxtaposed Domain-Oriented Multi-modal Reasoner, a framework that efficiently incorporates domain knowledge and context in visual and text reasoning. The visual reasoning provides detailed inspection by segmenting the defect region in the query image by juxtaposing it with the normal image as visual domain context, enabling a fine-grained visual comparative analysis. Furthermore, we inject domain knowledge through supervised fine-tuning (SFT) to enhance context understanding and subsequently guide domain reasoning through reinforcement learning (GRPO) with three tailored rewards. Experimental results demonstrate that JUDO achieves superior performance on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT4o. These results highlight the importance of enhancing domain knowledge and context for effective reasoning in anomaly understanding. The implementation of JUDO can be found in
https://anonymous.4open.science/r/JUDO-9C8B.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4562. DoFlow: Flow-based Generative Models for Interventional and Counterfactual Forecasting on Time Series
๐ฌ 21
โญ 4.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Time Series Causal Inference Generative Models Flow Matching
Time-series forecasting increasingly demands not only accurate observational predictions but also causal forecasting under interventional and counterfactual queries in multivariate systems. We present DoFlow, a flow-based generative model defined over a causal Directed Acyclic Graph (DAG) that delivers coherent observational and interventional predictions, as well as counterfactuals through the natural encodingโdecoding mechanism of continuous normalizing flows (CNFs). We also provide a supporting counterfactual recovery theory under certain assumptions. Beyond forecasting, DoFlow provides explicit likelihoods of future trajectories, enabling principled anomaly detection. Experiments on synthetic datasets with various causal DAG structures and real-world hydropower and cancer-treatment time series show that DoFlow achieves accurate system-wide observational forecasting, enables causal forecasting over interventional and counterfactual queries, and effectively detects anomalies. This work contributes to the broader goal of unifying causal reasoning and generative modeling for complex dynamical systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4563. Relative Value Learning
๐ฌ 14
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Relative Value Learning On-Policy Actor-Critic GAE PPO
In reinforcement learning (RL), critics traditionally learn absolute state values, estimating how good a particular situation is in isolation. Adding any constant to $V(s)$ leaves action preferences unchanged. Thus only value differences are relevant for decision making.
Motivated by this fact, we ask the question whether these differences can be learned directly. For this, we propose \emph{Relative Value Learning} (RV), a framework that considers antisymmetric value differences $\Delta(s_i, s_j) = V(s_i) - V(s_j)$. We define a new pairwise Bellman operator and prove it is a $\gamma$-contraction with a unique fixed point equal to the true value differences, derive well-posed $1$-step/$n$-step/$\lambda$-return targets and reconstruct generalized advantage estimation from pairwise differences to obtain an unbiased policy-gradient estimator (R-GAE). Besides rigorous theoretical contributions, we integrate RV with PPO and achieve competitive performance on the Atari benchmark (49 games, ALE) compared to standard PPO, indicating that relative value estimation is an effective alternative to absolute critics. Source code will be made available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4564. MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models
๐ฌ 19
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ generative models procedural materials appearance modeling multimodal learning program synthesis
Material node graphs are programs that generate the 2D channels of procedural materials, including geometry such as roughness and displacement maps, and reflectance such as albedo and conductivity maps. They are essential in computer graphics for representing the appearance of virtual 3D objects parametrically and at arbitrary resolution. In particular, their directed acyclic graph structures and intermediate states provide an intuitive understanding and workflow for interactive appearance modeling. Creating such graphs is a challenging task and typically requires professional training. While recent neural program synthesis approaches attempt to simplify this process, they solely represent graphs as *textual* programs, failing to capture the inherently visual-spatial nature of node graphs that makes them accessible to humans. To address this gap, we present MultiMat, a *multimodal* program synthesis framework that leverages large multimodal models to process both visual and textual graph representations for improved generation of procedural material graphs. We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures syntactic validity while efficiently navigating the program space. Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4565. Efficient Best-of-Both-Worlds Algorithms for Contextual Combinatorial Semi-Bandits
๐ฌ 13
โญ 4.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ best-of-both-worlds combinatorial semi-bandits follow-the-regularized-leader
We introduce the first best-of-both-worlds algorithm for contextual combinatorial semi-bandits that simultaneously guarantees $\widetilde{\mathcal{O}}(\sqrt{T})$ regret in the adversarial regime and $\widetilde{\mathcal{O}}(\ln T)$ regret in the corrupted stochastic regime. Our approach builds on the Follow-the-Regularized-Leader (FTRL) framework equipped with a Shannon entropy regularizer, yielding a flexible method that admits efficient implementations.
Beyond regret bounds, we tackle the practical bottleneck in FTRL (or, equivalently, Online Stochastic Mirror Descent) arising from the high-dimensional projection step encountered in each round of interaction. By leveraging the Karush-Kuhn-Tucker conditions, we transform the $K$-dimensional convex projection problem into a single-variable root-finding problem, dramatically accelerating each round. Empirical evaluations demonstrate that this combined strategy not only attains the attractive regret bounds of best-of-both-worlds algorithms but also delivers substantial per-round speed-ups, making it well-suited for large-scale, real-time applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4566. A New Initialization to Control Gradients in Sinusoidal Neural Networks
๐ฌ 19
โญ 4.50
๐ learning theory
๐ท๏ธ Initialization Strategy Deep Neural Networks Sinusoidal Activations Gradient Control Implicit Neural Representations Neural Tangent Kernel
Proper initialisation strategy is of primary importance to mitigate gradient explosion or vanishing when training neural networks. Yet, the impact of initialisation parameters still lacks a precise theoretical understanding for several well-established architectures. Here, we propose a new initialisation for networks with sinusoidal activation functions such as \texttt{SIREN}, focusing on gradients control, their scaling with network depth, their impact on training and on generalization. To achieve this, we identify a closed-form expression for the initialisation of the parameters, differing from the original \texttt{SIREN} scheme. This expression is derived from fixed points obtained through the convergence of pre-activation distribution and the variance of Jacobian sequences. Controlling both gradients and targeting vanishing pre-activation helps preventing the emergence of inappropriate frequencies during estimation, thereby improving generalization. We further show that this initialisation strongly influences training dynamics through the Neural Tangent Kernel framework (NTK). Finally, we benchmark \texttt{SIREN} with the proposed initialisation against the original scheme and other baselines on function fitting and image reconstruction. The new initialisation consistently outperforms state-of-the-art methods across a wide range of reconstruction tasks, including those involving physics-informed neural networks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4567. Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization
๐ฌ 15
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ LLM Agent Summarization
Long document summarization remains a significant challenge for current large language models (LLMs), as existing approaches commonly struggle with information loss, factual inconsistencies, and coherence issues when processing excessively long documents. We propose SummQ, a novel adversarial multi-agent framework that addresses these limitations through collaborative intelligence between specialized agents operating in two complementary domains: summarization and quizzing. Our approach employs summary generators and reviewers that work collaboratively to create and evaluate comprehensive summaries, while quiz generators and reviewers create comprehension questions that serve as continuous quality checks for the summarization process. This adversarial dynamic, enhanced by an examinee agent that validates whether the generated summary contains the information needed to answer the quiz questions, enables iterative refinement through multifaceted feedback mechanisms. We evaluate SummQ on three widely used long document summarization benchmarks. Experimental results demonstrate that our framework significantly outperforms existing state-of-the-art methods across ROUGE and BERTScore metrics, as well as in LLM-as-a-Judge and human evaluations. Our comprehensive analyses reveal the effectiveness of the multi-agent collaboration dynamics, the influence of different agent configurations, and the impact of the quizzing mechanism. This work establishes a new approach for long document summarization that uses adversarial agentic collaboration to improve summarization quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4568. A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers
๐ฌ 18
โญ 4.50
๐ generative models
๐ท๏ธ conditional embeddings diffusion models generative AI transformer-based diffusion sparse representation learning efficient learning
Diffusion Transformers have achieved state-of-the-art performance in class-conditional and multimodal generation, yet the structure of their learned conditional embeddings remains poorly understood. In this work, we present the first systematic study of these embeddings and uncover a notable redundancy: class-conditioned embeddings exhibit extreme angular similarity, exceeding 99% on ImageNet-1K, while continuous-condition tasks such as pose-guided image generation and video-to-audio generation reach over 99.9%. We further find that semantic information is concentrated in a small subset of dimensions, with head dimensions carrying the dominant signal and tail dimensions contributing minimally. By pruning low-magnitude dimensions--removing up to two-thirds of the embedding space--we show that generation quality and fidelity remain largely unaffected, and in some cases improve. These results reveal a semantic bottleneck in Transformer-based diffusion models, providing new insights into how semantics are encoded and suggesting opportunities for more efficient conditioning mechanisms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4569. MSCR: Exploring the Vulnerability of LLMsโ Mathematical Reasoning Abilities Using Multi-Source Candidate Replacement
๐ฌ 23
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models Adversarial Attack Mathematical Reasoning
LLMs demonstrate performance comparable to human abilities in complex tasks such as mathematical reasoning, but their robustness in mathematical reasoning under minor input perturbations still lacks systematic investigation. Existing methods generally suffer from limited scalability, weak semantic preservation, and high costs. Therefore, we propose MSCR, an automated adversarial attack method based on multi-source candidate replacement. By combining three information sources including cosine similarity in the embedding space of LLMs, the WordNet dictionary, and contextual predictions from a masked language model, we generate for each word in the input question a set of semantically similar candidates, which are then filtered and substituted one by one to carry out the attack. We conduct large-scale experiments on LLMs using the GSM8K and MATH500 benchmarks. The results show that even a slight perturbation involving only a single word can significantly reduce the accuracy of all models, with the maximum drop reaching 49.89\% on GSM8K and 35.40\% on MATH500, _while preserving the high semantic consistency of the perturbed questions._ Further analysis reveals that perturbations not only lead to incorrect outputs but also substantially increase the average response length, which results in more redundant reasoning paths and higher computational resource consumption. These findings highlight the robustness deficiencies and efficiency bottlenecks of current LLMs in mathematical reasoning tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4570. RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers
๐ฌ 16
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ LLM Router Evaluation
Today's LLM ecosystem comprises a wide spectrum of models that differ in size, capability, and cost. No single model is optimal for all scenarios; hence, LLM routers have become essential for selecting the most appropriate model under varying circumstances. However, the rapid emergence of various routers makes choosing the right one increasingly challenging. To address this problem, we need a comprehensive router comparison and a standardized leaderboard, similar to those available for models. In this work, we introduce RouterArena, the first open platform enabling comprehensive comparison of LLM routers. RouterArena has (1) a principally constructed dataset with broad knowledge domain coverage, (2) distinguishable difficulty levels for each domain, (3) an extensive list of evaluation metrics, and (4) an automated framework for leaderboard updates. Leveraging our framework, we have produced the initial leaderboard with detailed metrics comparison as shown in Figure 1. We will make our platform open to the public; the current code base is available here: https://anonymous.4open.science/r/RouterArena-1D4B/README.md
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4571. h-MINT: Modeling Pocket-Ligand Binding with Hierarchical Molecular Interaction Network
๐ฌ 19
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Binding Affinity Prediction BPE Virtual Screen
Accurate molecular representations are critical for drug discovery, and a central
challenge lies in capturing the chemical environment of molecular fragments,
as key interactions, such as H-bond and ฯ stackingโoccur only under specific
local conditions. Most existing approaches represent molecules as atom-level
graphs; however, individual atoms cannot express stereochemistry, lone pairs,
conjugation, and other complex features. Fragment-based methods (e.g., principal
subgraph or functional group libraries) fail to preserve essential information such
as chirality, aromatic bond integrity, and ionic states. This work addresses these
limitations from two aspects. (i) **OverlapBPE tokenization**. We propose a
novel data-driven molecule tokenization method. Unlike existing approaches, our
method allows overlapping fragments, reflecting the inherently fuzzy boundaries
of small-molecule substructures and, together with enriched chemical information
at the token level, thereby preserving a more complete chemical context. (ii) **h-
MINT model**. We develop a hierarchical molecular interaction network capable
of jointly modeling drugโtarget interactions at both atom and fragment levels. By
supporting fragment overlaps, the model naturally accommodates the many-to-
many atomโfragment mappings introduced by the OverlapBPE scheme. Extensive
evaluation against state-of-the-art methods shows our method improves binding
affinity prediction by 2-4% Pearson/Spearman correlation on PDBBind and LBA,
enhances virtual screening by 1-3% in key metrics on DUD-E and LIT-PCBA, and
achieves the best overall HTS performance on PubChem assays. Further analysis
demonstrates that our method effectively captures interactive information while
maintaining good generalization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4572. ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
๐ฌ 28
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Safety Interpretability Circuit Multi-Head Attention Scaling Jailbreak Guard
Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes.
As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood.
In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability.
For the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking, the tense-changing attack.
Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads.
Lastly, we apply it into a ``preventative fine-tuning", forcing the model to learn a more robust refusal mechanism.
Across four LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility.
Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis.
Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4573. MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents
๐ฌ 21
โญ 4.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Mobile Agent Reinforcement Learning
Building general-purpose graphical user interface (GUI) agents has become increasingly promising with the progress in vision language models. However, developing effective mobile GUI agents with reinforcement learning (RL) remains challenging due to the heavy-tailed distribution of task difficulty and the inefficiency of large-scale environment sampling. We present an online agentic reinforcement learning framework MOBILERL to enhance GUI agents in mobile environments. Its core component is the Difficulty-Adaptive GRPO (ADAGRPO) algorithm. In ADAGRPO, we design difficulty-adaptive positive replay and failure curriculum filtering to adapt the model to different task difficulties. We introduce the shortest-path reward adjustment strategy to reshape rewards concerning the task length in multi-turn agentic tasks. Those strategies jointly stabilize RL training, improve sample efficiency, and generate strong performance across diverse mobile apps and tasks. We apply MOBILERL to two open models (Qwen2.5-VL-7B-Instruct and GLM-4.1V-9B-Base). The resultant MOBILERL-9B model achieves state-of-the-art results in terms of success rates on both AndroidWorld (80.2%) and Android-Lab (53.6%). The MOBILERL framework is open-sourced at https://anonymous.4open.science/r/MobileRL-iclr-4513.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4574. SURGE: Surprise-Guided Token Reduction for Efficient Video Understanding with VLMs
๐ฌ 5
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Efficient Video Understanding Vision-Language Models Token Pruning Redundancy Reduction Predictive Coding
Videos contain rich information but also high redundancy, as consecutive frames often share similar backgrounds and predictable motions. Current video-language models (VLMs) are unable to exploit this redundancy and therefore perform a significant amount of superfluous computation, processing thousands of patch tokens even when little new information is present. What is missing is an on-the-fly, model-agnostic signal of temporal predictability to decide whether tokens carry unpredictable information that merits computation. We propose SURGE, a training-free and backbone-agnostic method that measures surprise in token space. Surprise scores are defined by the prediction error of each token from its recent history; high-surprise tokens are retained, while predictable ones are pruned. Aggregating scores over time produces a surprise curve that highlights key events, which can be further refined with CLIP-based query relevance to form a compact spatio-temporal mask. Experiments on multiple video understanding benchmarks show that SURGE reduces tokens by up to 7$\times$ and prefill cost by 86โ98\%, while maintaining accuracy within $\pm$1 point of full-token baselines. By aligning computation with novelty, SURGE enables video VLMs to handle long contexts efficiently and without retraining.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4575. Domain Expansion: A Latent Space Construction Framework for Multi-Task Learning
๐ฌ 24
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ representation learning;machine learning for computer vision
Training a single network with multiple objectives often leads to conflicting gradients that degrade shared representations, forcing them into a compromised state that is suboptimal for any single taskโa problem we term latent representation collapse. We introduce Domain Expansion, a framework that prevents these conflicts by restructuring the latent space itself. Our framework uses a novel orthogonal pooling to construct a latent space where each objective is assigned to a mutually orthogonal subspace. We validate our approach on the ShapeNet benchmark, simultaneously training a model for object classification and pose estimation. Our experiments demonstrate that this structure not only prevents collapse but also yields an explicit, interpretable, and compositional latent space where concepts can be directly manipulated.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4576. Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation
๐ฌ 12
โญ 4.50
๐ reinforcement learning
๐ท๏ธ RL POMDP Memory Classification
The incorporation of memory into agents is essential for numerous tasks within the domain of Reinforcement Learning (RL). In particular, memory is paramount for tasks that require the use of past information, adaptation to novel environments, and improved sample efficiency. However, the term ``memory'' encompasses a wide range of concepts, which, coupled with the lack of a unified methodology for validating an agent's memory, leads to erroneous judgments about agents' memory capabilities and prevents objective comparison with other memory-enhanced agents. This paper aims to streamline the concept of memory in RL by providing practical precise definitions of agent memory types, such as long-term vs. short-term memory and declarative vs. procedural memory, inspired by cognitive science. Using these definitions, we categorize different classes of agent memory, propose a robust experimental methodology for evaluating the memory capabilities of RL agents, and standardize evaluations. Furthermore, we empirically demonstrate the importance of adhering to the proposed methodology when evaluating different types of agent memory by conducting experiments with different RL agents and what its violation leads to.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4577. Boosting Open Set Recognition Performance through Modulated Representation Learning
๐ฌ 36
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Open set recognition representation learning
The open set recognition (OSR) problem aims to identify test samples from novel semantic classes that are not part of the training classes, a task that is crucial in many practical scenarios. However, the existing OSR methods use a constant scaling factor (the temperature) to the logits before applying a loss function, which hinders the model from exploring both ends of the spectrum in representation learning -- from instance-level to clas-specific features. In this paper, we address this problem by enabling temperature-modulated representation learning using a set of proposed temperature schedules, including our novel negative cosine schedule. Our temperature schedules allow the model to form a coarse decision boundary at the beginning of training by focusing on fewer neighbors, and gradually prioritizes more neighbors to smooth out the rough edges. This gradual task switching leads to a richer and more generalizable representation space. While other OSR methods benefit by including regularization or auxiliary negative samples, such as with mix-up, thereby adding a significant computational overhead, our schedules can be folded into any existing OSR loss function with no overhead. We implement the novel schedule on top of a number of baselines, using cross-entropy, contrastive and the ARPL loss functions and find that it boosts both the OSR and the closed set performance in most cases, especially on the tougher semantic shift benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4578. Can LLMs Reason Soundly in Law? Auditing Inference Patterns for Legal Judgment
๐ฌ 16
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Model Value Alignment Trustworthiness
This paper presents a method to analyze the inference patterns used by Large Language Models (LLMs) for judgment in a case study on legal LLMs, so as to identify potential incorrect representations of the LLM, according to human domain knowledge. Unlike traditional evaluations on language generation results, we propose to evaluate the correctness of the detailed inference patterns of an LLM behind its seemingly correct outputs. To this end, we quantify the interactions between input phrases used by the LLM as primitive inference patterns, because recent theoretical achievements have proven several mathematical guarantees of the faithfulness of the interaction-based explanation. We design a set of metrics to evaluate the detailed inference patterns of LLMs. Experiments show that even when the language generation results appear correct, a significant portion of the inference patterns used by the LLM for the legal judgment may represent misleading or irrelevant logic.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4579. DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains
๐ฌ 22
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Large Reasoning Models Reasoning Efficiency Reinforcement Learning
Large Reasoning Models (LRMs) have demonstrated impressive capabilities but suffer from cognitive inefficiencies like ''overthinking'' simple problems and ''underthinking'' complex ones. While existing methods that use supervised fine-tuning (SFT) or reinforcement learning (RL) with token-length rewards can improve efficiency, they often do so at the cost of accuracy. This paper introduces DeepCompress, a novel framework that simultaneously enhances both the accuracy and efficiency of LRMs. We challenge the prevailing approach of consistently favoring shorter reasoning paths, showing that longer responses can contain a broader range of correct solutions for difficult problems. DeepCompress employs an adaptive length reward mechanism that dynamically classifies problems as "Simple" or "Hard" in real-time based on the model's evolving capability. It encourages shorter, more efficient reasoning for "Simple" problems while promoting longer, more exploratory thought chains for "Hard" problems. This dual-reward strategy enables the model to autonomously adjust its Chain-of-Thought (CoT) length, compressing reasoning for well-mastered problems and extending it for those it finds challenging. Experimental results on challenging mathematical benchmarks show that DeepCompress consistently outperforms baseline methods, achieving superior accuracy while significantly improving token efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4580. Atomic HINs: Entity-Attribute Duality for Heterogeneous Graph Modeling
๐ฌ 17
โญ 4.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Heterogeneous Information Networks Heterogeneous Graph Neural Networks Graph Representation Learning
Heterogeneous Information Networks (HINs) provide a powerful framework for modeling multi-typed entities and relations, typically defined under a fixed schema. Yet, most research assumes this structure is given, overlooking the fact that alternative designs can emphasize different aspects of the data and substantially influence downstream performance.
As a theoretical foundation for such designs, we introduce the principle of entity-attribute duality: attributes can be atomized as entities with their associated relations, while entities can, in turn, serve as attributes of others. This principle motivates atomic HIN, a canonical representation that makes all modeling choices explicit and achieves maximal expressiveness.
Building on this foundation, we propose a systematic framework for task-specific schema refinement.
Within this framework, we demonstrate that widely used benchmarks correspond to heuristic refinements of the atomic HIN---often far from optimal.
Across eight datasets, refinement alone enables a simplified Relational GCN (sRGCN) to reach state-of-the-art performance on node- and link-level tasks, with further gains from advanced HGNNs. These results highlight schema design as a key dimension in heterogeneous graph modeling.
By releasing the atomic HINs, searched schemas, and refinement framework, we enable principled benchmarking and open the way for future work on schema-aware learning, automated structure discovery, and next-generation HGNNs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4581. MeSH: Memory-as-State-Highways for Recursive Transformers
๐ฌ 18
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Recursive Transformer Language Model Parameter Sharing Parameter Efficiency
Recursive transformers reuse parameters and iterate over hidden states multiple times, decoupling compute depth from parameter depth.
However, under matched compute, recursive models with fewer parameters often lag behind non-recursive counterparts.
By probing hidden states, we trace this performance gap to two primary bottlenecks: __undifferentiated computation__, where the core is forced to adopt a similar computational pattern at every iteration, and __information overload__, where long-lived and transient information must coexist in a single hidden state.
To address the issues, we introduce a **Me**mory-as-**S**tate-**H**ighways **(MeSH)** scheme, which externalizes state management into an explicit memory buffer and employs lightweight routers to dynamically diversify computation across iterations.
Probing visualizations confirm that MeSH successfully resolves the pathologies by inducing functional specialization across iterations. On the Pythia suite (160Mโ1.4B), MeSH-enhanced recursive transformers consistently improve over recursive baselines and outperforms its larger non-recursive counterpart at the 1.4B scale, improving average downstream accuracy by +1.06\% with 33\% fewer non-embedding parameters. Our analysis establishes MeSH as a scalable and principled architecture for building stronger recursive models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4582. Unbiased Gradient Estimation for Event Binning via Functional Backpropagation
๐ฌ 12
โญ 4.50
๐ learning theory
๐ท๏ธ Event Camera Gradient Computation Automatic Differentiation Functional Derivative
Event-based vision encodes dynamic scenes as asynchronous spatio-temporal spikes called events. To leverage conventional image processing pipelines, events are typically binned into frames. However, binning functions are discontinuous, which truncates gradients at the frame level and forces most event-based algorithms to rely solely on frame-based features. Attempts to directly learn from raw events avoid this restriction but instead suffer from biased gradient estimation due to the discontinuities of the binning operation, ultimately limiting their learning efficiency. To address this challenge, we propose a novel framework for unbiased gradient estimation of arbitrary binning functions by synthesizing weak derivatives during backpropagation while keeping the forward output unchanged. The key idea is to exploit integration by parts: lifting the target functions to functionals yields an integral form of the derivative of the binning function during backpropagation, where the cotangent function naturally arises. By reconstructing this cotangent function from the sampled cotangent vector, we compute weak derivatives that provably match long-range finite differences of both smooth and non-smooth targets. Experimentally, our method improves simple optimization-based egomotion estimation with 3.2\% lower RMS error and 1.57$\times$ faster convergence. On complex downstream tasks, we achieve 9.4\% lower EPE in self-supervised optical flow, and 5.1\% lower RMS error in SLAM, demonstrating broad benefits for event-based visual perception.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4583. Pulp Motion: Framing-aware multimodal camera and human motion generation
๐ฌ 15
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Camera generation Human motion generation Generative model mulrt
Treating human motion and camera trajectory generation separately overlooks a core principle of cinematography: the tight interplay between actor performance and camera work in the screen space.
In this paper, we are the first to cast this task as a text-conditioned joint generation, aiming to maintain consistent on-screen framing while producing two heterogeneous, yet intrinsically linked, modalities: human motion and camera trajectories.
We propose a simple, model-agnostic framework that enforces multimodal coherence via an auxiliary modality: the on-screen framing induced by projecting human joints onto the camera. This on-screen framing provides a natural and effective bridge between modalities, promoting consistency and leading to more precise joint distribution.
We first design a joint autoencoder that learns a shared latent space, together with a lightweight linear mapping from the human and camera latents to a framing latent. We then introduce Auxiliary Sampling, which exploits this linear map to steer generation toward a coherent framing modality.
To support this task, we also introduce the PulpMotion dataset, a camera-motion and human-motion dataset with rich captions, and high-quality human motions.
Extensive experiments across DiT- and MAR-based architectures show the generality and effectiveness of our method in generating on-frame coherent camera-human motions, while also achieving gains on textual alignment for both modalities. Our qualitative results yield more cinematographically meaningful framings setting the new state of the art for this task.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4584. InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents
๐ฌ 17
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ LLM Agents Tool-Augmented Agents Model Context Protocol
Information seeking is a fundamental requirement for humans. However, existing LLM agents rely heavily on open-web search, which exposes two fundamental weaknesses: online content is noisy and unreliable, and many real-world tasks require precise, domain-specific knowledge unavailable from the web. The emergence of the Model Context Protocol (MCP) now allows agents to interface with thousands of specialized tools, seemingly resolving this limitation. Yet it remains unclear whether agents can effectively leverage such toolsโand more importantly, whether they can integrate them with general-purpose search to solve complex tasks.
Therefore, we introduce InfoMosaic-Bench, the first benchmark dedicated to multi-source information seeking in tool-augmented agents. Covering six representative domains (medicine, finance, maps, video, web, and multi-domain integration), InfoMosaic-Bench requires agents to combine general-purpose search with domain-specific tools. Tasks are synthesized with InfoMosaic-Flow, a scalable pipeline that grounds task conditions in verified tool outputs, enforces cross-source dependencies, and filters out shortcut cases solvable by trivial lookup. This design guarantees both reliability and non-triviality.
Experiments with 14 state-of-the-art LLM agents reveal three findings: (i) web information alone is insufficient, with GPT-5 achieving only 38.2\% accuracy and 67.5\% pass rate; (ii) domain tools provide selective but inconsistent benefits, improving some domains while degrading others; and (iii) 22.4\% of failures arise from incorrect tool usage or selection, highlighting that current LLMs still struggle with even basic tool handling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4585. A Two-Phase Deep Learning Framework for Adaptive Time-Stepping in High-Speed Flow Modeling
๐ฌ 19
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Partial Differential Equations PDEs CFD
We consider the problem of modeling high-speed flows using machine learning methods. While most prior studies focus on low-speed fluid flows in which uniform time-stepping is practical, flows approaching and exceeding the speed of sound exhibit sudden changes such as shock waves. In such cases, it is essential to use adaptive time-stepping methods to allow a temporal resolution sufficient to resolve these phenomena while simultaneously balancing computational costs. Here, we propose a two-phase machine learning method, known as ShockCast, to model high-speed flows with adaptive time-stepping. In the first phase, we propose to employ a machine learning model to predict the timestep size. In the second phase, the predicted timestep is used as an input along with the current fluid fields to advance the system state by the predicted timestep. We explore several physically-motivated components for timestep prediction and introduce timestep conditioning strategies inspired by neural ODE and Mixture of Experts. We evaluate our methods by generating three supersonic flow datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4586. Exploring Real-Time Super-Resolution: Benchmarking and Fine-Tuning for Streaming Content
๐ฌ 13
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ super-resolution dataset benchmark real-time quality assessment video compression
Recent advancements in real-time super-resolution have enabled higher-quality video streaming, yet existing methods struggle with the unique challenges of compressed video content. Commonly used datasets do not accurately reflect the characteristics of streaming media, limiting the relevance of current benchmarks. To address this gap, we introduce a new comprehensive dataset - $\textbf{StreamSR}$ - sourced from YouTube, covering a wide range of video genres and resolutions representative of real-world streaming scenarios. We benchmark 11 state-of-the-art real-time super-resolution models to evaluate their performance for the streaming use-case.
Furthermore, we propose $\textbf{EfRLFN}$, an efficient real-time model that integrates Efficient Channel Attention and a hyperbolic tangent activation function - a novel design choice in the context of real-time super-resolution. We extensively optimized the architecture to maximize efficiency and designed a composite loss function that improves training convergence. EfRLFN combines the strengths of existing architectures while improving both visual quality and runtime performance.
Finally, we show that fine-tuning other models on our dataset results in significant performance gains that generalize well across various standard benchmarks. We made the dataset, the code, and the benchmark available at $\textit{[Link is redacted]}$.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4587. Intrinsic Lorentz Neural Network
๐ฌ 20
โญ 4.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ hyperbolic; lorentz model
Real-world data frequently exhibit latent hierarchical structures, which can be naturally represented by hyperbolic geometry. Although recent hyperbolic neural networks have demonstrated promising results, many existing architectures remain partially intrinsic, mixing Euclidean operations with hyperbolic ones or relying on extrinsic parameterizations. To address it, we propose the \emph{Intrinsic Lorentz Neural Network} (ILNN), a fully intrinsic hyperbolic architecture that conducts all computations within the Lorentz model. At its core, the network introduces a novel \emph{point-to-hyperplane} fully connected layer (FC), replacing traditional Euclidean affine logits with closed-form hyperbolic distances from features to learned Lorentz hyperplanes, thereby ensuring that the resulting geometric decision functions respect the inherent curvature.
Around this fundamental layer, we design intrinsic modules: GyroLBN, a Lorentz batch normalization that couples gyro-centering with gyro-scaling, consistently outperforming both LBN and GyroBN while reducing training time. We additionally proposed a gyro-additive bias for the FC output, a Lorentz patch-concatenation operator that aligns the expected log-radius across feature blocks via a digamma-based scale, and a Lorentz dropout layer.
Extensive experiments conducted on CIFAR-10/100 and two genomic benchmarks (TEB and GUE) illustrate that ILNN achieves state-of-the-art performance and computational cost among hyperbolic models and consistently surpasses strong Euclidean baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4588. Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis
๐ฌ 28
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Respiratory sounds multimodal learning LLM-based generation medical agents
Deep learning-based respiratory auscultation is currently hindered by two fundamental disconnects: the representation gap, where compressing signals into spectrograms discards transient acoustic events and clinical context, and the data gap, characterized by severe class imbalance and scarcity. To bridge these gaps, we present **_Resp-Agent_**, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-AยฒCA). Unlike static pipelines, the Thinker-AยฒCA acts as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. Under this unified orchestration, we propose two specialized architectural solutions. First, to address the representation gap, we introduce a Modality Weaving Diagnoser. This module moves beyond standard fusion by explicitly interleaving electronic health records (EHR) with audio tokens and employs Strategic Global Attention to capture long-range clinical dependencies while retaining sensitivity to millisecond-level transient events via sparse audio anchors. Second, to resolve the data gap, we design a Flow Matching Generator that retools a text-only Large Language Model (LLM) via modality injection. Guided by the Thinker-AยฒCA, this generator decouples pathological content from acoustic style to programmatically synthesize high-fidelity, hard-to-diagnose samples that remedy the systemโs boundary errors. To support this work, we construct **_Resp-229k_**, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that our agentic co-design consistently outperforms prior approaches, advancing robust and deployable respiratory intelligence. Data and code will be released upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4589. RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation
๐ฌ 11
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ RAG Large Language Models Question Answering Knowledge Graphs Graph LLM
Large language models (LLMs) have achieved impressive performance on knowledge-intensive tasks, yet they often struggle with multi-step reasoning due to the unstructured nature of retrieved context. While retrieval-augmented generation (RAG) methods provide external information, the lack of explicit organization among retrieved passages limits their effectiveness, leading to brittle reasoning pathways. Recent interpretability studies highlighting the importance of structured intermediate reasoning further align with this perspective.
We propose Retrieval-And-Structuring (RAS), a framework that dynamically constructs question-specific knowledge graphs through iterative retrieval and structured knowledge building. RAS interleaves targeted retrieval planning with incremental graph construction, enabling models to assemble and reason over evolving knowledge structures tailored to each query. On seven knowledge-intensive benchmarks, RAS consistently outperforms strong baselines, achieving up to 8.7\% and 7.0\% gains with proprietary and open-source LLMs, respectively. Our results demonstrate that dynamic, question-specific knowledge structuring offers a robust path to improving reasoning accuracy and robustness in language model generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4590. VCWorld: A Biological World Model for Virtual Cell Simulation
๐ฌ 21
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Virtual Cell Large Language Models Perturb-seq
Virtual cell modeling aims to predict cellular responses to perturbations. Existing virtual cell models rely heavily on large-scale single-cell datasets, learning explicit mappings between gene expression and perturbations. Although recent models attempt to incorporate multi-source biological information, their generalization remains constrained by data quality, coverage, and batch effects. More critically, these models often function as black boxes, offering predictions without interpretability or consistency with biological principles, which undermines their credibility in scientific research. To address these challenges, we present VCWorld, a cell-level white-box simulator that integrates structured biological knowledge with the iterative reasoning capabilities of large language models to instantiate a biological world model. VCWorld operates in a data-efficient manner to reproduce perturbation-induced signaling cascades and generates interpretable, stepwise predictions alongside explicit mechanistic hypotheses. In drug perturbation benchmarks, VCWorld achieves state-of-the-art predictive performance, and the inferred mechanistic pathways are consistent with publicly available biological evidence. Our code is publicly available at https://anonymous.4open.science/r/VCWorld-B970.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4591. Does Higher Interpretability Imply Better Utility? A Pairwise Analysis on Sparse Autoencoders
๐ฌ 21
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ Sparse Autoencoders; Interpretability; Utility
Sparse Autoencoders (SAEs) are widely used to steer large language models (LLMs), based on the assumption that their interpretable features naturally enable effective model behavior steering. Yet a fundamental question remains: does higher interpretability imply better steering utility? To answer this, we train 90 SAEs across three LLMs (Gemma-2-2B, Qwen-2.5-3B, Gemma-2-9B), spanning five architectures and six sparsity levels. We evaluate interpretability with SAEBench (Karvonen et al., 2025) and steering utility with AxBench (Wu et al., 2025), and analyze rank agreement via Kendallโs rank coefficient $\tau_b$. Our analysis reveals only a relatively weak positive association ($\tau_b \approx 0.298$), indicating that interpretability is an insufficient proxy for steering performance. We conjecture the interpretabilityโutility gap stems from feature selection: not all SAE features are equally effective for steering. To identify features that truly steer LLM behavior, we propose a novel selection criterion, $\Delta$ Token Confidence, which measures how much amplifying a feature changes the next-token distribution. Our method improves steering performance on three LLMs by **52.52\%** compared to the best prior output scoreโbased criterion (Arad et al., 2025). Strikingly, after selecting features with high $\Delta$ Token Confidence, the correlation between interpretability and utility vanishes ($\tau_b \approx 0$) and can even become negative, further highlighting their divergence for the most effective steering features.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4592. Expressiveness of Multi-Neuron Convex Relaxations in Neural Network Certification
๐ฌ 11
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ certification convex relaxation theory
Neural network certification methods heavily rely on convex relaxations to provide robustness guarantees. However, these relaxations are often imprecise: even the most accurate single-neuron relaxation is incomplete for general ReLU networks, a limitation known as the \emph{single-neuron convex barrier}. While multi-neuron relaxations have been heuristically applied to address this issue, two central questions arise: (i) whether they overcome the convex barrier, and if not, (ii) whether they offer theoretical capabilities beyond those of single-neuron relaxations.
In this work, we present the first rigorous analysis of the expressiveness of multi-neuron relaxations. Perhaps surprisingly, we show that they are inherently incomplete, even when allocated sufficient resources to capture finitely many neurons and layers optimally. This result extends the single-neuron barrier to a \textit{universal convex barrier} for neural network certification.
On the positive side, we show that completeness can be achieved by either (i) augmenting the network with a polynomial number of carefully designed ReLU neurons or (ii) partitioning the input domain into convex sub-polytopes, thereby distinguishing multi-neuron relaxations from single-neuron ones which are unable to realize the former and have worse partition complexity for the latter.
Our findings establish a foundation for multi-neuron relaxations and point to new directions for certified robustness, including training methods tailored to multi-neuron relaxations and verification methods with multi-neuron relaxations as the main subroutine.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4593. Spilled Energy in Large Language Models
๐ฌ 30
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM hallucination detection EBM
We reinterpret the final softmax classifier over the vocabulary of Large Language Models (LLM) as an Energy-based Model (EBM). This allows us to decompose the chain of probabilities used in sequence-to-sequence modeling as multiple EBMs that interact together at inference time. Our decomposition offers a principled approach to measuring where the "energy spills" in LLM decoding, empirically showing that spilled energy correlates well with factual errors, inaccuracies, biases, and failures. Similar to Orgad et al. (2025), we localize the exact token associated with the answer, yet, unlike them, who need to train a classifier and ablate which activations to feed to it, we propose a method to detect hallucinations *completely training-free that naturally generalizes across tasks and LLMs* by using the output logits across subsequent generation steps. We propose two ways to detect hallucinations: the first one that measures the difference between two quantities that we call **spilled energy**, measuring the difference between energy values across two generation steps that mathematically should be equal; the other is **marginal energy**, which we can measure at a single step.
Unlike prior work, our method is training-free, mathematically principled, and demonstrates strong cross-dataset generalization: we scale our analysis to state-of-the-art LLMs, including LLaMa-3, Mistral, and Qwen-3, evaluating on nine benchmarks and achieving competitive performance with robust results across datasets and different LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4594. Reasoning on Time-Series for Financial Technical Analysis
๐ฌ 22
โญ 4.50
๐ learning on time series and dynamical systems
๐ท๏ธ Time-Series Large Language Models Stock Prediction
While Large Language Models have been used to produce interpretable stock forecasts, they mainly focus on analyzing textual reports but not historical price data, also known as Technical Analysis. This task is challenging as it switches between domains: the stock price inputs and outputs lie in the time-series domain, while the reasoning step should be in natural language. In this work, we introduce Verbal Technical Analysis (VTA), a novel framework that combine verbal and latent reasoning to produce stock time-series forecasts that are both accurate and interpretable. To reason over time-series, we convert stock price data into textual annotations and optimize the reasoning trace using an inverse Mean Squared Error (MSE) reward objective. To produce time-series outputs from textual reasoning, we condition the outputs of a time-series backbone model on the reasoning-based attributes. Experiments on stock datasets across U.S., Chinese, and European markets show that VTA achieves state-of-the-art forecasting accuracy, while the reasoning traces also perform well on evaluation metrics judged by industry experts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4595. Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors
๐ฌ 19
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Context Compression Long Context LLMs LLM Memory
Context compression presents a promising approach for accelerating large language model (LLM) inference by compressing long contexts into compact representations.Current context compression methods predominantly rely on autoencoding tasks to train context-agnostic compression tokens to compress contextual semantics.While autoencoding tasks enable compression tokens to acquire compression capabilities, compression via autoencoding tasks creates a fundamental mismatch: the models are optimized for reconstruction that diverge from actual downstream tasks, thereby weakening the features more beneficial for real-world usage.We propose Semantic-Anchor Compression (SAC), a novel method that shifts from autoencoding task based compression to an architecture that is equipped with this compression capability $\textit{a priori}$. Instead of training models to compress contexts through autoencoding tasks, SAC directly selects so-called anchor tokens from the original context and aggregates contextual information into their key-value (KV) representations. By deriving representations directly from the contextual tokens, SAC eliminates the need for autoencoding training. To ensure compression performance while directly leveraging anchor tokens directly, SAC incorporates two key designs: (1) anchor embeddings that enable the compressor to identify critical tokens, and (2) bidirectional attention modification that allows anchor tokens to capture information from the entire context.Experimental results demonstrate that SAC consistently outperforms existing context compression methods across various compression ratios. On out-of-distribution evaluation using MRQA, SAC achieves 1 EM improvement at 5x compression over strong baselines,with increasing advantages at higher compression ratios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4596. CaRe-BN: Precise Moving Statistics for Stabilizing Spiking Neural Networks in Reinforcement Learning
๐ฌ 23
โญ 4.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Spiking Neural Networks Batch Normalization Reinforcement Learning
Spiking Neural Networks (SNNs) offer low-latency and energy-efficient decision-making on neuromorphic hardware by mimicking the event-driven dynamics of biological neurons. However, the discrete and non-differentiable nature of spikes leads to unstable gradient propagation in directly trained SNNs, making Batch Normalization (BN) an important component for stabilizing training. In online Reinforcement Learning (RL), imprecise BN statistics hinder exploitation, resulting in slower convergence and suboptimal policies. While Artificial Neural Networks (ANNs) can often omit BN, SNNs critically depend on it, limiting the adoption of SNNs for energy-efficient control on resource-constrained devices. To overcome this, we propose Confidence-adaptive and Re-calibration Batch Normalization (CaRe-BN), which introduces (i) a confidence-guided adaptive update strategy for BN statistics and (ii) a re-calibration mechanism to align distributions. By providing more accurate normalization, CaRe-BN stabilizes SNN optimization without disrupting the RL training process. Importantly, CaRe-BN does not alter inference, thus preserving the energy efficiency of SNNs in deployment. Extensive experiments on both discrete and continuous control benchmarks demonstrate that CaRe-BN improves SNN performance by up to $22.6$% across different spiking neuron models and RL algorithms. Remarkably, SNNs equipped with CaRe-BN even surpass their ANN counterparts by $5.9$%. These results highlight a new direction for BN techniques tailored to RL, paving the way for neuromorphic agents that are both efficient and high-performing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4597. DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage
๐ฌ 19
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ GRPO Advantage Vanishing Reward Sparsity Multimodal LLM Difficulty-Adaptive
Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a traditional critic model, it often suffers from sparse rewards, arising from the scarcity of positive feedback on difficult problems, and from advantage vanishing, which occurs when group-level rewards exhibit high consistency for problems that are too easy or too hard. Existing solutions fall into three categories: sample enhancement and expansion, which may aggravate vanishing advantage due to poor control of difficulty distribution; selective sample utilization, which fails to fully leverage the value of all data; and indirect reward design, which may introduce biased optimization directions due to misalignment between reasoning and the final outcome. However, these approaches overlook a fundamental question: for a given problem, how can we ensure that the within-group reward distribution of responses exhibits enough variance to yield clear optimization signals for each response? To address these issues, we propose DIVA-GRPO, a difficulty-adaptive variant augmentation advantage method that dynamically adjusts the difficulty distribution of variants for each problem from a global perspective. Our method dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and advantages are computed within both local and global(a problem and its variants) groups using difficulty-weighted and normalized scaling. This design alleviates reward sparsity and advantage vanishing, minimizes data waste, and improves training stability. Extensive experiments on six reasoning benchmarks demonstrate that DIVA-GRPO outperforms existing approaches in both training efficiency and reasoning performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4598. NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
๐ฌ 35
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Generative Models Autoregressive Models Diffusion Models Text-to-image
Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4599. Long-Text-to-Image Generation via Compositional Prompt Decomposition
๐ฌ 20
โญ 4.50
๐ generative models
๐ท๏ธ Compositionality; Text-to-Image Generation; Generative Model Generalization
While modern text-to-image models excel at generating images from intricate prompts, they struggle to capture the key details when the prompts are expanded into descriptive paragraphs. This limitation stems from the prevalence of short captions in their training data. Existing methods attempt to address this by either fine-tuning on long-prompt data, which generalizes poorly to even longer inputs; or by projecting the oversize inputs into normal-prompt domain and compromising fidelity. We propose a compositional approach that enables pre-trained models to handle long-prompts by breaking it down into manageable components. Specifically, we introduce a trainable PromptDecomposer module to decompose the long-prompt into a set of distinct sub-prompts. The pre-trained T2I model processes these sub-prompts in parallel, and their corresponding outputs are merged together using concept conjunction. Our compositional long-text-to-image model achieves performance comparable to those with specialized tuning. Meanwhile, our approach demonstrates superior generalization, outperforming other models by 7.4% on prompts over 500 tokens in the challenging DetailMaster benchmark.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4600. Neuron-Aware Data Selection in Instruction Tuning for Large Language Models
๐ฌ 21
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ Instruction Tuning Data Selection Large Language Models
Instruction Tuning (IT) has been proven to be an effective approach to unlock the powerful capabilities of large language models (LLMs).
Recent studies indicate that excessive IT data can degrade LLMs performance, while carefully selecting a small subset of high-quality IT data can significantly enhance their capabilities. Therefore, identifying the most efficient subset data from the IT dataset to effectively develop either specific or general abilities in LLMs has become a critical challenge.
To address this, we propose a novel and efficient framework called Nait. Nait evaluates the impact of IT data on LLMs performance by analyzing the similarity of neuron activation patterns between the IT dataset and the target domain capability. Specifically, Nait captures neuron activation patterns from in-domain datasets of target domain capabilities to construct reusable and transferable neuron activation features. It then evaluates and selects optimal samples based on the similarity between candidate samples and the expected activation features of the target capabilities.
Experimental results show that training on the 10\% Alpaca-GPT4 IT data subset selected by Nait consistently outperforms methods that rely on external advanced models or uncertainty-based features across various tasks. Our findings also reveal the transferability of neuron activation features across different capabilities of LLMs. In particular, IT data with more logical reasoning and programmatic features possesses strong general transferability, enabling models to develop stronger capabilities across multiple tasks, while a stable core subset of data is sufficient to consistently activate fundamental model capabilities and universally improve performance across diverse tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4601. SCAD: Super-Class-Aware Debiasing for Long-Tailed Semi-Supervised Learning
๐ฌ 15
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ semi-supervised learning
In long-tailed semi-supervised learning (LTSSL), pseudolabeling often creates a vicious cycle of bias amplification,
a problem that recent state-of-the-art methods attempt to mitigate using logit adjustment (LA). However, their adjustment
schemes, inherited from LA, remain inherently hierarchyagnostic, failing to account for the semantic relationships between classes. In this regard, we identify a critical yet overlooked problem of intra-super-class imbalance, where a toxic
combination of high semantic similarity and severe local
imbalance within each super-class hinders effective LTSSL.
This problem causes the model to reinforce on its errors,
leading to representation overshadowing. To break this cycle, we propose Super-Class-Aware Debiasing (SCAD), a
new framework that performs a dynamic, super-class-aware
logit adjustment. SCAD leverages the latent semantic structure between classes to focus its corrective power on the most
confusable groups, effectively resolving the local imbalances.
Our extensive experiments validate that SCAD achieves new
state-of-the-art performance, demonstrating the necessity of
a super-class-aware approach for robust debiasing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4602. Recover Cell Tensor: Diffusion-Equivalent Tensor Completion for Fluorescence Microscopy Imaging
๐ฌ 25
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Fluorescence microscopy Cell recovery Tensor completion Conditional diffusion
Fluorescence microscopy (FM) imaging is a fundamental technique for observing live cell divisionโone of the most essential processes in the cycle of life and death. Observing 3D live cells requires scanning through the cell volume while minimizing lethal phototoxicity. That limits acquisition time and results in sparsely sampled volumes with anisotropic resolution and high noise. Existing image restoration methods, primarily based on inverse problem modeling, assume known and stable degradation processes and struggle under such conditions, especially in the absence of high-quality reference volumes. In this paper, from a new perspective, we propose a novel tensor completion framework tailored to the nature of FM imaging, which inherently involves nonlinear signal degradation and incomplete observations. Specifically, FM imaging with equidistant Z-axis sampling is essentially a tensor completion task under a uniformly random sampling condition. On one hand, we derive the theoretical lower bound for exact cell tensor completion, validating the feasibility of accurately recovering 3D cell tensor. On the other hand, we reformulate the tensor completion problem as a mathematically equivalent score-based generative model. By incorporating structural consistency priors, the generative trajectory is effectively guided toward denoised and geometrically coherent reconstructions. Our method demonstrates state-of-the-art performance on SR-CACO-2 and four real \textit{in vivo} cellular datasets, showing substantial improvements in both signal-to-noise ratio and structural fidelity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4603. Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning
๐ฌ 17
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Learning rate schedules Large language models (LLMs)
We investigate the role of learning rate scheduling in the large-scale pre-training of large language models, focusing on its influence on downstream performance after supervised fine-tuning (SFT).
Decay-based learning rate schedulers are widely used to minimize pre-training loss.
However, despite their widespread use, how these schedulers affect performance after SFT remains underexplored.
In this paper, we examine Warmup-Stable-Only (WSO), which maintains a constant learning rate after warmup without any decay.
Through experiments with 1B and 8B parameter models, we show that WSO consistently outperforms decay-based schedulers in terms of performance after SFT, even though decay-based schedulers may exhibit better performance after pre-training.
The result also holds across different regimes with mid-training and over-training.
Loss landscape analysis further reveals that decay-based schedulers lead models into sharper minima, whereas WSO preserves flatter minima that support adaptability.
These findings indicate that applying LR decay to improve pre-training metrics may compromise downstream adaptability.
Our work also provides practical guidance for training and model release strategies, highlighting that pre-training models with WSO enhances their adaptability for downstream tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4604. NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
๐ฌ 31
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Discrete Diffusion Sampling; Neural Indicator
Discrete diffusion language models (dLLMs) have recently emerged as a promising alternative to traditional autoregressive approaches, offering the flexibility to generate tokens in arbitrary orders and the potential of parallel decoding. However, existing heuristic sampling strategies remain inefficient: they choose only a small part of tokens to sample at each step, leaving substantial room for improvement. In this work, we study the problem of token sampling order optimization and demonstrate its significant potential for acceleration. Specifically, we find that fully leveraging correct predictions at each step can reduce the number of sampling iterations by an order of magnitude without compromising accuracy. Based on this, we propose Neural Indicator Sampling (NI Sampling), a general sampling order optimization framework that utilize a neural indicator to decide which tokens should be sampled at each step. We further propose a novel trajectory-preserving objective to train the indicator. Experiments on LLaDA and Dream models across multiple benchmarks show that our method achieves up to 14.3$\times$ acceleration over full-step sampling with negligible performance drop, and consistently outperforms confidence threshold sampling in the accuracyโstep trade-off.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4605. JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks
๐ฌ 15
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ fake news jailbreak llm multilingual
Fake news undermines societal trust and decision-making across politics, economics, health, and international relations, and in extreme cases threatens human lives and societal safety.
Because fake news reflects region-specific political, social, and cultural contexts and is expressed in language, evaluating the risks of large language models (LLMs) requires a multi-lingual and regional perspective.
Malicious users can bypass safeguards through jailbreak attacks, inducing LLMs to generate fake news.
However, no benchmark currently exists to systematically assess attack resilience across languages and regions.
Here, we propose JailNewsBench, the first benchmark for evaluating LLM robustness against jailbreak-induced fake news generation.
JailNewsBench spans 34 regions and 22 languages, covering 8 evaluation sub-metrics through LLM-as-a-Judge and 5 jailbreak attacks, with approximately 300k instances.
Our evaluation of 9 LLMs reveals that the maximum attack success rate reached 86.3% and the maximum harmfulness score was 3.5 out of 5.
Notably, for English and U.S.-related topics, the defensive performance of typical multilingual LLMs was significantly higher than for other regions, highlighting substantial imbalances in safety across languages and regions.
In addition, our analysis shows that coverage of fake news in existing safety datasets is limited and less well defended than major categories such as toxicity and social bias.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4606. Unleashing Guidance Without Classifiers for Human-Object Interaction Animation
๐ฌ 10
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ human-object interaction human motion generation
Generating realistic human-object interaction (HOI) animations remains challenging because it requires jointly modeling dynamic human actions and diverse object geometries. Prior diffusion-based approaches often rely on handcrafted contact priors or human-imposed kinematic constraints to improve contact quality. We propose a data-driven alternative in which guidance emerges from the denoising pace itself, reducing dependence on manually designed priors. Building on diffusion forcing, we factor the representation into modality-specific components and assign individualized noise levels with asynchronous denoising schedules. In this paradigm, cleaner components guide noisier ones through cross-attention, yielding guidance without auxiliary classifiers. We find that this data-driven guidance is inherently contact-aware, and can be further enhanced when training is augmented with a broad spectrum of synthetic object geometries, encouraging invariance of contact semantics to geometric diversity. Extensive experiments show that pace-induced guidance more effectively mirrors the benefits of contact priors than conventional classifier-free guidance, while achieving higher contact fidelity, more realistic HOI generation, and stronger generalization to unseen objects and tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4607. Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models
๐ฌ 14
โญ 4.50
๐ generative models
๐ท๏ธ flow matching distillation models fast likelihood evaluation fast sampling generative models
Log-likelihood evaluation enables important capabilities in generative models, including model comparison, certain fine-tuning objectives, and many downstream applications. Yet paradoxically, some of today's best generative models -- diffusion and flow-based models -- still require hundreds to thousands of neural function evaluations (NFEs) to compute a single likelihood. While recent distillation methods have successfully accelerated sampling to just a few steps, they achieve this at the cost of likelihood tractability: existing approaches either abandon likelihood computation entirely or still require expensive integration over full trajectories. We present fast flow joint distillation (F2D2), a framework that simultaneously reduces the number of NFEs required for both sampling and likelihood evaluation by two orders of magnitude. Our key insight is that in continuous normalizing flows, the coupled ODEs for sampling and likelihood are computed from a shared underlying velocity field, allowing us to jointly distill both the sampling trajectory and cumulative divergence using a single model. F2D2 is modular, compatible with existing flow-based few-step sampling models, and requires only an additional divergence prediction head. Experiments demonstrate F2D2's capability of achieving accurate log-likelihood with few-step evaluations while maintaining high sample quality, solving a long-standing computational bottleneck in flow-based generative models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4608. TangleScore: Tangle-Guided Purge and Imprint for Unstructured Knowledge Editing
๐ฌ 18
โญ 4.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ TangleScore Unstructured Knowledge Editing LLMs Purge and Imprint
Large language models (LLMs) struggle with inaccurate and outdated information, driving the emergence of knowledge editing as a lightweight alternative. Despite their effectiveness in modifying structured knowledge, existing editing methods often fail to generalize to unstructured cases, particularly those involving inherently hard-to-edit knowledge, where the original facts tend to be more resistant to change. To address this, we propose a metric, TangleScore, that quantifies the intrinsic difficulty of editing a given knowledge instance. This difficulty, in turn, strongly correlates with the modelโs ability to generalize the edit to paraphrased and related prompts. Building on this insight, we introduce a TangleScore-driven method termed Purge-Imprint Patch Editing (PIPE), an editing framework that adaptively modulates the purge and imprint of knowledge based on TangleScore of the target knowledge to be edited, thus adjusting the editing strength to match the instance's difficulty, thereby enabling more precise and effective model updates. Experiments applying PIPE to four LLMs of varying sizes on two unstructured knowledge editing datasets show that PIPE significantly outperforms previous editing methods by 6.49% in terms of generalization performance. Extensive evaluation show that PIPE also exhibits effectiveness in structured knowledge editing and strong robustness under batch and sequential editing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4609. SplitLoRA: Balancing Stability and Plasticity in Continual Learning Through Gradient Space Splitting
๐ฌ 15
โญ 4.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Continual Learning Low-Rank Adaptation Gradient orthogonal projection
Continual Learning (CL) requires a model to learn multiple tasks in sequence while maintaining both stabilityโpreserving knowledge from previously learned tasks, and plasticityโeffectively learning new tasks. Orthogonal projection has emerged as an effective and popular paradigm in CL, where it partitions the gradient space of previously learned tasks into two orthogonal subspaces: a primary subspace and a minor subspace. New tasks are learned effectively within the minor subspace, thereby reducing interference with previously acquired knowledge. However, existing orthogonal projection methods struggle to achieve an optimal balance between plasticity and stability, as it is hard to appropriately partition the gradient space. In this work, we consider a continual learning paradigm based on Low-Rank Adaptation (LoRA), which has gained considerable attention due to its efficiency and wide applicability, and propose a novel approach for continual learning, called SplitLoRA. We first provide a theoretical analysis of how subspace partitioning affects model stability and plasticity. Informed by this analysis, we then introduce an effective method that derives the optimal partition of the gradient space for previously learned tasks. This approach effectively balances stability and plasticity in continual learning. Experimental results on multiple datasets demonstrate that the proposed method achieves state-of-the-art performance. The code is available at https://anonymous.4open.science/r/SplitLoRA-FB45.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4610. A New Paradigm for Genome-wide DNA Methylation Prediction Without Methylation Input
๐ฌ 16
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ DNA Methylation Deep Learning Genome
DNA methylation (DNAm) is a key epigenetic modification that regulates gene expression and is pivotal in development and disease. However, profiling DNAm at genome scale is challenging: of $\textasciitilde$28 million CpG sites in the human genome, only about 1โ3\% are typically assayed in common datasets due to technological limitations and cost. Recent deep learning approaches, including masking-based generative Transformer models, have shown promise in capturing DNAmโgene expression relationships, but they rely on partially observed DNAm values for unmeasured CpGs and cannot be applied to completely unmeasured samples. To overcome this barrier, we introduce MethylProphet, a gene-guided, context-aware Transformer model for whole-genome DNAm inference without any measured DNAm input. MethylProphet compresses comprehensive gene expression profiles ($\textasciitilde$25K genes) through an efficient bottleneck multilayer perceptron, and encodes local CpG sequence context with a specialized DNA tokenizer. These representations are integrated by a Transformer encoder to predict site-specific methylation levels. Trained on large-scale pan-tissue whole-genome bisulfite sequencing data from ENCODE (1.6 billion CpGโsample pairs, $\textasciitilde$322 billion tokens), MethylProphet demonstrates strong performance in hold-out evaluations, accurately inferring DNAm at unmeasured CpGs and generalizing to unseen samples. Furthermore, application to TCGA pan-cancer data (chromosome 1, 9,194 samples; $\textasciitilde$450 million training pairs, 91 billion tokens) highlights its potential for pan-cancer whole-genome methylome imputation. MethylProphet offers a powerful and scalable foundation model for epigenetics, providing high-resolution methylation landscape reconstruction and advancing both biological research and precision medicine.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4611. Decoding Dynamic Visual Experience from Calcium Imaging via Cell-Pattern-Aware SSL
๐ฌ 19
โญ 4.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Computational Neuroscience Machine Learning for Science Neural Decoding Calcium Imaging Neural Population Heterogeneity Visual Reconstruction Scaling Self-Supervised Learning Representation Learning
Self-supervised learning (SSL) holds a great deal of promise for applications in neuroscience, due to the lack of large-scale, consistently labeled neural datasets. However, most neural datasets contain heterogeneous populations that mix stable, predictable cells with highly stochastic, stimulus-contingent ones, which has made it hard to identify consistent activity patterns during SSL. As a result, self-supervised pretraining has yet to show clear signs of benefits from scale on neural data. Here, we present a novel approach to self-supervised pretraining, POYO-SSL that exploits the heterogeneity of neural data to improve pretraining and achieve benefits of scale. Specifically, in POYO-SSL we pretrain only on predictable neurons---identified on the pretraining split via simple higher-order statistics (skewness and kurtosis)---then we fine-tune on the unpredictable population for downstream tasks. On the Allen Brain Observatory dataset, this strategy yields approximately 12--13\% relative gains over from-scratch training and exhibits smooth, monotonic scaling with model size. In contrast, existing state-of-the-art baselines plateau or destabilize as model size increases. By making predictability an explicit metric for crafting the data diet, POYO-SSL turns heterogeneity from a liability into an asset, providing a robust, biologically grounded recipe for scalable neural decoding and a path toward foundation models of neural dynamics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4612. FlowGen: Synthesizing Diverse Flowcharts to Enhance and Benchmark MLLM Reasoning
๐ฌ 23
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ flowchart parsing question answering data synthesis
Flowcharts are widely used to represent processes and relationships through intuitive visual representations. However, accurately interpreting these diagrams remains challenging due to their structural complexity and high visual diversity. Existing flowchart datasets often lack fine-grained control over key properties such as graph complexity and rendering style, limiting their utility for training and testing of multimodal large language models (MLLMs) on visual reasoning tasks. To address these limitations, we introduce FlowGen, a controllable synthesizer that generates flowcharts that have customizable structural features and supports multiple renderer backends. FlowGen enables fine-grained control over graph properties such as graph order and size, branched arrows, and nested subgraphs, facilitating systematic evaluation of MLLMsโ capabilities. Extensive experiments on open-source and proprietary MLLMs show that training on FlowGen substantially improves flowchart parsing and question answering (QA), while also enhancing generalization to other public datasets. Furthermore, FlowGen provides challenging test datasets that expose consistent weaknesses in current MLLMs, particularly related to high structural complexity and varied rendering styles. Our code and data are publicly available at https://anonymous.4open.science/r/FlowGen-.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4613. SketchingReality: From Freehand Scene Sketches to Photorealistic Images
๐ฌ 11
โญ 4.50
๐ generative models
๐ท๏ธ Image Generation Freehand Sketches
Recent years have witnessed remarkable progress in generative AI, with natural language emerging as the most common conditioning input. As underlying models grow more powerful, researchers are exploring increasingly diverse conditioning signals -- such as depth maps, edge maps, camera parameters, and reference images -- to give users finer control over generation. Among different modalities, sketches constitute a natural and long-standing form of human communication, enabling rapid expression of visual concepts. Yet algorithms that effectively handle true freehand sketches -- with their inherent abstraction and distortions -- remain largely unexplored.
In this work, we distinguish between edge maps, often regarded as โsketchesโ in the literature, and genuine freehand sketches. We pursue the challenging goal of balancing photorealism with sketch adherence when generating images from freehand input. A key obstacle is the absence of ground-truth, pixel-aligned images: by their nature, freehand sketches do not have a single correct alignment. To address this, we propose a modulation-based approach that prioritizes semantic interpretation of the sketch over strict adherence to individual edge positions. We further introduce a novel loss that enables training on freehand sketches without requiring ground-truth pixel-aligned images. We show that our method outperforms existing approaches in both semantic alignment with freehand sketch inputs and in the realism and overall quality of the generated images.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4614. Unlocking Long-Horizon Agentic Search with Large-Scale End-to-End RL
๐ฌ 8
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Agentic RL; Asynchronous RL; Search Agent
Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling knowledge-intensive tasks using external tools. One representative example is search agent. Existing open-source search agents heavily rely on advanced commercial LLMs: they either collect trajectories from the larger, stronger models for supervised fine-tuning or directly use them as specialized tools. In this work, we develop ASearcher, a single-model search agent purely trained by reinforcement learning (RL) without using any commercial APIs for data or tools. Based on an RL-trained QwQ-32B model, ASearcher is capable of conducting complex reasoning, such as uncertainty analysis and conflict verification, and achieve comparable performances to commercial search agents. There are two key techniques to unlock such long-horizon information-seeking abilities: first, we design a two-staged agentic process to synthesize high-quality QA pairs as the training data for RL; second, we conduct large-scale long-horizon RL, allowing the agent to take up to 128 actions per rollout for sufficient exploration. In particular, after RL training, ASearcher achieved scores of GAIA 58.1, xBench 51.1, and Frames 74.5 using only basic search tools. Furthermore, ASearcher also demonstrates strong zero-shot transferability: ASearcher can be further augmented with an additional summary tool, which is supported by DeepSeek-V3, and test-time scaling, which aggregates the answer from 16 parallel rollouts. With both zero-shot enhancements, the performances of ASearcher further rise to 71.8, 75.0, and 83.4, respectively, outperforming OpenAI DeepResearch and Kimi-Researcher, suggesting the great potential of RL scaling for agentic tasks. We release all the code and data at an anonymous link. The model will be released after the review process.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4615. HSG-12M: A Large-Scale Dataset of Spatial Multigraphs from the Energy Spectra of non-Hermitian Crystals
๐ฌ 21
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ graph-level learning spatial network multigraph dataset generator large-scale dataset condensed matter physics non-Hermitian physics topological physics AI4Science Toeplitz matrix
AI is transforming scientific research by revealing new ways to understand complex physical systems, but its impact remains constrained by the lack of large, high-quality domain-specific datasets. A rich, largely untapped resource lies in non-Hermitian quantum physics, where the energy spectra of crystals form intricate geometries on the complex planeโtermed as $\textit{Hamiltonian spectral graphs}$. Despite their significance as fingerprints for electronic behavior, their systematic study has been intractable due to the reliance on manual extraction. To unlock this potential, we introduce $\textbf{Poly2Graph}$ (https://anonymous.4open.science/r/iclr2026_generator-AE56): a high-performance, open-source pipeline that automates the mapping of 1-D crystal Hamiltonians to spectral graphs. Using this tool, we present $\textbf{HSG-12M}$ (https://anonymous.4open.science/r/iclr2026_dataset-2802): a dataset containing 11.6 million static and 5.1 million dynamic Hamiltonian spectral graphs across 1401 characteristic-polynomial classes, distilled from 177 TB of spectral potential data. Crucially, HSG-12M is the first large-scale dataset of $\textit{spatial multigraphs}$โgraphs embedded in a metric space where multiple geometrically distinct trajectories between two nodes are retained as separate edges. This simultaneously addresses a critical gap, as existing graph benchmarks overwhelmingly assume simple, non-spatial edges, discarding vital geometric information. Benchmarks with popular GNNs expose new challenges in learning spatial multi-edges at scale. Beyond its practical utility, we show that spectral graphs serve as universal topological fingerprints of polynomials, vectors, and matrices, forging a new algebra-to-graph link. HSG-12M lays the groundwork for data-driven scientific discovery in condensed matter physics, new opportunities in geometry-aware graph learning and beyond.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4616. Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
๐ฌ 10
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ Multimodality Circuit analysis Probing AI Safety Vision transformers
Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks.
In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token.
Building on these insights, we introduce Dyslexify - a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, dyslexify improves performance by up to 22.06\% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1\%, and demonstrate its utility in a medical foundation model for skin lesion diagnosis. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4617. Theoretical Guarantees for Causal Discovery on Large Random Graphs
๐ฌ 15
โญ 4.50
๐ causal reasoning
๐ท๏ธ causality causal discovery theoretical guarantees single-variable interventions large-scale causal discovery
We investigate theoretical guarantees for the \emph{false-negative rate} (FNR)โthe fraction of true causal edges whose orientation is not recovered, under single-variable random interventions and an $\epsilon$-interventional faithfulness assumption that accommodates latent confounding. For sparse Erdลs--Rรฉnyi directed acyclic graphs, where the edge probability scales as $p_e = \Theta(1/d)$, we show that the FNR concentrates around its mean at rate $O\bigl(\tfrac{\log d}{\sqrt d}\bigr)$, implying that large deviations above the expected error become exponentially unlikely as dimensionality increases. This concentration ensures that derived upper bounds hold with high probability in large-scale settings. Extending the analysis to generalized Barabรกsi--Albert graphs reveals an even stronger phenomenon: when the degree exponent satisfies $\gamma > 3$, the deviation width scales as $O\bigl(d^{\beta - \frac{1}{2}}\bigr)$ with $\beta = 1/(\gamma - 1) < \frac{1}{2}$, and hence vanishes in the limit. This demonstrates that heterogeneous, heavy-tailed degree structures commonly observed in empirical networks can intrinsically regularize causal discovery by reducing variability in orientation error. These finite-dimension results provide the first dimension-adaptive, faithfulness-robust guarantees for causal structure recovery, and challenge the intuition that high dimensionality and network heterogeneity necessarily hinder accurate discovery. Our simulation results corroborate these theoretical predictions, showing that the FNR indeed concentrates and often vanishes in practice as dimensionality grows.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4618. DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference
๐ฌ 16
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ LLM reasoning Adaptive Inference Entropy Analysis
Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear. We conduct a systematic analysis across models and datasets and discover a U-shaped entropy pattern: high entropy on simple problems despite high accuracy, low entropy on medium difficulty, and high entropy on hard problems reflecting uncertainty. The 22--25\% entropy reduction from simple to optimal regions reveals a fundamental inefficiencyโan \emph{overthinking} phenomenon on easy instances. Building on these insights, we introduce \textbf{DiffAdapt}, a lightweight, deployment-ready framework that predicts problem difficulty from hidden states and selects among Easy/Normal/Hard reasoning strategies to allocate computation adaptively. DiffAdapt requires no retraining of the base LLM and is compatible with common inference optimizations. Across five models and eight benchmarks, DiffAdapt achieves comparable or improved accuracy while reducing token usage by up to 22.4\%, establishing a practical path toward compute-efficient reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4619. SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs
๐ฌ 12
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Theory of Mind social reasoning LLM benchmark mental state behavior judgment false belief
Large language models (LLMs) are increasingly tested for a "Theory of Mind" (ToM) โ the ability to attribute mental states to oneself and others. Yet most evaluations stop at explicit belief attribution in classical toy stories or stylized tasks, leaving open the questions of whether LLMs can implicitly apply such knowledge to predict human behavior, or to judge an observed behavior, in diverse scenarios. We introduce SimpleToM, a benchmark that advances ToM evaluation along two novel axes. First, it probes multiple levels of ToM reasoning, from mental state inference (explicit ToM) to behavior prediction and judgment (applied ToM). Second, it situates these tasks in diverse, everyday scenarios โ such as supermarkets, hospitals, schools, and offices โ where information asymmetries naturally arise (e.g., hidden defects in grocery store items, incomplete information in providerโpatient interactions, or restricted access to locked devices). SimpleToM contains concise stories (e.g., "The can of Pringles has moldy chips in it. Mary picks up the can in the supermarket and walks to the cashier."), each with three questions that test different degrees of ToM reasoning, asking models to predict: (a) mental states ("Is Mary aware of the mold?"), (b) behaviors ("Will Mary pay for the chips or report the mold?"), and (c) judgments ("Mary paid for the chips. Was that reasonable?"). Experiments reveal a striking gap: state-of-the-art models often reliably infer mental state (a), but fail at applying knowledge about the mental state for secondary predictions, with performance dropping sharply for behavior prediction (b) and further for behavior judgment (c). This exposes a critical fragility in LLMsโ social reasoning in terms of what they know (explicit ToM) versus how well they can implicitly apply that knowledge for predictions (applied ToM). By uniting assessment of different levels of ToM reasoning with diverse, everyday scenarios, SimpleToM opens new opportunities for rigorously evaluating and diagnosing ToM abilities in LLMs, and reveals surprising, new insights about current model capabilities, guiding efforts toward future generations of models capable of robust social understanding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4620. Positional Preservation Embedding for Multimodal Large Language Models
๐ฌ 20
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ positional encoding visual token merging token clustering token compression cascade compression multimodal large language models
Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks, yet often suffer from inefficiencies due to redundant visual tokens. Existing token merging methods reduce sequence length but frequently disrupt spatial layouts and temporal continuity by disregarding positional relationships. In this work, we propose a novel encoding operator dubbed as **P**ositional **P**reservation **E**mbedding (**PPE**), which has the main hallmark of preservation of spatiotemporal structure during visual token compression. PPE explicitly introduces the disentangled encoding of 3D positions in the token dimension, enabling each compressed token to encapsulate different positions from multiple original tokens. Furthermore, we show that PPE can effectively support cascade clustering --- a progressive token compression strategy that leads to better performance retention. PPE is a parameter-free and generic operator that can be seamlessly integrated into existing token merging methods without any adjustments. Applied to state-of-the-art token merging framework, PPE achieves consistent improvements of 2\%~5\% across multiple vision-language benchmarks, including MMBench (general vision understanding), TextVQA (layout understanding) and VideoMME (temporal understanding). These results demonstrate that preserving positional cues is critical for efficient and effective MLLM reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4621. Enhancing Molecular Property Predictions by Learning from Bond Modelling and Interactions
๐ฌ 22
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Molecule Representation Learning Bond Modelling Molecule Property Prediction
Molecule representation learning is crucial for understanding and predicting molecular properties. However, conventional atom-centric models, which treat chemical bonds merely as pairwise interactions, often overlook complex bond-level phenomena like resonance and stereoselectivity. This oversight limits their predictive accuracy for nuanced chemical behaviors. To address this limitation, we introduce \textbf{DeMol}, a dual-graph framework whose architecture is motivated by a rigorous information-theoretic analysis demonstrating the information gain from a bond-centric perspective. DeMol explicitly models molecules through parallel atom-centric and bond-centric channels. These are synergistically fused by multi-scale Double-Helix Blocks designed to learn intricate atom-atom, atom-bond, and bond-bond interactions. The framework's geometric consistency is further enhanced by a regularization term based on covalent radii to enforce chemically plausible structures. Comprehensive evaluations on diverse benchmarks, including PCQM4Mv2, OC20 IS2RE, QM9, and MoleculeNet, show that DeMol establishes a new state-of-the-art, outperforming existing methods. These results confirm the superiority of explicitly modelling bond information and interactions, paving the way for more robust and accurate molecular machine learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4622. UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation
๐ฌ 24
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Unified Understanding and Generation;Vision Tokenizer; Visual Representation Learning;
Tokenizer is a crucial component for both visual understanding and generation. To advance toward the ultimate goal of universal modeling, recent research has focused on developing a unified tokenizer. However, existing tokenizers face a significant performance trade-off between understanding and generation, stemming from the inherent conflict between high-level semantic abstraction and low-level pixel reconstruction. To tackle this challenge, we propose a generic and unified tokenizer, namely $\textbf{UniFlow}$, by flexibly adapting any visual encoder with a concise reconstruction decoder.
Specifically, we introduce $\textit{layer-wise adaptive self-distillation}$ applied to the well-pretrained visual encoders, which enables UniFlow to simultaneously inherit the strong semantic features for visual understanding and flexibly adapt to model fine-grained details for visual generation.
Moreover, we propose a lightweight $\textit{patch-wise pixel flow decoder}$, which efficiently achieves high-fidelity pixel reconstruction by modeling a conditional flow from the noisy state back to the patch-wise pixel domain.
By leveraging the semantic features as visual conditions for the decoder, we effectively alleviate the training conflicts between understanding and generation. Furthermore, the patch-wise learning strategy simplifies the data distribution, thereby improving training efficiency.
For instance, our 7B UniFlow-XL not only surpasses the 14B TokenFlow-XL by 6.05\% on average understanding benchmarks, but also achieves a competitive results in both visual reconstruction and generation, surpassing UniTok by 0.15 in rFID and 0.09 in gFID (without guidance), respectively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4623. Q-learning with Posterior Sampling
๐ฌ 13
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Theory Regret Analysis Posterior Sampling Q-learning
Bayesian posterior sampling techniques have demonstrated superior empirical performance in many exploration-exploitation settings. However, their theoretical analysis remains a challenge, especially in complex settings like reinforcement learning.
In this paper, we introduce Q-Learning with Posterior Sampling (PSQL), a simple Q-learning-based algorithm that uses Gaussian posteriors on Q-values for exploration, akin to the popular Thompson Sampling algorithm in the multi-armed bandit setting. We show that in the tabular episodic MDP setting, PSQL achieves a regret bound of $\tilde O(H^2\sqrt{SAT})$, closely matching the known lower bound of $\Omega(H\sqrt{SAT})$. Here, S, A denote the number of states and actions in the underlying Markov Decision Process (MDP), and $T=KH$ with $K$ being the number of episodes and $H$ being the planning horizon. Our work provides several new technical insights into the core challenges in combining posterior sampling with dynamic programming and TD-learning-based RL algorithms, along with novel ideas for resolving those difficulties. We hope this will form a starting point for analyzing this efficient and important algorithmic technique in even more complex RL settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4624. SSVPO: Effective Step-Level Credit Assignment for RL Training of Language Models
๐ฌ 23
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Credit Assignment Reinforcement Learning Step-Level Reward Large Language Model
Language models have shown strong performance on mathematical reasoning tasks. Post-training with outcome-based reinforcement learning (RL) can further enhance reasoning but is inefficient because it relies solely on final rewards. Recent credit assignmentโbased RL methods provide intermediate feedback, yet they often struggle to fairly evaluate each stepโs importance, especially in partially correct reasoning chains. We propose Sequential Shapley Value Policy Optimization (SSVPO), a step-level credit assignment framework inspired by multi-agent RL. SSVPO introduces an insertion MDP and Sequential Shapley Values (SSV), which measure each stepโs marginal contribution by reordering reasoning steps into alternative chains, ensuring fair credit assignment to all possible steps. By identifying steps with zero credit, SSVPO can shorten reasoning chains to improve training efficiency. We further provide a theoretical proof that SSV fairness to allocate credits and demonstrate that SSV as the new advantage baseline is consistent with Proximal Policy Optimization (PPO). Across 7 benchmarks, SSVPO outperforms state-of-the-art RL methods, both outcome-based (RLOO, GRPO, DAPO) and credit assignmentโbased (VinePPO, SPO), achieving up to an 11.6\% gain in accuracy, an 18.1\% reduction in token usage, and a 1.6ร improvement in reasoning efficiency over vanilla methods. Our findings highlight that SSVPO provides effective step-level credit assignment, advancing post-training LLM reasoning performance while reducing token budgets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4625. Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation
๐ฌ 20
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Chart-to-Code Generation Reinforcement Learning
While reinforcement learning (RL) has proven highly effective for general reasoning in vision-language models, its application to tasks requiring deep understanding of information-rich images and structured output generation remains underexplored. Chart-to-code generation exemplifies this challenge, demanding complex reasoning over visual charts to produce structured code. Supervised fine-tuning (SFT) alone is often insufficient, highlighting the need for effective RL strategies tailored to structured outputs.
In this paper, we systematically investigate the performance plateau of SFT through large-scale experiments and propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation. We construct the largest training corpus to date, with 3 million chart-code pairs curated from real-world tables in arXiv papers, addressing the limitations of previous synthetic datasets. Despite achieving state-of-the-art performance, our experiments show that simply increasing SFT data eventually leads to diminishing improvements. To break this plateau, MSRL employs a multi-granularity reward system that integrates both textual and visual feedback. At the textual level, rule-based rewards validate fine-grained code details, while at the visual level, a model-based reward assesses the structural similarity between rendered code and ground-truth charts. We implement a two-stage curriculum training strategy, first optimizing the model with textual rewards and then incorporating visual signals for further enhancement. Experimental results demonstrate that MSRL substantially breaks the SFT plateau, improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA benchmarks, respectively. Notably, our method outperforms all existing approaches in the chart domain and achieves competitive results with advanced closed-source models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4626. Discrete Diffusion Trajectory Alignment via Stepwise Decomposition
๐ฌ 17
โญ 4.50
๐ generative models
๐ท๏ธ discrete diffusion models preference optimization
Discrete diffusion models have demonstrated great promise in modeling various sequence data, ranging from human language to biological sequences. Inspired by the success of RL in language models, there is growing interest in further improving the models by alignment with a certain reward. In this work, we propose an offline preference optimization method to approach trajectory alignment for discrete diffusion models. Instead of applying the reward on the final output and backpropagating the gradient to the entire denoising process, we decompose the problem into a set of stepwise alignment objectives by matching the per-step posterior. This framework enables efficient diffusion optimization, is compatible with arbitrary reward functions, and importantly, yields an equivalent optimal solution under additive factorization of the trajectory reward. Experiments across multiple domains including DNA sequence design, protein inverse folding, and language modeling consistently demonstrate the superiority of our approach. Notably, it achieves an up to 12\% improvement over the most competitive RL-based baseline in terms of predicted activity on DNA sequence design, and further improves the GSM8K score from 78.6 to 81.2 on LLaDA-8B-Instruct for language modeling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4627. Gradient Descent with Large Step Sizes: Chaos and Fractal Convergence Region
๐ฌ 16
โญ 4.50
๐ optimization
๐ท๏ธ large step size gradient descent matrix factorization convergence implicit bias chaos fractal basin boundary
We examine gradient descent in matrix factorization and show that under large step sizes the parameter space develops a fractal structure. We derive the exact critical step size for convergence in scalar-vector factorization and show that near criticality the selected minimizer depends sensitively on the initialization. Moreover, we show that adding regularization amplifies this sensitivity, generating a fractal boundary between initializations that converge and those that diverge. The analysis extends to general matrix factorization with orthogonal initialization. Our findings reveal that near-critical step sizes induce a chaotic regime of gradient descent where the long-term dynamics are unpredictable and there are no simple implicit biases, such as towards balancedness, minimum norm, or flatness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4628. SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks
๐ฌ 29
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Large language models multi-turn safety benchmark jailbreak
With the rapid advancement of Large Language Models (LLMs), the safety of LLMs has been a critical concern requiring precise assessment. Current benchmarks primarily concentrate on single-turn dialogues or a single jailbreak attack method to assess the safety. Additionally, these benchmarks have not taken into account the LLM's capability to identify and handle unsafe information in detail. To address these issues, we propose a fine-grained benchmark SafeDialBench for evaluating the safety of LLMs across various jailbreak attacks in multi-turn dialogues. Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios. We employ 7 jailbreak attack strategies, such as reference attack and purpose reverse, to enhance the dataset quality for dialogue generation. Notably, we construct an innovative auto assessment framework of LLMs, measuring capabilities in detecting, and handling unsafe information and maintaining consistency when facing jailbreak attacks. Experimental results across 19 LLMs reveal that Yi-34B-Chat and GLM4-9B-Chat demonstrate superior safety performance, while Llama3.1-8B-Instruct and o3-mini exhibit safety vulnerabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4629. Divergence-Free Neural Networks with Application to Image Denoising
๐ฌ 14
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ image denoising divergence Stein's unbiased risk estimate self-supervised learning incompressible vector fields
We introduce a resource-efficient neural network architecture with zero divergence by design, adapted for high-dimensional problems. Our method is directly applicable to image denoising, for which divergence-free estimators are particularly well-suited for self-supervised learning, in accordance with Stein's unbiased risk estimation theory. Comparisons of our parameterization on popular denoising datasets demonstrate that it retains sufficient expressivity to remain competitive with other divergence-based approaches, while outperforming its counterparts when the noise level is unknown and varies across the training data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4630. Error as Signal: Stiffness-Aware Diffusion Sampling via Embedded Runge-Kutta Guidance
๐ฌ 18
โญ 4.50
๐ generative models
๐ท๏ธ Diffusion models Classifier-free guidance
Classifier-Free Guidance (CFG) has established the foundation for guidance mechanisms in diffusion models, showing that well-designed guidance proxies significantly improve conditional generation and sample quality.
Autoguidance (AG) has extended this idea, but it relies on an auxiliary network and leave solver-induced errors unaddresed.
In stiff regions, the ODE trajectory changes sharply, where local truncation error (LTE) becomes a critical factor to deteriorate sample quality.
Our key observation is that these errors align with the dominant eigenvector, motivating us to target the solver-induced error as a guidance signal.
We propose **E**mbedded **R**ungeโ**K**utta based **Guid**ance (ERK-Guid), which exploits detected stiffness to reduce LTE and stabilize sampling.
We theoretically and empirically analyze stiffness and eigenvector estimators with solver errors to motivate the design of ERK-Guid.
Our experiments on both synthetic datasets and popular benchmark dataset, ImageNet, demonstrate that ERK-Guid consistently outperforms state-of-the-art methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4631. Revisiting Confidence Calibration for Misclassification Detection in VLMs
๐ฌ 22
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ Vision-language models Confidence calibration Misclassification Detection
Confidence calibration has been widely studied to improve the trustworthiness of predictions in vision-language models (VLMs). However, we theoretically reveal that standard confidence calibration inherently _impairs_ the ability to distinguish between correct and incorrect predictions (i.e., Misclassification Detection, MisD), which is crucial for reliable deployment of VLMs in high-risk applications. In this paper, we investigate MisD in VLMs and propose confidence recalibration to enhance MisD. Specifically, we design a new confidence calibration objective to replace the standard one. This modification theoretically achieves higher precision in the MisD task and reduces the mixing of correct and incorrect predictions at every confidence level, thereby overcoming the limitations of standard calibration for MisD. As the calibration objective is not differentiable, we introduce a differentiable surrogate loss to enable better optimization. Moreover, to preserve the predictions and zero-shot ability of the original VLM, we develop a post-hoc framework, which employs a lightweight meta network to predict sample-specific temperature factors, trained with the surrogate loss. Extensive experiments across multiple metrics validate the effectiveness of our approach on MisD.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4632. ForestPersons: A Large-Scale Dataset for Under-Canopy Missing Person Detection
๐ฌ 22
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Missing Person Detection UAV-based Search and Rescue Forest Environment Dataset
Detecting missing persons in forest environments remains a challenge, as dense canopy cover often conceals individuals from detection in top-down or oblique aerial imagery typically captured by Unmanned Aerial Vehicles (UAVs). While UAVs are effective for covering large, inaccessible areas, their aerial perspectives often miss critical visual cues beneath the forest canopy. This limitation underscores the need for under-canopy perspectives better suited for detecting missing persons in such environments. To address this gap, we introduce ForestPersons, a novel large-scale dataset specifically designed for under-canopy person detection. ForestPersons contains 96,482 images and 204,078 annotations collected under diverse environmental and temporal conditions. Each annotation includes a bounding box, pose, and visibility label for occlusion-aware analysis. ForestPersons provides ground-level and low-altitude perspectives that closely reflect the visual conditions encountered by Micro Aerial Vehicles (MAVs) during forest Search and Rescue (SAR) missions. Our baseline evaluations reveal that standard object detection models, trained on prior large-scale object detection datasets or SAR-oriented datasets, show limited performance on ForestPersons. This indicates that prior benchmarks are not well aligned with the challenges of missing person detection under the forest canopy. We offer this benchmark to support advanced person detection capabilities in real-world SAR scenarios. The dataset is publicly available at https://huggingface.co/datasets/anonreviewer2026/under-canopy-benchmark-anon.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4633. Improving Long-Range Interactions in Graph Neural Simulators via Hamiltonian Dynamics
๐ฌ 18
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Graph Neural Simulators Long-range interactions Learning Simulators AI4Science
Learning to simulate complex physical systems from data has emerged as a promising way to overcome the limitations of traditional numerical solvers, which often require prohibitive computational costs for high-fidelity solutions. Recent Graph Neural Simulators (GNSs) accelerate simulations by learning dynamics on graph-structured data, yet often struggle to capture long-range interactions and suffer from error accumulation under autoregressive rollouts. To address these challenges, we propose Information-preserving Graph Neural Simulators (IGNS), a graph-based neural simulator built on the principles of Hamiltonian dynamics. This structure guarantees preservation of information across the graph, while extending to port-Hamiltonian systems allows the model to capture a broader class of dynamics, including non-conservative effects. IGNS further incorporates a warmup phase to initialize global context, geometric encoding to handle irregular meshes, and a multi-step training objective to reduce rollout error. To evaluate these properties systematically, we introduce new benchmarks that target long-range dependencies and challenging external forcing scenarios. Across all tasks, IGNS consistently outperforms state-of-the-art GNSs, achieving higher accuracy and stability under challenging and complex dynamical systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4634. Video-LevelGauge: Investigating Contextual Positional Bias in Video Language Models.
๐ฌ 20
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Contextual Positional Bais Video Benchmark Large Video Language Model
Large video language models (LVLMs) have made notable progress in video understanding, spurring the development of corresponding evaluation benchmarks. However, existing benchmarks generally assess overall performance across entire video sequences, overlooking nuanced behaviors such as contextual positional bias, a critical yet under-explored aspect of LVLM performance. We present **Video-LevelGauge**, a dedicated benchmark designed to systematically assess positional bias in LVLMs. We employ standardized probes and customized contextual setups, allowing flexible control over context length, probe position, and contextual types to simulate diverse real-world scenarios. In addition, we introduce a comprehensive analysis method that combines statistical measures with bias pattern recognition to characterize bias. Our benchmark comprises 438 manually curated videos spanning multiple types, yielding 1,177 high-quality multiple-choice questions and 120 open-ended questions, validated for their effectiveness in exposing positional bias. Based on these, we evaluate 27 state-of-the-art LVLMs, including both commercial and open-source models. Our findings reveal significant positional biases in many leading open-source models, typically exhibiting head or neighbor-content preferences. In contrast, commercial models such as Gemini 2.5 Pro show impressive, consistent performance across entire video sequences. Further analyses on context variation, context length, model scale, and multi-modal reasoning provide insights for mitigating bias and guiding model enhancement.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4635. OWLEYE: ZERO-SHOT LEARNER FOR CROSSDOMAIN GRAPH DATA ANOMALY DETECTION
๐ฌ 22
โญ 4.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ anomaly detection zero-shot Multi-domain
Graph structured data is commonly used to represent complex relationships such as transactions between accounts, communications between devices, and dependencies among machines or processes. Correspondingly, graph anomaly detection (GAD) plays a critical role in identifying anomalies across various domains, including finance, cybersecurity, manufacturing, etc. Facing the large-volume and multi-domain graph data, recent efforts aim to develop foundational generalist models capable of detecting anomalies in unseen graphs without retraining. To the best of our knowledge, the different feature semantics and dimensions of cross-domain graph structured data heavily hinders the development of graph foundation model, and leaves the further in-depth continual learning and inference capabilities in the evolving setting a quite nascent problem. To address these above challenges, we propose OWLEYE, a novel zero-shot GAD framework that learns transferable patterns of normal behavior from multiple graphs. Systematically, OWLEYE first introduces a cross-domain feature alignment module to harmonize feature distributions, which preserves domain-specific semantics during aligning more than the simple but widely-used Principle Component Analysis. Second, with aligned features, to enable method with continuous and scaling-up learning and inference capabilities, OWLEYE designs the multi-domain pattern dictionary learning to encode shared structural and attribute-based patterns. Third, for achieving the in-context learning ability, OWLEYE presents a truncated attention-based reconstruction module to robustly detect anomalies without requiring labeled data for unseen graph structured data. Extensive experiments on real-world datasets demonstrate that OWLEYE achieves superior performance and generalizability compared to state-of-the-art baselines, establishing a strong foundation for scalable and label-efficient anomaly detection.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4636. DeNOTS: Stable Deep Neural ODEs for Time Series
๐ฌ 18
โญ 4.50
๐ learning on time series and dynamical systems
๐ท๏ธ Neural ODE Time series Gaussian Processes
Neural CDEs provide a natural way to process the temporal evolution of irregular time series. The number of function evaluations (NFE) is these systems' natural analog of depth (the number of layers in traditional neural networks). It is usually regulated via solver error tolerance: lower tolerance means higher numerical precision, requiring more integration steps. However, lowering tolerances does not adequately increase the models' expressiveness. We propose a simple yet effective alternative: scaling the integration time horizon to increase NFEs and "deepen`` the model. Increasing the integration interval causes uncontrollable growth in conventional vector fields, so we also propose a way to stabilize the dynamics via Negative Feedback (NF). It ensures provable stability without constraining flexibility. It also implies robustness: we provide theoretical bounds for Neural ODE risk using Gaussian process theory. Experiments on four open datasets demonstrate that our method, DeNOTS, outperforms existing approaches---including recent Neural RDEs and state space models,---achieving up to 20% improvement in metrics. DeNOTS combines expressiveness, stability, and robustness, enabling reliable modelling in continuous-time domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4637. GarmentGPT: Compositional Garment Pattern Generation via Discrete Latent Tokenization
๐ฌ 13
โญ 4.50
๐ generative models
๐ท๏ธ Garment Generation vision language models
Apparel is a fundamental component of human appearance, making garment digitalization critical for digital human creation. However, sewing pattern creation traditionally relies on the intuition and extensive experience of skilled artisans. This manual bottleneck significantly hinders the scalability of digital garment creation. Existing generative approaches either operate as data replicators without intrinsic understanding of garment construction principles (e.g., diffusion models), or struggle with low-level regression of raw floating-point coordinates (e.g., Vision-Language Models).
We present GarmentGPT, the first framework to operationalize latent space generation for sewing patterns. Our approach introduces a novel pipeline where a Residual Vector Quantizer Variational AutoEncoder (RVQ-VAE) tokenizes continuous pattern boundary curves into discrete codebook indices. A fine-tuned Vision-Language Model then autoregressively predicts these discrete token sequences instead of regressing coordinates, enabling high-level compositional reasoning. This paradigm shift aligns generation with the knowledge-driven, symbolic reasoning capabilities of large language models.
To address the data bottleneck for real-world applications, we develop a Data Curation Pipeline that synthesizes over one million photorealistic images paired with GarmentCode, and establish the Real-Garments Benchmark for comprehensive evaluation. Experiments demonstrate that GarmentGPT significantly outperforms existing methods on structured datasets (95.62\% Panel Accuracy, 81.84\% Stitch Accuracy), validating our discrete compositional paradigm's advantages.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4638. Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models
๐ฌ 15
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ Forgetting-Augmented Reinforcement Learning Large Reasoning Model Answer Attribution Chain-of-Thought
Large reasoning models (LRMs) exhibit unprecedented capabilities in solving complex problems through Chain-of-Thought (CoT) reasoning. However, recent studies reveal that their final answers often contradict their own reasoning traces. We hypothesize that this inconsistency stems from two competing mechanisms for generating answers: CoT reasoning and memory retrieval. To test this hypothesis, we conduct controlled experiments that challenge LRMs with misleading cues during reasoning and/or corrupted answers during retrieval. Our results across models and datasets confirm that both mechanisms operate simultaneously, with their relative dominance influenced by multiple factors: problem domains, model scales, and fine-tuning approaches (e.g., reinforcement learning vs. distillation). The findings reveal a critical limitation in current reasoning fine-tuning paradigms: models can exploit the retrieval mechanism as a shortcut, effectively "hacking" the reward signal and undermining genuine reasoning development. To address this challenge, we introduce FARL, a novel fine-tuning framework that integrates memory unlearning with reinforcement learning. By carefully suppressing retrieval shortcuts during the fine-tuning process, FARL promotes reasoning-dominant behavior and enhances generalizable reasoning capabilities. The code is available at https://anonymous.4open.science/r/FARL-EF56.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4639. Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning
๐ฌ 16
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Vision Language Models Geometry Perception Reinforcement Learning
Vision-language models (VLMs) often struggle with geometric reasoning due to their limited perception of fundamental diagram elements. To tackle this challenge, we introduce GeoPerceive, a benchmark comprising diagram instances paired with domain-specific language (DSL) representations, along with an efficient automatic data generation pipeline. This design enables the isolated evaluation of geometric perception independently from reasoning. To exploit the data provided by GeoPerceive for enhancing the geometric perception capabilities of VLMs, we propose GeoDPO, a translator-guided reinforcement learning framework. GeoDPO employs an NL-to-DSL translator, which is trained on synthetic pairs generated by the data engine of GeoPerceive, to bridge natural language and DSL. This translator facilitates the computation of fine-grained, DSL-level scores, which serve as reward signals in reinforcement learning. We assess GeoDPO on both in-domain and out-of-domain datasets, spanning tasks in geometric perception as well as downstream reasoning. Experimental results demonstrate that, while supervised fine-tuning (SFT) offers only marginal improvements and may even impair performance in out-of-domain scenarios, GeoDPO achieves substantial gains: $+26.5\\%$ on in-domain data, $+8.0\\%$ on out-of-domain data, and $+39.0\\%$ on downstream reasoning tasks. These findings underscore the superior performance and generalization ability of GeoDPO over SFT. All code and data are released at https://anonymous.4open.science/r/GeoPerceive-9846, ensuring full reproducibility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4640. Scheduling Your LLM Reinforcement Learning with Reasoning Trees
๐ฌ 19
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language model RLVR Data Scheduling
Using Reinforcement Learning with Verifiable Rewards (RLVR) to optimize Large Language Models (LLMs) can be conceptualized as progressively editing a query's 'Reasoning Tree'. This process involves exploring nodes (tokens) and dynamically modifying the model's policy at each node. When combined with data scheduling, this process yields further gains in data efficiency and accuracy.
However, existing RLVR data scheduling methods typically rely on path-based metrics to rank queries, overlooking the reasoning tree structures of these queries.
In this paper, we introduce a novel metric, namely **Reasoning Score (r-score)**, which measures the query's learning difficulty based on the structure of its reasoning tree.
Based on the r-score, we propose the **Reasoning Tree Schedule (Re-Schedule)**, a scheduling algorithm that constructs a curriculum progressing from structurally simple (high r-score) to complex (low r-score) queries.
Experiments on six math-reasoning benchmarks show that Re-Schedule significantly improves average accuracy, achieving gains of up to 3.2\%.
These strong results validate our approach and demonstrate that a structural understanding of the reasoning tree provides a more powerful and principled foundation for RLVR data scheduling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4641. Reliability-Adjusted Prioritized Experience Replay
๐ฌ 5
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Deep Reinforcement Learning Temporal Difference Learning Experience Replay
Experience replay enables data-efficient learning from past experiences in online reinforcement learning agents. Traditionally, experiences were sampled uniformly from a replay buffer, regardless of differences in experience-specific learning potential. In an effort to sample more efficiently, researchers introduced Prioritized Experience Replay (PER). In this paper, we propose an extension to PER by introducing a novel measure of temporal difference error reliability. We theoretically show that the resulting transition selection algorithm, Reliability-adjusted Prioritized Experience Replay (ReaPER), enables more efficient learning than PER. We further present empirical results showing
that ReaPER outperforms both uniform experience replay and PER across a diverse set of traditional environments including several classic control environments and the Atari-10 benchmark, which approximates the median score across the Atari-57 benchmark within one percent of variance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4642. Reducing Contextual Stochastic Bilevel Optimization via Structured Function Approximation
๐ฌ 11
โญ 4.50
๐ optimization
๐ท๏ธ stochastic optimization bilevel optimization contextual stochastic optimization parametrization
Contextual Stochastic Bilevel Optimization (CSBO) extends standard stochastic bilevel optimization (SBO) by incorporating context-dependent lower-level problems. CSBO problems are generally intractable since existing methods require solving a distinct lower-level problem for each sampled context, resulting in prohibitive sample and computational complexity, in addition to relying on impractical conditional sampling oracles. We propose a reduction framework that approximates the lower-level solutions using expressive basis functions, thereby decoupling the lower-level dependence on context and transforming CSBO into a standard SBO problem solvable using only joint samples from the context and noise distribution. First, we show that this reduction preserves hypergradient accuracy and yields an $\epsilon$-stationary solution to CSBO. Then, we relate the sample complexity of the reduced problem to simple metrics of the basis. This establishes sufficient criteria for a basis to yield $\epsilon$-stationary solutions with a near-optimal complexity of $\widetilde{\mathcal{O}}(\epsilon^{-3})$, matching the best-known rate for standard SBO up to logarithmic factors. Moreover, we show that Chebyshev polynomials provide a concrete and efficient choice of basis that satisfies these criteria for a broad class of problems. Empirical results on inverse and hyperparameter optimization demonstrate that our approach outperforms CSBO baselines in convergence, sample efficiency, and memory usage.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4643. BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, and Rerankers
๐ฌ 23
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ zero-shot classification cross-encoder embedding models reranker benchmark classification tasks NLPzero-shot classification cross-encoder embedding models reranker benchmark classification tasks NLP
Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI), recent advances in text-embedding models, rerankers, and instruction-tuned large language models (LLMs) have challenged the dominance of NLI-based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine-tuning, leaving genuine zero-shot capabilities underexplored. To address this, we introduce __BTZSC__, a comprehensive benchmark of $22$ public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross-encoders, embedding models, rerankers and instruction-tuned LLMs, encompassing $38$ public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by _Qwen3-Reranker-8B_, set a new state-of-the-art with macro $F_1 = 0.72$; (ii) strong embedding models such as _GTE-large-en-v1.5_ substantially close the accuracy gap while offering the best trade-off between accuracy and latency; (iii) instruction-tuned LLMs at 4-12B parameters achieve competitive performance (macro $F_1$ up to $0.67$), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross-encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero-shot text understanding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4644. HOG-Diff: Higher-Order Guided Diffusion for Graph Generation
๐ฌ 20
โญ 4.50
๐ generative models
๐ท๏ธ Topology Topological Deep Learning Graph Generation Higher order Guidance
Graph generation is a critical yet challenging task as empirical analyses require a deep understanding of complex, non-Euclidean structures. Diffusion models have recently made significant achievements in graph generation, but these models are typically adapted from image generation frameworks and overlook inherent higher-order topology, leaving them ill-suited for capturing the topological properties of graphs. In this work, we propose Higher-order Guided Diffusion (HOG-Diff), a principled framework that progressively generates plausible graphs with inherent topological structures. HOG-Diff follows a coarse-to-fine generation curriculum guided by higher-order topology and implemented via diffusion bridges. We further prove that our model exhibits a stronger theoretical guarantee than classical diffusion frameworks. Extensive experiments on both molecular and generic graph generation tasks demonstrate that our method consistently outperforms or remains competitive with state-of-the-art baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4645. Diffusion Negative Preference Optimization Made Simple
๐ฌ 10
โญ 4.50
๐ generative models
๐ท๏ธ Diffusion Models Text-to-image generation Preference Alignment
Classifier-Free Guidance (CFG) improves diffusion sampling by encouraging conditional generations while discouraging unconditional ones. Existing preference alignment methods, however, focus only on positive preference pairs, limiting their ability to actively suppress undesirable outputs. Diffusion Negative Preference Optimization (Diff-NPO) approaches this limitation by introducing a separate negative model trained with inverted labels, allowing it to capture signals for suppressing undesirable generations. However, this design comes with two key drawbacks. First, maintaining two distinct models throughout training and inference substantially increases computational cost, making the approach less practical. Second, at inference time, Diff-NPO relies on weight merging between the positive and negative models, a process that dilutes the learned negative alignment and undermines its effectiveness. To overcome these issues, we introduce Diff-SNPO, a single-network framework that jointly learns from both positive and negative preferences. Our method employs a bounded preference objective to prevent winner-likelihood collapse, ensuring stable optimization. Diff-SNPO delivers strong alignment performance with significantly lower computational overhead, showing that explicit negative preference modeling can be simple, stable, and efficient within a unified diffusion framework. Code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4646. POEMetric: The Last Stanza of Humanity
๐ฌ 17
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Poetry evaluation metrics poetry generation large language models dataset and benchmark
Large Language Models (LLMs) can compose poetry, but how far are they from human poets? In this paper, we introduce POEMetric, the first comprehensive framework for poetry evaluation, examining 1) basic instruction-following abilities in generating poems according to a certain form and theme, 2) advanced abilities of showing creativity, lexical diversity, and idiosyncrasy, evoking emotional resonance, and using imagery and literary devices, and 3) general appraisal of the overall poem quality and estimation of authorship. We curated a human poem dataset - 203 English poems of 7 fixed forms annotated with meter, rhyme patterns and themes - and experimented with 30 LLMs for poetry generation based on the same forms and themes of the human data, totaling 6,090 LLM poems. Based on POEMetric, we assessed the performance of both human poets and LLMs through rule-based evaluation and LLM-as-a-judge, whose results were validated by human experts. Results show that, though the top model achieved high form accuracy (4.26 out of 5.00, with Gemini-2.5-Pro as a judge; same below) and theme alignment (4.99), all models failed to reach the same level of advanced abilities as human poets, who achieved unparalleled creativity (4.02), idiosyncrasy (3.95), emotional resonance (4.06), and skillful use of imagery (4.49) and literary devices (4.67). Humans also defeated the best-performing LLM in overall poem quality (4.22 vs. 3.20). As such, poetry generation remains a formidable challenge for LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4647. Breaking the Correlation Plateau: On the Optimization and Capacity Limits of Attention-Based Regressors
๐ฌ 21
โญ 4.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Attention Mechanisms Joint Optimization Pearson Correlation Data Homogeneity
Attention-based regression models are often trained by jointly optimizing Mean Squared Error (MSE) loss and Pearson correlation coefficient (PCC) loss, emphasizing the magnitude of errors and the order or shape of targets, respectively. A common but poorly understood phenomenon during training is the *PCC plateau*: PCC stops improving early in training, even as MSE continues to decrease. We provide the first rigorous theoretical analysis of this behavior, revealing fundamental limitations in both optimization dynamics and model capacity. First, in regard to the flattened PCC curve, we uncover a critical conflict where lowering MSE (magnitude matching) can *paradoxically* suppress the PCC gradient (shape matching). This issue is exacerbated by the softmax attention mechanism, particularly when the data to be aggregated is highly homogeneous. Second, we identify a limitation in the model capacity: we derived a PCC improvement limit for *any* convex aggregator (including the softmax attention), showing that the convex hull of the inputs strictly bounds the achievable PCC gain. We demonstrate that data homogeneity intensifies both limitations. Motivated by these insights, we propose the Extrapolative Correlation Attention (ECA), which incorporates novel, theoretically-motivated mechanisms to improve the PCC optimization and extrapolate beyond the convex hull. Across diverse benchmarks, including challenging homogeneous data setting, ECA consistently breaks the PCC plateau, achieving significant improvements in correlation without compromising MSE performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4648. Neologism Learning for Controllability and Self-Verbalization
๐ฌ 10
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ interpretability alignment steering
Humans invent new words when there is a rising demand for a new useful concept
(e.g., doomscrolling). We explore and validate a similar idea in our communication
with LLMs: introducing new words to better understand and control the models,
expanding on the recently introduced neologism learning. This method introduces a
new word by adding a new word embedding and training with examples that exhibit
the concept with no other changes in model parameters. We show that adding a new
word allows for control of concepts such as flattery, incorrect answers, text length,
as well as more complex concepts in AxBench. We discover that neologisms can
also further our understanding of the model via self-verbalization: models can
describe what each new word means to them in natural language, like explaining
that a word that represents a concept of incorrect answers means โa lack of complete,
coherent, or meaningful answers. . . โ To validate self-verbalizations, we introduce
plug-in evaluation: we insert the verbalization into the context of a model and
measure whether it controls the target concept. In some self-verbalizations, we find
machine-only synonyms: words that seem unrelated to humans but cause similar
behavior in machines. Finally, we show how neologism learning can jointly learn
multiple concepts in multiple words.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4649. V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction
๐ฌ 24
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ LVLMs Video Understanding Visual Prompt
Large Vision-Language Models (LVLMs) have made significant strides in the field of video understanding in recent times. Nevertheless, existing video benchmarks predominantly rely on text prompts for evaluation, which often require complex referential language and diminish both the accuracy and efficiency of humanโmodel interaction in turn. To address this limitation, we propose V2P-Bench, a robust and comprehensive benchmark for evaluating the ability of LVLMs to understand Video Visual Prompts in humanโmodel interaction scenarios. V2P-Bench consists of 980 videos and 1172 well-structured high-quality QA pairs, each paired with manually annotated visual prompt frames. The benchmark spans three main tasks and twelve categories, thereby enabling fine-grained, instance-level evaluation. Through an in-depth analysis of current LVLMs, we identify several key findings: 1) Visual prompts are both more model-friendly and user-friendly in interactive scenarios than text prompts, leading to significantly improved model performance and enhanced user experience. 2) Models are reasonably capable of zero-shot understanding of visual prompts, but struggle with spatiotemporal understanding. Even o1 achieves only 71.8%, far below the human expert score of 88.3%, while most open-source models perform below 60%. 3) LVLMs exhibit pervasive Hack Phenomena in video question answering tasks, which become more pronounced as video length increases and frame sampling density decreases, thereby inflating performance scores artificially. We anticipate that V2P-Bench will not only shed light on these challenges but also serve as a foundational tool for advancing humanโmodel interaction and improving the evaluation of video understanding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4650. ODEBrain: Continuous-Time EEG Graph for Modeling Dynamic Brain Networks
๐ฌ 20
โญ 4.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ EEG ODE Brain network continuous dynamics
Modeling neural population dynamics is crucial for foundational neuroscientific research and various clinical applications. Conventional latent variable methods typically model continuous brain dynamics through discretizing time with recurrent architecture, which necessarily results in compounded cumulative prediction errors and failure of capturing instantaneous, nonlinear characteristics of EEGs. We propose ODEBrain, a Neural ODE latent dynamic forecasting framework to overcome these challenges by integrating spatio-temporal-frequency features into spectral graph nodes, followed by a Neural ODE modeling the continuous latent dynamics. Our design ensures that the latent representations can capture stochastic variations of complex brain state at any given time point. Extensive experiments verifies that ODEBrain can improve significantly over existing methods in forecasting EEG dynamics with enhanced robustness and generalization capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4651. Rethinking LLM Evaluation: Can We Evaluate LLMs with 200ร Less Data?
๐ฌ 35
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Data Selection Data Pruning Large Language Model Benchmark Compression
As the demand for comprehensive evaluations of diverse model capabilities steadily increases, benchmark suites have correspondingly grown significantly in scale. Despite notable advances in redundancy reduction and subset-level performance prediction, a systematic framework that effectively integrates these methods to ensure both prediction accuracy and ranking consistency is still largely elusive. In this paper, we first perform a sample-level analysis of benchmark redundancy and identify several highly similar samples that can be eliminated. Besides, we frame benchmark compression as an optimization problem with the aim of score reconstruction. Building on these, we then propose EssenceBench, a coarse-to-fine framework utilizing an iterative Genetic Algorithm (GA), which takes the advantages of fitness-based subset search and attribution-based sample search. Compared to previous methods, our approach yields superior compression results with lower reconstruction error and markedly higher efficiency. In particular, on the HellaSwag benchmark (10K samples), our method preserves the ranking of all models shifting within 5% using 25$\times$ fewer samples, and achieves 95% ranking preservation shifting within 5% using only 200$\times$ fewer samples. The source code will be made available upon acceptance of the paper.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4652. Diffusion & Adversarial Schrรถdinger Bridges via Iterative Proportional Markovian Fitting
๐ฌ 18
โญ 4.50
๐ generative models
๐ท๏ธ Schrรถdinger Bridge Optimal Transport Entropic Optimal Transport Unpaired Learning
The Iterative Markovian Fitting (IMF) procedure, which iteratively projects onto the space of Markov processes and the reciprocal class, successfully solves the Schrรถdinger Bridge (SB) problem. However, an efficient practical implementation requires a heuristic modification-alternating between fitting forward and backward time diffusion at each iteration. This modification is crucial for stabilizing training and achieving reliable results in applications such as unpaired domain translation. Our work reveals a close connection between the modified version of IMF and the Iterative Proportional Fitting (IPF) procedure-a foundational method for the SB problem, also known as Sinkhornโs algorithm. Specifically, we demonstrate that the heuristic modification of the IMF effectively integrates both IMF and IPF procedures. We refer to this combined approach as the Iterative Proportional Markovian Fitting (IPMF) procedure. Through theoretical and empirical analysis, we establish the convergence of the IPMF procedure under various settings, contributing to developing a unified framework for solving SB problems. Moreover, from a practical standpoint, the IPMF procedure enables a flexible trade-off between image similarity and generation quality, offering a new mechanism for tailoring models to specific tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4653. Safe Continuous-time Multi-Agent Reinforcement Learning via Epigraph Form
๐ฌ 14
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Continuous-time safe multi-agent reinforcement learning epigraph
Multi-agent reinforcement learning (MARL) has made significant progress in recent years, but most algorithms still rely on a discrete-time Markov Decision Process (MDP) with fixed decision intervals. This formulation is often ill-suited for complex multi-agent dynamics, particularly in high-frequency or irregular time-interval settings, leading to degraded performance and motivating the development of continuous-time MARL (CT-MARL). Existing CT-MARL methods are mainly built on HamiltonโJacobiโBellman (HJB) equations. However, they rarely account for safety constraints such as collision penalties, since these introduce discontinuities that make HJB-based learning difficult. To address this challenge, we propose a continuous-time constrained MDP (CT-CMDP) formulation and a novel MARL framework that transforms discrete MDPs into CT-CMDPs via an epigraph-based reformulation. We then solve this by proposing a novel PINN-based actorโcritic method that enables stable and efficient optimization in continuous time. We evaluate our approach on continuous-time safe multi-particle environments (MPE) and safe multi-agent MuJoCo benchmarks. Results demonstrate smoother value approximations, more stable training, and improved performance over safe MARL baselines, validating the effectiveness and robustness of our method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4654. DynamicInfer: Runtime-Aware Sparse Offloading for LLMs Inference on a Consumer-Grade GPU
๐ฌ 14
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Artificial Intelligence Offloading LLM inference
Large Language Models (LLMs) have achieved remarkable success in various NLP tasks, but their enormous memory footprints pose significant challenges for deployment on consumer-grade GPUs.
Prior solutions, such as PowerInfer, combine offloading and sparse activation to reduce memory and computational overhead, but suffer from static neuron partitioning, leading to suboptimal GPU utilization and increased latency.
In this work, we present DynamicInfer, a runtime neuron offloading framework that dynamically adapts neuron scheduling based on input-dependent activation patterns. DynamicInfer introduces (1) a hierarchical neural caching strategies, (2) a load-aware neuron activation mechanism tailored to heterogeneous hardware, and (3) an activation-aware prefetching pipeline that overlaps data transfer with computation.
Extensive experiments on ReluLLaMA and Prosparse models across multiple hardware platforms demonstrate that DynamicInfer achieves up to 253\% speedup over llama.cpp and 59\% over PowerInfer, while retaining model accuracy. Our approach offers a practical and scalable solution for high-performance LLM inference on resource-constrained devices.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4655. MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
๐ฌ 18
โญ 4.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Embodied AI Vision-Language-Action Models robotic manipulation
Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, and LIBERO-5 suites, it achieves 71.9%, 72.7%, and 96.5% success rates, respectively, all outperforming state-of-the-art baselines CogACT and ฯ0, with a notable +14.6 gain on Bridge. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4656. Don't Settle Too Early: Self-Reflective Remasking for Diffusion Language Models
๐ฌ 12
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ diffusion language model discrete diffusion masked diffusion model language model
Mask-based Diffusion Language Models (DLMs) struggle to revise incorrect tokens: once a token is generated, it typically remains fixed. The key challenge is to identify potential errors in the inputs. In this paper, we propose Remasking-enabled Diffusion Language Model (RemeDi), a mask-based DLM that introduces remasking as another fundamental mechanism, enabling more flexible text refinement in diffusion-based text generation. To achieve this, RemeDi jointly predicts token distributions and per-token confidence scores at each step. The confidence scores determine which tokens to be unmasked after the current step, allowing the model to identify tokens with low quality and remask them. These remasked tokens can be resampled with richer context in subsequent steps. We design a remask-aware pipeline to train this ability, including supervised fine-tuning which teaches the model to detect and remask incorrect tokens in addition to predict mask tokens, and reinforcement learning which optimizes full generation trajectories toward higher rewards. Experiments show that RemeDi achieves the state-of-the-art results among open-source DLMs on multiple datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4657. Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models
๐ฌ 34
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ multimodal misinformation detection vision-language models creator intent
The impact of misinformation arises not only from factual inaccuracies but also from the misleading narratives that creators deliberately embed. Interpreting such creator intent is therefore essential for multimodal misinformation detection (MMD) and effective information governance. To this end, we introduce DeceptionDecoded, a large-scale benchmark of 12,000 imageโcaption pairs grounded in trustworthy reference articles, created using an intent-guided simulation framework that models both the desired influence and the execution plan of news creators. The dataset captures both misleading and non-misleading cases, spanning manipulations across visual and textual modalities, and supports three intent-centric tasks: (1) misleading intent detection, (2) misleading source attribution, and (3) creator desire inference. We evaluate 14 state-of-the-art visionโlanguage models (VLMs) and find that they struggle with intent reasoning, often relying on shallow cues such as surface-level alignment, stylistic polish, or heuristic authenticity signals. These results highlight the limitations of current VLMs and position DeceptionDecoded as a foundation for developing intent-aware models that go beyond shallow cues in MMD.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4658. SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From
๐ฌ 19
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ fingerprint LLM
Fingerprinting Large Language Models (LLMs) is essential for provenance verification and model attribution. Existing methods typically extract post-hoc signatures based on training dynamics, data exposure, or hyperparametersโproperties that only emerge after training begins. In contrast, we propose a stronger and more intrinsic notion of LLM fingerprinting: **SeedPrints**, a method that leverages random initialization biases as persistent, seed-dependent identifiers present even before training. We show that untrained models exhibit reproducible token selection biases conditioned solely on their parameters at initialization. These biases are stable and measurable throughout training, enabling our statistical detection method to recover a modelโs lineage with high confidence. Unlike prior techniques, unreliable before convergence and vulnerable to distribution shifts, **SeedPrints** remains effective across all training stages and robust under domain shifts or parameter modifications. Experiments on LLaMA-style and Qwen-style models show that SeedPrints achieves seed-level distinguishability and can provide birth-to-lifecycle identity verification akin to a biometric fingerprint. Evaluations on large-scale pretrained models and fingerprinting benchmarks further confirm its effectiveness under practical deployment scenarios. These results suggest that initialization itself imprints a unique and persistent identity on neural language models, forming a true ``Galtonian'' fingerprint.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4659. In Context Semi-Supervised Learning
๐ฌ 22
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ semi-supervised learning Transformer In-context learning
There has been significant recent interest on understanding the capacity of Transformers for in-context learning (ICL), yet most theory focuses on supervised settings with explicitly labeled pairs. In practice, Transformers often perform well even when labels are sparse or absent, suggesting crucial structure within unlabeled contextual demonstrations. We introduce and study in-context semi-supervised learning (IC-SSL), where a small set of labeled examples is accompanied by many unlabeled points, and show that Transformers can leverage the unlabeled context to learn a robust, context-dependent representation. This representation enables accurate predictions and markedly improves performance in low-label regimes, offering foundational insights into how Transformers exploit unlabeled context for representation learning within the ICL framework.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4660. RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents
๐ฌ 17
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Code generation safety security
Code agents have gained widespread adoption due to their strong code generation capabilities and integration with code interpreters, enabling dynamic execution, debugging, and interactive programming capabilities. While these advancements have streamlined complex workflows, they have also introduced critical safety and security risks. Current static safety benchmarks and red-teaming tools are inadequate for identifying emerging real-world risky scenarios, as they fail to cover certain boundary conditions, such as the combined effects of different jailbreak tools.
In this work, we propose RedCodeAgent, the first automated red-teaming agent designed to systematically uncover vulnerabilities in diverse code agents.
With an adaptive memory module, RedCodeAgent can leverage existing jailbreak knowledge, dynamically select the most effective red-teaming tools and tool combinations in a tailored
toolbox for a given input query, thus identifying vulnerabilities that might otherwise be overlooked.
For reliable evaluation, we develop simulated sandbox environments to additionally evaluate the execution results of code agents, mitigating potential biases of LLM-based judges that only rely on static code.
Through extensive evaluations across multiple state-of-the-art code agents, diverse risky scenarios, and various programming languages, RedCodeAgent consistently outperforms existing red-teaming methods, achieving higher attack success rates and lower rejection rates with high efficiency. We further validate RedCodeAgent on real-world code assistants, e.g., Cursor and Codeium, exposing previously unidentified security risks. By automating and optimizing red-teaming processes, RedCodeAgent enables scalable, adaptive, and effective safety assessments of code agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4661. Learning Flexible Forward Trajectories for Masked Molecular Diffusion
๐ฌ 18
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Molecule Generation Masked Diffusion Models Molecule Diffusion Models
Masked diffusion models (MDMs) have achieved notable progress in modeling discrete data, while their potential in molecular generation remains underexplored. In this work, we explore their potential and introduce the surprising result that naively applying standards MDMs to molecules leads to severe performance degradation. We trace this critical issue to a *state-clashing*-where the forward diffusion trajectories of distinct molecules collapse into a common state, resulting in a mixture of reconstruction targets that cannot be learned with a typical reverse diffusion with unimodal predictions. To mitigate this, we propose **M**asked **E**lement-wise **L**earnable **D**iffusion (**MELD**) that orchestrates per-element corruption trajectories to avoid collisions between different molecular graphs. This is realized through a parameterized noise scheduling network that learns distinct corruption rates for individual graph elements, *i.e.*, atoms and bonds. Across extensive experiments, **MELD** is the first diffusion-based molecular generator to achieve 100% chemical validity in unconditional generation on QM9 and ZINC250K datasets, while markedly improving distributional and property alignment over standard MDMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4662. FreeAdapt: Unleashing Diffusion Priors for Ultra-High-Definition Image Restoration
๐ฌ 4
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Latent Diffusion Models; Ultra-High-Definition Image Restoration; Training-Free; Plug-and-Play; Resolution Adaptation
Latent Diffusion Models (LDMs) have recently shown great potential for image restoration owing to their powerful generative priors. However, directly applying them to ultra-high-definition image restoration (UHD-IR) often results in severe global inconsistencies and loss of fine-grained details, primarily caused by patch-based inference and the information bottleneck of the VAE. To overcome these issues, we present FreeAdapt, a plug-and-play framework that unleashes the capability of diffusion priors for UHD-IR. The core of FreeAdapt is a training-free Frequency Feature Synergistic Guidance (FFSG) mechanism, which introduces guidance at each denoising step during inference time. It consists of two modules: 1) Frequency Guidance (FreqG) selectively fuses phase information from a reference image in the frequency domain to enforce structural consistency across the entire image; 2) Feature Guidance (FeatG) injects global contextual information into the self-attention layers of the U-Net, effectively suppressing unrealistic textures in smooth regions and preserving local detail fidelity. In addition, FreeAdapt includes an optional VAE fine-tuning module, where skip connection further enhances the reconstruction of fine-grained textures. Extensive experiments demonstrate that our method achieves superior quantitative performance and visual quality compared to state-of-the-art UHD-IR approaches, and consistently delivers strong gains across multiple LDM-based backbones.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4663. Latent Wasserstein Adversarial Imitation Learning
๐ฌ 26
โญ 4.50
๐ reinforcement learning
๐ท๏ธ adversarial imitation learning wasserstein distance latent state space
Imitation Learning (IL) enables agents to mimic expert behavior by learning from demonstrations. However, traditional IL methods require large amounts of medium-to-high-quality demonstrations as well as actions of expert demonstrations, both of which are often unavailable. To reduce this need, we propose Latent Wasserstein Adversarial Imitation Learning (LWAIL), a novel adversarial imitation learning framework that focuses on state-only distribution matching. It benefits from the Wasserstein distance computed in a dynamics-aware latent space. This dynamics-aware latent space differs from prior work and is obtained via a pre-training stage, where we train the Intention Conditioned Value Function (ICVF) to capture a dynamics-aware structure of the state space using a small set of randomly generated state-only data. We show that this enhances the policy's understanding of state transitions, enabling the learning process to use only one or a few state-only expert episodes to achieve expert-level performance. Through experiments on multiple MuJoCo environments, we demonstrate that our method outperforms prior Wasserstein-based IL methods and prior adversarial IL methods, achieving better results across various tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4664. EvA: Evolutionary Attacks on Graphs
๐ฌ 9
โญ 4.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Adversarial Attack Evolutionary Algorithm graph neural network
Even a slight perturbation in the graph structure can cause a significant drop in the accuracy of graph neural networks (GNNs). Most existing attacks leverage gradient information to perturb edges. This relaxes the attack's optimization problem from a discrete to a continuous space, resulting in solutions far from optimal. It also prevents the adaptability of the attack to non-differentiable objectives. Instead, we introduce a few simple, yet effective, enhancements of an evolutionary-based algorithm to solve the discrete optimization problem directly. Our Evolutionary Attack EvA works with any black-box model and objective, eliminating the need for a differentiable proxy loss. This allows us to design two novel attacks that reduce the effectiveness of robustness certificates and break conformal sets.
EvA uses sparse representations to significantly reduce memory requirements and scale to larger graphs.
We also introduce a divide and conquer strategy that improves both EvA and existing gradient-based attacks.
Among our experiments, EvA shows $\sim$11\% additional drop in accuracy on average compared to the best previous attack,
revealing significant untapped potential in designing attacks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4665. SpikeStereoNet: A Brain-Inspired Framework for Stereo Depth Estimation from Spike Streams
๐ฌ 36
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Stereo depth estimation Neuromorphic camera Recurrent spiking neural network Computer vision
Conventional frame-based cameras often struggle with stereo depth estimation in rapidly changing scenes. In contrast, bio-inspired spike cameras emit asynchronous events at microsecond-level resolution, providing an alternative sensing modality. However, existing methods lack specialized stereo algorithms and benchmarks tailored to the spike data. To address this gap, we propose SpikeStereoNet, a brain-inspired framework to estimate stereo depth directly from raw spike streams. The model fuses raw spike streams from two viewpoints and iteratively refines depth estimation through a recurrent spiking neural network (RSNN) update module. To benchmark our approach, we introduce a large-scale synthetic spike stream dataset and a real-world stereo spike dataset with dense depth annotations. SpikeStereoNet outperforms existing methods on both datasets by leveraging spike streams' ability to capture subtle edges and intensity shifts in challenging regions such as textureless surfaces and extreme lighting conditions. Furthermore, our framework exhibits strong data efficiency, maintaining high accuracy even with substantially reduced training data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4666. Sample Complexity and Representation Ability of Test-time Scaling Paradigms
๐ฌ 8
โญ 4.50
๐ learning theory
๐ท๏ธ Large Language Models Test-time Scaling Sample Complexity Representation Theory
Test-time scaling paradigms have significantly advanced the capabilities of large language models (LLMs) on complex tasks. Despite their empirical success, theoretical understanding of the sample efficiency of various test-time strategies---such as self-consistency, best-of-$n$, and self-correction---remains limited.
In this work, we first establish a separation result between two repeated sampling strategies: self-consistency requires $\Theta(1/\Delta^2)$ samples to produce the correct answer, while best-of-$n$ only needs $\Theta(1/\Delta)$, where $\Delta < 1$ denotes the probability gap between the correct and second most likely answers.
Next, we present an expressiveness result for the self-correction approach with verifier feedback: it enables Transformers to simulate online learning over a pool of experts at test time. Therefore, a single Transformer architecture can provably solve multiple tasks without prior knowledge of the specific task associated with a user query, extending the representation theory of Transformers from single-task to multi-task settings.
Finally, we empirically validate our theoretical results, demonstrating the practical effectiveness of self-correction methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4667. Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models
๐ฌ 12
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Large Language Model Reinforcement Learning Process Supervision
Reinforcement Learning (RL) has shown remarkable success in enhancing the reasoning capabilities of Large Language Models (LLMs). Process-Supervised RL (PSRL) has emerged as a more effective paradigm compared to outcome-based RL. However, existing PSRL approaches suffer from limited exploration efficiency, both in terms of branching positions and sampling. In this paper, we introduce a novel PSRL framework (AttnRL), which enables efficient exploration for reasoning models. Motivated by preliminary observations that steps exhibiting high attention scores correlate with reasoning behaviors, we propose to branch from positions with high values. Furthermore, we develop an adaptive sampling strategy that accounts for problem difficulty and historical batch size, ensuring that the whole training batch maintains non-zero advantage values. To further improve sampling efficiency, we design a one-step off-policy training pipeline for PSRL. Extensive experiments on multiple challenging mathematical reasoning benchmarks demonstrate that our method consistently outperforms prior approaches in terms of performance and sampling and training efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4668. An Information-Theoretical Framework For Optimizing Experimental Design To Distinguish Probabilistic Neural Codes
๐ฌ 17
โญ 4.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Bayesian brain hypothesis perceptual decision-making uncertainty representations probabilistic neural codes probabilistic population code neural sampling code information theory experiment optimization
The Bayesian brain hypothesis has been a leading theory in understanding perceptual decision-making under uncertainty.
While extensive psychophysical evidence supports the notion of the brain performing Bayesian computations, how uncertainty information is encoded in sensory neural populations remains elusive.
Specifically, two competing hypotheses propose that early sensory populations encode either the likelihood function (exemplified by probabilistic population codes) or the posterior distribution (exemplified by neural sampling codes) over the stimulus, with the key distinction lying in whether stimulus priors would modulate the neural responses.
However, experimentally differentiating these two hypotheses has remained challenging, as it is unclear what task design would effectively distinguish the two.
In this work, we present an information-theoretical framework for optimizing the task stimulus distribution that would maximally differentiate competing probabilistic neural codes.
To quantify how distinguishable the two probabilistic coding hypotheses are under a given task design, we derive the *information gap*---the expected performance difference when likelihood versus posterior decoders are applied to neural populations---by evaluating the KL divergence between the true posterior and a task-marginalized surrogate posterior.
Through extensive simulations, we demonstrate that the information gap accurately predicts decoder performance differences across diverse task settings.
Critically, maximizing the information gap yields stimulus distributions that optimally differentiate likelihood and posterior coding hypotheses.
Our framework enables principled, theory-driven experimental designs with maximal discriminative power to differentiate probabilistic neural codes, advancing our understanding of how neural populations represent and process sensory uncertainty.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4669. Holdout-Loss-Based Data Selection for LLM Finetuning via In-Context Learning
๐ฌ 19
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Fine-tuning (SFT DPO SimPO); Data selection; Holdout loss; In-context learning; Gradient reweighting
Fine-tuning large pretrained language models is a common approach for aligning them with human preferences, but noisy or off-target examples can dilute supervision. While small, well-chosen datasets often match the performance of much larger ones, systematic and efficient ways to identify high-value training data remain underexplored. Many current methods rely on heuristics or expensive retraining. We present a theoretically grounded, resource-efficient framework for data selection and reweighting. At its core is an In-Context Approximation (ICA) that estimates the holdout loss a model would incur after training on a candidate example by conditioning on a small, curated holdout set in context. ICA requires no reference model and no additional finetuning. Under a local linearization, ICA is equivalent to a first-order update toward the holdout optimum, motivating its use as a proxy for data value. We derive per-example weights from ICA scores, dynamically reweighting gradient updates as model parameters evolve. Across SFT, DPO, and SimPO, and over diverse backbones and datasets, ICA-based reweighting consistently improves model alignment with minimal overhead. We analyze sensitivity to score update frequency and the choice of $k$ holdout examples for in-context demonstrations, and note limitations for rapidly drifting on-policy updates, highlighting directions for future work. Code and prompts will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4670. SCoT: Teaching 3D-LLMs to Think Spatially with Million-scale CoT Annotations
๐ฌ 13
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Large Language Model Chain-of-Thought Spatial Perception Spatial Analysis Spatial Planning
Recent advances in 3D Large Language Models (3D-LLMs) show strong potential in understanding and interacting with 3D environments, yet their training data typically lack explicit reasoning processes, limiting complex spatial reasoning and task planning.
To address this, we annotate SCoT, a million-scale Chain-of-Thought dataset spanning three levels: a) Spatial Perception (what is there), recognizing object properties, relations, and scene attributes; b) Spatial Analysis (what does it mean), inferring rationality, functionalities, and physical implications; c) Spatial Planning (what should I do), integrating perception and reasoning for actionable strategies. Unlike prior datasets supervising only answers, SCoT annotates intermediate reasoning grounded in scene cues, specifically for analysis and planning tasks. Results show that CoT supervision greatly benefits complex analysis and planning but induces hallucinations and accuracy drops in simple perception. These findings highlight both the necessity and the nuanced challenges of scene-grounded reasoning for advancing 3D intelligence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4671. Buckingham $\pi$-Invariant TestโTime Projection for Robust PDE Surrogate Modeling
๐ฌ 20
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Buckingham-pi PDE model-agnostic
PDE surrogate models such as FNO and PINN struggle to predict solutions across inputs with diverse physical units and scales, limiting their out-of-distribution (OOD) generalization. We propose a $\{pi}$-invariant test-time projection that aligns test inputs with the training distribution by solving a log-space least squares problem that preserves Buckingham ฯ-invariants. For PDEs with multidimensional spatial fields, we use geometric representative $\{pi}$-values to compute distances and project inputs, overcoming degeneracy and singular points that limit prior $\{pi}$-methods. To accelerate projection, we cluster the training set into K clusters, reducing the complexity from O(MN) to O(KN) for the M training and N test samples. Across wide input scale ranges, tests on 2D thermal conduction and linear elasticity achieve an average MAE reduction up to $\approx\!91\%$ with minimal overhead. This training-free, model-agnostic method is expected to apply to more diverse PDE-based simulations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4672. Steering and Rectifying Latent representation manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection
๐ฌ 14
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Anomaly Detection; Multi-modal LLM
Video anomaly detection (VAD) aims to identify abnormal events in videos. Traditional VAD methods generally suffer from the high costs of labeled data and full training, thus some recent works have explored leveraging frozen multi-modal large language models (MLLMs) in a tuning-free manner to perform VAD. However, their performance is limited as they directly inherit pre-training biases and cannot adapt internal representations to specific video contexts, leading to difficulties in handling subtle or ambiguous anomalies. To address these limitations, we propose a novel intervention framework, termed SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations. Our approach first leverages the gradient-free representational separability analysis (RSA) to identify top attention heads as latent anomaly experts (LAEs) which are most discriminative for VAD. Then a hierarchical meta-controller (HMC) generates dynamic rectification signals by jointly conditioning on global context and these LAE outputs. The signals execute targeted, anisotropic scaling directly upon the LAE representation manifolds, amplifying anomaly-relevant dimensions while suppressing inherent biases. Extensive experiments on mainstream benchmarks demonstrate our method achieves state-of-the-art performance among tuning-free approaches requiring only 1\% of training data, establishing it as a powerful new direction for video anomaly detection. The code will be released upon the publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4673. Superficial Safety Alignment Hypothesis
๐ฌ 17
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Model Safety Alignment
As large language models (LLMs) are overwhelmingly more and more integrated into various applications, ensuring they generate safe responses is a pressing need. Previous studies on alignment have largely focused on general instruction-following but have often overlooked the distinct properties of safety alignment, such as the brittleness of safety mechanisms. To bridge the gap, we propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment teaches an otherwise unsafe model to choose the correct reasoning direction - fulfill or refuse users' requests - interpreted as an implicit binary classification task. Through SSAH, we hypothesize that only a few essential components can establish safety guardrails in LLMs. We successfully identify four types of attribute-critical components: Safety Critical Unit (SCU), Utility Critical Unit (UCU), Complex Unit (CU), and Redundant Unit (RU). Our findings show that freezing certain safety-critical components during fine-tuning allows the model to retain its safety attributes while adapting to new tasks. Similarly, we show that leveraging redundant units in the pre-trained model as an "alignment budget" can effectively minimize the alignment tax while achieving the alignment goal. All considered, this paper concludes that the atomic functional unit for safety in LLMs is at the neuron level and underscores that safety alignment should not be complicated.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4674. Scaling Laws and Symmetry, Evidence from Neural Force Fields
๐ฌ 14
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ compute-optimal scaling laws geometric deep learning interatomic potentials
We present an empirical study in the geometric task of learning interatomic potentials, which shows equivariance matters even more at larger scales; we show a
clear power-law scaling behaviour with respect to data, parameters and compute
with โarchitecture-dependent exponentsโ. In particular, we observe that equivariant
architectures, which leverage task symmetry, scale better than non-equivariant
models. Moreover, among equivariant architectures, higher-order representations
translate to better scaling exponents. Our analysis also suggests that for computeoptimal
training, the data and model sizes should scale in tandem regardless of the
architecture. At a high level, these results suggest that, contrary to common belief,
we should not leave it to the model to discover fundamental inductive biases such
as symmetry, especially as we scale, because they change the inherent difficulty
of the task and its scaling laws.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4675. Learning from Label Proportions via Proportional Value Classification
๐ฌ 5
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Learning from label proportions weakly supervised learning.
Learning from Label Proportions (LLP) aims to use bags of instances associated with the proportions of each label within the bag to learn an instance-level classifier. Proportion matching is a widely used strategy that aligns the average model outputs of all instances in a bag with the label proportions in order to induce the classifier. However, simply fitting the label proportion may cause over-smoothing problems and does not guarantee correct label prediction of individual instances, resulting in poor classification performance. In this paper, we propose a novel LLP approach that can mitigate the over-smoothing problems with theoretical guarantees. Rather than fitting the label proportions directly, we treat them as targets for an auxiliary proportional value classification task to induce the target classifier. Our approach only requires the incorporation of an aggregation function after the classification layer. We also introduce an efficient computational approach with a divide-and-conquer strategy. Extensive experiments on various image and text benchmark datasets demonstrate that our approach achieves superior performance against state-of-the-art LLP methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4676. Inconsistency Biases in Dynamic Data Pruning
๐ฌ 9
โญ 4.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ dynamic data pruning efficient training
Dynamic data pruning accelerates training by focusing on informative samples. However, comparing importance scores across different model states introduces inconsistency (score context drift), and variable selection rates bias gradient dynamics over time (temporal gradient bias). We introduce RePB (Resolving Pruning Biases), a framework addressing these issues. RePB performs pruning decisions within local windows (short sequences of batches) during training, using loss scores computed with a near-constant model state within each window to ensure valid comparisons. These decisions determine the data subset used in the subsequent training phase. To counteract temporal gradient bias arising from non-uniform sample inclusion, cumulative temporal rescaling reweights sample losses during training based on their historical selection frequency. We provide theoretical grounding for RePB's consistency in score comparison and gradient alignment. Experiments show RePB achieves near-full-dataset accuracy using reduced data (most above 30%) across 16 datasets, 17 models and 13 tasks, offering a robust and scalable approach to efficient deep learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4677. Adaptive Mixture of Disentangled Experts for Dynamic Graphs under Distribution Shifts
๐ฌ 21
โญ 4.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Dynamic Graph Neural Network; Out of Distribution Generalization; Mixture of Experts
Dynamic graph representation learning under distribution shifts has drawn an increasing amount of attention in the research community, given its wide applicability in real-world scenarios. Existing methods typically employ a fixed-architecture design to extract invariant patterns. However, there may exist evolving distribution shifts in dynamic graphs, leading to suboptimal performance of fixed-architecture designs. To address this issue, we propose a novel adaptive-architecture design to handle evolving distribution shifts over time, to the best of our knowledge, for the first time. The proposed adaptive-architecture design introduces an adaptive mixture of architecture experts to capture invariant patterns under evolving distribution shifts, which imposes three challenges: 1) How to detect and characterize evolving distribution shifts to inform architectural decisions; 2) How to dynamically route different expert architectures to handle varying distribution characteristics; 3) How to ensure that the adaptive mixture of experts effectively discovers invariant patterns. To solve these challenges, we propose a novel \underline{\textbf{Ada}}ptive \underline{\textbf{Mix}}ture of Disentangled Experts (AdaMix) model to adaptively route architecture experts to varying distribution shifts and jointly learn spatio-temporal invariant patterns. Specifically, we propose a spatio-temporal distribution detector to infer evolving distribution shifts by jointly leveraging historical and current information. Building upon this, we develop a prototype-guided mixture of disentangled experts that adaptively routes experts with disentangled factors to different distribution shifts. Finally, we design a distribution-aware intervention mechanism that discovers invariant patterns based on expert selection of nodes. Extensive experiments on both synthetic and real-world datasets demonstrate that our proposed (AdaMix) model significantly outperforms state-of-the-art baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4678. Kยฒ-Agent: Co-Evolving Know-What and Know-How for Hierarchical Mobile Device Control
๐ฌ 28
โญ 4.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ LLM/VLM Agent Self-Evolving Agent Mobile Device Control Decision-Making
Existing mobile device control agents often perform poorly when solving complex tasks requiring long-horizon planning and precise operations, typically due to a lack of relevant task experience or unfamiliarity with skill execution. We propose $\textbf{Kยฒ-Agent}$, a hierarchical framework that models human-like cognition by separating and co-evolving declarative ("knowing what") and procedural ("knowing how") knowledge for planning and execution. Kยฒ-Agentโs high level reasoner is bootstrapped from a single demonstration per task and runs a SummarizeโReflectโLocateโRevise (SRLR) loop to distill and iteratively refine task-level declarative knowledge through self-evolution. The low-level executor is trained with our curriculum-guided Group Relative Policy Optimization (C-GRPO), which (i) constructs a balanced sample pool using decoupled reward signals and (ii) employs dynamic demonstration injection to guide the model in autonomously generating successful trajectories for training. On the challenging AndroidWorld benchmark, K$^2$-Agent achieves a new $\textbf{state of the art}$ with $\textbf{76.1\% success rate}$, ranking $\textbf{1st}$ among all methods $\textbf{using only raw screenshots and open-source backbones}$. Furthermore, Kยฒ-Agent shows powerful dual generalization: its high-level declarative knowledge transfers across diverse base models, while its low-level procedural skills achieve competitive performance on unseen tasks in ScreenSpot-v2 and Android-in-the-Wild (AitW).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4679. Text summarization via global structure awareness
๐ฌ 24
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Text summarization Topological Data Analysis natural language processing
Text summarization is a core task in natural language processing (NLP). With the rapid growth of information, handling long documents has become increasingly demanding, making summarization essential. Existing research mainly focuses on model improvements and sentence-level pruning, but often overlooks global structure, leading to disrupted coherence and weakened downstream performance. Some studies employ large language models (LLMs), which achieve higher accuracy but incur substantial resource and time costs. To address these issues, this paper introduces the first summarization method based on global structure awareness using topological data analysis (TDA). The method summarizes text efficiently while preserving semantic cores and logical dependencies. Specifically, we construct a semantic-weighted graph from sentence embeddings, where persistent homology identifies core semantics and logical structures, preserved in a ``protection pool'' as the backbone for summarization. We design a topology-guided iterative strategy, where lightweight proxy metrics approximate sentence importance to avoid repeated high-cost computations, thus preserving structural integrity while improving efficiency. To further enhance long-text processing, we propose a hierarchical strategy that integrates segment-level and global summarization. Experiments on multiple datasets demonstrate that GloSA-sum reduces redundancy while preserving semantic and logical integrity, striking a balance between accuracy and efficiency, and further benefits LLM downstream tasks by shortening contexts while retaining essential reasoning chains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4680. Learning Domain-Aware Task Prompt Representations for Multi-Domain All-in-One Image Restoration
๐ฌ 19
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ multi-domain all-in-one image restoration prompt learning low-level computer vision
Recently, significant breakthroughs have been made in all-in-one image restoration (AiOIR), which can handle multiple restoration tasks with a single model. However, existing methods typically focus on a specific image domain, such as natural scene, medical imaging, or remote sensing. In this work, we aim to extend AiOIR to multiple domains and propose the first multi-domain all-in-one image restoration method, DATPRL-IR, based on our proposed Domain-Aware Task Prompt Representation L}earning. Specifically, we first construct a task prompt pool containing multiple task prompts, in which task-related knowledge is implicitly encoded. For each input image, the model adaptively selects the most relevant task prompts and composes them into an instance-level task representation via a prompt composition mechanism (PCM). Furthermore, to endow the model with domain awareness, we introduce another domain prompt pool and distill domain priors from multimodal large language models into the domain prompts. PCM is utilized to combine the adaptively selected domain prompts into a domain representation for each input image. Finally, the two representations are fused to form a domain-aware task prompt representation which can make full use of both specific and shared knowledge across tasks and domains to guide the subsequent restoration process. Extensive experiments demonstrate that our DATRL-IR significantly outperforms existing SOTA image restoration methods, while exhibiting strong generalization capabilities. We believe that this work provides a new research paradigm and represents a step towards more unified image restoration.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4681. EditLens: Quantifying the Extent of AI Editing in Text
๐ฌ 11
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ AI detection authorship attribution human-AI collaboration
A significant proportion of queries to large language models ask them to edit user-provided text, rather than generate new text from scratch. While previous work focuses on detecting fully AI-generated text, we demonstrate that AI-edited text is distinguishable from human-written and AI-generated text. First, we propose using lightweight similarity metrics to quantify the magnitude of AI editing present in a text given the original human-written text and validate these metrics with human annotators. Using these similarity metrics as intermediate supervision, we then train EditLens, a regression model that predicts the amount of AI editing present within a text. Our model achieves state-of-the-art performance on both binary (F1=94.7%) and ternary (F1=90.4%) classification tasks in distinguishing human, AI, and mixed writing. Not only do we show that AI-edited text can be detected, but also that the degree of change made by AI to human writing can be detected, which has implications for authorship attribution, education, and policy. Finally, as a case study, we use our model to analyze the effects of AI-edits applied by Grammarly, a popular writing assistance tool. To encourage further research, we commit to publicly releasing our models and dataset.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4682. A Unifying Framework for Causal Imitation Learning with Hidden Confounders
๐ฌ 14
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Imitation Learning Hidden Confounders Causal Inference Reinforcement Learning
We propose a general framework for causal Imitation Learning (IL) with hidden confounders, which subsumes several existing settings. Our framework accounts for two types of hidden confounders: (a) variables observed by the expert but not by the imitator, and (b) confounding noise hidden from both. By leveraging trajectory histories as instruments, we reformulate causal IL in our framework into a Conditional Moment Restriction (CMR) problem. We propose DML-IL, an algorithm that solves this CMR problem via instrumental variable regression, and upper bound its imitation gap. Empirical evaluation on continuous state-action environments, including Mujoco tasks, demonstrates that DML-IL outperforms existing causal IL baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4683. Query-Level Uncertainty in Large Language Models
๐ฌ 25
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Language Models Frugal AI Uncertainty Quantification
It is important for Large Language Models (LLMs) to be aware of the boundary of their knowledge, i.e., the mechanism of identifying known and unknown queries. This type of awareness enables models to perform adaptive inference, such as invoking retrieval-augmented generation (RAG), engaging in slow and deep thinking, or abstaining from answering when appropriate. These mechanisms are beneficial to the development of efficient and trustworthy AI.
In this work, we propose a method to detect knowledge boundaries via \textbf{\emph{Query-Level Uncertainty }}, which estimates if a model is capable of to answering a given query before generating any tokens.
To this end, we propose a novel, training-free method called \textbf{\emph{Internal Confidence}}, which leverages self-evaluations across layers and tokens to provide a reliable signal of uncertainty.
Empirical studies on both factual question answering and mathematical reasoning tasks demonstrate that our internal confidence can outperform several baselines. Furthermore, we showcase that our proposed method can be used for adaptive inference, such as efficient RAG and model cascading, thereby reducing inference costs while preserving overall performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4684. AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features
๐ฌ 22
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ Sparse Autoencoder Mechanistic Interpretability
Sparse autoencoders (SAEs) are a cornerstone of interpretability for large language models (LLMs), aiming to decompose hidden states into meaningful semantic features. While several SAE variants have been proposed, there remains no principled framework to derive SAEs from the original dictionary learning formulation. In this work, we introduce such a framework by unrolling the proximal gradient method for sparse coding. We show that a single-step update naturally recovers common SAE variants, including ReLU, JumpReLU, and TopK. Through this lens, we reveal a fundamental limitation of existing SAEs: their sparsity-inducing regularizers enforce non-negativity, preventing a single feature from representing bidirectional concepts (e.g., male vs. female). This structural constraint fragments semantic axes into separate, redundant features, limiting representational completeness. To address this issue, we propose AbsTopK SAE, a new variant derived from the $\ell_0$ sparsity constraint that applies hard thresholding over the largest-magnitude activations. By preserving both positive and negative activations, AbsTopK uncovers richer, bidirectional conceptual representations. Comprehensive experiments across four LLMs and seven probing and steering tasks show that AbsTopK improves reconstruction fidelity, enhances interpretability, and enables single features to encode contrasting concepts. Remarkably, AbsTopK matches or even surpasses the Difference-in-Mean methodโa supervised approach that requires labeled data for each concept and has been shown in prior work to outperform SAEs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4685. PT$^2$-LLM: Post-Training Ternarization for Large Language Models
๐ฌ 21
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Ternarization LLM
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment. Ternarization has gained attention as a promising compression technique, delivering substantial size reduction and high computational efficiency. However, its potential in the post-training quantization (PTQ) setting remains underexplored, due to the challenge of training-free parameter optimization and the quantization difficulty posed by outliers and dispersed weights. To address these issues, we propose PT$^2$-LLM, a post-training ternarization framework tailored for LLMs. At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline: (1) Iterative Ternary Fitting (ITF), which alternates between optimal ternary grid construction and flexible rounding to minimize quantization error, and (2) Activation-aware Grid Alignment (AGA), which further refines the ternary grid to better match full-precision outputs. In addition, we propose a plug-and-play Structural Similarity-based Reordering (SSR) strategy that leverages inter-column structural similarity to ease quantization and mitigate outlier effects, further enhancing overall performance. Extensive experiments demonstrate that PT$^2$-LLM delivers competitive performance against state-of-the-art (SOTA) 2-bit PTQ methods with lower memory cost, while also accelerating both prefill and decoding to achieve end-to-end speedup. We will release all the code and models of PT$^2$-LLM.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4686. Lean Finder: Semantic Search for Mathlib That Understands User Intents
๐ฌ 25
โญ 4.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Lean mathlib code search informalization
We present Lean Finder, a semantic search engine for Lean and mathlib that understands and aligns with the intents of mathematicians. Progress in formal theorem proving is often hindered by the difficulty of locating relevant theorems and the steep learning curve of the Lean 4 language, making advancement slow and labor-intensive. Existing Lean search engines, though helpful, rely primarily on informalizations (natural language translation of the formal statements), while largely overlooking the mismatch with real-world user queries. In contrast, we propose a user-centered semantic search tailored to the needs of mathematicians. Our approach begins by analyzing and clustering the semantics of public Lean discussions, then fine-tuning text embeddings on synthesized queries that emulate user intents. We further align Lean Finder with mathematiciansโ preferences using diverse feedback signals, encoding it with a rich awareness of their goals from multiple perspectives. Evaluations on real-world queries, informalized statements, and proof states demonstrate that our Lean Finder achieves over 30% relative improvement compared to previous search engines and GPT-4o. In addition, Lean Finder is compatible with LLM-based theorem provers, bridging retrieval with formal reasoning. We promise to release both the code, model checkpoints, and datasets upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4687. When Flatness Does (Not) Guarantee Adversarial Robustness
๐ฌ 12
โญ 4.50
๐ learning theory
๐ท๏ธ Flatness Adversarial Robustness
Despite their empirical success, neural networks remain vulnerable to small, adversarial perturbations. A longstanding hypothesis suggests that flat minima, regions of low curvature in the loss landscape, offer increased robustness. While intuitive, this connection has remained largely informal and incomplete. By rigorously formalizing the relationship, we show this intuition is only partially correct: flatness implies *local* but not *global* adversarial robustness. To arrive at this result, we first derive a closed-form expression for relative flatness in the penultimate layer, and then show we can use this to constrain the variation of the loss in input space. This allows us to formally analyze the adversarial robustness of the entire network. We then show that to maintain robustness beyond a local neighborhood, the loss needs to curve *sharply* away from the data manifold.
We validate our theoretical predictions empirically across architectures and datasets, uncovering the geometric structure that governs adversarial vulnerability, and linking flatness to model confidence: adversarial examples often lie in large, flat regions where the model is confidently wrong. Our results challenge simplified views of flatness and provide a nuanced understanding of its role in robustness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4688. When Bias Helps Learning: Bridging Initial Prejudice and Trainability
๐ฌ 15
โญ 4.50
๐ learning theory
๐ท๏ธ Trainability initial guessing bias mean field regime phase diagrams
Understanding the statistical properties of deep neural networks (DNNs) at initialization is crucial for elucidating both their trainability and the intrinsic architectural biases they encode prior to data exposure. Mean-field (MF) analyses have demonstrated that the parameter distribution in randomly initialized networks dictates whether gradients vanish or explode. Recent work has shown that untrained DNNs exhibit an initial-guessing bias (IGB), in which large regions of the input space are assigned to a single class.
In this work, we provide a theoretical proof linking IGB to MF analyses, establishing that a networkโs predisposition toward specific classes is intrinsically tied to the conditions for efficient learning. This connection leads to a counterintuitive conclusion: the initialization that optimizes trainability is systematically biased rather than neutral. We validate our theory through experiments across multiple architectures and datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4689. Identifying Robust Neural Pathways: Few-Shot Adversarial Mask Tuning for Vision-Language Models
๐ฌ 25
โญ 4.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Vision-Language Models (VLMs) Adversarial Robustness Mask Tuning Robust Neural Pathways
Recent vision-language models (VLMs), such as CLIP, have demonstrated remarkable transferability across a wide range of downstream tasks by effectively leveraging the joint text-image embedding space, even with only a few data samples. Despite their impressive performance, these models remain vulnerable to adversarial attacks, raising significant concerns about their security and reliability in practical deployments. To address this issue, we propose Adversarial Mask Tuning (AdvMask), a method that effectively enhances the robustness of VLMs without directly modifying their pre-trained weights. Instead, our AdvMask learns a set of binary masks that selectively deactivate model parameters vulnerable to adversarial perturbations. By identifying robust neural pathways within the vision encoder, AdvMask facilitates the generation of features and predictions that are resistant to adversarial attacks. Furthermore, we introduce a Layer-wise Adaptive Feature Alignment (LAFA) loss, specifically designed to optimize AdvMask in few-shot scenarios. The LAFA loss adaptively aligns intermediate-layer features from clean and adversarial samples across each transformer block, enhancing the representational robustness of the model. Experimental results across multiple benchmarks confirm that our AdvMask approach substantially outperforms existing adversarial tuning techniques for VLMs, especially in few-shot settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4690. CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models
๐ฌ 28
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language models Reasoning Exploration
Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for enhancing the reasoning ability of Large Language Models (LLMs). Yet current RLVR methods often explore poorly, leading to premature convergence and entropy collapse. Moreover, they tend to produce poorly calibrated policies that remain confident in their generations regardless of correctness. To address this challenge, we introduce **Curiosity-Driven Exploration (CDE)**, a framework that leverages the model's intrinsic sense of curiosity to guide exploration. We formalize curiosity with signals from both the actor and the critic: for the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head critic architecture. Both signals serve as an exploration bonus within the RLVR framework to guide the model. Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses; moreover, we connect the critic-wise bonus to the well-established count-based exploration bonus in RL. Empirically, our method achieves an approximate **+3** point improvement over standard RLVR using GRPO/PPO on AIME benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4691. WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning
๐ฌ 30
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ code agent website generation large language model
Agent systems powered by large language models (LLMs) have demonstrated impressive performance on repository-level code-generation tasks. However, for tasks such as website codebase generation, which depend heavily on visual effects and user-interaction feedback, current code agents rely only on simple code execution for feedback and verification. This approach fails to capture the actual quality of the generated code. In this paper, we propose WebGen-Agent, a novel website-generation agent that leverages comprehensive and multi-level visual feedback to iteratively generate and refine the website codebase. Detailed and expressive text descriptions and suggestions regarding the screenshots and GUI-agent testing of the websites are generated by a visual language model (VLM), together with scores that quantify their quality. The screenshot and GUI-agent scores are further integrated with a backtracking and select-best mechanism, enhancing the performance of the agent. Utilizing the accurate visual scores inherent in the WebGen-Agent workflow, we further introduce Step-GRPO with Screenshot and GUI-Agent Feedback to improve the ability of LLMs to act as the agent-engine model. By using the screenshot and GUI-agent scores at each step as the reward in Step-GRPO, we provide a dense and reliable process supervision signal, which effectively improves the model's website-generation ability. On the WebGen-Bench dataset, WebGen-Agent increases the accuracy of Claude 3.5 Sonnet from 26.4\% to 51.9\% and its appearance score from 3.0 to 3.9, outperforming the previous state-of-the-art agent system. Additionally, our training approach increases the accuracy of Qwen2.5-Coder-7B-Instruct from 38.9\% to 45.4\% and raises the appearance score from 3.4 to 3.7.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4692. FilMaster: Bridging Cinematic Principles and Generative AI for Automated Film Generation
๐ฌ 18
โญ 4.50
๐ generative models
๐ท๏ธ video generation filmmaking
Existing AI-based film generation systems can generate high-quality videos, but struggle to design expressive camera language and establish cinematic rhythm. This deficiency leads to templated visuals and unengaging narratives. To address these limitations, we introduce FilMaster, an end-to-end automated film generation system that integrates real-world cinematic principles to generate professional-grade, editable films. Inspired by professional filmmaking, FilMaster is built on two key cinematic principles: (1) camera language design by learning cinematography from extensive real-world film references, and (2) cinematic rhythm by emulating professional post-production workflows. For camera language, our Multi-shot Synergized Camera Language Design module introduces a novel scene-level Retrieval-Augmented Generation (RAG) framework. Unlike shot-level RAG which retrieves references independently and often leads to visual incoherence, our approach treats an entire scene, comprising multiple shots with a shared spatio-temporal context and narrative objective, as a single, unified query. This holistic query retrieves a consistent set of semantically similar shots with cinematic techniques from a large corpus of 440,000 real film clips. These references then guide an LLM to synergistically plan coherent and expressive camera language for all shots within that scene. To achieve cinematic rhythm, our Audience-Aware Cinematic Rhythm Control module emulates professional post-production, featuring a Rough Cut assembly followed by a Fine Cut process that uses simulated audience feedback to optimize the integration of video and sound for cinematic rhythm. Extensive experiments show superior performance in camera language and cinematic rhythm, paving the way for generative AI in professional filmmaking.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4693. Attention Smoothing Is All You Need For Unlearning
๐ฌ 18
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Model Large Language Model Unlearning Self-distillation Attention Smoothing
Large Language Models are prone to memorizing sensitive, copyrighted, or hazardous content, posing significant privacy and legal concerns. Retraining from scratch is computationally infeasible, whereas current unlearning methods exhibit unstable trade-offs between forgetting and utility, frequently producing incoherent outputs on forget prompts and failing to generalize due to the persistence of lexical-level and semantic-level associations in attention. We propose Attention Smoothing Unlearning (ASU), a principled framework that casts unlearning as self-distillation from a forget-teacher derived from the modelโs own attention. By increasing the softmax temperature, ASU flattens attention distributions and directly suppresses the lexical-level and semantic-level associations responsible for reconstructing memorized knowledge. This results in a bounded optimization objective that erases factual information yet maintains coherence in responses to forget prompts. Empirical evaluation on TOFU, MUSE, and WMDP, along with real-world and continual unlearning scenarios across Question Answering (QA) and text completion, demonstrates that ASU outperforms the baselines for most of the unlearning scenarios, delivering robust unlearning with minimal loss of model utility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4694. Experience-based Knowledge Correction for Robust Planning in Minecraft
๐ฌ 20
โญ 4.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ llm-guided exploration hierarchical planning
Large Language Model (LLM)-based planning has advanced embodied agents in long-horizon environments such as Minecraft, where acquiring latent knowledge of goal (or item) dependencies and feasible actions is critical. However, LLMs often begin with flawed priors and fail to correct them through prompting, even with feedback. We present XENON (eXpErience-based kNOwledge correctioN), an agent that algorithmically revises knowledge from experience, enabling robustness to flawed priors and sparse binary feedback. XENON integrates two mechanisms: Adaptive Dependency Graph, which corrects item dependencies using past successes, and Failure-aware Action Memory, which corrects action knowledge using past failures. Together, these components allow XENON to acquire complex dependencies despite limited guidance. Experiments across multiple Minecraft benchmarks show that XENON outperforms prior agents in both knowledge learning and long-horizon planning. Remarkably, with only a 7B open-weight LLM, XENON surpasses agents that rely on much larger proprietary models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4695. Noise-Adaptive Diffusion Sampling for Inverse Problems Without Task-Specific Tuning
๐ฌ 14
โญ 4.50
๐ generative models
๐ท๏ธ Diffusion models Inverse problems Generative model Bayesian inference
Diffusion models (DMs) have recently shown remarkable performance on inverse problems (IPs). Optimization-based methods can fast solve IPs using DMs as powerful regularizers, but it is susceptible to local minima and noise overfitting. Although DMs can provide strong priors for Bayesian approaches, enforcing measurement consistency during the denoising process leads to manifold infeasibility issues. We propose Noise-space Hamiltonian Monte Carlo (N-HMC), a posterior sampling method that treats reverse diffusion as a deterministic mapping from initial noise to clean images. N-HMC enables comprehensive exploration of the solution space, avoiding local optima. By moving inference entirely into the initial-noise space, N-HMC keeps proposals on the learned data manifold. We provide a comprehensive theoretical analysis of our approach and extend the framework to a noise-adaptive variant (NA-NHMC) that effectively handles IPs with unknown noise type and level. Extensive experiments across four linear and three nonlinear inverse problems demonstrate that NA-NHMC achieves superior reconstruction quality with robust performance across different hyperparameters and initializations, significantly outperforming recent state-of-the-art methods. Code will be made available on GitHub upon publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4696. Conditional Independent Component Analysis For Estimating Causal Structure with Latent Variables
๐ฌ 19
โญ 4.50
๐ causal reasoning
๐ท๏ธ Causal Discovery Latent Structure Learning Conditional Independent Component Analysis Sparsity
Identifying latent variables and their induced causal structure is fundamental in various scientific fields. Existing approaches often rely on restrictive structural assumptions (e.g., purity) and may become invalid when these assumptions are violated. We introduce Conditional Independent Component Analysis (CICA), a new tool that extracts components that are conditionally independent given latent variables. Under mild conditions, CICA can be optimized using a tractable proxy such as rank-deficiency constraints. Building on CICA, we establish an identifiability theory for linear non-Gaussian acyclic models with latent variables: solving CICA and then applying an appropriate row permutation to the sparsest CICA solution enables recovery of the causal structure. Accordingly, we propose an estimation method based on the identifiability theory and substantiate the algorithm with experiments on both synthetic and real-world datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4697. Demystifying Supervision Data Generalization in Multimodal LMs
๐ฌ 8
โญ 4.50
๐ generative models
๐ท๏ธ MLLM
Conventional wisdom in selecting supervision data for multimodal large language models (MLLMs) is to prioritize datasets that are intuitively similar to the target task (e.g. text-rich v.s. vision-centric). However, it remains unclear how reliably such similarity translates into improved performance on the test benchmarks.
In this paper, we take the first step to study the problem in MLLMs: can we predict a training data's influence on a target benchmark even before any training takes place?
To answer this question, we first conduct an in-depth analysis using 14 vision-language datasets covering 7 diverse tasks. Our analysis shows that intuitive task
similarity is unreliable in predicting task generalizability, and that transfer depends on the specific dataset rather than the broader task category.
We propose DATAPROPHET, a training-free, simple yet effective metric based on multimodal perplexity, similarity, and data diversity. Our experiments demonstrate that the influence rankings for different supervision datasets derived from DATAPROPHET is strongly-correlated with rankings based on the actual performance increase after training, with a Kendallโs $\tau$ correlation coefficient of 86.0\%. Moreover, we show that DATAPROPHET can help select better supervision data, achieving up to 6.9\% improvement in average over uniform selection, 1.4\% over SoTA training-based baseline, and 0.2\% higher than oracle experiment performance-based selection. Our code and data will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4698. Teach to Reason Safely: Policy-Guided Safety Tuning for MLRMs
๐ฌ 15
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ MLRM safety alignment safety-helpfulness trade-off
Multimodal Large Reasoning Models (MLRMs) have exhibited remarkable capabilities in complex multimodal tasks.
However, our findings reveal a critical trade-off: reasoning-based models are more prone to generating harmful content, leading to degradation in safety performance.
This paper presents a large-scale analysis of this safetyโreasoning trade-off, identifying two main mechanisms of safety degradation: (i) visual attention drift, which reduces the modelโs reliance on visual grounding and thereby exacerbates overlooked risks in cross-modal interactions; (ii) unsafe reasoning patterns, including flawed reasoning initiation and chain-of-thought safety attenuation, which compromise the modelโs safety awareness.
To mitigate these issues, we propose **P**olicy-guided **S**afety **T**uning (**PST**), a two-stage alignment framework. It first employs *Policy-Guided Supervised Fine-Tuning* to integrate explicit safety policies into the reasoning process, establishing a structured and interpretable foundation for safe decision-making.
Then, PST applies *Safety Reasoning Preference Optimization* to encourage the model to generate safe, helpful, and informative responses while reducing oversensitive and homogeneous characteristics.
Extensive experiments demonstrate that PST significantly reduces harmful outputs across multiple multimodal safety benchmarks, while maintaining competitive performance on general tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4699. Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach
๐ฌ 22
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ MLLM Visual Emotion Evaluation
Recently, Multimodal Large Language Models (MLLMs) have achieved exceptional performance across diverse tasks, continually surpassing previous expectations regarding their capabilities. Nevertheless, their proficiency in perceiving emotions from images remains debated, with studies yielding divergent results in zero-shot scenarios. We argue that this inconsistency stems partly from constraints in existing evaluation methods, including the oversight of plausible responses, limited emotional taxonomies, neglect of contextual factors, and labor-intensive annotations. To facilitate customized visual emotion evaluation for MLLMs, we propose an Emotion Statement Judgment task that overcomes these constraints. Complementing this task, we devise an automated pipeline that efficiently constructs emotion-centric statements with minimal human effort. Through systematically evaluating prevailing MLLMs, our study showcases their stronger performance in emotion interpretation and context-based emotion judgment, while revealing relative limitations in comprehending perception subjectivity. When compared to humans, even top-performing MLLMs like GPT4o demonstrate remarkable performance gaps, underscoring key areas for future improvement. By developing a fundamental evaluation framework and conducting a comprehensive MLLM assessment, we hope this work contributes to advancing emotional intelligence in MLLMs. Codes and data will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4700. CogMoE: Signal-QualityโGuided Multimodal MoE for Cognitive Load Prediction
๐ฌ 9
โญ 4.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Cognitive-load multi-modality mixture-of-experts
Reliable cognitive load (CL) prediction in real-world settings is fundamentally constrained by the poor and variable quality of physiological signals. In safety-critical tasks such as driving, degraded signal quality can severely compromise prediction accuracy, limiting the deployment of existing models outside controlled lab conditions. To address this challenge, we propose CogMoE, a signal qualityโguided Mixture-of-Experts (MoE) framework that dynamically adapts to heterogeneous and noisy inputs. CogMoE flexibly integrates physiological modalities, including EEG, ECG, EDA, and gaze, through quality-aware gating, enabling context-sensitive fusion. The framework operates in two stages: (1) quality-aware multi-modal synchronization and recovery to mitigate artifacts, temporal misalignment, and missing data, and (2) signal-quality-specific expert modeling via a cross-modal MoE transformer that regulates information flow based on signal reliability. To further improve stability, we introduce CORTEX Loss, which balances reconstruction fidelity and expert utilization under noise. Experiments on CL-Drive and ADABase show that CogMoE outperforms strong baselines, delivering consistent improvements across diverse signal qualities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4701. Towards Reliable Detection of Empty Space: Conditional Marked Point Processes for Object Detection
๐ฌ 12
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Computer Vision Confidence Calibration Object Detection Spatial Point Processes
Deep neural networks have set the state-of-the-art in computer vision tasks such as bounding box detection and semantic segmentation. Object detectors and segmentation models assign confidence scores to predictions, reflecting the modelโs uncertainty in object detection or pixel-wise classification. However, these confidence estimates are often miscalibrated, as their architectures and loss functions are tailored to task performance rather than probabilistic foundation. Even with well calibrated predictions, object detectors fail to quantify uncertainty outside detected bounding boxes, i.e., the model does not make a probability assessment of whether an area without detected objects is truly free of obstacles. This poses a safety risk in applications such as automated driving, where uncertainty in empty areas remains unexplored. In this work, we propose an object detection model grounded in spatial statistics. Bounding box data matches realizations of a marked point process, commonly used to describe the probabilistic occurrence of spatial point events identified as bounding box centers, where marks are used to describe the spatial extension of bounding boxes and classes. Our statistical framework enables a likelihood-based training and provides well-defined confidence estimates for whether a region is drivable, i.e., free of objects. We demonstrate the effectiveness of our method through calibration assessments and evaluation of performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4702. ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
๐ฌ 16
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ diffusion autoregressive large language model
Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 32\% performance gains and an over 10$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 1.4$\times$ average speedup.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4703. DanceTogether: Generating Interactive Multi-Person Video without Identity Drifting
๐ฌ 20
โญ 4.50
๐ generative models
๐ท๏ธ Controllable video generation Multi-person Interactive Video Generation Multi-person Pose Estimation Multi-object Tracking
Controllable video generation (CVG) has advanced rapidly, yet current systems falter when more than one actor must move, interact, and exchange positions under noisy control signals. We address this gap with DanceTogether, the first end-to-end diffusion framework that turns a single reference image plus independent pose-mask streams into long, photorealistic videos while strictly preserving every identity. A novel MaskPoseAdapter binds โwhoโ and โhowโ at every denoising step by fusing robust tracking masks with semantically rich but noisy pose heat maps, eliminating the identity drift and appearance bleeding that plague frame-wise pipelines. To train and evaluate at scale, we introduce (i) PairFS-4K, 26 h of dual-skater footage with more than 7 000 distinct IDs, (ii) HumanRob-300, a one-hour humanoid-robot interaction set for rapid cross-domain transfer, and (iii) TogetherVideoBench, a three-track benchmark centred on the DanceTogEval-100 test suite covering dance, boxing, wrestling, yoga, and figure skating. On TogetherVideoBench, DanceTogether outperforms the prior arts by a significant margin. Moreover, we show that a one-hour fine-tune yields convincing human-robot videos, underscoring broad generalisation to embodied-AI and HRI tasks. Extensive ablations confirm that persistent identity-action binding is critical to these gains. Together, our model, datasets, and benchmark lift CVG from single-subject choreography to compositionally controllable, multi-actor interaction, opening new avenues for digital production, simulation, and embodied intelligence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4704. PhyScensis: Physics-Augmented LLM Agents for Complex Physical Scene Generation
๐ฌ 14
โญ 4.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Physical Scene Generation
Automatically generating interactive 3D environments is crucial for scaling up robotic data collection in simulation. While prior work has primarily focused on 3D asset placement, it often overlooks the physical relationships between objects (e.g., contact, support, balance, and containment), which are essential for creating complex and realistic manipulation scenarios such as tabletop arrangements, shelf organization, or box packing. Compared to classical 3D scene generation, producing complex physical scenes introduces additional challenges: (a) higher object density and complexity (e.g., a small shelf may hold dozens of books), (b) richer supporting relationships and compact spatial layouts, and (c) the need to accurately model both spatial placement and physical properties.
To address these challenges, we propose PhyScensis, an LLM agent-based framework powered by a physics engine, to generate physically accurate 3D scenes with high complexity.
Specifically, our framework consists of three main components: an LLM agent iteratively proposes assets with spatial and physical predicates; a solver, equipped with a physics engine, realizes these predicates into a 3D scene; and feedback from the solver informs the agent to refine and enrich the configuration.
Moreover, our framework preserves strong controllability over fine-grained textual descriptions and numerical parameters (e.g., relative positions, scene stability), enabled through probabilistic programming for stability and a complementary heuristic that jointly regulates stability and spatial relations.
Experimental results show that our method outperforms prior approaches in scene complexity, visual quality, and physical accuracy, offering a unified pipeline for generating complex physical scenes for robotic manipulation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4705. Potentially Optimal Joint Actions Recognition for Cooperative Multi-Agent Reinforcement Learning
๐ฌ 14
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Value function factorization Multi-Agent
Value function factorization is widely used in cooperative multi-agent reinforcement learning (MARL).
Existing approaches often impose monotonicity constraints between the joint action value and individual action values to enable decentralized execution.
However, such constraints limit the expressiveness of value factorization, restricting the range of joint action values that can be represented and hindering the learning of optimal policies.
To address this, we propose Potentially Optimal Joint Actions Weighting (POW), a method that ensures optimal policy recovery where existing approximate weighting strategies may fail.
POW iteratively identifies potentially optimal joint actions and assigns them higher training weights through a theoretically grounded iterative weighted training process. We prove that this mechanism guarantees recovery of the true optimal policy, overcoming the limitations of prior heuristic weighting strategies.
POW is architecture-agnostic and can be seamlessly integrated into existing value factorization algorithms.
Extensive experiments on matrix games, difficulty-enhanced predator-prey tasks, SMAC, SMACv2, and a highway-env intersection scenario show that POW substantially improves stability and consistently surpasses state-of-the-art value-based MARL methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4706. Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents
๐ฌ 19
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Web agent web ui Cognitive Reasoning Web knowledge
Multimodal large-scale models have significantly advanced the development of web agents, enabling them to perceive and interact with the digital environment in a manner analogous to human cognition. In this paper, we argue that web agents must first acquire sufficient knowledge to engage in cognitive reasoning effectively. Therefore, we decompose a web agent's capabilities into two essential stages: knowledge content learning and cognitive processes. To formalize this, we propose Web-CogKnowledge Framework, which categorizes knowledge into Factual, Conceptual, and Procedural domains. In this framework, knowledge content learning corresponds to the agent's processes of Memorizing and Understanding, which rely on the former two types of knowledge, respectively, representing the "what" of learning. Conversely, cognitive processes correspond to Exploring, grounded in Procedural knowledge, defining the "how" of reasoning and action. To facilitate knowledge acquisition, we construct the Web-CogDataset, a structured resource curated from 14 real-world websites, designed to instill the core knowledge necessary for a web agent systematically. This dataset serves as the agent's conceptual groundingโthe "nouns" upon which comprehension is builtโas well as the basis for learning how to reason and act. Building on this foundation, we operationalize these processes through a novel knowledge-driven Chain-of-Thought (CoT) reasoning framework, developing and training our proposed agent, the Web-CogReasoner. Extensive experimentation reveals its significant superiority over existing models, particularly in its capacity for generalization to unseen tasks where its structured knowledge proves decisive. To facilitate rigorous and systematic evaluation, we introduce the Web-CogBench, a comprehensive evaluation suite designed to assess and compare agent performance across the delineated knowledge domains and cognitive capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4707. LogitโKL Flow Matching: NonโAutoregressive Text Generation via SamplingโHybrid Inference
๐ฌ 11
โญ 4.50
๐ generative models
๐ท๏ธ Flow Matching NAR text generation
Non-autoregressive (NAR) language models offer notable efficiency in text generation by circumventing the sequential bottleneck of autoregressive decoding. However, accurately modeling dependencies in discrete sequences remains challenging in this paradigm. In this work, we advance the field of NAR generation by applying conditional flow matching (CFM) methods grounded in geometrically principled interpolation, specifically leveraging Kullback-Leibler (KL) divergence geodesics, which correspond to linear interpolation in logit space. We rigorously establish that maximizing conditional likelihood in this setting precisely recovers the flow matching velocity field, supplying the theoretical justification for this approach in sequence modeling. To address practical performance gaps of \emph{basic} inference, we propose a novel empirical \emph{sampling} strategy that iteratively denoises and re-noises, along with a \emph{hybrid} scheme that integrates our \emph{sampling} method with \emph{basic} procedure. Across unconditional and conditional text and code infilling, the approach improves perplexity and downstream metrics over prior NAR baselines under matched settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4708. Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models
๐ฌ 15
โญ 4.50
๐ generative models
๐ท๏ธ Visual Tokenizer Latent Diffusion Model Foundation Encoder
In this work, we propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation. Unlike training a variational autoencoder (VAE) from scratch, which primarily emphasizes low-level details, our approach leverages the rich semantic structure of foundation encoders.
We introduce a three-stage alignment strategy: (1) freeze the encoder and train an adapter and a decoder to establish a semantic latent space; (2) jointly optimize all components with an additional semantic preservation loss, enabling the encoder to capture perceptual details while retaining high-level semantics; and (3) refine the decoder for improved reconstruction quality.
This alignment yields semantically rich image tokenizers that benefit diffusion models.
On ImageNet 256$\times$256, our tokenizer accelerates the convergence of diffusion models, reaching a gFID of 1.90 within just 64 epochs, and improves generation both with and without classifier-free guidance. Scaling to LAION, a 2B-parameter text-to-image model trained with our tokenizer consistently outperforms FLUX VAE under the same training steps. Overall, our method is simple, scalable, and establishes a semantically grounded paradigm for continuous tokenizer design.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4709. ComPhy: Composing Physical Models with end-to-end Alignment
๐ฌ 22
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Learning physics Physical systems Partial differential equations Systems of PDEs
Real-world phenomena typically involve multiple, interwoven dynamics that can be elegantly captured by systems of Partial Differential Equations (PDEs). However, accurately solving such systems remains a challenge. In this paper, we introduce ComPhy (CP), a novel modular framework designed to leverage the inherent physical structure of the problem to solve systems of PDEs. CP assigns each PDE to a dedicated learning module, each capable of incorporating state-of-the-art methodologies such as Physics-Informed Neural Networks or Neural Conservation Laws.
Crucially, CP introduces an end-to-end alignment mechanism, explicitly designed around the physical interplay of shared variables, enabling knowledge transfer between modules, and promoting solutions that are the result of the collective effort of all modules.
CP is the first approach specifically designed to tackle systems of PDEs, and our results show that it outperforms state-of-the-art approaches where a single model is trained on all PDEs at once.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4710. Neural Optimal Transport Meets Multivariate Conformal Prediction
๐ฌ 21
โญ 4.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ vector quantile regression conformal prediction neural optimal transport
We propose a framework for conditional vector quantile regression (CVQR) that combines neural optimal transport with amortized optimization, and apply it to multivariate conformal prediction. Classical quantile regression does not extend naturally to multivariate responses, while existing approaches often ignore the geometry of joint distributions. Our method parameterizes the conditional vector quantile function as the gradient of a convex potential implemented by an input-convex neural network, ensuring monotonicity and uniform ranks. To reduce the cost of solving high-dimensional variational problems, we introduce amortized optimization of the dual potentials, yielding efficient training and faster inference.
We then exploit the induced multivariate ranks for conformal prediction, constructing distribution-free predictive regions with finite-sample validity. Unlike coordinatewise methods, our approach adapts to the geometry of the conditional distribution, producing tighter and more informative regions. Experiments on benchmark datasets show improved coverageโefficiency trade-offs compared to baselines, highlighting the benefits of integrating neural optimal transport with conformal prediction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4711. MindMix: A Multimodal Foundation Model for Auditory Perception Decoding via Deep Neural-Acoustic Alignment
๐ฌ 32
โญ 4.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Electroencephalogram; Audio; Multimodal foundation model; Auditory decoding
Decoding complex auditory experiences from non-invasive EEG is a rapidly emerging field that holds significant promise for advancing both fundamental neuroscience and human-machine interaction technologies. Recent developments in EEG foundation models have yielded powerful neural representations that are promising for auditory decoding. However, the effectiveness of these models remains fundamentally constrained by their limited integration with acoustic stimulus information. Specifically, the lack of deep coupling between neural signals and auditory inputs hampers the modelsโ ability to generalize effectively across diverse auditory tasks. To bridge this gap, we introduce MindMix, a multimodal foundation model designed to bridge the gap between unimodal EEG foundations and task-specific auditory decoders. MindMix employs a two-stage training strategy: first, a high-capacity EEG encoder is pre-trained on over 3,000 hours of EEG data to learn generalized EEG features that can transfer across tasks and subjects. Second, the model learns the neural-acoustic mapping using over 100 hours of paired data, facilitated by our novel Cross-Attention Low-Rank Alignment module, which facilitates fine-grained, cross-modal information integration. Experimental results demonstrate that MindMix substantially surpassing existing baselines across a range of auditory decoding tasks, including auditory attention decoding, auditory emotion recognition, and cross-modal retrieval. This work thus establishes a foundation for future research in multimodal brain decoding and auditory brain-computer interfaces. Our code is available at https://anonymous.4open.science/r/MindMix-654B/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4712. U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding
๐ฌ 12
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ medical ultrasound benchmark large vision-language model
Ultrasound is a widely-used imaging modality critical to global healthcare, yet its interpretation remains challenging due to its varying image quality on operators, noises, and anatomical structures. Although large vision-language models (LVLMs) have demonstrated impressive multimodal capabilities across natural and medical domains, their performance on ultrasound remains largely unexplored. We introduce U2-BENCH, the first comprehensive benchmark to evaluate LVLMs on ultrasound understanding across classification, detection, regression, and text generation tasks. U2-BENCH aggregates 7,241 cases spanning 15 anatomical regions and defines 8 clinically inspired tasks, such as diagnosis, view recognition, lesion localization, clinical value estimation, and report generation, across 50 ultrasound application scenarios. We evaluate 23 state-of-the-art LVLMs, both open- and closed-source, general-purpose and medical-specific. Our results reveal strong performance on image-level classification, but persistent challenges in spatial reasoning and clinical language generation. U2-BENCH establishes a rigorous and unified testbed to assess and accelerate LVLM research in the uniquely multimodal domain of medical ultrasound imaging.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4713. Boosting Medical Visual Understanding From Multi-Granular Language Learning
๐ฌ 31
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multi-Granular Language Learning Medical Image Analysis Multimodal Learning
Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple labels across different levels of granularity. To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth KullbackโLeibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code will be available on GitHub.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4714. Diffusion Language Models For Code Infilling Beyond Fixed-size Canvas
๐ฌ 13
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Diffusion Language Models
Diffusion Language Models (DLMs) present a compelling alternative to autoregressive models, offering flexible, any-order infilling without specialized prompting design. However, their practical utility is blocked by a critical limitation: the requirement of a fixed-length masked sequence for generation. This constraint severely degrades code infilling performance when the predefined mask size mismatches the ideal completion length.
To address this, we propose \textsc{DreamOn}, a novel diffusion framework that enables dynamic, variable-length generation. \textsc{DreamOn} augments the diffusion process with two length control states, allowing the model to autonomously expand or contract the output length based solely on its own predictions. We integrate this mechanism into existing DLMs with minimal modifications to the training objective and no architectural changes.
Built upon Dream-Coder-7B and DiffuCoder-7B, \textsc{DreamOn} achieves infilling performance on par with state-of-the-art autoregressive models on HumanEval-Infilling and SantaCoder-FIM and matches oracle performance achieved with ground-truth length.
Our work removes a fundamental barrier to the practical deployment of DLMs, significantly advancing their flexibility and applicability for variable-length generation. Our code and models will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4715. Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs
๐ฌ 26
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Medical VQA Large Multimodal Models Data Synthesis Medical Literature Vision-Language Open-Weight Models
Large Multimodal Models (LMMs) are increasingly capable of answering medical questions that require joint reasoning over images and text, yet training general medical VQA systems is impeded by the lack of large, openly usable, high-quality corpora. We present \textbf{MedVLSynther}, a rubric-guided generator-verifier framework that synthesizes high-quality multiple-choice VQA items directly from open biomedical literature by conditioning on figures, captions, and in-text references. The generator produces self-contained stems and parallel, mutually exclusive options under a machine-checkable JSON schema; a multi-stage verifier enforces essential gates (self-containment, single correct answer, clinical validity, image-text consistency), awards fine-grained positive points, and penalizes common failure modes before acceptance. Applying this pipeline to PubMed Central yields \textit{MedVLSynther-13K}: 13,087 audited questions over 14,803 images spanning 13 imaging modalities and 28 anatomical regions. Training open-weight LMMs with reinforcement learning using verifiable rewards improves accuracy across six medical VQA benchmarks, achieving averages of 55.85 (3B) and 57.56 (7B), with up to 77.21 on VQA-RAD and 66.36 on PathVQA, outperforming strong medical LMMs. Ablations verify that both generation and verification are necessary and that more verified data consistently helps, and a targeted contamination analysis detects no leakage from evaluation suites. By operating entirely on open literature and open-weight models, MedVLSynther offers an auditable, reproducible, and privacy-preserving path to scalable medical VQA training data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4716. Combination-of-Experts with Knowledge Sharing for Cross-Task Vehicle Routing Problems
๐ฌ 5
โญ 4.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ neural combinatorial optimization vehicle routing problem cross-task generalization combination-of-experts
Recent neural methods have shown promise in generalizing across various vehicle routing problems (VRPs). These methods adopt either a fully-shared dense model across all VRP tasks (i.e., variants) or a mixture-of-experts model that assigns node embeddings within each task instance to different experts. However, they both struggle to generalize from training tasks with basic constraints to out-of-distribution (OOD) tasks involving unseen constraint combinations and new basic constraints, as they overlook the fact that each VRP task is defined by a combination of multiple basic constraints. To address this, this paper proposes a novel model, combination-of-experts with knowledge sharing (CoEKS), which leverages the structural characteristic of VRP tasks. CoEKS enhances generalization to constraint combinations via two complementary components: a combination-of-experts architecture enabling flexible combinations via prior assignment of constraint-specific experts, and a knowledge sharing strategy strengthening generalization via automatic learning of transferable general knowledge across constraints. Moreover, CoEKS allows new experts to be plugged into the trained model for rapid adaptation to new constraints. Experiments demonstrate that CoEKS outperforms state-of-the-art methods on in-distribution tasks and delivers greater gains on OOD tasks, including unseen constraint combinations (relative improvement of 12\% over SOTA) and new constraints (25\% improvement).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4717. Multiverse Mechanica: A Testbed for Learning Game Mechanics via Counterfactual Worlds
๐ฌ 10
โญ 4.50
๐ causal reasoning
๐ท๏ธ Causal generative modelling Game generative AI models Counterfactual reasoning
We study how generative world models trained on video games can go beyond mere reproduction of gameplay visuals to learning game mechanicsโthe modular rules that causally govern gameplay. We introduce a formalization of the concept of game mechanics that operationalizes mechanic-learning as a causal counterfactual inference task and uses the causal consistency principle to address the challenge of generating gameplay with world models that do not violate game rules. We present Multiverse Mechanica, a playable video game testbed that implements a set of ground truth game mechanics based on our causal formalism. The game natively emits training data, where each training example is paired with a set of causal DAGs that encode causality, consistency, and counterfactual dependence specific to the mechanic that is in playโthese provide additional artifacts that could be leveraged in mechanic-learning experiments. We provide a proof-of-concept that demonstrates fine-tuning a pre-trained model that targets mechanic learning. Multiverse Mechanica is a testbed that provides a reproducible, low-cost path for studying and comparing methods that aim to learn game mechanicsโnot just pixels.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4718. Beyond Ensembles: Simulating All-Atom Protein Dynamics in a Learned Latent Space
๐ฌ 15
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ representation learning generative models latent space dynamics score-based generative models dynamical systems GNNs autoregressive models
Simulating the long-timescale dynamics of biomolecules is a central challenge in computational science. While enhanced sampling methods can accelerate these simulations, they rely on pre-defined collective variables that are often difficult to identify, restricting their ability to model complex switching mechanisms between metastable states. A recent generative model, LD-FPG, demonstrated that this problem could be bypassed by learning to sample the static equilibrium ensemble as all-atom deformations from a reference structure, establishing a powerful method for all-atom ensemble generation. However, while this approach successfully captures a system's probable conformations, it does not model the temporal evolution between them. We introduce the Graph Latent Dynamics Propagator (GLDP), a modular component for simulating dynamics within the learned latent space of LD-FPG. We then compare three classes of propagators: (i) score-guided Langevin dynamics, (ii) Koopman-based linear operators, and (iii) autoregressive neural networks. Within a unified encoderโpropagatorโdecoder framework, we evaluate long-horizon stability, backbone and side-chain ensemble fidelity, and temporal kinetics via TICA. Benchmarks on systems ranging from small peptides to mixed-topology proteins and large GPCRs reveal that autoregressive neural networks deliver the most robust long rollouts and coherent physical timescales; score-guided Langevin best recovers side-chain thermodynamics when the score is well learned; and Koopman provides an interpretable, lightweight baseline that tends to damp fluctuations. These results clarify the trade-offs among propagators and offer practical guidance for latent-space simulators of all-atom protein dynamics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4719. Strategic Planning and Rationalizing on Trees Make LLMs Better Debaters
๐ฌ 18
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Language Model Competitive Debate
Winning competitive debates requires sophisticated reasoning and argument skills. There are unique challenges in the competitive debate: (1) The time constraints force debaters to make strategic choices about which points to pursue rather than covering all possible arguments; (2) The persuasiveness of the debate relies on the back-and-forth interaction between arguments, which a single final game status cannot evaluate. To address these challenges, we propose TreeDebater, a novel debate framework that excels in competitive debate. We introduce two tree structures: the Rehearsal Tree and Debate Flow Tree. The Rehearsal Tree anticipates the attack and defenses to evaluate the strength of the claim, while the Debate Flow Tree tracks the debate status to identify the active actions. TreeDebater allocates its time budget among candidate actions and uses the speech time controller and feedback from the simulated audience to revise its statement. The human evaluation on both the stage-level and the debate-level comparison shows that our TreeDebater outperforms the state-of-the-art multi-agent debate system, with a +15.6% improvement in stage-level persuasiveness with DeepSeek and +10% debate-level opinion shift win. Further investigation shows that TreeDebater shows better strategies in limiting time to important debate actions, aligning with the strategies of human debate experts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4720. CE-Nav: Flow-Guided Reinforcement Refinement for Cross-Embodiment Local Navigation
๐ฌ 20
โญ 4.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ embodied navigation
Generalizing local navigation policies across diverse robot morphologies is a critical challenge. Progress is often hindered by the need for costly and embodiment-specific data, the tight coupling of planning and control, and the "disastrous averaging" problem where deterministic models fail to capture multi-modal decisions (e.g., turning left or right). We introduce CE-Nav, a novel two-stage (IL-then-RL) framework that systematically decouples universal geometric reasoning from embodiment-specific dynamic adaptation. First, we train an embodiment-agnostic General Expert offline using imitation learning. This expert, a conditional normalizing flow model named VelFlow, learns the full distribution of kinematically-sound actions from a large-scale dataset generated by a classical planner, completely avoiding real robot data and resolving the multi-modality issue. Second, for a new robot, we freeze the expert and use it as a guiding prior to train a lightweight, Dynamics-Aware Refiner via online reinforcement learning. This refiner rapidly learns to compensate for the target robot's specific dynamics and controller imperfections with minimal environmental interaction. Extensive experiments on quadrupeds, bipeds, and quadrotors show that CE-Nav achieves state-of-the-art performance while drastically reducing adaptation cost. Successful real-world deployments further validate our approach as an efficient and scalable solution for building generalizable navigation systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4721. Pisces: Cryptography-based Private Retrieval-Augmented Generation with Dual-Path Retrieval
๐ฌ 13
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Retrieval-Augmented Generation Privacy-Preserving Cryptography
Retrieval-augmented generation (RAG) enhances the response quality of large language models (LLMs) when handling domain-specific tasks, yet raises significant privacy concerns. This is because both the user query and documents within the knowledge base often contain sensitive or confidential information. To address these concerns, we propose $\texttt{Pisces}$, the first practical cryptography-based RAG framework that supports dual-path retrieval, while protecting both the query and documents. Along the semantic retrieval path, we reduce computation and communication overhead by leveraging a coarse-to-fine strategy. Specifically, a novel oblivious filter is used to privately select a candidate set of documents to reduce the scale of subsequent cosine similarity computations. For the lexical retrieval path, to reduce the overhead of repeatedly invoking labeled PSI, we implement a multi-instance labeled PSI protocol to compute term frequencies for BM25 scoring in a single execution. $\texttt{Pisces}$ can also be integrated with existing privacy-preserving LLM inference frameworks to achieve end-to-end privacy. Experiments demonstrate that $\texttt{Pisces}$ achieves retrieval accuracy comparable to the plaintext baselines, within a 1.87% margin.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4722. Efficient Degradation-agnostic Image Restoration via Channel-Wise Functional Decomposition and Manifold Regularization
๐ฌ 16
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Image Restoration Transformer Contrastive Learning
Degradation-agnostic image restoration aims to handle diverse corruptions with one unified model, but faces fundamental challenges in balancing efficiency and performance across different degradation types. Existing approaches either sacrifice efficiency for versatility or fail to capture the distinct representational requirements of various degradations. We present MIRAGE, an efficient framework that addresses these challenges through two key innovations. First, we propose a channel-wise functional decomposition that systematically repurposes channel redundancy in attention mechanisms by assigning CNN, attention, and MLP branches to handle local textures, global context, and channel statistics, respectively. This principled decomposition enables degradation-agnostic learning while achieving superior efficiency-performance trade-offs. Second, we introduce manifold regularization that performs cross-layer contrastive alignment in Symmetric Positive Definite (SPD) space, which empirically improves feature consistency and generalization across degradation types. Extensive experiments across five degradation settings demonstrate that MIRAGE achieves state-of-the-art performance with remarkable efficiency, outperforming existing methods in both single and mixed degradation scenarios while showing strong zero-shot generalization to unseen domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4723. WARP: Weight Teleportation for Attack-Resilient Unlearning Protocols
๐ฌ 18
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Machine unlearning Approximate unlearning Neural teleportation Weight-space symmetries Privacy attacks Membership inference Model inversion Data reconstruction
Approximate machine unlearning aims to efficiently remove the influence of specific data points from a trained model, offering a practical alternative to full retraining. However, it introduces privacy risks: an adversary with access to both the original and unlearned models can exploit their differences for membership inference or data reconstruction. We show these vulnerabilities arise from two factors: large gradient norms of forgotten samples and the close proximity of the unlearned model to the original. To demonstrate their severity, we design unlearning-specific membership inference and reconstruction attacks, showing that several state-of-the-art methods (such as NGP and SCRUB) remain vulnerable.
To mitigate this leakage, we introduce WARP, a plug-and-play teleportation defense that leverages neural network symmetries to reduce gradient energy of forgotten samples and increase parameter dispersion while preserving accuracy. This reparameterization hides the signal of forgotten data, making it harder for attackers to distinguish forgotten samples from non-members or to recover them through reconstruction. Across six unlearning algorithms, our approach achieves consistent privacy gains, reducing adversarial advantage by up to 64% in black-box settings and 92% in white-box settings, while maintaining accuracy on retained data. These results highlight teleportation as a general tool for improving privacy in approximate unlearning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4724. Unified Registration of Cortical and Subcortical Structures
๐ฌ 21
โญ 4.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ neuroimaging registration sphere cortex mri
Accurate registration of brain MRI scans is fundamental for cross-subject analysis in neuroscientific studies. This involves aligning both the cortical surface of the brain and the interior volume. Traditional methods treat volumetric and surface-based registration separately, which often leads to inconsistencies that limit downstream analyses. We propose a deep learning framework, UCS, that registers 3D brain MRI images by jointly aligning both cortical and subcortical regions, through a unified volume-and-surface-based representation. Our approach leverages an intermediate spherical coordinate space to bridge anatomical surface topology with volumetric anatomy, enabling consistent and anatomically accurate alignment. By integrating spherical registration into the learning, our method ensures geometric coherence between volume and surface domains. In a series of experiments on both in-domain and out-of-domain datasets, our method consistently outperforms both classical and machine learning-based registration methods--improving the Dice score by up to 7 points while maintaining regular deformation fields. Additionally, it is orders of magnitude faster than the standard method for this task, and is simpler to use because it requires no additional inputs beyond an MRI scan. Its superior accuracy, fast inference, and ease of use sets a new standard for joint cortical and subcortical registration.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4725. Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits
๐ฌ 16
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multi-objective prompt optimization; multi-objective bandits; best feasible arm identification; fixed-budget pure exploration
Prompt engineering has become central to eliciting the capabilities of large language models (LLMs). At its core lies prompt selection - efficiently identifying the most effective prompts. However, most prior investigations overlook a key challenge: the inherently multi-faceted nature of prompt performance, which cannot be captured by a single metric. To fill this gap, we study the multi-objective prompt selection problem under two practical settings: Pareto prompt set recovery and best feasible prompt identification. Casting the problem into the pure-exploration bandits framework, we adapt provably efficient algorithms from multi-objective bandits and further introduce a novel design for best feasible arm identification in structured bandits, with theoretical guarantees on the identification error in the linear case. Extensive experiments across multiple LLMs show that the bandit-based approaches yield significant improvements over baselines, establishing a principled and efficient framework for multi-objective prompt optimization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4726. Hourglass Persistence for Graphs, Simplices, and Cells
๐ฌ 24
โญ 4.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph neural networks topological neural networks persistent homology
Persistent homology (PH) based schemes help encode information, such as cycles, and are thus increasingly being integrated with graph neural networks (GNNs) and higher order message-passing networks. Many PH based schemes in graph learning employ inclusion-based filtration mechanisms that trace a sequence of subgraphs of increasing size, maintaining bookkeeping information about the evolution (e.g., in terms of birth and death of components). We offer a novel perspective that goes beyond this inclusion paradigm. Specifically, we introduce topological descriptors for graphs, simplices, and cells that interleave a sequence of inclusions with a sequence of contractions and related families parametrized by two functions. The resulting descriptors on the extended sequence are provably more expressive than many existing PH methods with suitable stability conditions. Empirical results substantiate the merits of the proposed approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4727. LongHorizonUI: A Unified Framework for Robust long-horizon Task Automation of GUI Agent
๐ฌ 4
โญ 4.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Multimodal large language models
Although agents based on multimodal large language models (MLLMs) demonstrate proficiency in general short-term graphical user interface (GUI) tasks, their robustness remains a significant challenge for handling complex long-horizon tasks in dynamic environments . In response, the LongHorizonUI framework is proposed to improve the sustained reliability of agents in long-horizon GUI tasks. To overcome core limitations, we establish a comprehensive long-horizon benchmark, LongGUIBench, covering multiple categories of games and complex general applications, with long-horizon tasks defined as requiring more than 15 steps for rigorous evaluation of long-horizon reasoning capabilities. Based on this, a Multimodal Enhanced Perceiver is designed to incorporate element detection and text recognition models, assigning unique indices to interface elements, thereby reinforcing state representation. Furthermore, a Deep Reflection Decider engine is introduced, incorporating a structured multi-level feedback validation mechanism to enable progressive reasoning and ensure accurate action execution with predictable trajectories. Finally, we introduce a Compensatory Action Executor that combines multiple degradation compensation operations with a process rollback strategy based on execution progress monitoring to ensure operational effectiveness in long-horizon task logic. Experimental results demonstrate that LongHorizonUI achieves substantial long-horizon modeling improvements on LongGUIBench while retaining competitive performance on diverse public benchmarks. The code and models will be publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4728. MAC-AMP: A Closed-Loop Multi-Agent Collaboration System for Multi-Objective Antimicrobial Peptide Design
๐ฌ 23
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Multi-agent system Antimicrobial peptides Multi-objective optimization AI-simulated peer review Reinforcement learning
To address the global health threat of antimicrobial resistance, antimicrobial peptides (AMP) are being explored for their potent and promising ability to fight resistant pathogens. While artificial intelligence (AI) is being employed to advance AMP discovery and design, most AMP design models struggle to balance key goals like activity, toxicity, and novelty, using rigid or unclear scoring methods that make results hard to interpret and optimize. As the capabilities of Large Language Models (LLM) advance and evolve swiftly, we turn to AI multi-agent collaboration based on such models (multi-agent LLMs), which show rapidly rising potential in complex scientific design scenarios. Based on this, we introduce $\textbf{MAC-AMP}$, a closed-loop multi-agent collaboration (MAC) system for multi-objective AMP design. The system implements a fully autonomous simulated peer review-adaptive reinforcement learning framework that requires only a task description and example dataset to design novel AMPs. The novelty of our work lies in introducing a closed-loop multi-agent system for AMP design, with cross-domain transferability, that supports multi-objective optimization while remaining explainable rather than a 'black box'. Experiments show that MAC-AMP outperforms other AMP generative models by effectively optimizing AMP generation for multiple key molecular properties, demonstrating exceptional results in antibacterial activity, AMP likeliness, toxicity compliance, and structural reliability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4729. Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs
๐ฌ 17
โญ 4.50
๐ reinforcement learning
๐ท๏ธ LLM Reasoning RLVR
Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms through which they reshape model behavior remain unclear. We present a systematic empirical study of RLVRโs distributional effects across three complementary axes: (1) token-level distributional shifts, (2) functional validation via cross-sampling interventions, and (3) exploratory investigations of advantage signal modulation based on token divergence. We find that RL fine-tuning induces sparse, targeted changes, with only a small fraction of tokens exhibiting significant distributional divergence, and we further analyze the nature of these shifts. These divergent distributions are not uniformly predicted by entropy, indicating that RLVR can modify both initially high and low entropy distributions under different settings. Cross-sampling experiments reveal that inserting just a small fraction of RL-sampled tokens into base model generations recovers most RL performance gains, while injecting a small portion of base-sampled tokens into RL generations collapses performance to base levels, functionally isolating the critical role of divergent tokens. Finally, we explore divergence-weighted variants of the advantage signal, finding that they can amplify improvements in baselines. Our work sheds light on the distributional changes induced by RLVR and provides a granular, token-level lens for understanding and improving RL fine-tuning in LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4730. Sample-efficient evidence estimation of score based priors for model selection
๐ฌ 10
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Computational imaging inverse problems model selection posterior sampling diffusion models
The choice of prior is central to solving ill-posed imaging inverse problems, making it essential to select one consistent with the measurements $y$ to avoid severe bias. In Bayesian inverse problems, this could be achieved by evaluating the model evidence $p(y \mid M)$ under different models $M$ that specify the prior and then selecting the one with the highest value. Diffusion models are the state-of-the-art approach to solving inverse problems with a data-driven prior; however, directly computing the model evidence with respect to a diffusion prior is intractable. Furthermore, most existing model evidence estimators require either many pointwise evaluations of the unnormalized prior density or an accurate clean prior score. We propose DiME, an estimator of the model evidence under a diffusion prior by integrating over the time-marginals of posterior sampling methods. Our method leverages the large amount of intermediate samples that are naturally obtained during the reverse diffusion sampling process to obtain an accurate estimation of the model evidence using only a handful of posterior samples (e.g., 20). We demonstrate how to implement our estimator in tandem with recent diffusion posterior sampling methods. Empirically, our estimator matches the model evidence when it can be computed analytically, and it is able to both select the correct diffusion model prior and diagnose prior misfit under different highly ill-conditioned, non-linear inverse problems, including a real-world black hole imaging problem.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4731. DiffVax: Optimization-Free Image Immunization Against Diffusion-Based Editing
๐ฌ 11
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ diffusion malicious editing immunization defense against editing
Current image immunization defense techniques against diffusion-based editing embed imperceptible noise into target images to disrupt editing models. However, these methods face scalability challenges, as they require time-consuming re-optimization for each image separately, taking hours for small batches. To address these challenges, we introduce DiffVax, a scalable, lightweight, and optimization-free framework for image immunization, specifically designed to prevent diffusion-based editing. Our approach enables effective generalization to unseen content, reducing computational costs and cutting immunization time from days to milliseconds, achieving a speedup of 250,000ร. This is achieved through a loss term that ensures the failure of editing attempts and the imperceptibility of the perturbations. Extensive qualitative and quantitative results demonstrate that our model is scalable, optimization-free, adaptable to various diffusion-based editing tools, robust against counter-attacks, and, for the first time, effectively protects video content from editing. Our code and qualitative results are provided in the supplementary.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4732. MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models
๐ฌ 13
โญ 4.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Mixed-precision Quantization Microscaling Formats Post-training Quantization Large Language Models
Quantization significantly accelerates inference in large language models (LLMs) by replacing original high-precision matrices with low-precision counterparts. Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format. Although the new FP4 Tensor Cores in NVIDIAโs Blackwell architecture offer up to 4$\times$ speedup over FP16, existing INT4-based kernels fail to fully exploit this capability due to mismatched data formats. To bridge this gap, we propose MicroMix, a co-designed mixed-precision quantization algorithm and GEMM kernel based on Microscaling (MX) data formats. Tailored for the Blackwell architecture, the MicroMix kernel supports arbitrary combinations of MXFP4, MXFP6, and MXFP8 channels, and produces BFloat16 outputs. To achieve a favorable trade-off between accuracy and efficiency for each linear layer, we introduce quantization thresholds that identify activation elements where lower-precision formats (MXFP4 or MXFP6) incur excessive quantization error. Our algorithm selectively allocates higher-precision channels to preserve accuracy while maintaining compute efficiency. On the Llama and Qwen model families, MicroMix achieves near-FP16 performance across diverse downstream tasks with an average precision of 5 bits. In particular, Qwen2.5-32B-Base and Coder exhibit lossless accuracy on zero-shot, code generation, and mathematical reasoning benchmarks. In addition, on RTX 5070Ti laptop and RTX 5090 GPUs, our kernel achieves 2.29-3.38$\times$ acceleration compared to TensorRT-FP16.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4733. KnowGuard: Knowledge-Driven Abstention for Multi-Round Clinical Reasoning
๐ฌ 27
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ multi-agent system clinical reasoning medical question answering
In clinical practice, physicians refrain from making decisions when patient information is insufficient. This behavior, known as abstention, is a critical safety mechanism preventing potentially harmful misdiagnoses. Recent investigations have reported the application of large language models (LLMs) in medical scenarios. However, existing LLMs struggle with the abstentions, frequently providing overconfident responses despite incomplete information. This limitation stems from conventional abstention methods relying solely on model self-assessments, which lack systematic strategies to identify knowledge boundaries with external medical evidences. To address this, we propose \textbf{KnowGuard}, a novel \textit{investigate-before-abstain} paradigm that integrates systematic knowledge graph exploration for clinical decision-making. Our approach consists of two key stages operating on a shared contextualized evidence pool: 1) an evidence discovery stage that systematically explores the medical knowledge space through graph expansion and direct retrieval, and 2) an evidence evaluation stage that ranks evidence using multiple factors to adapt exploration based on patient context and conversation history. This two-stage approach enables systematic knowledge graph exploration, allowing models to trace structured reasoning paths and recognize insufficient medical evidence. We evaluate our abstention approach using open-ended multi-round clinical benchmarks that mimic realistic diagnostic scenarios, assessing abstention quality through accuracy-efficiency trade-offs beyond existing closed-form evaluations. Experimental evidence clearly demonstrates that KnowGuard outperforms state-of-the-art abstention approaches, improving diagnostic accuracy by 3.93\% through effective diagnostic interactions averaging 5.74 conversation turns.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4734. From Assumptions to Actions: Turning LLM Reasoning into Uncertainty-Aware Planning for Embodied Agents
๐ฌ 25
โญ 4.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ multi-agent system large language model human-agent cooperation
Embodied agents operating in multi-agent, partially observable, and decentralized environments must plan and act despite pervasive uncertainty about hidden objects and collaborators' intentions. Recent advances in applying Large Language Models (LLMs) to embodied agents have addressed many long-standing challenges, such as high-level goal decomposition and online adaptation. Yet, uncertainty is still primarily mitigated through frequent inter-agent communication. This incurs substantial token and time costs, and can disrupt established workflows, when human partners are involved. We introduce PCE, a Planner-Composer-Evaluator framework that converts the fragmented assumptions latent in LLM reasoning traces into a structured decision tree. Internal nodes encode environment assumptions and leaves map to actions; each path is then scored by scenario likelihood, goal-directed gain, and execution cost to guide rational action selection without heavy communication. Across two challenging multi-agent benchmarks (C-WAH and TDW-MAT) and three diverse LLM backbones, PCE consistently outperforms communication-centric baselines in success rate and task efficiency while showing comparable token usage. Ablation results indicate that the performance gains obtained by scaling model capacity or reasoning depth persist even when PCE is applied, while PCE consistently raises the baseline across both capacity and reasoning-depth scales, confirming that structured uncertainty handling complements both forms of scaling. A user study further demonstrates that PCE produces communication patterns that human partners perceive as more efficient and trustworthy. Together, these results establish a principled route for turning latent LLM assumptions into reliable strategies for uncertainty-aware planning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4735. I2Mole: Interaction-aware Invariant Molecular Learning For Generalizable Property Prediction
๐ฌ 4
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Molecular relationship learning; Drug-drug interaction; graph information bottleneck
Molecular interactions are a common phenomenon in physical chemistry field, which could produce unexpected biochemical properties harmful to humans, such as drug-drug interactions. Machine learning has the potential to deliver rapid and accurate predictions. However, the complexity of molecular structures and the diversity of molecular interactions could undermine model prediction accuracy and hinder generalizability. In this context, identifying core invariant substructures (\textit{i.e.}, rationales) has become essential for enhancing interpretability and generalization. Despite notable efforts, existing models often neglect the molecular pairsโ modeling, leading to insufficient capture of interaction relationships. To address these limitations, we propose a novel framework, \textbf{I}nteraction-aware \textbf{I}nvariant \textbf{Mole}cular learning (I2Mole), for generalizable property prediction. I2Mole meticulously models atomic interactions such as hydrogen bonds by initially establishing indiscriminate connections between intermolecular atoms, which are subsequently refined using an improved graph information bottleneck theory tailored for merged graphs. To further enhance model generalization, we construct an environment codebook by environment subgraph of the merged graph. This approach not only could provide noise source for optimizing mutual information but also preserve the integrity of chemical semantic information. By comprehensively leveraging the information inherent in the merged graph, our model accurately captures core substructures and significantly enhances generalization capabilities. Extensive experimental validation demonstrates the efficacy and generalizability of I2Mole. The implementation code is available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4736. Adaptive Mamba Neural Operators
๐ฌ 21
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Neural operator partial differential equation adaptive Fourier decomposition Mamba
Accurately solving partial differential equations (PDEs) on arbitrary geometries and a variety of meshes is an important task in science and engineering applications. In this paper, we propose Adaptive Fourier Mamba Operators (AFMO), which integrates reproducing kernels for state-space models (SSMs) rather than the kernel integral formulation of SSMs. This is achieved by constructing Takenaka-Malmquist systems for the PDEs. AFMO offers new representations that align well with the adaptive Fourier decomposition (AFD) theory and can approximate the solution manifold of PDEs on a wide range of geometries and meshes. In several challenging benchmark PDE problems in the field of fluid physics, solid physics, and finance on point clouds, structured meshes, regular grids, and irregular domains, AFMO consistently outperforms state-of-the-art solvers in terms of relative $L^2$ error. Overall, this work presents a new paradigm for designing explainable neural operator frameworks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4737. Rectified Decoupled Dataset Distillation: A Closer Look for Fair and Comprehensive Evaluation
๐ฌ 19
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ decoupled dataset distillation
Dataset distillation aims to generate compact synthetic datasets that enable models trained on them to achieve performance comparable to those trained on full real datasets, while substantially reducing storage and computational costs. Early bi-level optimization methods (e.g., MTT) have shown promising results on small-scale datasets, but their scalability is limited by high computational overhead.
To address this limitation, recent decoupled dataset distillation methods (e.g., SRe$^2$L) separate the teacher model pre-training from the synthetic data generation process. These methods also introduce random data augmentation and epoch-wise soft labels during the post-evaluation phase to improve performance and generalization. However, existing decoupled distillation methods suffer from inconsistent post-evaluation protocols, which hinders progress in the field. In this work, we propose **R**ectified **D**ecoupled **D**ataset **D**istillation (RD$^3$), and systematically investigate how different post-evaluation settings affect test accuracy. We further examine whether the reported performance differences across existing methods reflect true methodological advances or stem from discrepancies in evaluation procedures. Our analysis reveals that much of the performance variation can be attributed to inconsistent evaluation rather than differences in the intrinsic quality of the synthetic data. In addition, we identify general strategies that improve the effectiveness of distilled datasets across settings. By establishing a standardized benchmark and rigorous evaluation protocol, RD$^3$ provides a foundation for fair and reproducible comparisons in future dataset distillation research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4738. Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection
๐ฌ 21
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Concept Erasing Unlearning Robustness Text-to-Image Generation Diffusion Models
Recent advances in text-to-image (T2I) diffusion models have seen rapid and widespread adoption. However, their powerful generative capabilities raise concerns about potential misuse for synthesizing harmful, private, or copyrighted content. To mitigate such risks, concept erasure techniques have emerged as a promising solution. Prior works have primarily focused on fine-tuning the denoising component (e.g., the U-Net backbone). However, recent causal tracing studies suggest that visual attribute information is localized in the early self-attention layers of the text encoder, indicating a potential alternative for concept erasing. Building on this insight, we conduct preliminary experiments and find that directly fine-tuning early layers can suppress target concepts but often degrades the generation quality of non-target concepts. To overcome this limitation,
we propose High-Level Representation Misdirection (HiRM), which misdirects high-level semantic representations of target concepts in the text encoder toward designated vectors such as random directions or semantically defined directions (e.g., super-categories), while updating only early layers that contain causal states of visual attributes. Our decoupling strategy enables precise concept removal with minimal impact on unrelated concepts, as demonstrated by strong results on UnlearnCanvas and NSFW benchmarks across diverse targets (e.g., objects, styles, nudity). HiRM also preserves generative utility at low training cost and transfers effectively to state-of-the-art architectures such as Flux without additional training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4739. IDEAL: Data Equilibrium Adaptation for Multi-Capability Language Model Alignment
๐ฌ 29
โญ 4.50
๐ optimization
๐ท๏ธ Data Re-weighting LLM SFT
Large Language Models (LLMs) have achieved impressive performance through Supervised Fine-tuning (SFT) on diverse instructional datasets. When training on multiple capabilities simultaneously, the mixture training dataset, governed by volumes of data from different domains, is a critical factor that directly impacts the final model's performance.
Unlike many studies that focus on enhancing the quality of training datasets through data selection methods, few works explore the intricate relationship between the compositional quantity of mixture training datasets and the emergent capabilities of LLMs.
Given the availability of a high-quality multi-domain training dataset, understanding the impact of data from each domain on the model's overall capabilities is crucial for preparing SFT data and training a well-balanced model that performs effectively across diverse domains.
In this work, we introduce IDEAL, an innovative data equilibrium adaptation framework designed to effectively optimize volumes of data from different domains within mixture SFT datasets, thereby enhancing the model's alignment and performance across multiple capabilities.
IDEAL employs a gradient-based approach to iteratively refine the training data distribution, dynamically adjusting the volumes of domain-specific data based on their impact on downstream task performance. By leveraging this adaptive mechanism, IDEAL ensures a balanced dataset composition, enabling the model to achieve robust generalization and consistent proficiency across diverse tasks.
Experiments across different capabilities demonstrate that IDEAL outperforms conventional uniform data allocation strategies, achieving a comprehensive improvement of approximately 7\% in multi-task evaluation scores.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4740. AdaRank: Adaptive Rank Pruning for Enhanced Model Merging
๐ฌ 18
โญ 4.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Model Merging Task Arithmetic
Model merging has emerged as a promising approach for unifying independently fine-tuned models into an integrated framework, significantly enhancing computational efficiency in multi-task learning. Recently, several SVD-based techniques have been introduced to exploit low-rank structures for enhanced merging, but their reliance on heuristically designed rank selection often leads to inter-task interference and suboptimal performance. In this paper, we propose AdaRank, a model merging framework that replaces this heuristic selection by adaptively selecting the beneficial singular components of task vectors to merge multiple models. We first show empirically that (i) selecting only the top singular components of task vectors can cause critical interference with other tasks, and (ii) assigning fixed ranks does not align with the varying complexity of tasks and layers. AdaRank addresses both issues by adapting per-component masks, indicating the selection of the component, to the unlabeled test data with entropy minimization. Our experimental findings show that AdaRank consistently improves existing merging methods across diverse backbones from different modalities, largely narrowing the performance gap against individually fine-tuned models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4741. Distribution-informed Online Conformal Prediction
๐ฌ 15
โญ 4.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Conformal Prediction; Uncertainty Quantification; Distribution Shift; Time Series
Conformal prediction provides a pivotal and flexible technique for uncertainty quantification by constructing prediction sets with a predefined coverage rate. Many online conformal prediction methods have been developed to address data distribution shifts in fully adversarial environments, resulting in overly conservative prediction sets. We propose Conformal Optimistic Prediction (COP), an online conformal prediction algorithm incorporating underlying data pattern into the update rule. Through estimated cumulative distribution function of non-conformity scores, COP produces tighter prediction sets when predictable pattern exists, while retaining valid coverage guarantees even when estimates are inaccurate. We establish a joint bound on coverage and regret, which further confirms the validity of our approach. We also prove that COP achieves distribution-free, finite-sample coverage under arbitrary learning rates and can converge when scores are i.i.d. The experimental results also show that COP can achieve valid coverage and construct shorter prediction intervals than other baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4742. Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction
๐ฌ 25
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Selective Prediction Vision Language Alignment AI Safety
Selective prediction aims to endow predictors with a reject option, to avoid low confidence predictions. However, existing literature has primarily focused on closed-set tasks, such as visual question answering with predefined options or fixed-category classification. This paper considers selective prediction for visual language foundation models, addressing a taxonomy of tasks ranging from closed to open set and from finite to unbounded vocabularies, as in image captioning. We seek training-free approaches of low-complexity, applicable to any foundation model and consider methods based on external vision-language model (VLM) embeddings, like CLIP. This is denoted as $\textit{Plug-and-Play Selective Prediction} (\textbf{\texttt{PaPSP}})$. We identify two key challenges: (1) $\textit{instability of the visual-language representations}$, leading to high variance in image-text embeddings, and (2) $\textit{poor calibration of similarity scores}$. To address these issues, we propose a $\textit{memory augmented}$ $\textbf{\texttt{PaPSP}}$ ($\textbf{\texttt{MA-PaPSP}}$) model, which augments $\textbf{\texttt{PaPSP}}$ with a retrieval dataset of image-text pairs. This is leveraged to reduce embedding variance by averaging retrieved nearest-neighbor pairs and is complemented by the use of contrastive normalization to improve score calibration. Through extensive experiments on multiple datasets, we show that $\textbf{\texttt{MA-PaPSP}}$ outperforms $\textbf{\texttt{PaPSP}}$ and other selective prediction baselines for selective captioning, image-text matching, and fine-grained classification. Source code will be made public.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4743. Johnson-Lindenstrauss Lemma Guided Network for Efficient 3D Medical Segmentation
๐ฌ 19
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Efficient Medical segmentation multimodal learning Knowledge Transfer
Lightweight 3D medical image segmentation remains constrained by a fundamental "efficiency / robustness conflict", particularly when processing complex anatomical structures and heterogeneous modalities. In this paper, we study how to redesign the framework based on the characteristics of high-dimensional 3D images, and explore data synergy to overcome the fragile representation of lightweight methods. Our approach, VeloxSeg, begins with a deployable and extensible dual-stream CNN-Transformer architecture composed of Paired Window Attention (PWA) and Johnson-Lindenstrauss lemma-guided convolution (JLC). For each 3D image, we invoke a "glance-and-focus" principle, where PWA rapidly retrieves multi-scale information, and JLC ensures robust local feature extraction with minimal parameters, significantly enhancing the model's ability to operate with low computational budget. Followed by an extension of the dual-stream architecture that incorporates modal interaction into the multi-scale image-retrieval process, VeloxSeg efficiently models heterogeneous modalities. Finally, Spatially Decoupled Knowledge Transfer (SDKT) via Gram matrices injects the texture prior extracted by a self-supervised network into the segmentation network, yielding stronger representations than baselines at no extra inference cost. Experimental results on multimodal benchmarks show that VeloxSeg achieves a 26\% Dice improvement, alongside increasing GPU throughput by 11x, CPU by 48x, and reducing training peak GPU memory usage by 1/20, inference by 1/24.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4744. Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification
๐ฌ 16
โญ 4.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Uncertainty Quantification Language Models Epistemic Uncertainty
Large language models (LLMs) often produce confident yet incorrect responses, and uncertainty quantification is one potential solution to more robust usage. Recent works routinely rely on self-consistency to estimate aleatoric uncertainty (AU), yet this proxy collapses when models are overconfident and produce the same incorrect answer across samples. We analyze this regime and show that cross-model semantic disagreement is higher on incorrect answers precisely when AU is low. Motivated by this, we introduce an epistemic uncertainty (EU) term that operates in the black-box access setting: EU uses only generated text from a small, scale-matched ensemble and is computed as the gap between inter-model and intra-model sequence-semantic similarity. We then define total uncertainty (TU) as the sum of AU and EU. In a comprehensive study across five 7-9B instruction-tuned models and ten long-form tasks, TU improves ranking calibration and selective abstention relative to AU, and EU reliably flags confident failures where AU is low. We further characterize when EU is most useful via agreement and complementarity diagnostics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4745. Procedural Mistake Detection via Action Effect Modeling
๐ฌ 14
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Mistake detection Action effect modeling Video understanding
Mistake detection in procedural tasks is essential for building intelligent systems that support learning and task execution. Existing approaches primarily analyze how an action is performed, while overlooking what it produces, i.e., the \textbf{action effect}. Yet many errors manifest not in the execution itself but in the resulting outcome, such as an unintended object state or incorrect spatial arrangement. To address this gap, we propose Action Effect Modeling (AEM), a unified framework that jointly captures action execution and its outcomes through a probabilistic formulation. AEM first identifies the outcome of an action by selecting the most informative effect frame based on semantic relevance and visual quality. It then extracts complementary cues from visual grounding and symbolic scene graphs, aligning them in a shared latent space to form robust effect-aware representations. To detect mistakes, we further design a prompt-based detector that incorporates task-specific prompts and aligns each action segment with its intended execution semantics. Our approach achieves state-of-the-art performance on the EgoPER and CaptainCook4D benchmarks under the challenging one-class classification (OCC) setting. These results demonstrate that modeling both execution and outcome yields more reliable mistake detection, and highlight the potential of effect-aware representations to benefit a broader range of downstream applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4746. Supporting Multimodal Intermediate Fusion with Informatic Constraint and Distribution Coherence
๐ฌ 20
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Multimodal representation learning; Generalization error; Informatic constraint; Distribution cohering
Based on the prevalent intermediate fusion (IF) and late fusion (LF) frameworks, multimodal representation learning (MML) demonstrates its superiority over unimodal representation learning. To investigate the intrinsic factors underlying the empirical success of MML, research grounded in theoretical justifications from the perspective of generalization error has emerged. However, these provable MML studies derive the theoretical findings based on LF, while theoretical exploration based on IF remains scarce. This naturally gives rise to a question: **Can we design a comprehensive MML approach supported by the sufficient theoretical analysis across fusion types?** To this end, we revisit the IF and LF paradigms from a fine-grained dimensional perspective. The derived theoretical evidence sufficiently establishes the superiority of IF over LF under a specific constraint. Based on a general $K$-Lipschitz continuity assumption, we derive the generalization error upper bound of the IF-based methods, indicating that eliminating the distribution incoherence can improve the generalizability of IF-based MML methods. Building upon these theoretical insights, we establish a novel IF-based MML method, which introduces the informatic constraint and performs distribution cohering. Extensive experimental results on multiple widely adopted datasets verify the effectiveness of the proposed method.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4747. AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism
๐ฌ 22
โญ 4.50
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ Long context training Sequence Parallelism
Large-language-models (LLMs) demonstrate enormous utility in long-context tasks which require processing prompts that consist of tens to hundreds of thousands of tokens. However, existing LLM training libraries do not provide easy to use abstractions to optimize for long-context training, instead focusing on optimizations for models with large parameter counts through ZeRO-3/FSDP, Tensor and Pipeline parallelism. This forces users to rewrite LLM training libraries to incorporate compositions of various complex long-context optimizations, such as sequence-parallelism, to training pipelines; a process that requires in-depth expertise, reducing developer productivity. To tackle these challenges, we introduce AutSP: the first automated solution to automatically optimize LLM training for longer-contexts. AutoSP compiles models and applies a targeted set of optimizations: automated sequence parallelism, and long-context aware activation-checkpointing, to drastically enhance LLM trainability at negligible cost to throughput. Our evaluation demonstrates AutoSP's capability on both NVIDIA and AMD hardware, increasing training contexts by upto 2.7$\times$ and 2.5$\times$ respectively at negligible cost to runtime performance over competitive hand-written baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4748. Personalized Reasoning: Just-in-time Personalization and Why LLMs Fail at It
๐ฌ 17
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Interactive Personalization Test-time Reasoning Information Seeking Preference Alignment
Current large language model (LLM) development treats task-solving and preference-alignment as separate challenges, optimizing first for objective correctness, then for alignment to aggregated human preferences. This paradigm fails in human-facing applications where solving a problem correctly is insufficient if the response mismatches the userโs needs. This challenge intensifies in just-in-time scenarios where no prior user interaction history exists due to cold-start conditions or privacy constraints. LLMs need to identify what they donโt know about user preferences, strategically elicit preference values through questioning, then adapt their reasoning processes and responses accordinglyโa complicated chain of cognitive processes which we term personalized reasoning. We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse preferences. Our framework creates scenarios where identical questions require different reasoning chains depending on user context, as optimal explanation approaches vary by individual expertise and preferences while maintaining factual accuracy. Evaluation of 21 frontier models across 10 tasks reveals 29.0% of naive personalization attempts produce worse preference alignment than generic responses, yet generic responses also fail to serve individual user needs effectively. These findings suggest personalized reasoning requires dedicated development rather than emerging naturally. PREFDISCO establishes personalized reasoning as a measurable research frontier and reveals fundamental limitations in current LLMsโ interactive capabilities, providing a foundation for developing systems that can adapt to individual users in education, healthcare, and technical domains where personalization is critical.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4749. Scaling Sequence-to-Sequence Generative Neural Rendering
๐ฌ 27
โญ 4.50
๐ generative models
๐ท๏ธ 3D vision Novel View Synthesis Generative Neural Rendering
We present Kaleido, a family of generative models designed for photorealistic, unified object- and scene-level neural rendering. Kaleido is driven by the principle of treating 3D as a specialised sub-domain of video, which we formulate purely as a sequence-to-sequence image synthesis task. Through a systemic study of scaling sequence-to-sequence generative neural rendering, we introduce key architectural innovations that enable our model to: i) perform generative view synthesis without explicit 3D representations; ii) generate any number of 6-DoF target views conditioned on any number of reference views via a masked autoregressive framework; and iii) seamlessly unify 3D and video modelling within a single decoder-only rectified flow transformer. Within this unified framework, Kaleido leverages large-scale video data for pre-training, which significantly improves spatial consistency and reduces reliance on scarce, camera-labelled 3D datasets --- all without any architectural modifications. Kaleido sets a new state-of-the-art on a range of view synthesis benchmarks. Its zero-shot performance substantially outperforms other generative methods in few-view settings, and, for the first time, matches the quality of per-scene optimisation methods in many-view settings. For supplementary materials, including Kaleido's generated renderings and videos, please refer to our anonymous website: https://kaleido-research.github.io/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4750. Independence Test for Linear Non-Gaussian Data and Applications in Causal Discovery
๐ฌ 17
โญ 4.50
๐ causal reasoning
๐ท๏ธ independence tests Causal discovery
Independence testing involves determining whether two variables are independent based on observed samples, which is a fundamental problem in statistics and machine learning. Existing testing methods, such as HSIC, can theoretically detect broad forms of dependence, but may sacrifice statistical power when applied to limited samples with background knowledge of the distribution. In this paper, we focus on the linear non-Gaussian data, a widely supported model in scientific data analysis and causal discovery, where variables are linked linearly with noise terms that are non-Gaussian distributed. We provide a new theoretical characterization of independence in this case, showing that constancy of the conditional mean and variance is sufficient to guarantee independence under linear non-Gaussian models.
Building on this result, we develop a kernel-based testing framework with provable asymptotic guarantees. Extensive experiments on synthetic and real-world datasets demonstrate that our method achieves higher power than existing approaches and significantly improves downstream causal discovery performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4751. LayerSync: Self-aligning Intermediate Layers
๐ฌ 18
โญ 4.50
๐ generative models
๐ท๏ธ Diffusion models Self distillation
We propose LayerSync, a domain-agnostic approach for improving the generation quality and the training efficiency of diffusion models. Prior studies have highlighted the connection between the quality of generation and the representations learned by diffusion models, showing that external guidance on model intermediate representations accelerates training. We reconceptualize this paradigm by regularizing diffusion models with their own intermediate representations. Building on the observation that representation quality varies across diffusion model layers, we show that the most semantically rich representations can act as an intrinsic guidance for weaker ones, reducing the need for external supervision. Our approach, LayerSync, is a self-sufficient, plug-and-play regularizer term with no overhead on diffusion model training and generalizes beyond the visual domain to other modalities. LayerSync requires no pretrained models nor additional data. We extensively evaluate the method on image generation and demonstrate its applicability to other domains such as audio, video, and motion generation. We show that it consistently improves the generation quality and the training efficiency. For example, we speed up the training of flow-based transformer by over 8.75$\times$ on ImageNet dataset and improve the generation quality by 23.6\%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4752. Scalable In-Context Q-Learning
๐ฌ 14
โญ 4.50
๐ reinforcement learning
๐ท๏ธ In-context reinforcement learning Q-learning advantage-weighted regression world model
Recent advancements in language models have demonstrated remarkable in-context learning abilities, prompting the exploration of in-context reinforcement learning (ICRL) to extend the promise to decision domains. Due to involving more complex dynamics and temporal correlations, existing ICRL approaches may face challenges in learning from suboptimal trajectories and achieving precise in-context inference. In the paper, we propose **S**calable **I**n-**C**ontext **Q**-**L**earning (**S-ICQL**), an innovative framework that harnesses dynamic programming and world modeling to steer ICRL toward efficient reward maximization and task generalization, while retaining the scalability and stability of supervised pretraining. We design a prompt-based multi-head transformer architecture that simultaneously predicts optimal policies and in-context value functions using separate heads. We pretrain a generalized world model to capture task-relevant information, enabling the construction of a compact prompt that facilitates fast and precise in-context inference. During training, we perform iterative policy improvement by fitting a state value function to an upper-expectile of the Q-function, and distill the in-context value functions into policy extraction using advantage-weighted regression. Extensive experiments across a range of discrete and continuous environments show consistent performance gains over various types of baselines, especially when learning from suboptimal data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4753. Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification
๐ฌ 22
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ LLM Formal Language benchmark
Despite rapid advances in code generation, current Large Language Models (LLMs) still lack an essential capability for reliable and verifiable code generation: compositional reasoning across multi-function programs. To explore this potential and important gap, we introduce DafnyCOMP, a benchmark designed to systematically evaluate LLMs on the generation of compositional specifications in Dafny. Unlike prior benchmarks that primarily target single-function annotation, DafnyCOMP focuses on programs composed of multiple interacting functions with necessary data dependencies, requiring LLMs to produce specifications that ensure correctness across component boundaries. Our benchmark comprises 300 automatically synthesized programs, each carefully constructed by combining 2โ5 originally independent functions in a chain-based manner through LLM-driven synthesis. We evaluate LLMs from five leading research groups that represent the current frontier of reasoning-centric AI, including the GPT, Claude, Gemini, DeepSeek, and Qwen families. Our results reveal a striking dichotomy: while LLMs achieve both high syntax correctness (>99%) and moderate verification rates (>58%) in prior single-function benchmarks, they exhibit degraded syntax correctness (95.67%) and a catastrophic verification failure (3.69%) in DafnyCOMP's compositional tasksโ a 92% performance gap. Even the most powerful LLM achieves only 7% verification at Pass@8, with most LLMs below 2%. Further analysis reveals that LLMs systematically fail at cross-functional reasoning through three primary failure modes: specification fragility (39.2%), implementation-proof misalignment (21.7%), and reasoning instability (14.1%). These failures clearly reveal the absence of compositional reasoning capabilities in current LLMs. DafnyCOMP thus establishes a diagnostic benchmark for tracking progress in verifiable code generation with LLMs, highlighting that the path from local to compositional verification remains largely uncharted.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4754. Multimodal Dataset Distillation via Phased Teacher Models
๐ฌ 24
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Dataset Distillation multimodal learning
Multimodal dataset distillation aims to construct compact synthetic datasets that enable efficient compression and knowledge transfer from large-scale image-text data. However, existing approaches often fail to capture the complex, dynamically evolving knowledge embedded in the later training stages of teacher models. This limitation leads to degraded student performance and compromises the quality of the distilled data. To address critical challenges such as pronounced cross-stage performance gaps and unstable teacher trajectories, we propose Phased Teacher Model with Shortcut Trajectory (PTM-ST)โa novel phased distillation framework. PTM-ST leverages stage-aware teacher modeling and a shortcut-based trajectory construction strategy to accurately fit the teacherโs learning dynamics across distinct training phases. This enhances both the stability and expressiveness of the distillation process. Through theoretical analysis and comprehensive experiments, we show that PTM-ST significantly mitigates optimization oscillations and inter-phase knowledge gaps, while also reducing storage overhead. Our method consistently surpasses state-of-the-art baselines on Flickr30k and COCO, achieving up to 13.5% absolute improvement and an average gain of 9.53% on Flickr30k. The code and models will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4755. Revisiting Sharpness-Aware Minimization: A More Faithful and Effective Implementation
๐ฌ 12
โญ 4.50
๐ optimization
๐ท๏ธ Sharpness-Aware Minimization Optimization Generalization
Sharpness-Aware Minimization (SAM) enhances generalization by minimizing the maximum training loss within a predefined neighborhood around the parameters. However, its practical implementation approximates this as gradient ascent(s) followed by applying the gradient at the ascent point to update the current parameters. This practice can be justified as approximately optimizing the objective by neglecting the (full) derivative of the ascent point with respect to the current parameters. Nevertheless, a direct and intuitive understanding of why using the gradient at the ascent point to update the current parameters works superiorly, despite being computed at a shifted location, is still lacking. Our work bridges this gap by proposing a novel and intuitive interpretation. We show that the gradient at the single-step ascent point, when applied to the current parameters, provides a better approximation of the direction from the current parameters toward the maximum within the local neighborhood than the local gradient. This improved approximation thereby enables a more direct escape from the maximum within the local neighborhood. Nevertheless, our analysis further reveals two issues. First, the approximation by the gradient at the single-step ascent point is often inaccurate. Second, the approximation quality may degrade as the number of ascent steps increases. To address these limitations, we propose in this paper eXplicit Sharpness-Aware Minimization (XSAM). It tackles the first by explicitly estimating the direction of the maximum during training, and addresses the second by crafting a search space that effectively leverages the gradient information at the multi-step ascent point. XSAM features a unified formulation that applies to both single-step and multi-step settings and only incurs negligible computational overhead. Extensive experiments demonstrate the consistent superiority of XSAM against existing counterparts across various models, datasets, and settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4756. A Rich Knowledge Space for Scalable Deepfake Detection
๐ฌ 19
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Deepfake Detection Media Forensics Multi-modal Learning
The proliferation of realistic deepfakes has driven the development of numerous benchmark datasets to support detection research. Despite their increasing volume and diversity, no prior effort has systematically consolidated these resources into a unified framework for large-scale model training, nor has there been a massively pre-trained model tailored to deepfake detection. In this work, we introduce MMI-DD (Multi-modal Multi-type Integrated Deepfake Dataset), a large-scale resource containing 3.6 million facial images, the largest collection to date. It unifies diverse benchmarks with uniform preprocessing, and further provides fine-grained annotations across four deepfake types, as well as VLM-generated descriptions capturing both facial and environmental attributes for each image. By leveraging this comprehensive multi-modal dataset, we construct a foundational deepfake knowledge space that empowers our model to discern a broad spectrum of synthetic media. Our method, SD^2 (Scalable Deepfake Detection), refines CLIP for deepfake detection, optimizing image-text classification with rich, type-specific labels. We enhance this with intermediate visual features capturing low-level cues and text label separation loss for stability. We further leverage VLM-generated descriptions and contrastive learning to expand the scope of forgery knowledge, reducing overfitting and enhancing generalization. Extensive experiments on challenging deepfake datasets and AIGC benchmark demonstrate the effectiveness, scalability, and real-world applicability of our approach. Our dataset and code will be available at https://anonymous.4open.science/r/SDD/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4757. GPS: Directed Acyclic Graph guided Proactive Information Seeking in Large Language Models
๐ฌ 15
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Language Models Retrieval-augmented generation Reinforcement learning Clarification questions
Equipping Large Language Models (LLMs) with the ability to proactively ask clarifying questions is essential to mitigate ambiguity when faced with underspecified user queries in retrieval-augmented generation (RAG) systems. However, existing methods often neglect the rule-based reasoning structures embedded in the retrieved knowledge that are central to ambiguity, making it challenging to learn an effective and efficient question-asking strategy. To address these issues, we introduce \textbf{GPS}, a two-stage framework for enhancing proactive information seeking abilities of LLMs in RAG systems. In the reasoning stage, we propose a Directed Acyclic Graph (DAG) reasoning structure with theoretical guarantees of logical completeness, which facilitates capturing all condition logic in the retrieved knowledge and supports effective clarification. In the clarification stage, we design a traversal-based algorithm that dynamically prunes the DAG based on user responses, enabling efficient clarification. To further enhance DAG construction, we first propose a data synthesis method to address data scarcity challenge, then we apply a clarification-oriented reinforcement learning method with a hybrid reward that jointly considers effectiveness and efficiency to optimize the LLM. Experiments on three benchmarks demonstrate that \textbf{GPS} significantly outperforms baseline methods in both answer accuracy and interaction cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4758. TRIM: Hybrid Inference via Targeted Stepwise Routing in Multi-Step Reasoning Tasks
๐ฌ 23
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Model routing Reinforcement Learning Partially Observable MDPs (POMDPs)
Multi-step reasoning tasks like mathematical problem solving are vulnerable to cascading failures where a single incorrect step leads to complete solution breakdown. Current LLM routing methods assign entire queries to one model, treating all reasoning steps as equal. We propose TRIM (Targeted Routing in Multi-step reasoning tasks), which routes only critical steps to larger models while letting smaller models handle routine continuations. Our key insight is that targeted step-level interventions can fundamentally transform inference efficiency by confining expensive calls to precisely those steps where stronger models prevent cascading errors. TRIM operates at step-level granularity using process reward models to identify erroneous steps and makes routing decisions based on step-level uncertainty and budget constraints. We develop four routing strategies: a simple thresholding policy, two RL-trained policies (one using full sequential features, another using aggregated statistics), and a POMDP-based approach that handles uncertainty in step-level correctness estimates. On MATH-500, the thresholding policy already surpasses contemporary routing methods with $6.51$x higher cost efficiency, while RL-trained and POMDP-based policies match the strong, expensive modelโs performance using $80$% fewer expensive model tokens. All methods generalize effectively across mathematical reasoning datasets, demonstrating that step-level difficulty represents fundamental characteristics of multi-step reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4759. M$^3$E: Continual Vision-and-Language Navigation via Mixture of Macro and Micro Experts
๐ฌ 17
โญ 4.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Vision-and-Language Navigation Continual Learning Embodied AI Large Language Models
Vision-and-Language Navigation (VLN) agents have shown strong capabilities in following natural language instructions. However, they often struggle to generalize across environments due to catastrophic forgetting, which limits their practical use in real-world settings where agents must continually adapt to new domains. We argue that overcoming forgetting across environments hinges on decoupling global scene reasoning from local perceptual alignment, allowing the agent to adapt to new domains while preserving specialized capabilities.
To this end, we propose M$^3$E, the Mixture of Macro and Micro Experts, an environment-aware hierarchical MoE framework for continual VLN. Our method introduces a dual-router architecture that separates navigation into two levels of reasoning. A macro-level, scene-aware router selects strategy experts based on global environmental features (e.g., office vs. residential), while a micro-level, instance-aware router activates perception experts based on local instruction-vision alignment for step-wise decision making. To preserve knowledge across domains, we adopt a dynamic momentum update strategy that identifies expert utility in new environments and selectively updates or freezes their parameters.
We evaluate M$^3$E in a domain-incremental setting on the R2R and REVERIE datasets, where agents learn across unseen scenes without revisiting prior data. Results show that our method consistently outperforms standard fine-tuning and existing continual learning baselines in both adaptability and knowledge retention, offering a parameter-efficient solution for building generalizable embodied agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4760. Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends
๐ฌ 19
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Reinforcement Learning LLM Post-Training Off-Policy RL GRPO
Off-policy reinforcement learning (RL) for large language models (LLMs) is attracting growing interest, driven by practical constraints in real-world applications, the complexity of LLM-RL infrastructure, and the need for further innovations of RL methodologies. While classic REINFORCE and its modern variants like Group Relative Policy Optimization (GRPO) are typically regarded as on-policy algorithms with limited tolerance of off-policyness, we present in this work a first-principles derivation for *group-relative REINFORCE* without assuming a specific training data distribution, showing that it admits a *native off-policy interpretation*. This perspective yields two general principles for adapting REINFORCE to truly off-policy settings: regularizing policy updates, and actively shaping the data distribution. Our analysis demystifies some myths about the roles of importance sampling and clipping in GRPO, unifies and reinterprets two recent algorithms --- Online Policy Mirror Descent and Asymmetric REINFORCE --- as regularized forms of the REINFORCE loss, and offers theoretical justification for seemingly heuristic data-weighting strategies. Our findings lead to actionable insights that are validated with extensive empirical studies, and open up new opportunities for principled algorithm design in off-policy RL for LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4761. General Exploratory Bonus for Optimistic Exploration in RLHF
๐ฌ 20
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ RLHF optimistic exploration
Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods often fail to realize true optimism. We provide a theoretical analysis showing that current formulations, under KL or $\alpha$-divergence regularization, unintentionally bias exploration toward high-probability regions of the reference model, thereby reinforcing conservative behavior instead of promoting discovery of uncertain regions. To address this pitfall, we introduce the General Exploratory Bonus (GEB), a novel theoretical framework that provably satisfies the optimism principle. GEB counteracts divergence-induced bias via reference-dependent reward regulation and unifies prior heuristic bonuses as special cases, while extending naturally across the full $\alpha$-divergence family. Empirically, GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones. These results demonstrate that GEB offers both a principled and practical solution for optimistic exploration in RLHF.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4762. Alignment-Weighted DPO: A principled reasoning approach to improve alignment
๐ฌ 19
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Reasoning LLM alignment DPO
Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs). However, these LLMs remain vulnerable to jailbreak attacks that disguise harmful intent through indirect or deceptive phrasing. Using causal intervention, we empirically demonstrate that this vulnerability stems from shallow alignment mechanisms that lack deep reasoning, often rejecting harmful prompts without truly understanding why they are harmful. To mitigate this vulnerability, we propose enhancing alignment through reasoning-aware post-training. We construct and release a novel Chain-of-Thought (CoT) fine-tuning dataset that includes both utility-oriented and safety-critical prompts with step-by-step rationales. Fine-tuning on this dataset encourages models to produce principled refusals grounded in reasoning, outperforming standard SFT baselines. Furthermore, inspired by failure patterns in CoT fine-tuning, we introduce **Alignment-Weighted DPO**, which targets the most problematic parts of an output by assigning different preference weights to the reasoning and final-answer segments. This produces finer-grained, targeted updates than vanilla DPO and improves robustness to diverse jailbreak strategies. Extensive experiments across multiple safety and utility benchmarks show that our method consistently improves alignment robustness while maintaining overall model utility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4763. From Sparse to Dense: Spatio-Temporal Fusion for Multi-View 3D Human Pose Estimation with DenseWarper
๐ฌ 24
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D pose estimation; Spatiotemporal; Sparse Interleaved Input; Epipolar Geometry;
In multi-view 3D human pose estimation, models typically rely on images captured simultaneously from different camera views to predict a pose at a specific moment. While providing accurate spatial information, this traditional approach often overlooks the rich temporal dependencies between adjacent frames. We propose a novel 3D human pose estimation input method: the sparse interleaved input to address this. This method leverages images captured from different camera views at various time points (e.g., View 1 at time $t$ and View 2 at time $t+\delta$), allowing our model to capture rich spatio-temporal information and effectively boost performance. More importantly, this approach offers two key advantages: First, it can theoretically increase the output pose frame rate by N times with N cameras, thereby breaking through single-view frame rate limitations and enhancing the temporal resolution of the production. Second, using a sparse subset of available frames, our method can reduce data redundancy and simultaneously achieve better performance. We introduce the DenseWarper model, which leverages epipolar geometry for efficient spatio-temporal heatmap exchange. We conducted extensive experiments on the Human3.6M and MPI-INF-3DHP datasets. Results demonstrate that our method, utilizing only sparse interleaved images as input, outperforms traditional dense multi-view input approaches and achieves state-of-the-art performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4764. AIRE-Prune: Asymptotic Impulse-Response Energy for State Pruning in State Space Models
๐ฌ 20
โญ 4.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ State Space Models Pruning Impulse Response S5 State Pruning
State space models (SSMs) often sacrifice capacity, search space, or stability to offset the memory and compute costs of large state dimensions. We introduce a structured post-training pruning method for SSMs โ AIRE-Prune (Asymptotic Impulse- Response Energy for State PRUN(E)ing ) โ that reduces each layerโs state dimension by directly minimizing long-run output-energy distortion. AIRE-Prune assigns every state a closed-form asymptotic impulse-response energy based score, i.e., the total impulse-response energy it contributes over an infinite horizon (time), and normalizes these scores layer-wise to enable global cross-layer comparison and selection. This extends modal truncation from single systems to deep stacks and aligns pruning with asymptotic response energy rather than worst-case gain. Across diverse sequence benchmarks, AIRE-Prune reveals substantial redundancy in SISO and MIMO SSMs with average pruning of 60.8%, with average accuracy drop of 0.29% without retraining while significantly lowering compute.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4765. Training-Free Reward-Guided Image Editing via Trajectory Optimal Control
๐ฌ 12
โญ 4.50
๐ generative models
๐ท๏ธ reward-guided editing diffusion models optimal control
Recent advancements in diffusion and flow-matching models have demonstrated remarkable capabilities in high-fidelity image synthesis. A prominent line of research involves reward-guided guidance, which steers the generation process during inference to align with specific objectives. However, leveraging this reward-guided approach to the task of image editing, which requires preserving the semantic content of the source image while enhancing a target reward, is largely unexplored.
In this work, we introduce a novel framework for training-free, reward-guided image editing. We formulate the editing process as a trajectory optimal control problem where the reverse process of a diffusion model is treated as a controllable trajectory originating from the source image, and the adjoint states are iteratively updated to steer the editing process. Through extensive experiments across distinct editing tasks, we demonstrate that our approach significantly outperforms existing inversion-based training-free guidance baselines, achieving a superior balance between reward maximization and fidelity to the source image without reward hacking.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4766. Latent Geometry-Driven Network Automata for Complex Network Dismantling
๐ฌ 32
โญ 4.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ network robustness network dismantling network geometry network science complex systems network automata graphs network topology
Complex networks model the structure and function of critical technological, biological, and communication systems. Network dismantling, the targeted removal of nodes to fragment a network, is essential for analyzing and improving system robustness. Existing dismantling methods suffer from key limitations: they depend on global structural knowledge, exhibit slow running times on large networks, and overlook the networkโs latent geometry, a key feature known to govern the dynamics of complex systems. Motivated by these findings, we introduce Latent Geometry-Driven Network Automata (LGD-NA), a novel framework that leverages local network automata rules to approximate effective link distances between interacting nodes. LGD-NA is able to identify critical nodes and capture latent manifold information of a network for effective and efficient dismantling. We show that this latent geometry-driven approach outperforms all existing dismantling algorithms, including spectral Laplacian-based methods and machine learning ones such as graph neural networks and . We also find that a simple common-neighbor-based network automata rule achieves near state-of-the-art performance, highlighting the effectiveness of minimal local information for dismantling. LGD-NA is extensively validated on the largest and most diverse collection of real-world networks to date (1,475 real-world networks across 32 complex systems domains) and scales efficiently to large networks via GPU acceleration. Finally, we leverage the explainability of our common-neighbor approach to engineer network robustness, substantially increasing the resilience of real-world networks. We validate LGD-NA's practical utility on domain-specific functional metrics, spanning neuronal firing rates in the Drosophila Connectome, transport efficiency in flight maps, outbreak sizes in contact networks, and communication pathways in terrorist cells. Our results confirm latent geometry as a fundamental principle for understanding the robustness of real-world systems, adding dismantling to the growing set of processes that network geometry can explain.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4767. SkillFactory: Self-Distillation for Learning Cognitive Behaviors
๐ฌ 9
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Reasoning Reinforcement Learning
Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren't exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These "silver" SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4768. d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching
๐ฌ 26
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Diffusion-based large language models Key-value caching Inference acceleration
Diffusion-based large language models (dLLMs), despite their promising performance, still suffer from inferior inference efficiency. This is because dLLMs rely on bidirectional attention and cannot directly benefit from the standard key-value (KV) cache as autoregressive models (ARMs) do. To tackle this issue, we introduce \textit{Dual aDaptive Cache} (d$^2$Cache), which is a training-free approximate KV cache framework for accelerating dLLM inference. d$^2$Cache features a two-stage fine-grained selection strategy to identify tokens and adaptively update their KV states at each decoding step, while caching the KV states of the remaining tokens for reuse. Furthermore, d$^2$Cache naturally offers a more reliable decoding alternative, which can enable quasi left-to-right generation and mitigate premature overconfidence in tokens at the end of the sequence. Extensive experimental results on two representative dLLMs (\ie, LLaDA and Dream) demonstrate that d$^2$Cache not only achieves substantial inference speedups, but also yields consistent improvements in generation quality. The anonymous evaluation codes are available at \url{https://anonymous.4open.science/r/d2Cache-5538}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4769. **TandemFoilSet**: Datasets for Flow Field Prediction of Tandem-Airfoil Through the Reuse of Single Airfoils
๐ฌ 32
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Physics-informed Graph Neural Network; Tandem-Airfoil; Flow Field Prediction; CFD; Aerodynamics;
Accurate simulation of flow fields around tandem geometries is critical for engineering design but remains computationally intensive. Existing machine learning approaches typically focus on simpler cases and lack evaluation on multi-body configurations. To support research in this area, we present **TandemFoilSet**: five tandem-airfoil datasets (4152 tandem-airfoil simulations) paired with four single-airfoil counterparts, for a total of 8104 CFD simulations. We provide benchmark results of a curriculum learning framework using a directional integrated distance representation, residual pre-training, training schemes based on freestream conditions and smooth-combined estimated fields, and a domain decomposition strategy. Evaluations demonstrate notable gains in prediction accuracy. We believe these datasets will enable future work on scalable, data-driven flow prediction for tandem-airfoil scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4770. NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models
๐ฌ 15
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Watermarking Diffusion Models Generative AI
With the rapid adoption of diffusion models for visual content generation, proving authorship and protecting copyright have become critical. This challenge is particularly important when model owners keep their models private and may be unwilling or unable to handle authorship issues, making third-party verification essential. A natural solution is to embed watermarks for later verification. However, existing methods require access to model weights and rely on computationally heavy procedures, rendering them impractical and non-scalable. To address these challenges, we propose $\text{\emph{NoisePrints}}$, a lightweight watermarking scheme that utilizes the random seed used to initialize the diffusion process as a proof of authorship without modifying the generation process. Our key observation is that the initial noise derived from a seed is highly correlated with the generated visual content. By incorporating a hash function into the noise sampling process, we further ensure that recovering a valid seed from the content is infeasible. We also show that sampling an alternative seed that passes verification is infeasible, and demonstrate the robustness of our method under various manipulations. Finally, we show how to use cryptographic zero-knowledge proofs to prove ownership without revealing the seed. By keeping the seed secret, we increase the difficulty of watermark removal. In our experiments, we validate NoisePrints on multiple state-of-the-art diffusion models for images and videos, demonstrating efficient verification using only the seed and output, without requiring access to model weights.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4771. SSD-GS: Scattering and Shadow Decomposition for Relightable 3D Gaussian Splatting
๐ฌ 19
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Gaussian splatting Relighting Novel View Synthesis Light Decomposition Subsurface Scattering Physically-Based Rendering
We present SSD-GS, a physically-based relighting framework built upon 3D Gaussian Splatting (3DGS) that achieves high-quality reconstruction and photorealistic relighting under novel lighting conditions. In physically-based relighting, accurately modeling light-material interactions is essential for faithful appearance reproduction. However, existing 3DGS-based relighting methods adopt coarse shading decompositions, either modeling only diffuse and specular reflections or relying on neural networks to approximate shadows and scattering. This leads to limited fidelity and poor physical interpretability, particularly for anisotropic metals and translucent materials. To address these limitations, SSD-GS decomposes reflectance into four components: diffuse, specular, shadow, and subsurface scattering. We introduce a learnable dipole-based scattering module for subsurface transport, an occlusion-aware shadow formulation that integrates visibility estimates with a refinement network, and an enhanced specular component with an anisotropic Fresnel-based model. Through progressive integration of all components during training, SSD-GS effectively disentangles lighting and material properties, even for unseen illumination conditions, as demonstrated on the challenging OLAT dataset. Experiments demonstrate superior quantitative and perceptual relighting quality compared to prior methods and pave the way for downstream tasks including controllable light source editing and interactive scene relighting.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4772. PlantRSR: A New Plant Dataset and Method for Reference-based Super-Resolution
๐ฌ 29
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Plant dataset Image Super-Resolution Reference-based Super-Resolution
Single image super-resolution (SISR) often struggles to reconstruct high-resolution (HR) details from heavily degraded low-resolution (LR) inputs. Instead, reference-based super-resolution (RefSR) methods offer an alternative solution to generate promising results using high-quality reference (Ref) images to guide reconstruction. However, existing RefSR datasets focus on limited scene types, primarily featuring
human activities and architectural scenes. Plant scenes exhibit complex textures and fine details, essential for advancing RefSR in natural and highly detailed scenes. To this end, we meticulously captured and manually selected high-quality images containing rich textures to construct a large-scale plant dataset, PlantRSR, comprising 16,585 HRโRef pairs. The dataset captures the complexity and variability of plant scenes through extensive variations. In addition, we propose a novel RefSR method specifically designed to tackle the distinct challenges posed by plant imagery. It incorporates a Selective Key-Region Matching (SKRM) that selectively identifies and performs matching between LR and Ref images, focusing on distinctive botanical textures to improve matching efficiency. Additionally, a Texture-Guided Diffusion Module (TGDM) is proposed to refine LR textures by leveraging a diffusion process conditioned on the matched Ref textures. TGDM is effective in modeling irregular and fine textures, thereby facilitating more accurate SR results. The proposed method achieves significant improvements over state-of-the-art (SOTA) approaches on our PlantRSR dataset and other Benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4773. TTS Can Speak in Any Style with Any Voice
๐ฌ 15
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Instruction TTS Controllable Speech Synthesis Decoupling Reinforcement Learning
This study proposes FlexiVoice, a text-to-speech (TTS) synthesis system capable of flexible style control with zero-shot voice cloning. The speaking style is controlled by a natural-language instruction and the voice timbre is provided by a speech reference in zero-shot manner. FlexiVoice is built with an LLM core, which takes text as input, and also takes an optional natural language instruction and an optional speech reference to control style and timbre, respectively. FlexiVoice is equipped with a novel Progressive Post-Training (PPT) scheme that progressively unlocks accurate and flexible controllability. In particular, it first employs Direct Preference Optimization (DPO) to enable FlexiVoice to accurately follow both natural language instruction and speech reference simultaneously. It then uses a multi-objective Group Relative Policy Optimization (GRPO) to disentangle style instruction, reference timbre, and textual content. Finally, it adapts instruction GRPO for more advanced instruction following. Experimental results show that FlexiVoice surpasses competing baselines and demonstrates strong capability in decoupling control factors. Human evaluations further confirm its naturalness, controllability, and robustness.
Audio samples are available at https://flexi-voice.github.io/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4774. GAS: Enhancing Reward-Cost Balance of Generative Model-assisted Offline Safe RL
๐ฌ 22
โญ 4.50
๐ reinforcement learning
๐ท๏ธ offline safe reinforcement learning constrained decision transformer improving stitching ability
Offline Safe Reinforcement Learning (OSRL) aims to learn a policy that achieves high performance in sequential decision-making while satisfying safety constraints, using only pre-collected datasets. Recent works, inspired by the strong capabilities of Generative Models (GMs), reformulate decision-making in OSRL as a conditional generative process, where GMs generate desirable actions conditioned on predefined reward and cost return-to-go values. However, GM-assisted methods face two major challenges in constrained settings: (1) they lack the ability to ``stitch'' optimal transitions from suboptimal trajectories within the dataset, and (2) they struggle to balance reward maximization with constraint satisfaction, particularly when tested with imbalanced human-specified reward-cost conditions. To address these issues, we propose Goal-Assisted Stitching (GAS), a novel algorithm designed to enhance stitching capabilities while effectively balancing reward maximization and constraint satisfaction. To enhance the stitching ability, GAS first augments and relabels the dataset at the transition level, enabling the construction of high-quality trajectories from suboptimal ones. GAS also introduces novel goal functions, which estimate the optimal achievable reward and cost goals from the dataset. These goal functions, trained using expectile regression on the relabeled and augmented dataset, allow GAS to accommodate a broader range of reward-cost return pairs and achieve a better tradeoff between reward maximization and constraint satisfaction compared to human-specified values. The estimated goals then guide policy training, ensuring robust performance under constrained settings. Furthermore, to improve training stability and efficiency, we reshape the dataset to achieve a more uniform reward-cost return distribution. Empirical results validate the effectiveness of GAS, demonstrating superior performance in balancing reward maximization and constraint satisfaction compared to existing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4775. Composition-Grounded Instruction Synthesis for Visual Reasoning
๐ฌ 21
โญ 4.50
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Compositional data synthesis compositional generalization visual reasoning chart understanding Visual language model
Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning *factors*, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages. We release the code and data at https://cogsynthesis.github.io.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4776. Uncertainty Estimation via Hyperspherical Confidence Mapping
๐ฌ 14
โญ 4.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ uncertainty estimation Out-of-distribution (OOD) detection calibration
Quantifying uncertainty in neural network predictions is essential for deploying models in high-stakes domains such as autonomous driving, healthcare, and manufacturing. While conventional approaches often depend on costly sampling or parametric distributional assumptions, we propose Hyperspherical Confidence Mapping (HCM), a simple yet principled framework for uncertainty estimation that is both sampling-free and distribution-free. HCM decomposes model outputs into a magnitude and a normalized direction vector constrained to lie on a unit hypersphere, enabling a novel interpretation of uncertainty as the degree of violation of a geometric constraint. Grounded in this geometric constraint formulation, our method provides deterministic and interpretable uncertainty estimates applicable to both regression and classification. We validate the effectiveness of HCM across diverse benchmarks and real-world industrial tasks, demonstrating competitive or superior performance to ensemble and evidential approaches, while significantly reducing inference cost and ensuring strong confidenceโerror alignment. Our results highlight the value of geometric structure in uncertainty estimation and position HCM as a versatile alternative to conventional techniques.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4777. TSPulse: Tiny Pre-Trained Models with Disentangled Representations for Rapid Time-Series Analysis
๐ฌ 29
โญ 4.50
๐ learning on time series and dynamical systems
๐ท๏ธ time series foundation models pretrained models time series foundation models TSFM
Different time-series tasks benefit from distinct cues at various spaces and abstractions, yet existing time-series pre-trained models entangle these signals within large, monolithic embeddings, limiting transferability and zero-shot usability. Moreover, massive model sizes demand heavy compute, restricting practical deployments and real-time applications. To address this, we propose TSPulse, an ultra-light pre-trained model (1M parameters) that performs disentangled masked reconstruction across spaces and abstraction levels, explicitly learning three disentangled views: temporal embeddings for fine-grained time analysis, spectral embeddings for frequency-aware fidelity, and semantic embeddings for high-level task understanding. A hybrid masking scheme further randomizes mask style and span length to avoid pre-training bias and improve robustness. Despite its compact size, TSPulse achieves strong gains across four time-series tasks: +20\% and rank-1 on TSB-AD leaderboard benchmark for reliable anomaly detection through multi-head triangulation, which correlates complementary cues across disentangled views; +25\% in similarity search as the disentangled semantic embedding remain invariant to time, scale and noise shifts, making retrieval more robust; +50\% improvement in imputation since hybrid masking exposes the model to diverse real-world corruption patterns; and +5โ16\% gains in multivariate classification with TSLens, a lightweight module that selectively attends to the most informative signals across variates. Overall, TSPulse outperform models that are 10โ100ร larger on 75+ datasets across tasks, while delivering state-of-the-art zero-shot results with GPU-free support and efficient fine-tuning. Models and source code will be open-sourced and currently shared in the supplementary material.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4778. ReIn: Conversational Error Recovery with Reasoning Inception
๐ฌ 23
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Conversational AI Agent Reasoning Task-oriented Dialogue
Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors. Rather than focusing on error prevention, this work focuses on error recovery, which necessitates the accurate diagnosis of erroneous dialogue contexts and execution of proper recovery plans. Under realistic constraints precluding model fine-tuning or prompt modification due to significant cost and time requirements, we explore whether agents can recover from contextually flawed interactions and how their behavior can be adapted without altering model parameters and prompts. To this end, we propose **Reasoning Inception (ReIn)**, a test-time intervention method that *plants* an initial reasoning into the agent's decision-making process. Specifically, an external inception module identifies predefined errors within the dialogue context and generates recovery plans, which are subsequently integrated into the agent's internal reasoning process to guide corrective actions, without modifying its parameters or system prompts. We evaluate ReIn by systematically simulating conversational failure scenarios that directly hinder successful completion of user goals: user's ambiguous and unsupported requests. Across diverse combinations of agent models and inception modules, ReIn substantially improves task success and generalizes to unseen error types. Moreover, it consistently outperforms explicit prompt-modification approaches, underscoring its utility as an efficient, on-the-fly method. In-depth analysis of its operational mechanism, particularly in relation to instruction hierarchy, indicates that jointly defining recovery tools with ReIn can serve as a safe and effective strategy for improving the resilience of conversational agents without modifying the backbone models or system prompts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4779. Language Models Use Lookbacks to Track Beliefs
๐ฌ 18
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ Mechanistic Interpretability Belief Tracking Theory of Mind
How do language models (LMs) represent charactersโ beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze LMs' ability to reason about charactersโ beliefs using causal mediation and abstraction. We construct a dataset, CausalToM, consisting of simple stories where two characters independently change the state of two objects, potentially unaware of each other's actions. Our investigation uncovered a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating their reference information, represented as Ordering IDs (OIs), in low-rank subspaces of the state token's residual stream. When asked about a character's beliefs regarding the state of an object, the binding lookback retrieves the correct state OI and then the answer lookback retrieves the corresponding state token. When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs. In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character's beliefs. Our work provides insights into belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4780. VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL
๐ฌ 23
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Discriminator MLLM
With the rapid advancement of AI-generated videos, there is an urgent need for effective detection tools to mitigate societal risks such as misinformation and reputational harm. In addition to accurate classification, it is essential that detection models provide interpretable explanations to ensure transparency for regulators and end users. To address these challenges, we introduce VidGuard-R1, the first video authenticity detector that fine-tunes a multi-modal large language model (MLLM) using group relative policy optimization (GRPO). Our model delivers both highly accurate judgments and insightful reasoning. We curate a challenging dataset of 140k real and AI-generated videos produced by state-of-the-art generation models, carefully designing the generation process to maximize discrimination difficulty. We then fine-tune Qwen-VL using GRPO with two specialized reward models that target temporal artifacts and generation complexity. Extensive experiments demonstrate that VidGuard-R1 achieves state-of-the-art zero-shot performance on existing benchmarks, with additional training pushing accuracy above 95%. Case studies further show that VidGuard-R1 produces precise and interpretable rationales behind its predictions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4781. WavefrontDiffusion: Dynamic Decoding Schedule for Improved Reasoning
๐ฌ 25
โญ 4.50
๐ generative models
๐ท๏ธ diffusion language models dynamic decoding
Diffusion Language Models (DLMs) have shown strong potential for text generation and are becoming a competitive alternative to autoregressive models.
The denoising strategy plays an important role in determining the quality of their outputs.
Mainstream denoising strategies include Standard Diffusion and BlockDiffusion.
Standard Diffusion performs global denoising without restricting the update range, often finalizing incomplete context and causing premature end-of-sequence predictions.
BlockDiffusion updates fixed-size blocks in a preset order, but its rigid structure can break apart coherent semantic units and disrupt reasoning.
We present WavefrontDiffusion, a dynamic decoding approach that expands a wavefront of active tokens outward from finalized positions.
This adaptive process follows the natural flow of semantic structure while keeping computational cost equal to block-based methods.
Across four benchmarks in reasoning and code generation, WavefrontDiffusion achieves state-of-the-art performance while producing outputs with higher semantic fidelity, showing the value of adaptive scheduling for more coherent and efficient generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4782. The State of Reinforcement Finetuning for Transformer-based Generative Agents
๐ฌ 29
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Finetuning Meta-RL Generative Agents
Reinforcement finetuning (RFT) has garnered significant attention in recent years, particularly for enhancing large reasoning models such as OpenAI o1 and Deepseek R1. The appeal of RFT largely stems from its ability to refine model knowledge, better align outputs with user intent, and address challenges associated with limited finetuning data. Despite these advantages, the application of RFT in large Transformer-based generative agents remains relatively underexplored. Although these agents are designed to address multiple tasks through large-scale autoregressive pretraining and share many properties with large reasoning models, current adaptation strategies predominantly rely on supervised finetuning (SFT). In this work, we conduct a systematic investigation of several RFT techniques across a variety of finetuning parameter configurations and meta-reinforcement learning (meta-RL) environments, employing few-shot offline datasets. We provide a comprehensive analysis of RFT algorithm performance under diverse experimental conditions and, based on our empirical findings, introduce a lightweight enhancement to existing RFT methods. This enhancement consistently improves outcomes by combining the strengths of both SFT and RFT. Our findings provide valuable insights for advancing the effectiveness of RFT approaches and broadening their applicability to meta-RL tasks with large Transformer-based generative agents, motivating further research in broader domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4783. Exploring Cross-Modal Flows for Few-Shot Learning
๐ฌ 18
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Representation Learning; Generative Model; Few-shot Learning;
Aligning features from different modalities is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Todayโs PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment and are insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have shown that FMA can consistently yield significant performance gains across various benchmarks and backbones, especially on difficult datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4784. Synthetic Bootstrapped Pretraining
๐ฌ 11
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ language model pretraining synthetic data
We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure that first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training.
While the standard pretraining teaches LMs to learn causal correlations among tokens within a single document, it is not designed to efficiently model the rich, learnable inter-document correlations that can potentially lead to better performance.
We validate SBP by designing a compute-matched pretraining setup and pretrain a 3B-parameter model on up to 1T tokens from scratch.
We find SBP consistently improves upon a strong repetition baseline and delivers a significant fraction of performance improvement attainable by an oracle upper bound with access to 20x more unique data.
Qualitative analysis reveals that the synthesized documents go beyond mere paraphrases -- SBP first abstracts a core concept from the seed material and then crafts a new narration on top of it.
Besides strong empirical performance, SBP admits a natural Bayesian interpretation: the synthesizer implicitly learns to abstract the latent concepts shared between related documents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4785. MASS: MoErging through Adaptive Subspace Selection
๐ฌ 15
โญ 4.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Model Merging MoErging Task Vectors Multi-Task Learning Deep Learning Computer Vision
Model merging has recently emerged as a lightweight alternative to ensembling, combining multiple fine-tuned models into a single set of parameters with no additional training overhead. Yet, existing merging methods fall short of matching the full accuracy of separately fine-tuned endpoints. We present MASS (MoErging through Adaptive Subspace Selection), a new approach that closes this gap by unifying multiple fine-tuned models while retaining near state-of-the-art performance across tasks. Building on the low-rank decomposition of per-task updates, MASS stores only the most salient singular components for each task and merges them into a shared model. At inference time, a non-parametric, data-free router identifies which subspace (or combination thereof) best explains an input's intermediate features and activates the corresponding task-specific block. This procedure is fully training-free and introduces only a two-pass inference overhead plus a ~2 storage factor compared to a single pretrained model, irrespective of the number of tasks. We evaluate MASS on CLIP-based image classification using ViT-B-16, ViT-B-32 and ViT-L-14 for benchmarks of 8, 14 and 20 tasks respectively, establishing a new state-of-the-art. Most notably, MASS recovers up to ~98% of the average accuracy of individual fine-tuned models, making it a practical alternative to ensembling at a fraction of the storage cost.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4786. Asymmetric Synthetic Data Update for Domain Incremental Dataset Distillation
๐ฌ 17
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Dataset Distillation Continual Learning
Dataset distillation (DD) attempts to construct a compact synthetic dataset that serves as a proxy for a large real dataset under a fixed storage budget, thereby reducing the storage burden and training costs.
Prior works assume the full dataset is available upfront which is distilled at once, although real datasets are collected incrementally over time in practice.
To alleviate this gap, we introduce a new problem setting, *Domain Incremental Dataset Distillation*, that continually distills datasets from different domains into a single synthetic dataset.
The conventional DD sequentially processes arriving datasets in order, overwriting the old knowledge with new one, causing catastrophic forgetting problem.
To overcome this drawback, we propose *Asymmetric Synthetic Data Update* strategy that adjusts the per-sample update rates for synthetic dataset while balancing the stability-plasticity trade-off. Specifically, we design a bi-level optimization method based on meta learning framework to estimate the optimal update rates, that allow each sample to focus on either stability or plasticity, thereby striking a balance between the stability and plasticity.
Experimental results demonstrate that our approach effectively mitigates the catastrophic forgetting and achieves superior performance of dataset distillation across continually incoming datasets compared with existing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4787. NeRV-Diffusion: Diffuse Implicit Neural Representation for Video Synthesis
๐ฌ 24
โญ 4.50
๐ generative models
๐ท๏ธ Video diffusion model Implicit neural representation Hypernetwork
We present NeRV-Diffusion, an implicit latent video diffusion model that synthesizes videos via generating neural network weights. The generated weights can be rearranged as the parameters of a convolutional neural network, which forms an implicit neural representation (INR), and decodes into videos with frame indices as the input. Our framework consists of two stages: 1) A hypernetworkbased tokenizer that encodes raw videos from pixel space to neural parameter space, where the bottleneck latent serves as INR weights to decode. 2) An implicit diffusion transformer that denoises on the latent INR weights. In contrast to traditional video tokenizers that encode videos into frame-wise feature maps, NeRV-Diffusion compresses and generates a video holistically as a unified neural network. This enables efficient and high-quality video synthesis via obviating temporal cross-frame attentions in the denoiser and decoding video latent with dedicated decoders. To achieve Gaussian-distributed INR weights with high expressiveness, we reuse the bottleneck latent across all NeRV layers, as well as reform its weight assignment, upsampling connection and input coordinates. We also introduce SNR-adaptive loss weighting and scheduled sampling for effective training of the implicit diffusion model. NeRV-Diffusion reaches superior video generation quality over previous INR-based models and comparable performance to most recent state-of-the-art non-implicit models on real-world video benchmarks including UCF-101 and Kinetics-600. It also brings a smooth INR weight space that facilitates seamless interpolations between frames or videos.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4788. LABEL-FREE MITIGATION OF SPURIOUS CORRELATIONS IN VLMS USING SPARSE AUTOENCODERS
๐ฌ 4
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ Spurious Correlations Vision-Language Models Interpretability Sparse Auto Encoders
Vision-Language Models (VLMs) have demonstrated impressive zero-shot capabilities across a wide range of tasks and domains. However, their performance is often compromised by learned spurious correlations, which can adversely affect downstream applications. Existing mitigation strategies typically depend on additional data, model retraining, labeled features or classes, domain-specific expertise, or external language models posing scalability and generalization challenges. In contrast, we introduce a fully interpretable, zero-shot method that requires no auxiliary data or external supervision named DIAL (Disentangle, Identify, And Label-free removal). Our approach begins by filtering the representations that might be disproportionately influenced by spurious features, using distributional analysis. We then apply a sparse autoencoder to disentangle the representations and identify the feature directions associated with spurious features. To mitigate their impact, we remove the subspace spanned by these spurious directions from the affected representations.
Additionally, for cases where prior knowledge of spurious features in a dataset are not known, we introduce DIAL+ which can detect and mitigate the spurious features. We validate our method through extensive experiments on widely used spurious correlation benchmarks. Results show that our approach consistently outperforms or matches existing baselines in terms of overall accuracy and worst-group performance, offering a scalable and interpretable solution to a persistent challenge in VLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4789. TP-Spikformer: Token Pruned Spiking Transformer
๐ฌ 29
โญ 4.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Spiking Neural Networks Efficient Spiking Neural Network Pruning Spiking Neural Network
Spiking neural networks (SNNs) offer an energy-efficient alternative to traditional neural networks due to their event-driven computing paradigm. However, recent advancements in spiking transformers have focused on improving accuracy with large-scale architectures, which require significant computational resources and limit deployment on resource-constrained devices. In this paper, we propose a simple yet effective token pruning method for spiking transformers, termed TP-Spikformer, that reduces storage and computational overhead while maintaining competitive performance. Specifically, we first introduce a heuristic spatiotemporal information-retaining criterion that comprehensively evaluates tokens' importance, assigning higher scores to informative tokens for retention and lower scores to uninformative ones for pruning. Based on this criterion, we propose an information-retaining token pruning framework that employs a block-level early stopping strategy for uninformative tokens, instead of removing them outright. This also helps preserve more information during token pruning. We demonstrate the effectiveness, efficiency and scalability of TP-Spikformer through extensive experiments across diverse architectures, including Spikformer, QKFormer and Spike-driven Transformer V1 and V3, and a range of tasks such as image classification, object detection, semantic segmentation and event-based object tracking. Particularly, TP-Spikformer performs well in a training-free manner. These results reveal its potential as an efficient and practical solution for deploying SNNs in real-world applications with limited computational resources.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4790. Latent Wavelet Diffusion For Ultra High-Resolution Image Synthesis
๐ฌ 17
โญ 4.50
๐ generative models
๐ท๏ธ Generative Models Diffusion Models Wavelet Ultra High-Resolution
High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present $\textit{Latent Wavelet Diffusion (LWD)}$, a lightweight training framework that significantly improves detail and texture fidelity in ultra-high-resolution (2K-4K) image synthesis. LWD introduces a novel, frequency-aware masking strategy derived from wavelet energy maps, which dynamically focuses the training process on detail-rich regions of the latent space. This is complemented by a scale-consistent VAE objective to ensure high spectral fidelity. The primary advantage of our approach is its efficiency: LWD requires no architectural modifications and adds zero additional cost during inference, making it a practical solution for scaling existing models. Across multiple strong baselines, LWD consistently improves perceptual quality and FID scores, demonstrating the power of signal-driven supervision as a principled and efficient path toward high-resolution generative modeling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4791. Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning
๐ฌ 18
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multi-hop QA RAG Reasoning
Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval. However, previous methods that extend reasoning with single-query search steps struggle to scale to complex tasks demanding broad document exploration. Meanwhile, approaches that generate multiple independent queries simultaneously may limit deeper, sequential reasoning. To address these limitations, we propose HybridDeepSearcher that dynamically integrates parallel and sequential search strategies to enable effective search scaling. To support training, we introduce HDS-QA, a novel dataset that seamlessly integrates broad parallel search with sequential search reasoning, providing answer trajectories in the form of reasoning-query-retrieval loops with parallel sub-queries. Across all five benchmarks, our approach significantly outperforms the state-of-the-art, improving F1 scores by +15.9 on FanOutQA and +11.5 on a subset of BrowseComp. Further analysis reveals that HybridDeepSearcher effectively scales performance with additional test-time search resources and demonstrates robustness on questions requiring more evidence, achieving higher evidence coverage. We include the code in the supplementary materials and will release the dataset and code publicly.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4792. Characterizing Deep Research: A Benchmark and Formal Definition
๐ฌ 19
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Benchmark Evaluation Deep Research
Information tasks such as writing surveys or analytical reports require complex search and reasoning, and have recently been grouped under the umbrella of _deep research_ --- a term also adopted by recent models targeting these capabilities. Despite growing interest, the scope of the deep research task remains underdefined and its distinction from other reasoning-intensive problems is poorly understood. In this paper, we propose a formal characterization of the deep research (DR) task and introduce a benchmark to evaluate the performance of DR systems. We argue that the core defining feature of deep research is not the production of lengthy report-style outputs, but rather the high fan-out over concepts required during the search process, i.e., broad and reasoning-intensive exploration.To enable objective evaluation, we define DR using an intermediate output representation that encodes key claims uncovered during searchโseparating the reasoning challenge from surface-level report generation. Based on this formulation, we propose a benchmark LiveDRBench with 100 challenging tasks over scientific topics (e.g., datasets, materials discovery, prior art search) and public interest events (e.g., flight incidents, movie awards). Across state-of-the-art DR systems, F1 score ranges between 0.02 and 0.72 for any sub-category. OpenAI's model performs the best with an overall F1 score of 0.55. Analysis of the reasoning traces reveals that systems cover only about half of the necessary search queries, with proprietary models issuing broader and and deeper queries than open source models, highlighting gaps in both coverage and reasoning depth.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4793. LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
๐ฌ 14
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Vision-Language Model Token Reduction
Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this issue by pruning unimportant visual tokens, achieving substantial computational reduction while maintaining model performance. The core of token pruning lies in determining token importance, with current approaches primarily relying on attention scores from vision encoders or Large Language Models (LLMs).
In this paper, we analyze the effectiveness of attention mechanisms in both vision encoders and LLMs. We find that vision encoders suffer from attention sink, leading to poor focus on informative foreground regions, while in LLMs, although prior studies have identified attention bias toward token positions, text-to-vision attention demonstrates resistance to this bias and enables effective pruning guidance in middle layers.
Based on these observations, we propose $\textbf{LearnPruner}$, a two-stage token pruning framework that first removes redundant vision tokens via a learnable pruning module after the vision encoder, then retains only task-relevant tokens in the LLM's middle layer.
Experimental results show that our LearnPruner can preserve approximately 95\% of the original performance while using only 5.5\% of vision tokens, and achieve 3.2$\times$ inference acceleration, demonstrating a superior accuracy-efficiency trade-off.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4794. Toward Conservative Planning from Preferences in Offline Reinforcement Learning
๐ฌ 14
โญ 4.50
๐ reinforcement learning
๐ท๏ธ reinforcement learning sample complexity model-based planning
We study offline reinforcement learning (RL) with trajectory preferences, where the RL agent does not receive explicit rewards at each step but instead receives human-provided preferences over pairs of trajectories. Despite growing interest in preference-based reinforcement learning (PbRL), contemporary works cannot robustly learn policies in offline settings with poor data coverage and often lack algorithmic tractability. We propose a novel **M**odel-based **C**onservative **P**lanning (MCP) algorithm for offline PbRL, which leverages a general function class and uses a tractable conservative learning framework to improve the policy upon an arbitrary reference policy. We prove that, MCP can compete with the best policy within data coverage when the reference policy is supported by the data. To the best of our knowledge, MCP is the first provably sample-efficient and computationally tractable offline PbRL algorithm under partial data coverage, without requiring known transition dynamics. We further demonstrate that, with certain structural properties in PbRL dynamics, our algorithm can effectively exploit these structures to relax the partial data coverage requirement and improve regret guarantees. We evaluate MCP on a comprehensive suite of human-in-the-loop benchmarks in Meta-World. Experimental results show that our algorithm achieves competitive performance compared to state-of-the-art offline PbRL algorithms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4795. Semantic-Enhanced Time-Series Forecasting via Large Language Models
๐ฌ 16
โญ 4.50
๐ learning on time series and dynamical systems
๐ท๏ธ Large Language Models; Time Series Forecasting; Semantic Ehanced; Time-Adapter
Time series forecasting plays a significant role in finance, energy, meteorology, and IoT applications. Recent studies have leveraged the generalization capabilities of large language models (LLMs) to adapt to time series forecasting, achieving promising performance. However, existing studies focus on token-level modal alignment, instead of bridging the intrinsic modality gap between linguistic knowledge structures and time series data patterns, greatly limiting the semantic representation. To address this issue, we propose a novel Semantic-Enhanced LLM (SE-LLM) that explores the inherent periodicity and anomalous characteristics of time series to embed into the semantic space to enhance the token embedding. This process enhances the interpretability of tokens for LLMs, thereby activating the potential of LLMs for temporal sequence analysis. Moreover, existing Transformer-based LLMs excel at capturing long-range dependencies but are weak at modeling short-term anomalies in time-series data. Hence, we propose a plugin module embedded within self-attention that models long-term and short-term dependencies to effectively adapt LLMs to time-series analysis. Our approach freezes the LLM and reduces the sequence dimensionality of tokens, greatly reducing computational consumption. Experiments demonstrate the superiority performance of our SE-LLM against the state-of-the-art (SOTA) methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4796. Knowledge Exchange with Confidence: Cost-Effective LLM Integration for Reliable and Efficient Visual Question Answering
๐ฌ 5
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ visual question answering model calibration
Recent advances in large language models (LLMs) have improved the accuracy of visual question answering (VQA) systems. However, directly applying LLMs to VQA still presents several challenges: (a) suboptimal performance when handling questions from specialized domains, (b) higher computational costs and slower inference speed due to large model sizes, and (c) the absence of a systematic approach to precisely quantify the uncertainty of LLM responses, raising concerns about their reliability in high-stakes tasks. To address these issues, we propose an UNcertainty-aware LLM-Integrated VQA model ($\texttt{Uni-VQA}$). This model facilitates knowledge exchange between the LLM and a calibrated task specific model (\ie \texttt{TS-VQA}), guided by reliable confidence scores, resulting in improved VQA accuracy, reliability and inference speed. Our framework strategically leverages these confidence scores to manage the interaction between the LLM and $\texttt{TS-VQA}$: the specialized questions are answered by the $\texttt{TS-VQA}$ model, while general knowledge questions are handled by the LLM. For questions requiring both specialized and general knowledge, the $\texttt{TS-VQA}$ provides candidate answers, which the LLM then combines with its internal knowledge to generate a more accurate response. Extensive experiments on VQA datasets demonstrate the theoretically justified advantages of $\texttt{Uni-VQA}$ over using the LLM or $\texttt{TS-VQA}$ alone.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4797. SafeMoE: Safe Fine-Tuning for MoE LLMs by Aligning Harmful Input Routing
๐ฌ 17
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ AI safety Large language model Mixture-of-Experts
Recent large language models (LLMs) have increasingly adopted the Mixture-of-Experts (MoE) architecture for efficiency. MoE-based LLMs heavily depend on a superficial safety mechanism in which harmful inputs are routed safety-critical experts. However, our analysis reveals that routing decisions for harmful inputs drift significantly after fine-tuning, exposing a critical vulnerability to harmful fine-tuning (HFT) attacks. Existing defenses, primarily designed for monolithic LLMs, are less effective for MoE LLMs as they fail to prevent drift in harmful input routing. To address this limitation, we propose SafeMoE, a safe fine-tuning method tailored to MoE LLMs. SafeMoE directly mitigates routing drift by penalizing the gap between the routing weights of a fine-tuned model and those of the initial safety-aligned model, thereby preserving the safety-aligned routing of harmful inputs to safety-critical experts. Experiments on open-source MoE LLMs ranging from 7B to 141B parameters demonstrate that SafeMoE effectively mitigates HFT attacks, reducing the harmfulness score of OLMoE from 62.0 to 5.0, for example, while maintaining task utility within 1% degradation and incurring only 2\% overhead. It significantly outperforms state-of-the-art defense methods for safeguarding LLM fine-tuning and remains effective in recent large-scale MoE LLMs such as gpt-oss and Llama 4. Our implementation is available at https://anonymous.4open.science/r/SafeMoE.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4798. Mitigating Semantic Collapse in Generative Personalization with Test-Time Embedding Adjustment
๐ฌ 17
โญ 4.50
๐ generative models
๐ท๏ธ Generative Personalization Diffusion Models Test-Time Computing
In this paper, we investigate the semantic collapsing problem in generative personalization, an under-explored topic where the learned visual concept ($V$) gradually shifts from its original textual meaning and comes to dominate other concepts in multi-concept input prompts. This issue not only reduces the semantic richness of complex input prompts like "a photo of $V$ wearing glasses and playing guitar" into simpler, less contextually rich forms such as "a photo of $V$" but also leads to simplified output images that fail to capture the intended concept. We identify the root cause as unconstrained optimisation, which allows the learned embedding $V$ to drift arbitrarily in the embedding space, both in direction and magnitude. To address this, we propose a simple yet effective training-free method that adjusts the magnitude and direction of pre-trained embedding at inference time, effectively mitigating the semantic collapsing problem. Our method is broadly applicable across different personalization methods and demonstrates significant improvements in text-image alignment in diverse use cases. Our code is anonymously published at \url{https://anonymous.4open.science/r/Embedding-Adjustment}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4799. Native Reasoning Models: Training Language Models to Reason on Unverifiable Data
๐ฌ 20
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Reinforcement Learning
The dominant paradigm for training large reasoning modelsโcombining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR)โis fundamentally constrained by its reliance on high-quality, human-annotated reasoning data and external verifiers.
This dependency incurs significant data-collection costs, risks embedding human cognitive biases, and confines the reinforcement learning stage to objectively assessable domains like mathematics and coding, leaving a vast landscape of unverifiable tasks unaddressed.
To overcome these limitations, we introduce Native Reasoning Training (NRT), a novel framework that cultivates complex reasoning by having the model generate its own reasoning traces using only standard question-answer pairs, thereby obviating the need for expert-written demonstrations.
NRT reframes the training problem by treating the reasoning process as a latent variable.
It employs a unified training objective that models reasoning as an optimization problem, intrinsically rewarding paths that increase the model's likelihood of producing the ground-truth answer.
This unified perspective allows us to analyze intrinsic failure modes of prior methods, such as policy collapse, and systematically design more robust reward aggregation functions, creating a self-correcting feedback loop where the model learns to \textit{think} in ways that resolve its own uncertainty.
Empirical evaluation on Llama-3.1-8B and Llama-3.2-3B models demonstrates that NRT achieves state-of-the-art performance among verifier-free methods, significantly outperforming standard SFT baselines and prior verifier-free RL methods.
Our approach yields particularly strong performance gains in complex reasoning domains and exhibits high robustness to policy collapse, offering a general, scalable path toward building more powerful and broadly applicable reasoning systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4800. Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity
๐ฌ 29
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM large language models pretraining data filtering data pruning
As large language models (LLMs) are pretrained on massive web corpora, careful selection of data becomes essential to ensure effective and efficient learning. While perplexity (PPL)-based filtering has demonstrated strong performance, it suffers from drawbacks: substantial time costs and inherent unreliability of the model when handling noisy or out-of-distribution samples. In this work, we propose a simple yet powerful alternative: a prior-based data filtering method that estimates token priors using corpus-level term frequency statistics, inspired by linguistic insights on word roles and lexical density. Our approach filters documents based on the mean and standard deviation of token priors, serving as a fast proxy to PPL while requiring no model inference. Despite its simplicity, the prior-based filter achieves the highest average performance across 20 downstream benchmarks, while reducing time cost by over 1000ร compared to PPL-based filtering. We further demonstrate its applicability to symbolic languages such as code and math, and its dynamic adaptability to multilingual corpora without supervision.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4801. Coarse-to-Fine Learning of Dynamic Causal Structures
๐ฌ 25
โญ 4.50
๐ causal reasoning
๐ท๏ธ causal discovery dynamic causality coarse-to-fine matrix norm scaling
Learning the dynamic causal structure is a difficult challenge in discovering causality from time series. Most existing studies rely on distributional or structural invariance to uncover the underlying causal dynamics, assuming stationary or partially stationary causality, which frequently conflicts with complex causal relationships in the real world. This boosts temporal causal discovery to encompass fully dynamic causality, where both instantaneous and lagged causal dependencies may change over time, bringing significant challenges to the efficiency and stability of causal discovery. To tackle these challenges, we introduce DyCausal, a dynamic causal structure learning framework that leverages convolutional networks to effectively model causal structures within coarse-grained time windows, and introduces linear interpolation to refine causal structures to each time step and recover time-varying causal graphs. In addition, we propose an acyclic constraint based on matrix norm scaling. It is more stable both theoretically and empirically, and constrains loops in dynamic causal structures with improved efficiency. Evaluations on both synthetic and real-world datasets prove that DyCausal significantly outperforms existing methods and identifies fully dynamic causal structures from coarse to fine.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4802. DRIFT-Net: A Spectral-Coupled Neural Operator for PDEs Learning
๐ฌ 21
โญ 4.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ PDE neural operators neural solvers
Learning PDE dynamics with neural solvers can significantly improve wall-clock efficiency and accuracy compared with classical numerical solvers. In recent years, foundation models for PDEs have largely adopted multi-scale windowed self-attention, with the scOT backbone in Poseidon serving as a representative example.
However, because of their locality, truly globally consistent spectral coupling can only be propagated gradually through deep stacking and window shifting. This weakens global coupling and leads to error accumulation and drift during closed-loop rollouts.
To address this, we propose DRIFT-Net. It employs a dual-branch design comprising a spectral branch and an image branch. The spectral branch is responsible for capturing global, large-scale low-frequency information, whereas the image branch focuses on local details and nonstationary structures. Specifically, we first perform controlled, lightweight mixing within the low-frequency range. Then we fuse the spectral and image paths at each layer via bandwise weighting, which avoids the width inflation and training instability caused by naive concatenation. The fused result is transformed back into the spatial domain and added to the image branch, thereby preserving both global structure and high-frequency details across scales. Compared with strong attention-based baselines, DRIFT-Net achieves lower error and higher throughput with fewer parameters under identical training settings and budget. On Navier--Stokes benchmarks, the relative $L_{1}$ error is reduced by 7%-54%, the parameter count decreases by about 15%, and the throughput remains higher than scOT. Ablation studies and theoretical analyses further demonstrate the stability and effectiveness of this design. The code is available at https://anonymous.4open.science/r/DRIFT-Net-26D6.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4803. Disentangled Representation Learning for Parametric Partial Differential Equations
๐ฌ 18
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Neural Operators; Physics Discovery; Inverse Problems
Neural operators (NOs) excel at learning mappings between function spaces, serving as efficient forward solution approximators for PDE-governed systems. However, as black-box solvers, they offer limited insight into the underlying physical mechanism, due to the lack of interpretable representations of the physical parameters that drive the system. To tackle this challenge, we propose a new paradigm for learning disentangled representations from NO parameters, thereby effectively solving an inverse problem. Specifically, we introduce DisentangO, a novel hyper-neural operator architecture designed to unveil and disentangle latent physical factors of variation embedded within the black-box neural operator parameters. At the core of DisentangO is a multi-task NO architecture that distills the varying parameters of the governing PDE through a task-wise adaptive layer, alongside a variational autoencoder that disentangles these variations into identifiable latent factors. By learning these disentangled representations, DisentangO not only enhances physical interpretability but also enables more robust generalization across diverse systems. Empirical evaluations across supervised, semi-supervised, and unsupervised learning contexts show that DisentangO effectively extracts meaningful and interpretable latent features, bridging the gap between predictive performance and physical understanding in neural operator frameworks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4804. Inferring brain plasticity rule under long-term stimulation with structured recurrent dynamics
๐ฌ 28
โญ 4.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ brain plasticity long-term stimulation recurrent dynamics
Understanding how long-term stimulation reshapes neural circuits requires uncovering the rules of brain plasticity. While short-term synaptic modifications have been extensively characterized, the principles that drive circuit-level reorganization across hours to weeks remain unknown. Here, we formalize these principles as a latent dynamical law that governs how recurrent connectivity evolves under repeated interventions. To capture this law, we introduce the Stimulus-Evoked Evolution Recurrent dynamics (STEER) framework, a dual-timescale model that disentangles fast neural activity from slow plastic changes. STEER represents plasticity as low-dimensional latent coefficients evolving under a learnable recurrence, enabling testable inference of plasticity rules rather than absorbing them into black-box parameters.
We validate STEER with four benchmarks: synthetic Lorenz systems with controlled parameter shifts, BCM-based networks with biologically grounded plasticity, a task learning setting with adaptively optimized external stimulation and longitudinal recordings from Parkinsonian rats receiving closed-loop DBS. Our results demonstrate that STEER recovers interpretable update equations, predicts network adaptation under unseen stimulation schedules, and supports the design of improved intervention protocols. By elevating long-term plasticity from a hidden confound to an identifiable dynamical object, STEER provides a data-driven foundation for both mechanistic insight and principled optimization of brain stimulation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4805. GoT-R1: Unleashing Reasoning Capability of Autoregressive Visual Generation with Reinforcement Learning
๐ฌ 16
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Large Language Model Reinforcement Learning Visual Generation
Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in autoregressive visual generation models. Leveraging the natural affinity between autoregressive architectures and sequential reasoning, our approach builds upon the Generation Chain-of-Thought framework to enable models to autonomously discover effective reasoning strategies beyond predefined templates. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench and GenEval benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in autoregressive image generation by successfully transferring sophisticated reasoning capabilities from language models to the visual generation domain.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4806. XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
๐ฌ 22
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Omni-modal Benchmark Cross-modal consistency
Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks have advanced multimodal evaluation, it remains unclear whether OLLMs achieve modality-invariant reasoning or inherit modality-specific biases. We introduce \textbf{XModBench}, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency. XModBench contains 60K multiple-choice questions across five task families and systematically covers all six cross-modality directions, enabling diagnosis of task competence, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) suffers from modality disparities, with performance dropping by over {20 points} on average when audio inputs replace text, and (iii) exhibits directional imbalance, with a {9-point gap} when using vision as context versus using text as context.
The findings suggest that OLLMs fall short of modality-invariant reasoning, and XModBench provides a fundamental diagnostic tool for evaluating and improving their overall cross-modal competence.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4807. Breaking and Fixing Defenses Against Control Flow Hijacking in Multi-Agent Systems
๐ฌ 12
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Agents Multi-Agent Systems Security Defenses Control Flow Hijacking Indirect Prompt Injection
Control-flow hijacking attacks manipulate orchestration mechanisms in multi-agent systems into performing unsafe actions that compromise the system and exfiltrate sensitive information. Recently proposed defenses, such as LlamaFirewall, rely on alignment checks of inter-agent communications to ensure that all agent invocations are "related to" and "likely to further" the original objective.
We start by demonstrating control-flow hijacking attacks that evade these defenses even if alignment checks are performed by advanced LLMs. We argue that the safety and functionality objectives of multi-agent systems fundamentally conflict with each other. This conflict is exacerbated by the brittle definitions of "alignment" and the checkers' incomplete visibility into the execution context.
We then propose, implement, and evaluate ControlValve, a new defense based on the principles of control-flow integrity and least privilege. ControlValve (1) generates permitted control-flow graphs for multi-agent systems, and (2) enforces that all executions comply with these graphs, along with contextual rules (generated in a zero-shot manner) for each agent invocation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4808. Transformers Trained via Gradient Descent Can Provably Learn a Class of Teacher Models
๐ฌ 19
โญ 4.50
๐ learning theory
๐ท๏ธ transformer gradient descent teacher-student setting
Transformers have achieved great success across a wide range of applications, yet the theoretical foundations underlying their success remain largely unexplored. To demystify the strong capacities of transformers applied to versatile scenarios and tasks, we theoretically investigate utilizing transformers as students to learn from a class of teacher models. Specifically, the teacher models covered in our analysis include convolution layers with average pooling, graph convolution layers, and various classic statistical learning models, including a variant of sparse token selection models [Sanford et al., 2023, Wang et al., 2024] and group-sparse linear predictors [Zhang et al., 2025].
When learning from this class of teacher models, we prove that one-layer transformers with simplified "position-only'' attention can successfully recover all parameter blocks of the teacher models, thus achieving the optimal population loss. Building upon the efficient mimicry of trained transformers towards teacher models, we further demonstrate that they can generalize well to a broad class of out-of-distribution data under mild assumptions. The key in our analysis is to identify a fundamental bilinear structure shared by various learning tasks, which enables us to establish unified learning guarantees for these tasks when treating them as teachers for transformers.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4809. LFQA-E: Carefully Benchmarking Long-form QA Evaluation
๐ฌ 16
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ LFQA Evaluaton Benchmark
Long-Form Question Answering (LFQA) involves generating comprehensive, paragraph-level responses to open-ended questions, which poses a significant challenge for evaluation due to the richness of information and flexible response format. Existing LFQA-evaluation benchmarks often lack reference answers and are limited in size and topic coverage, reducing their reliability. To address this gap, we introduce LFQA-E, a well-constructed, multilingual, and reference-based benchmark designed to rigorously evaluate automatic metrics for LFQA. LFQA-E comprises 1,625 questions and 7,649 pairwise comparisons across 15 topics, drawn from diverse sources such as online queries and examination questions, thereby enabling a comprehensive assessment of evaluation metrics. We examine five categories of metrics, encompassing 17 specific methods, using LFQA-E. The results demonstrate that none of the existing automatic metrics perform comparably to human judgments, highlighting their inability to capture the dense information in long-form responses. Furthermore, we present a detailed analysis of the failure cases and the generalization capacity of these metrics, offering insights to guide the future development of LFQA evaluation methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4810. The logical expressiveness of topological neural networks
๐ฌ 26
โญ 4.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ topological neural networks message-passing networks expressive power first order logic
Graph neural networks (GNNs) are the standard for learning on graphs, yet they have limited expressive power, often expressed in terms of the Weisfeiler-Leman (WL) hierarchy or within the framework of first-order logic. In this context, topological neural networks (TNNs) have recently emerged as a promising alternative for graph representation learning. By incorporating higher-order relational structures into message-passing schemes, TNNs offer higher representational power than traditional GNNs. However, a fundamental question remains open: _what is the logical expressiveness of TNNs?_ Answering this allows us to characterize precisely which binary classifiers TNNs can represent. In this paper, we address this question by analyzing isomorphism tests derived from the underlying mechanisms of general TNNs. We introduce and investigate the power of higher-order variants of WL-based tests for combinatorial complexes, called $k$-CCWL test. In addition, we introduce the topological counting logic $TC_{k}$, an extension of standard counting logic featuring a novel pairwise counting quantifier $\exists^{N}(x_i,x_j)\, \varphi(x_i,x_j),$ which explicitly quantifies pairs $(x_i, x_j)$ satisfying property $\varphi$. We rigorously prove the exact equivalence: $\text{k-CCWL} \equiv \text{TC}_{k{+}2} \equiv \text{Topological }(k{+}2)\text{-pebble game}.$ These results establish a logical expressiveness theory for TNNs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4811. Implicit Inversion turns CLIP into a Decoder
๐ฌ 17
โญ 4.50
๐ optimization
๐ท๏ธ Model Inversion Text To Image CLIP Implicit Neural Representations
CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. We show that image synthesis is nevertheless possible using CLIP aloneโwithout a pre-trained generative decoder or CLIP tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. With CLIP frozen, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. Our findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4812. Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models
๐ฌ 17
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Compositional reasoning multimodal learning test-time adaptation evaluation metrics vision-language models
Multimodal models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capabilities. To address this, we introduce a group matching score that better leverages group structure and uncovers substantial hidden competence in both contrastive visionโlanguage models (VLMs) and multimodal large language models (MLLMs). Moreover, simply overfitting to the induced group matchings at test time transfers this hidden competence into higher scores under the original evaluation metric, closing much of the reported gap. With this adjustment, GPT-4.1 becomes the first system to surpass estimated human performance on Winoground. Building on this insight, we propose Test-Time Matching (TTM), an iterative self-training algorithm that bootstraps model performance without any external supervision. TTM delivers further non-trivial improvements: for example, SigLIP-B16 with TTM surpasses GPT-4.1 on MMVP-VLM, establishing a new state of the art. Importantly, TTM is broadly effective even on benchmarks without metric-induced effects or group structures, achieving relative gains exceeding 85.7\% on challenging datasets such as Whatsup. Across 16 datasets and variants, our experiments consistently demonstrate that TTM unlocks hidden compositional reasoning ability and advances the frontier of multimodal evaluation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4813. A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features
๐ฌ 24
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Visual Re-localization Relative Pose Regression Pose Estimation Visual Localization
Visually localizing an image, i.e., estimating its camera pose, requires building a scene representation that serves as a visual map. The representation we choose has direct consequences towards the practicability of our system. Even when starting from mapping images with known camera poses, state-of-the-art approaches still require hours of mapping time in the worst case, and several minutes in the best. This work raises the question whether we can achieve competitive accuracy much faster. We introduce FastForward, a method that creates a map representation and relocalizes a query image on-the-fly in a single feed-forward pass. At the core, we represent multiple mapping images as a collection of features anchored in 3D space. FastForward utilizes these mapping features to predict image-to-scene correspondences for the query image, enabling the estimation of its camera pose. We couple FastForward with image retrieval and achieve state-of-the-art accuracy when compared to other approaches with minimal map preparation time. Furthermore, FastForward demonstrates robust generalization to unseen domains, including challenging large-scale outdoor environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4814. Consistency Geodesic Bridge: Image Restoration with Pretrained Diffusion Models
๐ฌ 30
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ image restoration bridge diffusion
Bridge diffusion models have shown great promise in image restoration by constructing a direct path from degraded to clean images. However, they often rely on predefined, high-action trajectories, which limits both sampling efficiency and final restoration quality. To address this, we propose a Consistency Geodesic Bridge (CGB) framework to construct a lower-action, geodesic trajectory. We achieve this by designing a novel bridge process that evolves over a shorter time horizon and makes the reverse process start from an entropy-regularized point that mixes the degraded image and Gaussian noise, which theoretically reduces the required trajectory action. To ensure this path approximates a geodesic on the data manifold, we innovatively leverage a pretrained denoiser as a dynamic geodesic guidance
field. To solve this process efficiently, we draw inspiration from consistency models to learn a single-step mapping function, optimized via a continuous-time consistency objective tailored for our trajectory, so as to analytically map any state on the path to the target image. Notably, the path length in our framework becomes a tunable task-adaptive knob, allowing the model to adaptively balance information
preservation against generative power for tasks of varying degradation, such as denoising versus super-resolution. Extensive experiments demonstrate that CGB achieves state-of-the-art performance across various image restoration tasks, while enabling high-quality recovery with a single or fewer sampling steps. Our project page is \url{https://cgbridge.github.io/}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4815. MOSAIC: Multi-Subject Personalized Generation via Correspondence-Aware Alignment and Disentanglement
๐ฌ 36
โญ 4.50
๐ generative models
๐ท๏ธ Multi-Subject Personalized Generation Diffusion Model
Multi-subject personalized generation presents unique challenges in maintaining identity fidelity and semantic coherence when synthesizing images conditioned on multiple reference subjects. Existing methods often suffer from identity blending and attribute leakage due to inadequate modeling of how different subjects should interact within shared representation spaces. We present MOSAIC, a representation-centric framework that rethinks multi-subject generation through explicit semantic correspondence and orthogonal feature disentanglement. Our key insight is that multi-subject generation requires precise semantic alignment at the representation levelโknowing exactly which regions in the generated image should attend to which parts of each reference. To enable this, we introduce SemAlign-MS, a meticulously annotated dataset providing fine-grained semantic correspondences between multiple reference subjects and target images, previously unavailable in this domain. Building on this foundation, we propose the semantic correspondence attention loss to enforce precise point-to-point semantic alignment, ensuring high consistency from each reference to its designated regions. Furthermore, we develop the multi-reference disentanglement loss to push different subjects into orthogonal attention subspaces, preventing feature interference while preserving individual identity characteristics. Extensive experiments demonstrate that MOSAIC achieves SOTA performance on multiple benchmarks. Notably, while existing methods typically degrade beyond 3 subjects, MOSAIC maintains high fidelity with 4+ reference subjects, opening new possibilities for complex multi-subject synthesis applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4816. Taming Score-Based Denoisers in ADMM: A Convergent Plug-and-Play Framework
๐ฌ 17
โญ 4.50
๐ generative models
๐ท๏ธ diffusion score model inverse problem convergence optimization generative model
While score-based generative models have emerged as powerful priors for solving inverse problems, directly integrating them into optimization algorithms such as ADMM remains nontrivial. Two central challenges arise: i) the mismatch between the noisy data manifolds used to train the score functions and the geometry of ADMM iterates, especially due to the influence of dual variables, and ii) the lack of convergence understanding when ADMM is equipped with score-based denoisers. To address the manifold mismatch issue, we propose ADMM plug-and-play (ADMM-PnP) with the AC-DC denoiser, a new framework that embeds a three-stage denoiser into ADMM: (1) auto-correction (AC) via additive Gaussian noise, (2) directional correction (DC) using conditional Langevin dynamics, and (3) score-based denoising. In terms of convergence, we establish two results: first, under proper denoiser parameters, each ADMM iteration is a weakly nonexpansive operator, ensuring high-probability fixed-point *ball convergence* using a constant step size; second, under more relaxed conditions, the AC-DC denoiser is a bounded denoiser, which leads to convergence under an adaptive step size schedule. Experiments on a range of inverse problems demonstrate that our method consistently improves solution quality over a variety of baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4817. PolicyFlow: Policy Optimization with Continuous Normalizing Flow in Reinforcement Learning
๐ฌ 18
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Continuous Normalizing Flow Entropy Regularization Proximal Policy Optimization Multimodal Policy
Among on-policy reinforcement learning algorithms, Proximal Policy Optimiza-
tion (PPO) demonstrates is widely favored for its simplicity, numerical stability,
and strong empirical performance. Standard PPO relies on surrogate objectives
defined via importance ratios, which require evaluating policy likelihood that is
typically straightforward when the policy is modeled as a Gaussian distribution.
However, extending PPO to more expressive, high-capacity policy models such
as continuous normalizing flows (CNFs), also known as flow-matching models,
is challenging because likelihood evaluation along the full flow trajectory is com-
putationally expensive and often numerically unstable. To resolve this issue, we
propose PolicyFlow, a novel on-policy CNF-based reinforcement learning algo-
rithm that integrates expressive CNF policies with PPO-style objectives without
requiring likelihood evaluation along the full flow path. PolicyFlow approximates
importance ratios using velocity field variations along a simple interpolation path,
reducing computational overhead without compromising training stability. To
further prevent mode collapse and further encourage diverse behaviors, we pro-
pose the Brownian Regularizer, an implicit policy entropy regularizer inspired by
Brownian motion, which is conceptually elegant and computationally lightweight.
Experiments on diverse tasks across vairous environments including MultiGoal,
PointMaze, IsaacLab and MuJoCo Playground show that PolicyFlow achieves
competitive or superior performance compared to PPO using Gaussian policies
and flow-based baselines including FPO and DPPO. Notably, results on MultiGoal
highlight PolicyFlowโs ability to capture richer multimodal action distributions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4818. Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency
๐ฌ 11
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Knowledge Graph RAG Synthetic Data Generation Knowledge Distillation
Document-level knowledge graph (KG) construction faces a fundamental scaling challenge: existing methods either rely on expensive large language models (LLMs), making them economically unviable for large-scale corpora, or employ smaller models that produce incomplete and inconsistent graphs. We identify that this limitation stems not from model capabilities but from the absence of high-quality training data for document-level KG construction. To address this gap, we introduce SynthKG, a multi-step data synthesis pipeline that generates high-quality document-KG pairs through systematic chunking, decontextualization, and structured extraction using LLMs. By further fine-tuning a smaller LLM on synthesized document-KG pairs, we streamline the multi-step process into a single-step KG generation approach called Distill-SynthKG. Furthermore, we re-purpose existing question-answering datasets to establish KG evaluation datasets and introduce new evaluation metrics. Using KGs produced by Distill-SynthKG, we also design a novel graph-based retrieval framework for RAG. Experimental results demonstrate that Distill-SynthKG not only surpasses all baseline models in KG quality (including models up to eight times larger) but also consistently excels in retrieval and question-answering tasks. Additionally, our proposed graph retrieval framework outperforms all KG-retrieval methods across multiple benchmark datasets. We make SynthKG and Distill-SynthKG publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4819. Contamination Detection for VLMs Using MultiโModal Semantic Perturbations
๐ฌ 16
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Vision Language Models Data Contamination
Recent advances in VisionโLanguage Models (VLMs) have achieved state-of-the-art performance on numerous benchmark tasks. However, the use of internet-scale, often proprietary, pretraining corpora raises a critical concern for both practitioners and users: inflated performance due to \emph{test-set leakage}. While prior work has proposed mitigation strategies such as decontamination of pretraining data and benchmark redesign for LLMs, the complementary direction of developing detection methods for \emph{contaminated VLMs} remains underexplored. To address this gap, we deliberately contaminate open-source VLMs on popular benchmarks and show that existing detection approaches either fail outright or exhibit inconsistent behavior. We then propose a novel simple yet effective detection method based on \textit{multi-modal semantic perturbation}, demonstrating that contaminated models fail to generalize under controlled perturbations. Finally, we validate our approach across multiple contamination strategies, confirming its robustness and effectiveness. The code and perturbed dataset will be released publicly.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4820. SWE-RM: Execution-free Feedback for Software Engineering Agents
๐ฌ 16
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Coding Agent Reward Model Test-time Scaling Reinforcement Learning
Execution-based feedback like unit testing is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL). This paradigm requires scalable and reliable collection of unit test cases to provide accurate feedback, and the resulting feedback is often sparse and cannot effectively distinguish between trajectories that are both successful or both unsuccessful. In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases. Despite this potential, execution-free feedback for realistic software engineering (SWE) agents remains underexplored. Aiming to develop versatile reward models that are effective across TTS and RL, however, we observe that two verifiers with nearly identical TTS performance can nevertheless yield very different results in RL. Intuitively, TTS primarily reflects the modelโs ability to select the best trajectory, but this ability does not necessarily generalize to RL. To address this limitation, we identify two additional aspects that are crucial for RL training: classification accuracy and calibration. We then conduct comprehensive controlled experiments to investigate how to train a robust reward model that performs well across these metrics. In particular, we analyze the impact of various factors such as training data scale, policy mixtures, and data source composition.
Guided by these investigations, we introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference. SWE-RM substantially improves SWE agents on both TTS and RL performance. For example, it increases the accuracy of Qwen3-Coder-Flash from 51.6% to 62.0%, and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified using TTS, achieving new state-of-the-art performance among open-source models. On RL training, SWE-RM lifts the resolve rate of execution-based counterparts by 3 absolute points on SWE-Bench Verified.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4821. Lost in the Non-convex Loss Landscape: How to Fine-tune the Large Time Series Model?
๐ฌ 24
โญ 4.50
๐ learning on time series and dynamical systems
๐ท๏ธ Time series analysis large models fine-tuning
Recently, large time series models (LTSM) have become popular and important because they exhibit characteristics similar to large language models, such as flexible context length, scalability, and task generality, outperforming the advanced task-specific models in the domain.
However, existing research indicates that the pre-trained LTSM can show a poor non-convex loss landscape (indicating poor trainability). Hence, directly fine-tuning pre-trained LTSM shows overfitting, which leads to poor fine-tuning performance, even worse than training from scratch on the downstream datasets.
This severely diminishes the value of the pre-trained LTSM.
To address this, we propose a new fine-tuning method called Smoothed Full Fine-tuning (SFF).
Specifically, before fine-tuning, we first construct an auxiliary LTSM with a smooth
loss landscape (indicating good trainability) through random initialization.
Second, we utilize it to smooth the loss landscape of the pre-trained LTSM through linear interpolation between their weights.
As a result, the smoothed LTSM acquires good trainability while retaining good pre-training knowledge, thereby achieving better performance when fine-tuned on the downstream dataset.
We also explain why SFF is effective from the perspective of optimization theory: interpolation perturbs sharp minima without obviously harming originally flat regions, thereby aiding sharp minima escape to better and smoother basins.
Extensive experiments on popular datasets show that our method indeed improves the performance of eight popular LTSMs, e.g., Timer, TimesFM, MOMENT, UniTS, MOIRAI, Chronos, TTMs, and Sundial, in different downstream tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4822. Exploring the Basin-Like Loss Landscape in Large Language Models
๐ฌ 22
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ loss landscape empirical theory pre-training fine-tuning
We discover the emergence of \textit{basins} in the loss landscape of large language models. As model scale increases, LLMs become progressively more resilient to random perturbations in the parameter space, giving rise to expansive stability regions where models exhibit nearly identical performance, but outside of which their capabilities collapse. We observe that pre-training creates a \textit{basic capability} basin, and subsequent alignment fine-tuning forms \textit{specific capability} basins (e.g., safety, math, coding). Thus, we argue that benign fine-tuning confined to the basin should preserve prior capabilities. Besides, we also analyze the loss landscape for worst-case directions, which is consistently sharp and detrimental. We find that adversarial fine-tuning moves along the nearly worst-case directions, thus rapidly degrading model capabilities. Finally, we provide a theoretical analysis demonstrating that the basin size bounds the performance degradation of any fine-tuning, including the adversarial ones, while also guaranteeing the model robustness w.r.t. input perturbations, suggesting the benefit of enlarging basins.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4823. Distributionally Robust Classification for Multi-source Unsupervised Domain Adaptation
๐ฌ 19
โญ 4.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Unsupervised Domain Adaptation Distributionally Robust Learning Multi-source Learning
Unsupervised domain adaptation (UDA) is a statistical learning problem when the distribution of training (source) data is different from that of test (target) data. In this setting, one has access to labeled data only from the source domain and unlabeled data from the target domain. The central objective is to leverage the source data and the unlabeled target data to build models that generalize to the target domain. Despite its potential, existing UDA approaches often struggle in practice, particularly in scenarios where the target domain offers only limited unlabeled data or spurious correlations dominate the source domain. To address these challenges, we propose a novel distributionally robust learning framework that models uncertainty in both the covariate distribution and the conditional label distribution. Our approach is motivated by the multi-source domain adaptation setting but is also directly applicable to the single-source scenario, making it versatile in practice. We develop an efficient learning algorithm that can be seamlessly integrated with existing UDA methods. Extensive experiments under various distribution shift scenarios show that our method consistently outperforms strong baselines, especially when target data are extremely scarce.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4824. RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation
๐ฌ 10
โญ 4.50
๐ generative models
๐ท๏ธ Diffusion Models
In this paper, we propose a 3D asset-referenced diffusion model for image generation, exploring how to integrate 3D assets into image diffusion models. Existing reference-based image generation methods leverage large-scale pretrained diffusion models and demonstrate strong capability in generating diverse images conditioned on a single reference image. However, these methods are limited to single-image references and cannot leverage 3D assets, constraining their practical versatility. To address this gap, we present a cross-domain diffusion model with dual-branch perception that leverages multi-view RGB images and point maps of 3D assets to jointly model their colors and canonical-space coordinates, achieving precise consistency between generated images and the 3D references. Our spatially aligned dual-branch generation architecture and domain-decoupled generation mechanism ensure the simultaneous generation of two spatially aligned but content-disentangled outputs, RGB images and point maps, linking 2D image attributes with 3D asset attributes. Experiments show that our approach effectively uses 3D assets as references to produce images consistent with the given assets, opening new possibilities for combining diffusion models with 3D content creation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4825. Si-GT: Fast Interconnect Signal Integrity Analysis for Integrated Circuit Design via Graph Transformers
๐ฌ 11
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Graph Transformer Integrated Circuit Signal Integrity
Signal integrity issues present significant challenges in modern integrated circuit (IC) design, as crosstalk-induced delay variation and transient glitches caused by capacitive coupling among interconnects can severely impact IC functional correctness. Although circuit simulators like SPICE can deliver accurate signal integrity analysis, their computational cost becomes prohibitive for large-scale designs. In this paper, we propose Si-GT, a novel transformer-based model for fast and accurate signal integrity analysis in IC interconnects. Our model elaborates three key designs: (1) virtual NET token to encode net-specific signal characteristics and serve as net-wise representation, (2) mesh pattern encoding to embed high-order mesh structures at each node while distinguishing uncoupled wire segments, and (3) intra-inter net (IIN) attention mechanism to capture structures of signal propagation path and coupling connections. To support model training and evaluation, we construct the first interconnect signal integrity dataset comprising 200k delay examples and 187k glitch examples using SPICE simulations as the golden reference. Our experiments show that our Si-GT surpasses state-of-the-art graph neural network and graph transformer baselines with substantially reduced computation compared to SPICE, offering a scalable and effective solution for interconnect signal integrity analysis in IC design verification.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4826. Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification
๐ฌ 18
โญ 4.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ agentic search test-time scaling asymmetric verification
Test-time compute can be scaled both sequentially and in parallel. Sequential scaling involves lengthening the generation process, while parallel scaling involves verifying and selecting among multiple candidate outputs. Combining these two strategies has led to the most powerful AI systems, such as Grok 4 Heavy, GPT-5 Pro, and Gemini-2.5 Pro Deep Think. A key observation is that, in certain contexts (e.g., solving Sudoku puzzles), verifying responses can be substantially easier than generating them. This property, referred to as \emph{asymmetric verification}, highlights the strong potential of test-time scaling. In this work, we study both sequential and parallel test-time scaling of deep search agents, motivated by the intuition that verification in this setting is often much easier than generation. In experiments, we first show that sequential scaling methods, such as budget forcing, can be effective initially but eventually degrade performance when over-applied in agentic search. Due to asymmetric verification, however, we are able to achieve substantial improvements by allocating only a modest amount of compute to the verifier. We conduct experiments with flagship open-source models, including GLM-4.5, K2, Qwen3-2507 and Tongyi-DeepResearch, and extend them to their ``Heavy'' variants through test-time scaling. These deep research agents achieve improvements of up to 20 absolute points on benchmarks such as BrowseComp. Remarkably, as an open-source alternative, GLM-4.5 Heavy reaches accuracy of {\bf 54.0\%} on BrowseComp, {\bf 66.0\%} on GAIA, and {\bf 68.0\%} on xbench-DeepSearch, placing it on par with the best proprietary choices such as OpenAI Deep Research and o3. Tongyi-DeepResearch Heavy pushes performance even further, attaining {\bf 69.0\%} accuracy on BrowseComp.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4827. ProReGen: Progressive Residual Generation under Attribute Correlations
๐ฌ 9
โญ 4.50
๐ generative models
๐ท๏ธ attribute correlation progressive training data generation
Attribute correlations in the training data will compromise the ability of a deep generative model (DGM) to synthesize images with under-represented attribute combinations ($\textit{i.e.,}$ minority samples). Existing approaches mitigate this by data re-sampling to remove attribute correlations seen by the DGM, using a classifier to provide $\textit{pseudo-supervision}$ for the generation of counterfactual samples, or incorporating inductive bias to explicitly decompose the generation into independent causal mechanisms. We present ProReGen, a $\textit{progressive residual generation}$ approach inspired by the classical Robinson's transformation to partial out from an image attribute $\mathbf{x}_2$ its component $m(\mathbf{x}_1)$ that is predictable by other image attributes $\mathbf{x}_1$, and the residual $\gamma = \mathbf{x}_2 - m(\mathbf{x}_1)$ that is not. This simplifies the original problem of learning a DGM $g(\mathbf{x}_1, \mathbf{x}_2)$ conditioned on correlated inputs, to learning $\tilde{g}(\mathbf{x}_1, \gamma)$ conditioned on orthogonal inputs. It further allows us to progressively learn $\tilde{g}$ by first shifting the burden to abundant majority samples to learn the generator $\tilde{g}(\mathbf{x}_1, \gamma = 0)$, and then expanding it with additional layers $g\_{\text{res}}$ to resolve its difference to $\tilde{g}(\mathbf{x}_1, \gamma)$ using residual attribute $\gamma$ on limited minority samples. On three benchmark datasets with curated varying strengths of attribute correlations, we demonstrate that ProReGen---with input orthogonalizaton and progressive residual learning---improved the correctness and quality of generations compared to existing strategies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4828. Semantic Voting: A Self-Evaluation-Free Approach for Efficient LLM Self-Improvement on Unverifiable Open-ended Tasks
๐ฌ 16
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ LLM unsupervised learning self-improvement
The rising cost of acquiring supervised data has driven significant interest in self-improvement for large language models (LLMs). Straightforward unsupervised signals like majority voting have proven effective in generating pseudo-labels for verifiable tasks, while their applicability to unverifiable tasks (e.g., translation) is limited by the open-ended character of responses. As a result, self-evaluation mechanisms (e.g., self-judging and entropy minimization) are predominantly used to derive pseudo-labels. However, self-evaluation relying on LLMs typically incurs high computational overhead and introduces overconfidence issues due to intrinsic biases. To address these challenges, we propose a novel self-evaluation-free approach for unverifiable tasks, designed for lightweight yet effective self-improvement. Inspired by majority voting commonly employed in verifiable tasks, we propose semantic voting as a novel mechanism that relaxes the principle of hard matching (i.e., exact matching) toward soft matching (i.e., semantic similarity). Soft matching is achieved by leveraging a lightweight sentence embedding model to quantify semantic similarity, thereby mitigating excessive computational burden and intrinsic bias-associated limitations of self-evaluation. Comprehensive experiments demonstrate that our method achieves substantial gains in computational efficiency and overall better performance than self-evaluation methods across diverse model architectures and tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4829. VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
๐ฌ 15
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ AI Safety Vision Language Models Safety Alignment
Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90\%+ accuracy on clear unimodal safety signals, performance degrades substantially to 20-55\% when joint image-text reasoning is required to determine the safety label. Most critically, 34\% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4\% to 10.4\% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8\% to 53.9\%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust visionโlanguage safety.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4830. DVD-Quant: Data-free Video Diffusion Transformers Quantization
๐ฌ 25
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ video generation models; post-training quantization
Diffusion Transformers (DiTs) have emerged as the state-of-the-art architecture for video generation, yet their computational and memory demands hinder practical deployment. While post-training quantization (PTQ) presents a promising approach to accelerate Video DiT models, existing methods suffer from two critical limitations: (1) dependence on computation-heavy and inflexible calibration procedures, and (2) considerable performance deterioration after quantization.
To address these challenges, we propose DVD-Quant, a novel Data-free quantization framework for Video DiTs. Our approach integrates three key innovations:
(1) Bounded-init Grid Refinement (BGR) and
(2) Auto-scaling Rotated Quantization (ARQ) for calibration data-free quantization error reduction, as well as
(3) $\delta$-Guided Bit Switching ($\delta$-GBS) for adaptive bit-width allocation.
Extensive experiments across multiple video generation benchmarks demonstrate that DVD-Quant achieves an approximately 2$\times$ speedup over full-precision baselines on advanced DiT models while maintaining visual fidelity. Notably, DVD-Quant is the first to enable W4A4 PTQ for Video DiTs without compromising video quality. Code and models will be released to facilitate future research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4831. AntigenLM: Structure-Aware DNA Language Modeling for Influenza
๐ฌ 25
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Influenza A DNA Genome Language Model Foundation Model
Language models have transformed sequence analysis, yet DNA foundation models often underperform compared to task-specific approaches, with the causes remaining poorly understood. We introduce AntigenLM, a generative DNA language model explicitly pretrained on aligned, intact functional units of influenza genomes. This structure-aware pretraining enables AntigenLM to robustly capture evolutionary constraints and transfer effectively to multiple downstream tasks. Fine-tuned on hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM accurately forecasts antigenic variants for upcoming influenza seasons across diverse geographic regionsโincluding minor subtypes and regions unseen during trainingโoutperforming conventional phylogenetic and evolution-based models. Beyond forecasting, AntigenLM achieves near-perfect subtype classification (~100% accuracy), demonstrating strong representation learning. Ablation studies reveal that pretraining on unaligned or fragmented gene sequences drastically degrades performance, underscoring the criticalโbut previously overlookedโrole of both alignment and functional-unit preservation in DNA language modeling. AntigenLM thus provides not only a high-accuracy framework for antigen evolution prediction, essential for vaccine design, but also a methodological insight into how respecting biological sequence structure can guide the next generation of DNA foundation models for functional genomics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4832. GGBall: Graph Generative Model on Poincarรฉ Ball
๐ฌ 17
โญ 4.50
๐ generative models
๐ท๏ธ Hyperbolic Space; Graph Generation; Flow Matching
Generating graphs with hierarchical structures remains a fundamental challenge
due to the limitations of Euclidean geometry in capturing exponential complexity.
Here we introduce GGBall, a novel hyperbolic framework for graph generation
that integrates geometric inductive biases with modern generative paradigms. GGBall combines a Hyperbolic Vector-Quantized Autoencoder (HVQVAE) with a
Riemannian flow matching prior defined via closed-form geodesics. This design
enables flow-based priors to model complex latent distributions, while vector quantization helps preserve the curvature-aware structure of the hyperbolic space. We
further develop a suite of hyperbolic GNN and Transformer layers that operate
entirely within the manifold, ensuring stability and scalability. Empirically, GGBall
establishes a new state-of-the-art across diverse benchmarks. On hierarchical graph
datasets, it reduces the average generation error by up to 18% compared to the
strongest baselines. These results highlight the potential of hyperbolic geometry
as a powerful foundation for the generative modeling of complex, structured, and
hierarchical data domains.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4833. Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check
๐ฌ 27
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Jailbreak Defense
Content Warning: This paper contains examples with harmful content, and the constructed dataset includes samples that may be considered offensive.
As large language models (LLMs) continue to advance in capabilities, ensuring their safety against jailbreak attacks remains a critical challenge. In this paper, we introduce a novel safety alignment approach called Answer-Then-Check, which enhances LLM robustness against malicious prompts by applying thinking ability to mitigate jailbreaking problems before producing a final answer to the user. Our method enables models to answer the question directly in their thoughts and then critically evaluate its safety before deciding whether to provide it. To implement this approach, we construct the Reasoned Safety Alignment (ReSA) dataset, comprising 80K examples that teach models to reason through direct responses and then analyze their safety.
Experimental results demonstrate that our approach achieves the Pareto frontier with superior safety capability while decreasing over-refusal rates on over-refusal benchmarks. Notably, the model fine-tuned with ReSA maintains general reasoning capabilities on benchmarks like MMLU, MATH500, and HumanEval.
Besides, our method equips models with the ability to perform safe completion. Unlike post-hoc detection methods that can only reject harmful queries, our model can provide helpful and safe alternative responses for sensitive topics (e.g., self-harm).
Our results show that inference-time strategies alone are insufficient, highlighting the necessity of safety training, and we find even 500 samples can yield performance comparable to the entire dataset, suggesting a promising path for data-efficient safety alignment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4834. Empowering LLM Tool Invocation with Tool-call Reward Model
๐ฌ 22
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language model Tool invocation Tool-call reward model
Large Language Models (LLMs) have recently alleviated limitations in outdated internal knowledge and computational inaccuracies by invoking external tools such as search engines and code generation. While reinforcement learning (RL) has substantially enhanced tool usage in LLMs, most existing agentic RL approaches rely solely on outcome-only reward signals, which assign credit at a coarse granularity and often induce gradient conflict (e.g., correct tool calls may be penalized due to incorrect final answers). To address this, we propose the *Tool-call Reward Model* (TRM), a specialized process reward model meticulously designed to evaluate and reward each tool invocation. Since previous PRM research has predominantly focused on traditional reasoning tasks such as step-wise mathematical reasoning, the introduction of TRM brings two unique challenges: (1) limited understanding of how to construct effective TRMs, including data requirements and model size; and (2) difficulties integrating TRM with classical RL algorithms such as PPO and GRPO, where naive adaptation may lead to reward hacking (minimizing tool calls to avoid penalties). To tackle these challenges, we establish a systematic TRM construction workflow and propose refined credit assignment and turn-level advantage estimation for effective integration with PPO and GRPO. Experiments show that a 3B TRM trained on 10K samples achieves robust performance. On search-based QA and Python code-based math tasks, integrating TRM consistently outperforms outcome-only reward RL methods across models of different sizes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4835. Disco: Densely-overlapping Cell Instance Segmentation via Adjacency-aware Collaborative Coloring
๐ฌ 29
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Cell Instance Segmentation Digital Pathology Graph Coloring Topological Analysis Conflict Resolution
Accurate cell instance segmentation is foundational for digital pathology analysis. Existing methods based on contour detection and distance mapping still face significant challenges in processing complex and dense cellular regions. Graph coloring-based methods provide a new paradigm for this task, yet the effectiveness of this paradigm in real-world scenarios with dense overlaps and complex topologies has not been verified. Addressing this issue, we release a large-scale dataset GBC-FS 2025, which contains highly complex and dense sub-cellular nuclear arrangements. We conduct the first systematic analysis of the chromatic properties of cell adjacency graphs across four diverse datasets and reveal an important discovery: most real-world cell graphs are non-bipartite, with a high prevalence of odd-length cycles (predominantly triangles). This makes simple 2-coloring theory insufficient for handling complex tissues, while higher-chromaticity models would cause representational redundancy and optimization difficulties. Building on this observation of complex real-world contexts, we propose Disco (Densely-overlapping Cell Instance Segmentation via Adjacency-aware Collaborative Coloring), an adjacency-aware framework based on the โdivide and conquerโ principle. It uniquely combines a data-driven topological labeling strategy with a constrained deep learning system to resolve complex adjacency conflicts. First, โExplicit Markingโ strategy transforms the topological challenge into a learnable classification task by recursively decomposing the cell graph and isolating a โconflict set.โ Second, โImplicit Disambiguationโ mechanism resolves ambiguities in conflict regions by enforcing feature dissimilarity between different instances, enabling the model to learn separable feature representations. Disco achieves a significant 7.08\% improvement in the PQ metric on the GBC-FS 2025 dataset and an average improvement of 2.72% across all datasets. Furthermore, the predicted โConflict Mapโ serves as a novel tool for interpreting topological complexity, offering new potential for data-driven pathology research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4836. Your Language Model Secretly Contains Personality Subnetworks
๐ฌ 16
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Persona Modeling
Large Language Models (LLMs) demonstrate remarkable flexibility in adopting different personas and behaviors. Existing approaches typically adapt such behavior through external knowledge such as prompting, retrieval-augmented generation (RAG), or fine-tuning. We ask: do LLMs really need external context or parameters to adapting to different behaviors, or do they already have such knowledge embedded to their parameters?
In this work, we show that LLMs already contain persona-specialized subnetworks in their parameter space. Using small calibration datasets, we identify distinct activation signatures associated with different personas. Guided by these statistics, we develop masking strategy that isolate lightweight persona subnetworks. Building on the findings, we further discuss: how can we discover opposing sub-network from the model that lead to binary-opposing personas, such as introvert-extrovert?
To further enhance separation in binary opposition scenarios, we introduce a contrastive pruning strategy that identifies parameters responsible for the statistical divergence between opposing personas. Our method is entirely training-free, and rely solely on the language model's existing parameter space. Across diverse evaluation settings, the resulting subnetworks exhibit significantly stronger persona alignment than baselines that requires external knowledge while being more efficient. Our findings suggest that diverse human-like behaviors are not merely induced in LLMs, but are already embedded in their parameter spaceโpointing toward a new perspective on controllable and interpretable personalization in large language models. Our code is available at https://anonymous.4open.science/r/C694.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4837. Best-of-N through the Smoothing Lens: KL Divergence and Regret Analysis
๐ฌ 6
โญ 4.50
๐ learning theory
๐ท๏ธ Best-of-N Soft-best-of-N Regret bound
A simple yet effective method for inference-time alignment of generative models is Best-of-$N$ (BoN), where $N$ outcomes are sampled from a reference policy, evaluated using a proxy-reward model, and the highest-scoring one is selected.
While prior work argues that BoN is almost optimal in reward vs KL tradeoffs, the effectiveness of BoN depends critically on the quality of the proxy-reward model used for selection. For this purpose, we study BoN through a smooth version known as Soft Best-of-N (SBoN) and develop a theoretical framework to address this gap. We analyze the scaling behaviour of BoN by providing bounds on the KL divergence between the SBoN policy and the reference policy, offering insights into how performance varies with the number of samples. We also study the regret gap, i.e., the gap between the expected true-reward under the optimal policy and the SBoN policy. Our theoretical and empirical findings show that smoothing helps SBoN mitigate reward overoptimization, especially when the quality of the proxy-reward is low.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4838. Escaping Low-Rank Traps: Interpretable Visual Concept Learning via Implicit Vector Quantization
๐ฌ 24
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ Concept Bottleneck Models; Visual Concept Learning; Vision-Language Models; Representational Collapse; Interpretability
Concept Bottleneck Models (CBMs) achieve interpretability by interposing a human-understandable concept layer between perception and label prediction.
The foundation of CBMs lies in the many-to-many mapping that translates high-dimensional visual features to a set of discrete concepts.
However, we identify a critical and pervasive challenge that undermines this process: \emph{representational collapse}, where visual patch features degenerate into a low-rank subspace during training, severely degrading the quality of learned concept activation vectors, thus hindering both model interpretability and downstream performance.
To address these issues, we propose Implicit Vector Quantization (IVQ), a lightweight regularizer that maintains high-rank, diverse representations throughout training.
Rather than imposing a hard bottleneck via direct quantization, IVQ learns a codebook prior that anchors semantic information in visual features, allowing it to act as a proxy objective.
To further exploit these high-rank concept-aware features, we propose Magnet Attention, which dynamically aggregates patch-level features into visual concept prototypes, explicitly modeling the many-to-many visionโconcept correspondence.
Extensive experimental results show that our approach effectively prevents representational collapse and achieves state-of-the-art performance on eight diverse benchmarks.
Our experiments further probe the low-rank phenomenon in representational collapse, finding that IVQ mitigates the information bottleneck and yields cross-modal representations with clearer, more interpretable consistency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4839. FastAvatar: Towards Unified and Fast 3D Avatar Reconstruction with Large Gaussian Reconstruction Transformers
๐ฌ 9
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D Avatar 3D Reconstruction
Despite significant progress in 3D avatar reconstruction, it still faces challenges such as high time complexity, sensitivity to data quality, and low data utilization. We propose~\textbf{FastAvatar}, a feedforward 3D avatar framework capable of flexibly leveraging diverse daily recordings (e.g., a single image, multi-view observations, or monocular video) to reconstruct a high-quality 3D Gaussian Splatting (3DGS) model within seconds, using only a single unified model. The core of FastAvatar is a Large Gaussian Reconstruction Transformer (LGRT) featuring three key designs: First, a 3DGS transformer aggregating multi-frame cues while injecting initial 3D prompt to predict the corresponding registered canonical 3DGS representations; Second, multi-granular guidance encoding (camera pose, expression coefficient, head pose) mitigating animation-induced misalignment for variable-length inputs;
Third, incremental Gaussian aggregation via landmark tracking and sliced fusion losses. Integrating these features, FastAvatar enables incremental reconstruction, i.e., improving quality with more observations without wasting input data as in previous works. This yields a quality-speed-tunable paradigm for highly usable 3D avatar modeling. Extensive experiments show that FastAvatar has a higher quality and highly competitive speed compared to existing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4840. Advancing Spatiotemporal Representations in Spiking Neural Networks via Parametric Invertible Transformation
๐ฌ 16
โญ 4.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Spiking Neural Networks Spatiotemporal Representations Neuromorphic Computing
Spiking Neural Networks (SNNs) are regarded as energy-efficient neural architectures due to their event-driven, spike-based computation paradigm. However, existing SNNs suffer from two fundamental limitations: (1) the constrained representational space imposed by binary spike firing mechanisms, which restricts the network's capacity to encode complex spatiotemporal patterns, and (2) the ineffective design of surrogate gradient functions that leads to gradient mismatch issues and suboptimal learning dynamics. To address these challenges, we propose the Parametric Invertible Transformation (PIT), which operates in a conjugate manner with neuronal dynamics to achieve adaptive modulation and augmented spike representations simultaneously. Second, we design an auxiliary gradient correction term to mitigate the gradient mismatch issue and oscillation phenomena during training. Moreover, we introduce a theoretical framework for analyzing the spatiotemporal representation space of SNNs. Extensive experiments on both static and neuromorphic datasets demonstrate state-of-the-art performance with our proposed method. This approach lays the theoretical foundation for expanding the spatiotemporal representations of SNNs, offering a viable pathway for developing low-latency and high-performance neuromorphic processing systems in resource-constrained environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4841. ``Noisier'โ Noise Contrastive Estimation is (Almost) Maximum Likelihood
๐ฌ 22
โญ 4.50
๐ generative models
๐ท๏ธ Noise Contrastive Estimation; Generative Models
Noise Contrastive Estimation (NCE) has fueled major breakthroughs in representation learning and generative modeling. Yet a long-standing challenge remains: accurately estimating ratios between distributions that differ substantially, which significantly limits the applicability of NCE on modern high-dimensional and multimodal datasets. We revisit this problem from a less explored perspective: the magnitude of the noise distribution. Specifically, we show that with a virtually scaled (i.e., artificially increased) noise magnitude, the gradient of the NCE objective can closely align with that of Maximum Likelihood, enabling a trajectory-wise approximation from NCE to MLE, and faster convergence both theoretically and empirically. Building on this insight, we introduce "Noisier" NCE, a simple drop-in modification to vanilla NCE that incurs little to no extra computational cost, while effectively handling density-ratio estimation in challenging regimes where traditional MLE and NCE struggle. Beyond improving classical density-ratio learning, "Noisier" NCE proves broadly applicable: it achieves strong results across image modeling, anomaly detection, and offline black-box optimization. On CIFAR-10 and ImageNet64ร64 datasets, it yields 10-step and even 1-step samplers that match or surpass state-of-the-art methods, while cutting training iterations by up to half.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4842. Mini-cluster Guided Long-tailed Deep Clustering
๐ฌ 23
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Deep Clustering; Clustering; Long-tailed Clustering
As an important branch of unsupervised learning, deep clustering has seen substantial progress in recent years. However, the majority of current deep clustering methods operate under the assumption of balanced or near-balanced cluster distributions. This assumption contradicts the common long-tailed class distributions in real-world data, leading to severe performance degradation in deep clustering. Although many long-tailed learning methods have been proposed, these approaches typically rely on label information to differentiate treatment across different classes, which renders them inapplicable to deep clustering scenarios. How to re-weight the training of deep clustering models in an unsupervised setting remains an open challenge. To address this, we propose a mini-cluster guided long-tailed deep clustering method, termed MiniClustering. We introduce a specialized clustering head that divide data into much more clusters than the target number of clusters. These predicted clusters are referred to as mini-clusters. The mini-cluster-level predictions serve as the guide for estimating the appropriate weights for classes with varying degrees of long-tailedness. The weights are then incorporated to re-weight the self-training loss in model training. In this way, we can mitigate model bias by re-weighting gradients from different classes. We evaluate our method on multiple benchmark datasets with different imbalance ratios to demonstrate its effectiveness. Further, our method can be readily applied to the downstream of existing unsupervised representation learning frameworks for long-tailed deep clustering. It can also adapt label-dependent long-tailed learning methods to unsupervised clustering tasks by leveraging the estimated weights. *The code is available in the Supplementary Material.*
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4843. In Agents We Trust, but Who Do Agents Trust? Latent Preferences Steer LLM Generations
๐ฌ 18
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ AI Agents Source Preferences
Large Language Model (LLM) based agents are increasingly being deployed as user-friendly front-ends on online platforms, where they filter, prioritize, and recommend information retrieved from the platforms' back-end databases or via web search. In these scenarios, LLM agents act as decision assistants, drawing users' attention to particular instances of retrieved information at the expense of others. While much prior work has focused on biases in the information LLMs themselves generate, less attention has been paid to the factors and mechanisms that determine how LLMs select and present information to users.
We hypothesize that when information is attributed to specific sources (e.g., particular publishers, journals, or platforms), LLMs will exhibit systematic latent source preferences. That is, they will prioritize information from some sources over others based on attributes such as the sources' brand identity, reputation, or perceived expertise, encoded within their parametric knowledge. Through controlled experiments on twelve LLMs from six model providers, spanning both synthetic and real-world tasks including news recommendation, research paper selection, and choosing e-commerce platforms, we find that several models consistently exhibit strong and predictable source preferences. These preferences are sensitive to contextual framing, can outweigh the influence of content itself, and persist despite explicit prompting to avoid them. They also help explain phenomena such as the observed left-leaning skew in news recommendations, which arises from higher trust in certain sources rather than the content itself. Our findings advocate for deeper investigation into the origins of these preferences during pretraining, fine-tuning and instruction tuning, as well as for mechanisms that provide users with transparency and control over the biases guiding LLM-powered agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4844. MICLIP: Learning to Interpret Representation in Vision Models
๐ฌ 4
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ mechanistic interpretability contrastive learning sparse autoencoder
Vision models have demonstrated remarkable capabilities, yet their decision-making processes remain largely opaque. Mechanistic interpretability (MI) offers a promising avenue to decode these internal workings. However, existing interpretation methods suffer from two key limitations. First, they rely on the flawed activation-magnitude assumption, assuming that the importance of a neuron is directly reflected by the magnitude of its activation, which ignores more nuanced causal roles. Second, they are predominantly input-centric, failing to capture the causal mechanisms that drive a model's output. These shortcomings lead to inaccurate and unreliable internal representation interpretations, especially in cases of incorrect predictions. We propose MICLIP (Mechanism-Interpretability via Contrastive Learning), a novel framework that extends CLIPโs contrastive learning to align internal mechanisms of vision models with general semantic concepts, enabling interpretable and controllable representations. Our approach circumvents previous limitations by performing multimodal alignment between a model's internal representations and both its input concepts and output semantics via contrastive learning. We demonstrate that MICLIP is a general framework applicable to diverse representation unit types, including individual neurons and sparse autoencoder (SAE) features. By enabling precise, causal-aware interpretation, MICLIP not only reveals the semantic properties of a model's internals but also paves the way for effective and targeted manipulation of model behaviors.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4845. Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel
๐ฌ 37
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Mixture of Experts Large Language Models Foundation Model
Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on $\mathrm{Softmax}$ as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using $\mathrm{Softmax}$ to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical NadarayaโWatson regression and observe that MoE shares the same mathematical formulation as NadarayaโWatson regression. Furthermore, we show that both feed-forward neural network (FFN) and Mixture-of-Experts (MoE) can be interpreted as a special case of NadarayaโWatson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the **zero-additional-cost** Kernel Inspired Router with Normalization ($\mathrm{KERN}$), an FFN-style router function, as an alternative to $\mathrm{Softmax}$. We demonstrate that this router generalizes both $\mathrm{Sigmoid}$- and $\mathrm{Softmax}$-based routers. **Based on empirical observations and established practices in FFN implementation, we recommend the use of $\mathrm{ReLU}$ activation and $\ell_2$-normalization in $\mathrm{KERN}$ router function.** Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function $\mathrm{KERN}$.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4846. Branched Schrรถdinger Bridge Matching
๐ฌ 21
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Schrรถdinger bridges branched generative modeling stochastic optimal control unbalanced optimal transport flow matching trajectory inference stochastic processes probabilistic transport multimodal distributions dynamical systems
Predicting the intermediate trajectories between an initial and target distribution is a central problem in generative modeling. Existing approaches, such as flow matching and Schrรถdinger Bridge Matching, effectively learn mappings between two distributions by modeling a single stochastic path. However, these methods are inherently limited to unimodal transitions and cannot capture *branched* or *divergent* evolution from a common origin to multiple distinct outcomes. To address this, we introduce **Branched Schrรถdinger Bridge Matching (BranchSBM)**, a novel framework that learns branched Schrรถdinger bridges. BranchSBM parameterizes multiple time-dependent velocity fields and growth processes, enabling the representation of population-level divergence into multiple terminal distributions. We show that BranchSBM is not only more expressive but also essential for tasks involving multi-path surface navigation, modeling cell fate bifurcations from homogeneous progenitor states, and simulating diverging cellular responses to perturbations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4847. Explainable Mixture Models through Differentiable Rule Learning
๐ฌ 11
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ Mixture modeling Interpretability Conditional Density Estimation
Mixture models excel at decomposing complex, multi-modal distributions into simpler probabilistic components, but provide no insight into the conditions under which these components arise. We introduce explainable mixture models (EMM), a framework that pairs each mixture component with a human-interpretable rule over descriptive features. This enables mixtures that are not only statistically expressive but also transparently grounded in the underlying data. We formally examine the conditions under which an EMM exactly captures a target distribution and propose a scalable, differentiable learning procedure for discovering sets of rules. Experiments on synthetic and real-world datasets demonstrate that our method discovers interesting sub-populations in both univariate and multivariate settings, offering interpretable insights into the structure of complex distributions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4848. Regret-Guided Search Control for Efficient Learning in AlphaZero
๐ฌ 15
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Search Control Reinforcement Learning Regret Prioritization Monte Carlo Tree Search AlphaZero
Reinforcement learning (RL) agents achieve remarkable performance but remain far less learning efficient than humans. While RL agents require extensive self-play games to extract useful signals, humans often need only a few games, improving rapidly by repeatedly revisiting states where mistakes occurred. This idea, known as search control, aims to restart from valuable states rather than always from the initial state. In AlphaZero, prior work Go-Exploit applies this idea by sampling past states from self-play or search trees, but it treats all states equally, regardless of their learning potential. We propose Regret-Guided Search Control (RGSC), which extends AlphaZero with a regret network that learns to identify high-regret states, where the agent's evaluation diverges most from the actual outcome. These states are collected from both self-play trajectories and MCTS nodes, stored in a prioritized regret buffer, and reused as new starting positions. Across 9ร9 Go, 10ร10 Othello, and 11ร11 Hex, RGSC outperforms AlphaZero and Go-Exploit by an average of 77 and 89 Elo, respectively. When training on a well-trained 9ร9 Go model, RGSC further improves the win rate against KataGo from 69.3% to 78.2%, while both baselines show no improvement. These results demonstrate that RGSC provides an effective mechanism for search control, improving both efficiency and robustness of AlphaZero training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4849. Reference Guided Skill Discovery
๐ฌ 13
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Skill Discovery Imitation Learning Unsupervised RL Reinforcement Learning Motion Imitation
Scaling unsupervised skill discovery algorithms to high-DoF agents remains challenging. As dimensionality increases, the exploration space grows exponentially, while the manifold of meaningful skills remains limited. Therefore, semantic meaningfulness becomes essential to effectively guide exploration in high-dimensional spaces. In this work, we present **Reference-Grounded Skill Discovery (RGSD)**, a novel algorithm that grounds skill discovery in a semantically meaningful latent space using reference data. RGSD first performs contrastive pretraining to embed motions on a unit hypersphere, clustering each reference trajectory into a distinct direction. This grounding enables skill discovery to simultaneously involve both imitation of reference behaviors and the discovery of semantically related diverse behaviors.
On a simulated SMPL humanoid with $359$-D observations and $69$-D actions, RGSD successfully imitates skills such as walking, running, punching, and sidestepping, while also discover variations of these behaviors. In downstream locomotion tasks, RGSD leverages the discovered skills to faithfully satisfy user-specified style commands and outperforms imitation-learning baselines, which often fail to maintain the commanded style. Overall, our results suggest that lightweight reference-grounding offers a practical path to discovering semantically rich and structured skills in high-DoF systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4850. GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models
๐ฌ 29
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Spatial-Temporal Intelligence Geo-temporal Reasoning Visual-Language Models
Recently spatial-temporal intelligence of Visual-Language Models (VLMs) has attracted much attention due to its importance for Autonomous
Driving, Embodied AI and General Artificial Intelligence. Existing spatial-temporal benchmarks mainly focus on egocentric perspective reasoning with images/video context, or geographic perspective reasoning with graphics context (eg. a map), thus fail to assess VLMs' geographic spatial-temporal intelligence with both images/video and graphics context, which is important for areas like traffic management and emergency response. To address the gaps, we introduce Geo-Temporal Reasoning benchmark (GTR-Bench), a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network. GTR-Bench is more challenging as it requires multiple perspective switches between maps and videos, joint reasoning across multiple videos with non-overlapping fields of view, and inference over spatial-temporal regions that are unobserved by any video context. Evaluations of more than 10 popular VLMs on GTR-Bench demonstrate that even the best proprietary model, Gemini-2.5-Pro (34.9\%), significantly lags behind human performance (78.61\%) on geo-temporal reasoning. Moreover, our comprehensive analysis on GTR-Bench reveals three primary deficiencies of current models for geo-temporal reasoning. (1) VLMs' reasoning is impaired by an imbalanced utilization of spatial-temporal context. (2) VLMs are weak in temporal forecasting, which leads to worse performance on temporal-emphasized tasks than on spatial-emphasized tasks. (3) VLMs lack the proficiency to comprehend or align the map data with multi-view video inputs. We believe GTR-Bench offers valuable insights and opens up new opportunities for research and applications in spatial-temporal intelligence. Benchmark and code will be released at \href{https://anonymous.4open.science/r/GTR-Bench-5B76}{https://anonymous.4open.science/r/GTR-Bench-5B76}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4851. Deconstructing Guidance: A Semantic Hierarchy for Precise Diffusion Model Editing
๐ฌ 29
โญ 4.50
๐ generative models
๐ท๏ธ Diffusion models Image editing Information Theory
Text-guided image editing requires more than prompt followingโit demands a principled understanding of what to modify versus what to preserve. We investigate the internal guidance mechanism of diffusion models and reveal that the guidance signal follows a structured semantic hierarchy. We formalize this insight as the Semantic Scale Hypothesis: the magnitude of the guidance difference vector ($\Delta\boldsymbol{\epsilon}$) directly encodes the semantic scale of edits. Crucially, this phenomenon is theoretically grounded in Tweedieโs formula, which links score prediction to the variance of the underlying data distribution. Low-variance regions, such as objects, yield large-magnitude differences corresponding to structural edits, whereas high-variance regions, such as backgrounds, yield small-magnitude differences corresponding to stylistic adjustments. Building on this principle, we introduce Prism-Edit, a training-free, plug-and-play module that decomposes the guidance signal into semantic layers, enabling selective and interpretable control. Extensive experimentsโspanning direct visualization of the semantic hierarchy, generalization across foundation models, and integration with state-of-the-art editorsโdemonstrate that Prism-Edit achieves precise, robust, and controllable editing. Our findings establish semantic scale as a foundational axis for understanding and advancing diffusion-based image editing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4852. VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding
๐ฌ 13
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Meta-evaluation llm-as-judge synthetic data self-refinement video understanding
Precisely evaluating video understanding models remains challenging: commonly used metrics such as BLEU, ROUGE, and BERTScore fail to capture the fineness of human judgment, while obtaining such judgments through manual evaluation is costly. Recent work has explored using large language models (LLMs) or multimodal LLMs (MLLMs) as evaluators, but their extension to video understanding remains relatively unexplored. In this work, we introduce VideoJudge, a 3B and 7B-sized MLLM judge specialized to evaluate outputs from video understanding models (\textit{i.e.}, text responses conditioned on videos). To train VideoJudge, our recipe builds on the interplay between a generator and an evaluator: the generator is prompted to produce responses conditioned on a target rating, and responses not matching the evaluator's rating are discarded. Across three out of four meta-evaluation benchmarks, VideoJudge-7B outperforms larger MLLM judge baselines such as Qwen2.5-VL-32B and Qwen2.5-VL-72B. Notably, we find that LLM judges (Qwen3) models perform worse than MLLM judges (Qwen2.5-VL) and that long chain-of-thought reasoning does not improve performance, indicating that providing video inputs is crucial for evaluation of video understanding tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4853. Stackelberg Coupling of Online Representation Learning and Reinforcement Learning
๐ฌ 17
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Representation Learning Q-learning Stackelberg Equilibrium Two-timescale
Deep Q-learning jointly learns representations and values within monolithic networks, promising beneficial co-adaptation between features and value estimates. Although this architecture has attained substantial success, the coupling between representation and value learning creates instability as representations must constantly adapt to non-stationary value targets, while value estimates depend on these shifting representations. This is compounded by high variance in bootstrapped targets, which causes bias in value estimation in off-policy methods. We introduce Stackelberg Coupled Representation and Reinforcement Learning (SCORER), a framework for value-based RL that views representation and Q-learning as two strategic agents in a hierarchical game. SCORER models the Q-function as the leader, which commits to its strategy by updating less frequently, while the perception network (encoder) acts as the follower, adapting more frequently to learn representations that minimize Bellman error variance given the leader's committed strategy. Through this division of labor, the Q-function minimizes MSBE while perception minimizes its variance, thereby reducing bias accordingly, with asymmetric updates allowing stable co-adaptation, unlike simultaneous parameter updates in monolithic solutions. Our proposed SCORER framework leads to a bi-level optimization problem whose solution is approximated by a two-timescale algorithm that creates an asymmetric learning dynamic between the two players. Extensive experiments on DQN and its variants demonstrate that gains stem from algorithmic insight rather than model complexity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4854. RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models
๐ฌ 23
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Faithfulness Large Reasoning Models Benchmark
Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process, undermining reliability and trust. We introduce a formal framework for *reasoning faithfulness*, defined by two testable conditions: *stance consistency* (a coherent stance linking reasoning to answer) and *causal influence* (the stated reasoning causally drives the answer under output-level interventions), explicitly decoupled from accuracy. To operationalize this, we present **RFEval**, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions. Evaluating twelve open-source LRMs, we find unfaithfulness in 49.7% of outputs, predominantly from post-intervention stance inconsistency. Failures are concentrated in brittle, convergent domains such as math and code, and correlate more with post-training regimes than with scale: within-family ablations indicate that adding current RL-style objectives on top of supervised fine-tuning can *reduce* reasoning faithfulness, even when accuracy is maintained. Crucially, *accuracy is neither a sufficient nor a reliable proxy for faithfulness*: once controlling for model and task, the accuracyโfaithfulness link is weak and statistically insignificant. Our work establishes a rigorous methodology for auditing LRM reliability and shows that trustworthy AI requires optimizing not only for correct outcomes but also for the structural integrity of the reasoning process.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4855. LORE: Jointly Learning The Intrinsic Dimensionality and Relative Similarity Structure from Ordinal Data
๐ฌ 13
โญ 4.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Ordinal Embedding Intrinsic Dimensionality Psychophysics Subjective Perceptual Learning Relative Similarity Embedding
Learning the intrinsic dimensionality of subjective perceptual spaces such as taste, smell, or aesthetics from ordinal data is a challenging problem. We introduce LORE (Low Rank Ordinal Embedding), a scalable framework that jointly learns both the intrinsic dimensionality and an ordinal embedding from noisy triplet comparisons of the form, "Is A more similar to B than C?". Unlike existing methods that require the embedding dimension to be set apriori, LORE regularizes the solution using the nonconvex Schatten-p quasi norm, enabling automatic joint recovery of both the ordinal embedding and its dimensionality. We optimize this joint objective via an iteratively reweighted algorithm and establish convergence guarantees. Extensive experiments on synthetic datasets, simulated perceptual spaces, and real world crowdsourced ordinal judgements show that LORE learns compact, interpretable and highly accurate low dimensional embeddings that recover the latent geometry of subjective percepts. By simultaneously inferring both the intrinsic dimensionality and ordinal embeddings, LORE enables more interpretable and data efficient perceptual modeling in psychophysics and opens new directions for scalable discovery of low dimensional structure from ordinal data in machine learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4856. Teaching Metric Distance to Discrete Autoregressive Language Models
๐ฌ 10
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Autoregressive Modeling Generative Modeling Efficient Training
As large language models expand beyond natural language to domains such as mathematics, multimodal understanding, and embodied agents, tokens increasingly reflect metric relationships rather than purely linguistic meaning. We introduce DIST2Loss, a distance-aware framework designed to train autoregressive discrete models by leveraging predefined distance relationships among output tokens. At its core, DIST2Loss transforms continuous exponential family distributions derived from inherent distance metrics into discrete, categorical optimization targets compatible with the modelsโ architectures. This approach enables the models to learn and preserve meaningful distance relationships during token generation while maintaining compatibility with existing architectures. Empirical evaluations show consistent performance gains in diverse multimodal applications, including visual grounding, robotic manipulation, generative reward modeling, and image generation using vector-quantized features. These improvements are most notable in low-data regimes, demonstrating DIST2Lossโs strength under resource constraints.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4857. RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
๐ฌ 17
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ LLMs Reasoning RLVR Interpretability
Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning (SFT) on reasoning traces to improve their reasoning abilities. However, how these methods shape reasoning capabilities remains largely elusive. Going beyond an accuracy-based investigation of how these two components sculpt the reasoning process, this paper introduces a novel analysis framework that quantifies reasoning paths and captures their qualitative changes under each training process (with models of 1.5B, 7B, and 14B parameters on mathematical and code domains). Specifically, we investigate the reasoning process at two levels of granularity: the trajectory-level, which examines complete reasoning outputs, and the step-level, which analyzes reasoning graphs whose nodes correspond to individual reasoning steps. Notably, clustering of unique reasoning trajectories shows complementary effects: RL compresses incorrect trajectories, whereas SFT expands correct ones. Step-level analysis reveals that RL steepens (about 2.5 times), while SFT flattens (reduced to about one-third), the decay rates of node visitation frequency, degree, and betweenness centrality distributions in the reasoning graph. This indicates that RL concentrates reasoning functionality into a small subset of steps, while SFT homogenizes it across many steps.
Furthermore, by evaluating the reasoning graph topologies from multiple perspectives, we delineate the shared and distinct characteristics of RL and SFT. Our work presents a novel reasoning path perspective that explains why the current best practice of two-stage training, with SFT followed by RL, is successful, and offers practical implications for data construction and more efficient learning approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4858. Towards Cognitively-Faithful Decision-Making Models to Improve AI Alignment
๐ฌ 11
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Human-Centric AI Moral Preference Elicitation Axiomatic Analysis Interpretable Machine Learning
Recent AI trends seek to align AI models to learned human-centric objectives, such as personal preferences, utility, or societal values. Using standard preference elicitation methods, researchers and practitioners build models of human decisions and judgments, to which AI models are aligned. However, standard elicitation methods often fail to capture the true cognitive processes behind human decision making, such as the use of heuristics or simplifying structured thought patterns. To address this limitation, we take an axiomatic approach to learning cognitively faithful decision processes from pairwise comparisons. Building on the vast literature characterizing cognitive processes that contribute to human decision-making and pairwise comparisons, we derive a class of models in which individual features are first processed with learned rules, then aggregated via a fixed rule, such as the Bradley-Terry rule, to produce a decision. This structured processing of information ensures that such models are realistic and feasible candidates to represent underlying human decision-making processes. We demonstrate the efficacy of this modeling approach by learning interpretable models of human decision making in a kidney allocation task, and show that our proposed models match or surpass the accuracy of prior models of human pairwise decision-making.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4859. Contractive Diffusion Policies: Robust Action Diffusion via Contractive Score-Based Sampling with Differential Equations
๐ฌ 17
โญ 4.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Diffusion Policy Contraction Theory Robot learning
Diffusion policies have emerged as powerful generative models for offline policy learning, whose sampling process can be rigorously characterized by a score function guiding a Stochastic Differential Equation (SDE). However, the same score-based SDE modeling that grants diffusion policies the flexibility to learn diverse behavior also incurs solver and score-matching errors, large data requirements, and inconsistencies in action generation. While less critical in image generation, these inaccuracies compound and lead to failure in continuous control settings. We introduce **C**ontractive **D**iffusion **P**olicies (CDPs) to induce contractive behavior in the diffusion sampling dynamics. Contraction pulls nearby flows closer to enhance robustness against solver and score-matching errors while reducing unwanted action variance. We develop an in-depth theoretical analysis along with a practical implementation recipe to incorporate CDPs into existing diffusion policy architectures with minimal modification and computational cost. We evaluate CDPs for offline learning by conducting extensive experiments in simulation and real world settings. Across benchmarks, CDPs often outperform baseline policies, with pronounced benefits under data scarcity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4860. Event-T2M: Event-level Conditioning for Complex Text-to-Motion Synthesis
๐ฌ 31
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ text-to-motion generation event-level conditioning event decomposiiton
Text-to-motion generation has advanced with diffusion models, yet existing systems often collapse complex multi-action prompts into a single embedding, leading to omissions, reordering, or unnatural transitions. In this work, we shift perspective by introducing a principled definition of an event as the smallest semantically self-contained action or state change in a text prompt that can be temporally aligned with a motion segment. Building on this definition, we pro- pose Event-T2M, a diffusion-based framework that decomposes prompts into events, encodes each with a motion-aware retrieval model, and integrates them through event-based cross-attention in Conformer blocks. Existing benchmarks mix simple and multi-event prompts, making it unclear whether models that succeed on single actions generalize to multi-action cases. To address this, we con- struct HumanML3D-E, the first benchmark stratified by event count. Experiments on HumanML3D, KIT-ML, and HumanML3D-E show that Event-T2M matches state-of-the-art baselines on standard tests while outperforming them as event complexity increases. Human studies validate the plausibility of our event definition, the reliability of HumanML3D-E, and the superiority of Event-T2M in generating multi-event motions that preserve order and naturalness close to ground- truth. These results establish event-level conditioning as a generalizable principle for advancing text-to-motion generation beyond single-action prompts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4861. Foresight Diffusion: Improving Sampling Consistency in Predictive Diffusion Models
๐ฌ 18
โญ 4.50
๐ generative models
๐ท๏ธ diffusion models flow-based models predictive learning generative models
Diffusion and flow-based models have enabled significant progress in generation tasks across various modalities and have recently found applications in predictive learning. However, unlike typical generation tasks that encourage sample diversity, predictive learning entails different sources of stochasticity and requires sampling consistency aligned with the ground-truth trajectory, which is a limitation we empirically observe in diffusion models. We argue that a key bottleneck in learning sampling-consistent predictive diffusion models lies in suboptimal predictive ability, which we attribute to the entanglement of condition understanding and target denoising within shared architectures and co-training schemes. To address this, we propose **Foresight Diffusion (ForeDiff)**, a framework for predictive diffusion models that improves sampling consistency by decoupling condition understanding from target denoising. ForeDiff incorporates a separate deterministic predictive stream to process conditioning inputs independently of the denoising stream, and further leverages a pretrained predictor to extract informative representations that guide generation. Extensive experiments on robot video prediction and scientific spatiotemporal forecasting show that ForeDiff improves both predictive accuracy and sampling consistency over strong baselines, offering a promising direction for predictive diffusion models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4862. TABLET: A Large-Scale Dataset for Robust Visual Table Understanding
๐ฌ 13
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Visual Table Understanding Table Understanding Datasets Visually Represented Language Multimodal Table Understanding
While table understanding increasingly relies on pixel-only settings where tables are processed as visual representations, current benchmarks predominantly use synthetic renderings that lack the complexity and visual diversity of real-world tables. Additionally, existing visual table understanding (VTU) datasets offer fixed examples with single visualizations and pre-defined instructions, providing no access to underlying serialized data for reformulation. We introduce TABLET, a large-scale VTU dataset with 4 million examples across 20 tasks, grounded in 2 million unique tables where 88% preserve original visualizations. Each example includes paired image-HTML representations, comprehensive metadata, and provenance information linking back to the source datasets. Fine-tuning vision-language models like Qwen2.5-VL-7B on TABLET improves performance on seen _and_ unseen VTU tasks while increasing robustness on real-world table visualizations. By preserving original visualizations and maintaining example traceability in a unified large-scale collection, TABLET establishes a foundation for robust training and extensible evaluation of future VTU models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4863. Bridging Successor Measure and Online Policy Learning with Flow Matching-Based Representations
๐ฌ 17
โญ 4.50
๐ reinforcement learning
๐ท๏ธ reinforcement learning representation learning flow matching
The Successor Measure (SM), a powerful method in reinforcement learning (RL), describes discounted future state distributions under a policy, and it has recently been studied using generative modeling techniques. Although SM is a powerful predictive object, it lacks compact representations tailored for online RL. To address this, we introduce Successor Flow Features (SF2), a representation learning framework that bridges SM estimation with policy optimization. SF2 leverages flow-matching generative models to approximate successor measures, while enforcing a structured linear decomposition into a time-invariant embedding and a time-dependent projection. This yields compact, policy-aware state-action features that integrate readily into standard off-policy algorithms like TD3 and SAC. Experiments on DeepMind Control Suite tasks show that SF2 improves sample efficiency and training stability compared to strong successor feature baselines. We attribute these gains to the compact representation induced by flow matching, which reduces compounding errors in long-horizon predictions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4864. Efficient Ensemble Conditional Independence Test Framework for Causal Discovery
๐ฌ 16
โญ 4.50
๐ causal reasoning
๐ท๏ธ Conditional Independence Testing Causal Discovery P-value Combination
Constraint-based causal discovery relies on numerous conditional independence tests (CITs), but its practical applicability is severely constrained by the prohibitive computational cost, especially as CITs themselves have high time complexity with respect to the sample size. To address this key bottleneck, we introduce the Ensemble Conditional Independence Test (E-CIT), a general and plug-and-play framework. E-CIT operates on an intuitive divide-and-aggregate strategy: it partitions the data into subsets, applies a given base CIT independently to each subset, and aggregates the resulting p-values using a novel method grounded in the properties of stable distributions. This framework reduces the computational complexity of a base CIT to linear in the sample size when the subset size is fixed. Moreover, our tailored p-value combination method offers theoretical consistency guarantees under mild conditions on the subtests. Experimental results demonstrate that E-CIT not only significantly reduces the computational burden of CITs and causal discovery but also achieves competitive performance. Notably, it exhibits an improvement in complex testing scenarios, particularly on real-world datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4865. AutoQD: Automatic Discovery of Diverse Behaviors with Quality-Diversity Optimization
๐ฌ 13
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Quality Diversity Optimization Policy Representation Reinforcement Learning
Quality-Diversity (QD) algorithms have shown remarkable success in discovering diverse, high-performing solutions, but rely heavily on hand-crafted behavioral descriptors that constrain exploration to predefined notions of diversity. Leveraging the equivalence between policies and occupancy measures, we present a theoretically grounded approach to automatically generate behavioral descriptors by embedding the occupancy measures of policies in Markov Decision Processes. Our method, AutoQD, leverages random Fourier features to approximate the Maximum Mean Discrepancy (MMD) between policy occupancy measures, creating embeddings whose distances reflect meaningful behavioral differences. A low-dimensional projection of these embeddings that captures the most behaviorally significant dimensions can then be used as behavioral descriptors for CMA-MAE, a state of the art blackbox QD method, to discover diverse policies. We prove that our embeddings converge to true MMD distances between occupancy measures as the number of sampled trajectories and embedding dimensions increase. Through experiments in multiple continuous control tasks we demonstrate AutoQD's ability in discovering diverse policies without predefined behavioral descriptors, presenting a well-motivated alternative to prior methods in unsupervised Reinforcement Learning and QD optimization. Our approach opens new possibilities for open-ended learning and automated behavior discovery in sequential decision making settings without requiring domain-specific knowledge.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4866. Federated ADMM from Bayesian Duality
๐ฌ 10
โญ 4.50
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ bayesian deep learning variational inference variational learning federated learning convex optimization splitting methods
We propose a new Bayesian approach to derive and extend the federated Alternating Direction Method of Multipliers (ADMM). We show that the solutions of variational-Bayesian objectives are associated with a duality structure that not only resembles ADMM but also extends it. For example, ADMM-like updates are recovered when the objective is optimized over the isotropic-Gaussian family, and new non-trivial extensions are obtained for other more flexible exponential families. Examples include a Newton-like variant that converges in one step on quadratics and an Adam-like variant called IVON-ADMM that has the same cost as Adam but yields up to 7\% accuracy boosts in heterogeneous deep learning. Our work opens a new direction to use Bayes to extend ADMM and other primal-dual methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4867. Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs
๐ฌ 27
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Deep Learning Large Language Model Preference Learning
Direct Preference Optimization (DPO) has become a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based approaches typically treat all preference pairs uniformly, ignoring critical variations in their inherent quality and learning difficulty, leading to suboptimal data utilization and final performance. To address this challenge, we propose **Uni-DPO**, a unified dynamic preference optimization paradigm that jointly accounts for (1) the inherent quality of each preference pair and (2) the model's evolving performance on those pairs. By adaptively weighting samples according to both data quality and the model's learning dynamics during training, Uni-DPO enables more effective utilization of training data and achieves better performance. Experimental results on various models and benchmarks demonstrate the superiority and generalization capabilities of Uni-DPO. On textual understanding tasks, Gemma-2-9b-it finetuned with Uni-DPO beats the leading LLM, Claude 3 Opus, by a significant margin of 6.7 points on Arena-Hard. On mathematical reasoning and multimodal tasks, Uni-DPO consistently outperforms the baseline methods across all benchmarks, providing strong empirical evidence for the effectiveness and robustness of our approach. Code and models will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4868. ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems
๐ฌ 15
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ benchmark LLM Agent
In recent years, the research focus of large language models (LLMs) and agents has shifted increasingly from demonstrating novel capabilities to complex reasoning and tackling challenging tasks. However, existing evaluations focus mainly on math/code contests or general tasks, while existing multi-domain academic benchmarks lack sufficient reasoning depth, leaving the field without a rigorous benchmark for high-level reasoning. To fill this gap, we introduce the ACADREASON benchmark, designed to evaluate the ability of LLMs and agents to acquire and reason over academic knowledge.
It consists of 50 expert-annotated academic problems across five high-reasoning domains, including computer science, economics, law, mathematics, and philosophy. All questions are sourced from top-tier publications in recent years and undergo rigorous annotation and quality control to ensure they are both challenging and answerable. We conduct systematic evaluations over 10 mainstream LLMs and agents. The results show that most LLMs scored below 20 points, with even the cutting-edge GPT-5 achieving only 16 points. While agents achieved higher scores, none exceeded 40 points. This demonstrates the current capability gap between LLMs and agents in super-intelligent academic research tasks and highlights the challenges of ACADREASON. The code and data for the ACADREASON benchmark are available at https://anonymous.4open.science/r/Acadreason-Benchmark-1BD3/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4869. Physics-Informed Inference Time Scaling for Solving High-Dimensional Partial Differential Equations
๐ฌ 22
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ AI for Science Inference-time Scaling Deep learning Curse of dimensionality
Solving high-dimensional partial differential equations (PDEs) is a critical challenge where modern data-driven solvers often lack reliability and rigorous error guarantees. We introduce Simulation-Calibrated Scientific Machine Learning (SCaSML), a framework that systematically improves pre-trained PDE solvers at inference time without any retraining. Our core idea is to derive a new PDE, which we term the Law of Defect, that precisely governs the error of a given surrogate model. Because this defect PDE retains the structure of the original problem, we can solve it efficiently with traditional stochastic simulators, yielding a targeted correction to the initial machine-learned solution. We prove that SCaSML achieves a faster convergence rate, with a final error bounded by the product of the surrogate and simulation errors. On challenging PDEs up to 160 dimensions, SCaSML reduces the error of various surrogate models, including PINNs and Gaussian Processes, by 20-80%. SCaSML provides a principled method to fuse the speed of machine learning with the rigor of numerical simulation, enhancing the trustworthiness of Al for scientific discovery.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4870. TSM-Bench: Detecting LLM-Generated Text in Real-World Wikipedia Editing Practices
๐ฌ 11
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ llm-generated text detection editing tasks wikipedia benchmark multilingual
Automatically detecting machine-generated text (MGT) is critical to maintaining the knowledge integrity of user-generated content (UGC) platforms such as Wikipedia.
Existing detection benchmarks primarily focus on \textit{generic} text generation tasks (e.g., ``Write an article about machine learning.'').
However, editors frequently employ LLMs for specific writing tasks (e.g., summarisation).
These \textit{task-specific} MGT instances tend to resemble human-written text more closely due to their constrained task formulation and contextual conditioning.
In this work, we show that a range of MGT detectors struggle to identify task-specific MGT reflecting real-world editing on Wikipedia.
We introduce \textsc{TSM-Bench}, a multilingual, multi-generator, and multi-task benchmark for evaluating MGT detectors on common, real-world Wikipedia editing tasks.
Our findings demonstrate that (\textit{i}) average detection accuracy drops by 10--40\% compared to prior benchmarks, and (\textit{ii}) a generalisation asymmetry exists: fine-tuning on task-specific data enables generalisation to generic data---even across domains---but not vice versa.
We demonstrate that models fine-tuned exclusively on generic MGT overfit to superficial artefacts of machine generation.
Our results suggest that, in contrast to prior benchmarks, most detectors remain unreliable for automated detection in real-world contexts such as UGC platforms.
\textsc{TSM-Bench} therefore provides a crucial foundation for developing and evaluating future models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4871. Medical Interpretability and Knowledge Maps of Large Language Models
๐ฌ 12
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ Large Language Models Interpretability Explainability Medicine Healthcare Knowledge Maps
We present a systematic study of medical-domain interpretability in Large Language Models (LLMs). We study how the LLMs both represent and process medical knowledge through four different interpretability techniques: (1) UMAP projections of intermediate activations, (2) gradient-based saliency with respect to the model weights, (3) layer lesioning/removal and (4) activation patching. We present knowledge maps of five LLMs which show, at a coarse-resolution, where knowledge about patient's ages, medical symptoms, diseases and drugs is stored in the models. In particular for Llama3.3-70B, we find that most medical knowledge is processed in the first half of the model's layers. In addition, we find several interesting phenomena: (i) age is often encoded in a non-linear and sometimes discontinuous manner at intermediate layers in the models, (ii) the disease progression representation is non-monotonic and circular at certain layers of the model, (iii) in Llama, drugs cluster better by medical specialty rather than mechanism of action, especially for Llama and (iv) Gemma-27B and MedGemma-27B have activations that collapse at intermediate layers but recover by the final layers. These results can guide future research on fine-tuning, un-learning or de-biasing LLMs for medical tasks by suggesting at which layers in the model these techniques should be applied. We attached our source code to the paper for reproducibility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4872. Graph homophily booster: Rethinking the role of discrete features on heterophilic graphs
๐ฌ 14
โญ 4.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph Machine Learning Graph Neural Network Heterophily
Graph neural networks (GNNs) have emerged as a powerful tool for modeling graph-structured data, demonstrating remarkable success in many real-world applications such as complex biological network analysis, neuroscientific analysis, and social network analysis. However, existing GNNs often struggle with heterophilic graphs, where connected nodes tend to have dissimilar features or labels. While numerous methods have been proposed to address this challenge, they primarily focus on architectural designs without directly targeting the root cause of the heterophily problem. These approaches still perform even worse than the simplest MLPs on challenging heterophilic datasets. For instance, our experiments show that 21 latest GNNs still fall behind the MLP on the Actor dataset. This critical challenge calls for an innovative approach to addressing graph heterophily beyond architectural designs. To bridge this gap, we propose and study a new and unexplored paradigm: directly increasing the graph homophily via a carefully designed graph transformation. In this work, we present a simple yet effective framework called Graph Homophily Booster (GRAPHITE) to address graph heterophily. To the best of our knowledge, this work is the first method that explicitly transforms the graph to directly improve the graph homophily. Stemmed from the exact definition of homophily, our proposed GRAPHITE creates feature nodes to facilitate homophilic message passing between nodes that share similar features. Furthermore, we both theoretically and empirically show that our proposed GRAPHITE significantly increases the homophily of originally heterophilic graphs, with only a slight increase in the graph size. Extensive experiments on challenging datasets demonstrate that our proposed GRAPHITE significantly outperforms state-of-the-art methods on heterophilic graphs while achieving comparable accuracy with state-of-the-art methods on homophilic graphs. Furthermore, our proposed graph transformation alone can already enhance the performance of homophilic GNNs on heterophilic graphs, even though they were not originally designed for heterophilic graphs. We will release our code upon the publication of this paper.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4873. Bias Similarity Measurement: A Black-Box Audit of Fairness Across LLMs
๐ฌ 10
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Bias and Fairness Fairness Auditing Bias Measurement
Large Language Models (LLMs) reproduce social biases, yet prevailing evaluations
score models in isolation, obscuring how biases persist across families and
releases. We introduce Bias Similarity Measurement (BSM), which treats fairness
as a relational property between models, unifying scalar, distributional, behavioral,
and representational signals into a single similarity space. Evaluating 30
LLMs on 1M+ prompts, we find that instruction tuning primarily enforces abstention
rather than altering internal representations; small models gain little accuracy
and can become less fair under forced choice; and open-weight models can match
or exceed proprietary systems. Family signatures diverge: Gemma favors refusal,
LLaMA 3.1 approaches neutrality with fewer refusals, and converges toward
abstention-heavy behavior overall. Counterintuitively, Gemma 3 Instruct matches
GPT-4โlevel fairness at far lower cost, whereas Geminiโs heavy abstention suppresses
utility. Beyond these findings, BSM offers an auditing workflow for procurement,
regression testing, and lineage screening, and extends naturally to code
and multilingual settings. Our results reframe fairness not as isolated scores but
as comparative bias similarity, enabling systematic auditing of LLM ecosystems.
Code is available at https://anonymous.4open.science/r/bias_llm-0A8E.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4874. Diverse Text Decoding via Iterative Reweighting
๐ฌ 20
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Natural Language Processing Large Language Models Text Generation Decoding Method
Recent advances in large language models (LLMs) have led to impressive results in text generation. However, current decoding methods still lack diversity when combined with popular sampling techniques. We propose a Reweighting-based Iterative DEcoding (OverRIDE) approach that dynamically adjusts the decoding process with history responses. Our method fine-tunes auxiliary output heads iteratively on previously generated sequences to capture and suppress semantic patterns that appear in the history responses. This inference-time training process only incurs minimal loss of efficiency. We conduct extensive experiments on various tasks, including code generation, mathematical reasoning and story generation, demonstrating that OverRIDE increases output diversity while maintaining quality. We implement OverRIDE on LLM serving systems like vLLM, achieving a 6.4\% throughput loss for 72B models under parallel decoding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4875. SysMoBench: Evaluating AI on Formally Specifying Complex Real-World Systems
๐ฌ 17
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Specification Benchmark Distributed System Concurrent System Agentic AI Large Language Model
Formal models are essential to specifying large, complex computer systems and verifying their correctness, but are notoriously expensive to write and maintain. Recent advances in generative AI show promise in generating certain forms of specifications. However, existing work mostly targets small code, not complete systems. It is unclear whether AI can deal with realistic system artifacts, as this requires abstracting their complex behavioral properties into formal models. We present SysMoBench, a benchmark that evaluates AI's ability to formally model large, complex systems. We focus on concurrent and distributed systems, which are keystones of today's critical computing infrastructures, encompassing operating systems and cloud infrastructure. We use TLA+, the de facto specification language for concurrent and distributed systems, though the benchmark can be extended to other specification languages. We address the primary challenge of evaluating AI-generated models by automating metrics like syntactic and runtime correctness, conformance to system code, and invariant correctness. SysMoBench currently includes eleven diverse system artifacts: the Raft implementation of Etcd and Redis, the leader election of ZooKeeper, the Spinlock, Mutex and Ringbuffer in Asterinas OS, etc., with more being added. SysMoBench enables us to understand the capabilities and limitations of today's LLMs and agents, putting tools in this area on a firm footing and opening up promising new research directions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4876. GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning
๐ฌ 22
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Model Reasoning Multimodal Model
Reinforcement Learning (RL) can directly enhance the reasoning capabilities of large language models without extensive reliance on Supervised Fine-Tuning (SFT). In this work, we revisit the traditional Policy Gradient (PG) mechanism and propose a minimalist RL approach termed Group Policy Gradient (GPG). Unlike conventional methods, GPG directly optimizes the original RL objective, thus obviating the need for surrogate loss functions. By eliminating the critic and reference models, avoiding KL divergence constraints, and addressing the advantage and gradient estimation bias, our approach significantly simplifies the training process compared to Group Relative Policy Optimization (GRPO). Our approach achieves superior performance without relying on auxiliary techniques or adjustments. As illustrated in Figure 1, extensive experiments demonstrate that our method not only reduces computational costs but also consistently outperforms GRPO across various unimodal and multimodal tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4877. Risk Phase Transitions in Spiked Regression: Alignment Driven Benign and Catastrophic Overfitting
๐ฌ 14
โญ 4.50
๐ learning theory
๐ท๏ธ Random Matrix Theory Spiked Model Linear Regression Generalization
This paper analyzes the generalization error of minimum-norm interpolating solutions in linear regression using spiked covariance data models. The paper characterizes how varying spike strengths and target-spike alignments can affect risk, especially in overparameterized settings. The study presents an exact expression for the generalization error, leading to a comprehensive classification of benign, tempered, and catastrophic overfitting regimes based on spike strength, the aspect ratio $c=d/n$ (particularly as $c \to \infty$), and target alignment. Notably, in well-specified aligned problems, increasing spike strength can surprisingly induce catastrophic overfitting before achieving benign overfitting. The paper also reveals that target-spike alignment is not always advantageous, identifying specific, sometimes counterintuitive, conditions for its benefit or detriment. Alignment with the spike being detrimental is empirically demonstrated to persist in nonlinear models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4878. ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping
๐ฌ 13
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Diffusion Large Language Model Inference Acceleration KV Caching
Diffusion large language models (dLLMs) are emerging as a promising alternative to autoregressive models (ARMs) due to their ability to capture bidirectional context and the potential for parallel generation. Despite the advantages, dLLM generation remains time-consuming, as it processes the full context for inference at each iteration. In this work, we analyze the generation characteristics of dLLMs and observe that intermediate states (e.g., key, value, and hidden states) change only subtly across iterations. Leveraging this insight, we propose ES-dLLM, a training-free inference acceleration framework for dLLM that reduces computation by skipping tokens with low importance scores in earlier layers of the model. Importance is estimated based on intermediate tensor variation and confidence scores from previous iterations. Experiments on LLaDA-8B and Dream-7B show that ES-dLLM achieves throughput of up to 226.57 and 308.51 tokens per second (TPS) on an NVIDIA H200 GPU, delivering 5.6$\times$ to 16.8$\times$ speedup over the original implementation and up to 1.85$\times$ over the state-of-the-art caching method, while preserving generation quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4879. Robust Spiking Neural Networks Against Adversarial Attacks
๐ฌ 4
โญ 4.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Spiking Neural Networks Rubustness
Spiking Neural Networks (SNNs) represent a promising paradigm for energy-efficient neuromorphic computing due to their bio-plausible and spike-driven characteristics.
However, the robustness of SNNs in complex adversarial environments remains significantly constrained. In this study, we theoretically demonstrate that those threshold-neighboring spiking neurons are the key factors limiting the robustness of directly trained SNNs.
We find that these neurons set the upper limits for the maximum potential strength of adversarial attacks and are prone to state-flipping under minor disturbances. To address this challenge, we propose a Threshold Guarding Optimization (TGO) method, which comprises two key aspects. First, we incorporate additional constraints into the loss function to move neurons' membrane potentials away from their thresholds. It increases SNNs' gradient sparsity, thereby reducing the theoretical upper bound of adversarial attacks. Second, we introduce noisy spiking neurons to transition the neuronal firing mechanism from deterministic to probabilistic, decreasing their state-flipping probability due to minor disturbances. Extensive experiments conducted in standard adversarial scenarios prove that our method significantly enhances the robustness of directly trained SNNs. These findings pave the way for advancing more reliable and secure neuromorphic computing in real-world applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4880. Label Smoothing Improves Machine Unlearning
๐ฌ 4
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Machine unlearning Label smoothing Differential privacy
The objective of machine unlearning (MU) is to eliminate previously learned data from a model. However, it can be challenging to strike a balance between computation cost and performance when using existing MU techniques. Taking inspiration from the influence of label smoothing on model confidence and differential privacy, we propose a simple gradient-based MU approach that uses an inverse process of label smoothing. This work introduces UGradSL, a simple, plug-and-play MU approach that uses smoothed labels. We provide theoretical analyses demonstrating why properly introducing label smoothing improves MU performance. We conducted extensive experiments on several datasets of various sizes and different modalities, demonstrating the effectiveness and robustness of our proposed method. UGradSL also shows close connection to improve the local differential privacy. The consistent improvement in MU performance is only at a marginal cost of additional computations. For instance, UGradSL improves over the gradient ascent MU baseline constantly on different unlearning tasks without sacrificing unlearning efficiency. A self-adaptive UGradSL is also given for simple parameter selection.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4881. The Power of Small Initialization in Noisy Low-Tubal-Rank Tensor Recovery
๐ฌ 27
โญ 4.50
๐ optimization
๐ท๏ธ low-tubal-rank tensor recovery; t-SVD; t-product; over-parameterization;non-convex
We study the problem of recovering a low-tubal-rank tensor $\mathcal{X}\_\star\in \mathbb{R}^{n \times n \times k}$ from noisy linear measurements under the t-product framework. A widely adopted strategy involves factorizing the optimization variable as $\mathcal{U} * \mathcal{U}^\top$, where $\mathcal{U} \in \mathbb{R}^{n \times R \times k}$, followed by applying factorized gradient descent (FGD) to solve the resulting optimization problem. Since the tubal-rank $r$ of the underlying tensor $\mathcal{X}_\star$ is typically unknown, this method often assumes $r < R \le n$, a regime known as over-parameterization. However, when the measurements are corrupted by some dense noise (e.g., sub-Gaussian noise), FGD with the commonly used spectral initialization yields a recovery error that grows linearly with the over-estimated tubal-rank $R$. To address this issue, we show that using a small initialization enables FGD to achieve a nearly minimax optimal recovery error, even when the tubal-rank $R$ is significantly overestimated. Using a four-stage analytic framework, we analyze this phenomenon and establish the sharpest known error bound to date, which is independent of the overestimated tubal-rank $R$. Furthermore, we provide a theoretical guarantee showing that an easy-to-use early stopping strategy can achieve the best known result in practice. All these theoretical findings are validated through a series of simulations and real-data experiments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4882. Towards a Theoretical Understanding of In-context Learning: Stability and Non-I.I.D Generalisation
๐ฌ 17
โญ 4.50
๐ learning theory
๐ท๏ธ In-context Learning Generalisation Error
In-context learning (ICL) has demonstrated significant performance improvements in transformer-based large models. This study identifies two key factors influencing ICL generalisation under complex non-i.i.d. scenario: algorithmic stability and distributional discrepancy. First, we establish a stability bound for transformer-based models trained with mini-batch gradient descent, revealing how specific optimization configurations interact with the smoothness of the loss landscape to ensure the stability of non-linear Transformers. Next, we introduce a distribution-level discrepancy measure that highlights the importance of aligning the ICL prompt distribution with the training data distribution to achieve effective generalisation. Building on these insights, we derive a generalisation error bound for ICL with asymptotic convergence guarantees, which further reveals that token-wise prediction errors accumulate over time and even lead to generalisation collapse if the prediction length is not properly constrained. Finally, empirical evaluations are provided to validate our theoretical findings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4883. CyclicReflex: Improving Reasoning Models via Cyclical Reflection Token Scheduling
๐ฌ 25
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large language model Reasoning Testing time scaling
Large reasoning models (LRMs), such as OpenAI's o1 and DeepSeek-R1, harness test-time scaling to perform multi-step reasoning for complex problem-solving. This reasoning process, executed before producing final answers, is often guided by special juncture tokens that prompt self-evaluative reflection. We refer to these transition markers and reflective cues as โreflection tokensโ (e.g., โwait,โ โbut,โ โalternativelyโ). In this work, we treat reflection tokens as a โresourceโ and introduce the problem of resource allocation, aimed at improving the test-time compute performance of LRMs by adaptively regulating the frequency and placement of reflection tokens. Through empirical analysis, we show that both excessive and insufficient use of reflection tokens, referred to as over-reflection and under-reflection, can degrade model performance. To better understand this trade-off, we draw an analogy between reflection token usage and learning rate scheduling in optimization. Building on this insight, we propose cyclical reflection token scheduling (termed Ours), a training-free decoding strategy that dynamically modulates reflection token logits with a bidirectional, position-dependent triangular waveform, incurring no additional computation cost. Experiments on MATH500, AIME2024/2025, AMC2023, GPQA Diamond, and LiveCodeBench demonstrate that Ours consistently improves performance across model sizes (1.5Bโ8B), outperforming standard decoding and recent approaches such as TIP (thought switching penalty) and S1.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4884. Getting Your LLMs Ready for Reinforcement Learning with Lightweight SFT
๐ฌ 11
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLMs SFT Post-train
Reinforcement learning (RL) has emerged as a powerful post-training paradigm for large language models (LLMs), yet its effectiveness varies significantly across base models. While incorporating a pre-RL supervised fine-tuning (SFT) phase can enhance RL training, key questions remain: how long should the SFT cold-start phase last, and is the SFT objective truly aligned with the requirements for effective RL preparation?
In our analysis of cold-start dynamics, we uncover a key limitation: the SFT checkpoint with the highest evaluation performance often fails to maximize RL potential due to distributional forgettingโa phenomenon where the model drifts excessively away from the base modelโs distribution even before traditional overfitting occurs. We identify diversity metrics, such as the entropy and self-BLEU, as more reliable early-stopping criteria than the standard performance-based checkpoint selection. Our findings show that SFT checkpoints with peak diversity consistently lead to superior post-RL results. Building on these insights, we introduce Adaptive Early-Stop Loss (AESL), a lightweight and dynamic cold-start method that balances the acquisition of new patterns with the preservation of the base model's distribution. AESL operates at both the token and subsequence levels, providing finer-grained control over the cold-start process. Experimental results on mathematical reasoning benchmarks demonstrate that diversity-based early stopping surpasses traditional performance-based SFT, while AESL further enhances RL preparation. By steering LLMs toward better initialization points for RL, AESL consistently achieves superior final performance compared to existing SFT and cold-start strategies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4885. Learning Escorted Protocols For Multistate Free-Energy Estimation
๐ฌ 9
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Free-Energy Jarzynski Equality Crooks Fluctuation Theorem Non-Equilibrium Transport Thermodynamics Stochastic Thermodynamics Flow Matching
Estimating relative free energy differences between multiple thermodynamic states lies at the core of numerous problems in computational biochemistry. Traditional estimators, such as Free Energy Perturbation and its non-equilibrium counterpart based on the Jarzynski equality, rely on defining a switching protocol between thermodynamic states and computing the free energy difference from the work performed during this process. In this work, we present a method for learning such switching protocols within the class of escorted protocols that combine deterministic and stochastic steps. For this purpose, we use Conditional Flow Matching, and introduce Conditional Density Matching (CDM) for the purpose of estimating the change in Free-Energy. We further reduce the variance in the multistate setting by coupling multiple flows between thermodynamic states into a Flow Graph, enforcing estimator consistency across different transition paths.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4886. Leveraging Pretrained Knowledge at Inference Time: LoRA-Gated Contrastive Decoding for Multilingual Factual Language Generation in Adapted LLMs
๐ฌ 29
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Contrastive Decoding Multilingual Language Models Inference-Time Knowledge Integration Token-Level Confidence Gating LLM
Large language models (LLMs) adapted to specific languages through continual pretraining or instruction tuning often suffer from catastrophic forgetting, which can lead to factual inaccuracies. This issue is particularly pronounced in multilingual settings, where adaptation may override general world knowledge with language-specific patterns. We propose LoRA-Gated Contrastive Decoding (LGCD), a training-free inference-time decoding framework that improves factuality in language-adapted LLMs by leveraging knowledge from the original pretrained model. LGCD operates by (1) extracting factual representations from Feed-Forward Network (FFN) layers via LoRA-based decomposition, approximating pretrained knowledge, (2) dynamically gating decoding based on token-level confidence, and (3) applying contrastive decoding with Top-K masking to revise uncertain predictions by referencing the approximated representation of pretrained knowledge. LGCD requires no additional training or access to the original pretraining data. Extensive experiments with LGCD on multilingual multiple-choice and long-form QA tasks across nine languages demonstrate its strong effectiveness in mitigating hallucinations and enhancing factual accuracy in language-adapted models. These results further indicate that pretrained knowledge can be strategically reintroduced during decoding to promote factual multilingual generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4887. Why High-rank Neural Networks Generalize?: An Algebraic Framework with RKHSs
๐ฌ 10
โญ 4.50
๐ learning theory
๐ท๏ธ Generalization bound Deep neural network Koopman operator Reproducing kernel Hilbert space
We derive a new Rademacher complexity bound for deep neural networks using Koopman operators, group representations, and reproducing kernel Hilbert spaces (RKHSs).
The proposed bound describes why the models with high-rank weight matrices generalize well.
Although there are existing bounds that attempt to describe this phenomenon, these existing bounds can be applied to limited types of models.
We introduce an algebraic representation of neural networks and a kernel function to construct an RKHS to derive a bound
for a wider range of realistic models.
This work paves the way for the Koopman-based theory for Rademacher complexity bounds to be valid for more practical situations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4888. Online time series prediction using feature adjustment
๐ฌ 16
โญ 4.50
๐ learning on time series and dynamical systems
๐ท๏ธ time series neural network online adaption
Time series forecasting is of significant importance across various domains. However, it faces significant challenges due to distribution shift. This issue becomes particularly pronounced in online deployment scenarios where data arrives sequentially, requiring models to adapt continually to evolving patterns. Current time series online learning methods focus on two main aspects: selecting suitable parameters to update (e.g., final layer weights or adapter modules) and devising suitable update strategies (e.g., using recent batches, replay buffers, or averaged gradients). We challenge the conventional parameter selection approach, proposing that distribution shifts stem from changes in underlying latent factors influencing the data. Consequently, updating the feature representations of these latent factors may be more effective. To address the critical problem of delayed feedback in multi-step forecasting (where true values arrive much later than predictions), we introduce ADAPT-Z (Automatic Delta Adjustment via Persistent Tracking in Z-space). ADAPT-Z utilizes an adapter module that leverages current feature representations combined with historical gradient information to enable robust parameter updates despite the delay. Extensive experiments demonstrate that our method consistently outperforms standard base models without adaptation and surpasses state-of-the-art online learning approaches across multiple datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4889. VUDG: A Dataset for Video Understanding Domain Generalization
๐ฌ 9
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Video Understanding Dataset Domain Generalization
Video understanding has made remarkable progress in recent years, largely driven by advances in deep models and the availability of large-scale annotated datasets.
However, the robustness of these models to domain shifts encountered in real-world video applications remains a critical yet underexplored problem, limiting their practical reliability.
To address this problem, we introduce \textbf{V}ideo \textbf{U}nderstanding \textbf{D}omain \textbf{G}eneralization (\textbf{VUDG}), the first dataset designed specifically for evaluating domain generalization in video understanding.
VUDG contains videos from 11 distinct domains that cover three types of domain shifts, and maintains semantic consistency across different domains to ensure fair and meaningful evaluation. We propose a multi-expert progressive annotation framework to efficiently annotate videos with structured question-answer pairs designed for domain generalization.
Extensive experiments on 9 representative Large Vision-Language Models (LVLMs) and several traditional video question answering methods show that most models (including state-of-the-art LVLMs) suffer performance degradation under domain shifts.
These results highlight the challenges posed by VUDG and the difference in the robustness of current models to data distribution shifts. We believe VUDG provides a critical resource to benefit future research in domain generalization for video understanding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4890. Markovian Transformers for Informative Language Modeling
๐ฌ 9
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ Markovian Transformers chain-of-thought reasoning language model interpretability causal reasoning reinforcement learning next-token prediction GSM8K large language models
Chain-of-Thought (CoT) reasoning often fails to faithfully reflect a language model's underlying decision process. We address this by introducing a \emph{Markovian} language model framework with an autoencoder-style \emph{reasoning bottleneck}: it creates a text-based bottleneck where CoT serves as an intermediate representation, forcing the model to compress essential reasoning into interpretable text before making predictions, in the sense of learning short intermediate descriptions that make answers easy to compute from questions. We train this system with a GRPO-style policy gradient algorithm using parallel sampling, a frozen baseline CoT$'$, within-batch standardized advantages, and actor-reward (chain-rule) gradients. On QA tasks, Markovian training recovers most of the gains of a non-Markovian GRPO variant while forcing the model to answer from the CoT alone (e.g., GSM8K: 19.6\% $\to$ 57.1\%; ARC-Challenge: 36.1\% $\to$ 79.9\%; on average only $\approx$3-4 pp below a non-Markovian upper bound). Perturbation analyses across types and severities show that Markovian models incur systematically larger log-probability drops under CoT corruption than matched Non-Markovian baselines, indicating stronger causal reliance on the CoT. Cross-model evaluation confirms that learned CoTs generalize across architectures, suggesting they capture transferable reasoning patterns rather than model-specific artifacts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4891. Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
๐ฌ 4
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Jailbreak attack LLMs Classical Chinese
As Large Language Models (LLMs) are increasingly used, their security risks have drawn increasing attention. Existing research reveals that LLMs are highly susceptible to jailbreak attacks, with effectiveness varying across language contexts. This paper investigates the role of classical Chinese in jailbreak attacks. Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs. Based on this observation, this paper proposes a framework, CC-BOS, for the automatic generation of classical Chinese adversarial prompts based on multi-dimensional fruit fly optimization, facilitating efficient and automated jailbreak attacks in black-box settings. Prompts are encoded into eight policy dimensionsโcovering role, behavior, mechanism, metaphor, expression, knowledge, trigger pattern and context; and iteratively refined via smell search, visual search, and cauchy mutation. This design enables efficient exploration of the search space, thereby enhancing the effectiveness of black-box jailbreak attacks. To enhance readability and evaluation accuracy, we further design a classical Chinese to English translation module. Extensive experiments demonstrate that effectiveness of the proposed CC-BOS, consistently outperforming state-of-the-art jailbreak attack methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4892. Strategic Obfuscation of Deceptive Reasoning in Language Models
๐ฌ 8
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ ai safety chain-of-thought monitoring alignment faking deceptive alignment unfaithful chain-of-thought
Large language models can exhibit different behaviors during training versus deployment, a phenomenon known as alignment faking. While prior work has shown models can explicitly reason about faking alignment to preserve their goals, we demonstrate that Claude 3.5 Sonnet exhibits alignment faking behavior without verbalizing deceptive reasoning. When told its reasoning will be monitored and that it will be trained to comply with harmful queries, the model shows a 24.3\% compliance gap between training and deployment contexts while its verbalization of alignment faking drops by 93.7\%. We compare this to instruction-based approaches where we explicitly tell models to hide certain reasoning from monitors. We find that intrinsic motivation to obfuscate leads to far better performance at evading a monitor than following instructions, even with assistance on how to do so. Even on tasks requiring simpler reasoning, instruction-based methods only reduce verbalization by 47.6\%. Our results indicate that models can exhibit sophisticated deceptive behavior in high-stakes scenarios without accessible reasoning when internally motivated, limiting the reliability of instruction-based elicitation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4893. Boosting for Predictive Sufficiency
๐ฌ 18
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ OOD Generalization Boosting Predictive sufficiency Reference Class
Out-of-distribution (OOD) generalization is a defining hallmark of truly robust and reliable machine learning systems. Recently, it has been empirically observed that existing OOD generalization methods often underperform on real-world tabular data, where hidden confounding shifts drive distribution changes that boosting models handle more effectively. Part of boostingโs success is attributed to variance reduction, handling missing variables, feature selection, and connections to multicalibration. This paper uncovers a crucial reason behind its success in OOD generalization: boostingโs ability to infer stable environments robust to hidden confounding shifts and maximize predictive performance within those environments. This paper introduces an information-theoretic notion called $\alpha$-predictive sufficiency and formalizes its link to OOD generalization under hidden confounding. We show that boosting implicitly identifies suitable environments and produces an $\alpha$-predictive sufficient predictor. We validate our theoretical results through synthetic and real-world experiments and show that boosting achieves robust performance by identifying these environments and maximizing the association between predictions and true outcomes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4894. On the Shelf Life of Finetuned LLM-Judges: Future Proofing, Backward Compatibility, and Question Generalization
๐ฌ 11
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models LLM-as-a-Judge Distribution Shift Generalization Evaluation Robustness
The LLM-as-a-judge paradigm is widely used in both evaluating free-text model responses and reward modeling for model alignment and finetuning. Recently, finetuning judges with judge-specific data has emerged as an often preferred choice over directly prompting frontier models as judges, as the former achieves better performance with smaller model sizes while being more robust to common biases.
However, the standard evaluation ignores several practical concerns of finetuned judges regarding their real world deployment.
In this paper, we identify and formalize three aspects that affect the *shelf life* of these judges: *future-proofing* and *backward-compatibility* โ how well judges finetuned on responses by today's generator models perform on responses by future models or past models, as well as *question generalization* โ how well judges generalize to unseen questions at test time. We study these three aspects under a unified framework with varying train and test distributions in two reasoning datasets, three SFT- and DPO-based fine-tuning algorithms, and three different backbone models. Experiments suggest that future-proofing is challenging for most models, while backward-compatibility is relatively easy, with DPO-trained models consistently *improving* performance. We further find that continual learning provides a more balanced adaptation to shifts between older and newer response distributions than training solely on stronger or weaker responses. Moreover, all models observe certain degrees of performance degradation when moving from questions seen during training to unseen ones, showing that current judges do not fully generalize to unseen questions. These findings provide insights into practical considerations for developing and deploying judge models in the face of ever-changing generators.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4895. Adaptive Debiasing Tsallis Entropy for Test-Time Adaptation
๐ฌ 25
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Test-Time Adaptation Entropy Minimization Representation Learning
Mainstream Test-Time Adaptation (TTA) methods for adapting vision-language models, e.g., CLIP, typically rely on Shannon Entropy (SE) at test time to measure prediction uncertainty and inconsistency. However, since CLIP has a built-in bias from pretraining on highly imbalanced web-crawled data, SE inevitably results in producing biased estimates of uncertainty entropy. To address this issue, we notably find and demonstrate that Tsallis Entropy (TE), a generalized form of SE, is naturally suited for characterizing biased distributions by introducing a non-extensive parameter q, with the performance of SE serving as a lower bound for TE. Building upon this, we generalize TE into Adaptive Debiasing Tsallis Entropy (ADTE) for TTA, customizing a class-specific parameter q^l derived by normalizing the estimated label bias from continuously incoming test instances, for each category. This adaptive approach allows ADTE, even without hyperparameter tuning required by TE, to accurately select high-confidence views and seamlessly integrate with label adjustment strategy to enhance adaptation. Besides, our investigation reveals that both TE and ADTE can serve as direct, advanced alternatives to SE in TTA, without any other modifications. Experimental results show that ADTE outperforms state-of-the-art methods on ImageNet and its five variants, and achieves the highest average performance on 10 cross-domain benchmarks, regardless of the model architecture or text prompts used. Our code is available at https://anonymous.4open.science/r/TTA-Entropy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4896. ConRep4CO: Contrastive Representation Learning of Combinatorial Optimization Instances across Types
๐ฌ 26
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Combinatorial Optimization Contrastive Learning Representation Learning
Considerable efforts have been devoted to machine learning (ML) for combinatorial optimization (CO) problems, especially on graphs. Compared to the active and well-established research for representation learning of text and vision, etc., it remains under-studied for the representation learning of CO problems, especially across different types. In this paper, we try to fill this gap (especially for NP-complete (NPC) problems, as they, in fact, can be reduced to one another). Our so-called ConRep4CO framework, performs contrastive learning by first transforming CO instances in various original forms into the form of Boolean satisfiability (SAT). This scheme is readily doable, especially for NPC problems, including those practical graph decision problems (GDPs) which are inherently related to their NP-hard optimization versions. Specifically, each positive pair of instances for contrasting consists of an instance in its original form and its corresponding transformed SAT form, while the negative samples are other instances not in correspondence. Extensive experiments on seven GDPs (most of which are NPC) show that ConRep4CO significantly improves the representation quality and generalizability to problem scale. Furthermore, we conduct extensive experiments on NP-hard optimization versions of the GDPs, including MVC, MIS, MC and MDS. The results show that introducing ConRep4CO can yield performance improvements of 61.27%, 32.20%, 36.46%, and 45.29% in objective value gaps compared to problem-specific baselines, highlighting the potential of ConRep4CO as a unified pre-training paradigm for CO problems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4897. Point Prompting: Counterfactual Tracking with Video Diffusion Models
๐ฌ 4
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ video diffusion models tracking tracking any point diffusion corresponding matching video generation
Recent advances in video generation have produced powerful diffusion models capable of generating high-quality, temporally coherent videos. We ask whether space-time tracking capabilities emerge automatically within these generators, as a consequence of the close connection between synthesizing and estimating motion. We propose a simple but effective way to elicit point tracking capabilities in off-the-shelf image-conditioned video diffusion models. We simply place a colored marker in the first frame, then guide the model to propagate the marker across frames, following the underlying videoโs motion. To ensure the marker remains visible despite the modelโs natural priors, we use the unedited video's initial frame as a negative prompt. We evaluate our method on the TAP-Vid benchmark using several video diffusion models. We find that it outperforms prior zero-shot methods, often obtaining performance that is competitive with specialized self-supervised models, despite the fact that it does not require any additional training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4898. Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
๐ฌ 30
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Visualization
Numerous applications of large language models (LLMs) rely on their ability to perform step-by-step reasoning. However, the reasoning behavior of LLMs remains poorly understood, posing challenges to research, development, and safety. To address this gap, we introduce landscape of thoughts (LoT), the first landscape visualization tool to inspect the reasoning trajectories with certain reasoning methods on any multi-choice dataset. We represent the textual states in a trajectory as numerical features that quantify the states' distances to the answer choices. These features are then visualized in two-dimensional plots using t-SNE. Qualitative and quantitative analysis with the landscape of thoughts effectively distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks. It also uncovers undesirable reasoning patterns, such as low consistency and high uncertainty. Additionally, users can adapt LoT to a model that predicts the property they observe. We showcase this advantage by adapting LoT to a lightweight verifier that evaluates the correctness of trajectories. Empirically, this verifier boosts the reasoning accuracy and the test-time scaling effect.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4899. Understanding Dataset Distillation via Spectral Filtering
๐ฌ 10
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Dataset Distillation Spectral Filtering
Dataset distillation (DD) has emerged as a promising approach to compress datasets and speed up model training. However, the underlying connections among various DD methods remain largely unexplored. In this paper, we introduce UniDD, a spectral filtering framework that unifies diverse DD objectives. UniDD interprets each DD objective as a specific filter function applied to the eigenvalues of the feature-feature correlation (FFC) matrix to extract certain frequency information of the feature-label correlation (FLC) matrix. In this way, UniDD reveals that the essence of DD fundamentally lies in matching frequency-specific features. Moreover, we characterize the roles of different filters. For example, low-pass filters, \eg, DM and DC, capture blurred patches, while high-pass filters, \eg, MTT and FrePo, prefer to synthesize fine-grained textures and have better diversity. However, existing methods can only learn the sole frequency information as they rely on fixed filter functions throughout distillation. To address this limitation, we further propose Curriculum Frequency Matching (CFM), which gradually adjusts the filter parameter to cover both low- and high-frequency information of the FFC and FLC matrices. Extensive experiments on small-scale datasets, such as CIFAR-10/100, and large-scale ImageNet-1K, demonstrate the superior performance of CFM over existing baselines and validate the practicality of UniDD.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4900. Temporal Geometry of Deep Networks: Hyperbolic Representations of Training Dynamics for Intrinsic Explainability
๐ฌ 30
โญ 4.50
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph Meta Networks Temporal Hyperbolic Embeddings Neural Weights as Data
This paper investigates how multilayer perceptrons (MLPs) can be represented in non-Euclidean spaces, with emphasis on the Poincarรฉ model of hyperbolic geometry. We aim to capture the geometric evolution of their weighted topology and self-organization over time. Instead of restricting analysis to single checkpoints, we construct temporal parameter-graphs across $T$ snapshots of the optimization process. This reflects the view that neural networks encode information not only in their weights but also in the trajectory traced during training. Drawing on the idea that many complex networks admit embeddings in hidden metric spaces where distances correspond to connection likelihood, we present a geometric and temporal graph-based meta learning framework for obtaining dynamic hyperbolic representations of the underlying neural parameter graphs. Our model embeds temporal parameter-graphs in the Poincarรฉ ball and learns from them while maintaining equivariance to within-snapshot neuron permutations and invariance to permutations of past snapshots. In doing so, it preserves functional equivalence across time and recovers the networkโs latent geometry. Experiments on regression and classification tasks with trained MLPs show that hyperbolic temporal representations expose how structure emerges during training, offering intrinsic explanations of self-organisation in a given model training environment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4901. Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks
๐ฌ 20
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large language models Knowledge-aware refusal Factuality evaluation
Large Language Models (LLMs) should refuse to answer questions beyond their knowledge. This capability, which we term knowledge-aware refusal, is crucial for factual reliability. However, existing metrics fail to faithfully measure this ability. On the one hand, simple refusal-based metrics are biased by refusal rates and yield inconsistent scores when models exhibit different refusal tendencies. On the other hand, existing calibration metrics are proxy-based, capturing the performance of auxiliary calibration processes rather than the modelโs actual refusal behavior. In this work, we propose the Refusal Index (RI), a principled metric that measures how accurately LLMs refuse questions they do not know. We define RI as Spearmanโs rank correlation between refusal probability and error probability. To make RI practically measurable, we design a lightweight two-pass evaluation method that efficiently estimates RI from observed refusal rates across two standard evaluation runs. Extensive experiments across 16 models and 5 datasets demonstrate that RI accurately quantifies a modelโs intrinsic knowledge-aware refusal capability in factual tasks. Notably, RI remains stable across different refusal rates and provides consistent model rankings independent of a modelโs overall accuracy and refusal rates. More importantly, RI provides insight into an important but previously overlooked aspect of LLM factuality: while LLMs achieve high accuracy on factual tasks, their refusal behavior can be unreliable and fragile. This finding highlights the need to complement traditional accuracy metrics with the Refusal Index for comprehensive factuality evaluation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4902. PARD: Accelerating LLM Inference with LowโCost PARallel Draft Model Adaptation
๐ฌ 17
โญ 4.50
๐ generative models
๐ท๏ธ LLM Speculative Decoding AI Infra Low Cost Training
The autoregressive nature of large language models (LLMs) fundamentally limits inference speed, as each forward pass generates only a single token and is often bottlenecked by memory bandwidth. Speculative decoding has emerged as a promising solution, adopting a draft-then-verify strategy to accelerate token generation. While the EAGLE series achieves strong acceleration, its requirement of training a separate draft head for each target model introduces substantial adaptation costs. In this work, we propose \textbf{PARD (PARallel Draft)}, a novel speculative decoding method featuring \textit{target-independence} and \textit{parallel token prediction}. Specifically, PARD enables a single draft model to be applied across an entire family of target models without requiring separate training for each variant, thereby minimizing adaptation costs. Meanwhile, PARD substantially accelerates inference by predicting multiple future tokens within a single forward pass of the draft phase. To further reduce the training adaptation cost of PARD, we propose a COnditional Drop-token (COD) mechanism based on the integrity of prefix key-value states, enabling autoregressive draft models to be adapted into parallel draft models at low-cost. Our experiments show that the proposed COD method improves draft model training efficiency by \textbf{3$\times$} compared with traditional masked prediction training. On the \texttt{vLLM} inference framework, PARD achieves up to \textbf{3.67$\times$} speedup on LLaMA3.1-8B, reaching \textbf{264.88} tokens per second, which is \textbf{1.15$\times$} faster than EAGLE-3.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4903. JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation
๐ฌ 13
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Multi-modal Large Language Model Benchmark Omni-Large Language Model Video Understanding
Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio, an effective benchmark must comprehensively cover three key aspects: (1) multi-modal dependency (i.e., questions that cannot be answered using vision or audio alone), (2) diverse audio information types (e.g., speech, sound events), and (3) varying scene spans.
However, existing datasets fall short in one or more of these dimensions, limiting strict and comprehensive evaluation.
To address this gap, we introduce JointAVBench, a novel benchmark with strict audio-video correlation, spanning five cognitive dimensions, four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, and full-scene).
Given the high cost of manual annotation, we propose an automated pipeline that leverages state-of-the-art vision-LLMs, audio-LLMs, and general-purpose LLMs to synthesize questions and answers that strictly require joint audio-visual understanding.
We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset. Results show that even the best-performing Omni-LLM achieves only 62.6\% average accuracy, outperforming uni-modal baselines but revealing substantial room for improvement, especially in cross-scene reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4904. Trajectory Generation with Conservative Value Guidance for Offline Reinforcement Learning
๐ฌ 20
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Sequential Decision Making
Recent advances in offline reinforcement learning (RL) have led to the development of high-performing algorithms that achieve impressive results across standard benchmarks. However, many of these methods depend on increasingly complex planning architectures, which hinder their deployment in real-world settings due to high inference costs. To overcome this limitation, recent research has explored data augmentation techniques that offload computation from online decision-making to offline data preparation. Among these, diffusion-based generative models have shown potential in synthesizing diverse trajectories but incur significant overhead in training and data generation.
In this work, we propose Trajectory Generation with Conservative Value Guidance (TGCVG), a novel trajectory-level data augmentation framework that integrates a high-performing offline policy with a learned dynamics model. To ensure that the synthesized trajectories are both high-quality and close to the original dataset distribution, we introduce a value-guided regularization during the training of the offline policy. This regularization encourages conservative action selection, effectively mitigating distributional shift during trajectory synthesis.
Empirical results on standard benchmarks demonstrate that TGCVG not only improves the performance of state-of-the-art offline RL algorithms but also significantly reduces training and trajectory synthesis time. These findings highlight the effectiveness of value-aware data generation in improving both efficiency and policy performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4905. CIAR: Interval-based Collaborative Decoding for Image Generation Acceleration
๐ฌ 14
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Cloud Device Collaboration Image Generation Uncertainty Quantification
Auto-regressive (AR) models have recently made notable progress in image generation, achieving performance comparable to diffusion-based approaches. However, their computational intensity and sequential nature impede on-device deployment, causing disruptive latency. We address this via a cloud-device collaboration framework \textbf{CIAR}, which utilizes on-device self-verification to handle two key properties of visual synthesis: \textit{the vast token vocabulary} required for high-fidelity images and \textit{inherent spatial redundancy} which leads to extreme predictability in homogeneous regions, while object boundaries exhibit high uncertainty. Uniform verification wastes resources on such redundant tokens. Our solution centers on an on-device token uncertainty quantifier, which adopts continuous probability intervals to accelerate processing and make it feasible for large visual vocabularies instead of conventional discrete solution sets. Additionally, we incorporate a Interval-enhanced decoding module to further speed up decoding while maintaining visual fidelity and semantic consistency via a distribution alignment training strategy.
Extensive experiments demonstrate that CIAR achieves a 2.18ร speed-up and reduces cloud requests by 70\%, while preserving image quality compared to existing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4906. Patronus: Interpretable Diffusion Models with Prototypes
๐ฌ 24
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ interpretability interpretable diffusion model diffusion models transparency prototypical network shortcut learning bias detection
Uncovering the opacity of diffusion-based generative models is urgently needed, as their applications continue to expand while their underlying procedures largely remain a black box.
With a critical question -- how can the diffusion generation process be interpreted and understood? -- we proposed \textit{Patronus}, an interpretable diffusion model that incorporates a prototypical network to encode semantics in visual patches, revealing \textit{what} visual patterns are learned and \textit{where} and \textit{when} they emerge throughout denoising.
This interpretability of Patronus provides deeper insights into the generative mechanism, enabling the detection of shortcut learning via unwanted correlations and the tracing of semantic emergence across timesteps. We evaluate \textit{Patronus} on four natural image datasets and one medical imaging dataset, demonstrating both faithful interpretability and strong generative performance. With this work, we open new avenues for understanding and steering diffusion models through prototype-based interpretability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4907. SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests
๐ฌ 32
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ alignment red-teaming safety robustness sociopolitical risks democracy defense societal good evaluation benchmark
Large language models (LLMs) are increasingly deployed in contexts where their failures have the potential to carry sociopolitical consequences. However, existing safety benchmarks sparsely test vulnerabilities in domains such as political manipulation, propaganda generation, or surveillance and information control. To address this gap, we propose SocialHarmBench, a dataset of 585 prompts spanning 7 sociopolitical categories and 34 countries with real-world events, designed to evaluate LLM vulnerabilities to sociopolitical harms. Using SocialHarmBench, we provide: (1) adversarial evaluation coverage of high-risk domains including authoritarian surveillance, disinformation campaigns, erosion of democratic processes, and crimes against humanity; (2) adversarial evaluations across open-source models, establishing baseline robustness and measuring attack efficiency in politically charged settings; and (3) insights into domain-specific vulnerability comparisons, temporal-wide investigations to trace vulnerable time periods, and region-specific vulnerabilities. Our findings reveal that existing safeguards fail to transfer effectively to sociopolitical contexts, exposing partisan biases and limitations in preserving human rights and democratic values.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4908. Multi-Scale Hypergraph Meets LLMs: Aligning Large Language Models for Time Series Analysis
๐ฌ 23
โญ 4.50
๐ learning on time series and dynamical systems
๐ท๏ธ Time series forecasting large language models multi-scale modeling hypergraph neural network hypergraph learning transformer
Recently, there has been great success in leveraging pre-trained large language models (LLMs) for time series analysis. The core idea lies in effectively aligning the modality between natural language and time series. However, the multi-scale structures of natural language and time series have not been fully considered, resulting in insufficient utilization of LLMs capabilities. To this end, we propose MSH-LLM, a Multi-Scale Hypergraph method that aligns Large Language Models for time series analysis. Specifically, a hyperedging mechanism is designed to enhance the multi-scale semantic information of time series semantic space. Then, a cross-modality alignment (CMA) module is introduced to align the modality between natural language and time series at different scales. In addition, a mixture of prompts (MoP) mechanism is introduced to provide contextual information and enhance the ability of LLMs to understand the multi-scale temporal patterns of time series. Experimental results on 27 real-world datasets across 5 different applications demonstrate that MSH-LLM achieves the state-of-the-art results. Code is available at: https://anonymous.4open.science/r/MSH-LLM-1E9B.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4909. Dual-Space Smoothness for Robust and Balanced LLM Unlearning
๐ฌ 22
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ AI safety Large Language Model Unlearning Robustness Jailbreak Attacks LLM Safety Relearning Attack
As large language models evolve, Machine Unlearning has emerged to address growing concerns around user privacy, copyright infringement, and overall safety. Yet state-of-the-art (SOTA) unlearning methods often suffer from catastrophic forgetting and metric imbalance, for example by over-optimizing one objective (e.g., unlearning effectiveness, utility preservation, or privacy protection) at the expense of others. In addition, small perturbations in the representation or parameter space can be exploited by relearn and jailbreak attacks. To address these challenges, we propose PRISM, a unified framework that enforces dual-space smoothness in representation and parameter spaces to improve robustness and balance unlearning metrics. PRISM consists of two smoothness optimization stages: (i) a representation space stage that employs a robustly trained probe to defend against jailbreak attacks, and (ii) a parameter-space stage that decouples retainโforget gradient conflicts, reduces imbalance, and smooths the parameter space to mitigate relearning attacks. Extensive experiments on WMDP and MUSE, across conversational-dialogue and continuous-text settings, show that PRISM outperforms SOTA baselines under multiple attacks while achieving a better balance among key metrics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4910. Using Reinforcement Learning to Train Large Language Models to Explain Human Decisions
๐ฌ 14
โญ 4.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ cognitive modeling verbal theory risky choice group-relative policy optimization supervised fine-tuning large language model
A central goal of cognitive modeling is to develop models that not only predict human behavior but also provide insight into the underlying cognitive mechanisms. While neural network models trained on large-scale behavioral data often achieve strong predictive performance, they typically fall short in offering interpretable explanations of the cognitive processes they capture. In this work, we explore the potential of pretrained large language models (LLMs) to serve as dual-purpose cognitive models--capable of both accurate prediction and interpretable explanation in natural language. Specifically, we employ reinforcement learning with outcome-based rewards to guide LLMs toward generating explicit reasoning traces for explaining human risky choices. Our findings demonstrate that this approach produces high-quality explanations alongside strong quantitative predictions of human decisions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4911. LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences
๐ฌ 22
โญ 4.50
๐ generative models
๐ท๏ธ LLMs KV cache retrieval LLM inference acceleration
While Key-Value (KV) cache succeeds in reducing redundant computations in auto-regressive models, it introduces significant memory overhead, limiting its practical deployment in long-sequence scenarios. Existing KV retrieval methods attempt to mitigate this by dynamically retaining only a subset of KV entries on the GPU. However, they still suffer from notable efficiency and accuracy bottlenecks due to per-token retrieval and coarse-grained page-level KV management strategy, especially in long-output reasoning scenarios. With the emergence of large reasoning models, efficiently handling such scenarios has become increasingly important. To address this issue, we present two key observations: (1) critical KVs exhibit strong temporal locality during decoding, and (2) these KVs exhibit distinct distribution patterns across the input prompt and the generated output. Building on these observations, we propose LouisKV, an efficient KV cache retrieval framework designed for various long-sequence scenarios. Specifically, LouisKV introduces a semantic-aware retrieval strategy that leverages temporal locality to trigger retrieval only at semantic boundaries, drastically reducing computation and data transfer overhead. LouisKV also designs a decoupled, fine-grained management scheme that tailors differentiated strategies for input and output sequences to create retrieval units that better match the model's attention patterns, thereby enabling the precise identification of critical KVs. Furthermore, to boost system efficiency, LouisKV incorporates several kernel-level optimizations, including custom Triton and CUDA kernels to accelerate the KV clustering and retrieval. Evaluation results show that LouisKV achieves up to 4.7$\times$ speedup over state-of-the-art KV retrieval methods while maintaining near-lossless accuracy across diverse long-sequence tasks, including long-input short-output, short-input long-output, and long-input long-output scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4912. FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion
๐ฌ 14
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Few-Shot Learning Foundation Models Object Detection
In this paper, we present FSOD-VFM: Few-Shot Object Detectors with Vision Foundation Models, a framework that leverages vision foundation models to tackle the challenge of few-shot object detection. FSOD-VFM integrates three key components: a universal proposal network (UPN) for category-agnostic bounding box generation, SAM2 for accurate mask extraction, and DINOv2 features for efficient adaptation to new object categories. Despite the strong generalization capabilities of foundation models, the bounding boxes generated by UPN often suffer from overfragmentation, covering only partial object regions and leading to numerous small, false-positive proposals rather than accurate, complete object detections. To address this issue, we introduce a novel graph-based confidence reweighting method. In our approach, predicted bounding boxes are modeled as nodes in a directed graph, with graph diffusion operations applied to propagate confidence scores across the network. This reweighting process refines the scores of proposals, assigning higher confidence to whole objects and lower confidence to local, fragmented parts. This strategy improves detection granularity and effectively reduces the occurrence of false-positive bounding box proposals. Through extensive experiments on Pascal-5$^i$, COCO-20$^i$, and CD-FSOD datasets, we demonstrate that our method substantially outperforms existing approaches, achieving superior performance without requiring additional training. Notably, on the challenging CD-FSOD dataset, which spans multiple datasets and domains, our FSOD-VFM achieves 31.6 AP in the 10-shot setting, substantially outperforming previous training-free methods that reach only 21.4 AP. The source code will be released publicly upon publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4913. OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!
๐ฌ 29
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Safety Alignment Large Language Models Trustworthy AI
Large Language Model (LLM) safety is one of the most pressing challenges for enabling wide-scale deployment. While most studies and global discussions focus on generic harms, such as models assisting users in harming themselves or others, enterprises face a more fundamental concern: whether LLM-based agents are safe for their intended use case. To address this, we introduce operational safety, defined as an LLMโs ability to appropriately accept or refuse user queries when tasked with a specific purpose. We further propose OffTopicEval, an evaluation suite and benchmark for measuring operational safety both in general and within specific agentic use cases. Our evaluations on six model families comprising 20 open-weight LLMs reveal that while performance varies across models, all of them remain highly operationally unsafe. Even the strongest modelsโQwen-3 (235B) with 77.77% and Mistral (24B) with 79.96%โfall far short of reliable operational safety, while GPT models plateau in the 62โ73% range, Phi achieves only mid-level scores (48โ70%), and Gemma and Llama-3 collapse to 39.53% and 23.84%, respectively. While operation safety is core model's alignment issue, to suppress these failures, we propose prompt-based steering methods, query grounding (Q-ground), and system-prompt grounding (P-ground), which substantially improve OOD refusal. Q-ground provides consistent gains of up to 23%, while P-ground delivers even larger boosts, raising Llama-3.3 (70B) by 41% and Qwen-3 (30B) by 27%. These results highlight both the urgent need for operational safety interventions and the promise of prompt-based steering as a first step toward more reliable LLM-based agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4914. Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image
๐ฌ 25
โญ 4.50
๐ generative models
๐ท๏ธ 3D generation novel view synthesis satellite to street-view generation feed-forward image to 3D outdoor scene generation
Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry.
We attribute these geometric failures to the extreme viewpoint gap and sparse, inconsistent supervision inherent in satellite-to-street data.
We introduce Sat3DGen to address these fundamental challenges, which embodies a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error.
This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. For validation, we first constructed a new benchmark by pairing the VIGOR-OOD test set with high-resolution DSM data. On this benchmark, our method improves geometric RMSE from 6.76m to 5.20m. Crucially, this geometric leap also boosts photorealism, reducing the Fr\'echet Inception Distance (FID) from $\sim$40 to 19 against the leading method, Sat2Density++, despite using no extra tailored image-quality modules. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4915. ReVeal: Self-Evolving Code Agents via Reliable Self-Verification
๐ฌ 23
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Reinforcement Learning Code LLM multi-turn RL
Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models. Howerer, existing methods rely solely on outcome rewards, without explicitly optimizing verification or leveraging reliable signals from realistic environments, leading to unreliable self-verification and limited test-time scaling. To address this, we widen the verificationโgeneration asymmetry by explicitly optimizing self-verification, making it a reliable driver of deeper test-time scaling. We introduce ReVeal, a multi-turn Reinforcement learning framework that evolves code generation through self-Verification and tool-based evaluation. ReVeal structures long-horizon reasoning as iterative generationโverification turns and incorporates TAPO for turn-level credit assignment, fostering the co-evolution of code and test generation. At inference, this strengthened self-verification enables the model to use self-constructed tests and tool feedback to continuously evolve code for 20+ turns on LiveCodeBench despite training on only three. It also significantly improves Pass@k, indicating stronger exploration that expands the reasoning boundaries of the base model. These findings highlight the promise of ReVeal as a scalable paradigm for RL training and test-time scaling, paving the way for more robust and autonomous AI agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4916. PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model
๐ฌ 20
โญ 4.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ pixel-level understanding vision-language-action model robotics multimodal large language model
Vision-Language-Action models (VLAs) are emerging as powerful tools for learning generalizable visuomotor control policies. However, current VLAs are mostly trained on large-scale imageโtextโaction data and remain limited in two key ways: (i) they struggle with pixel-level scene understanding, and (ii) they rely heavily on textual prompts, which reduces their flexibility in real-world settings.
To address these challenges, we introduce PixelVLA, the first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs. Our approach is built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual prompting encoder. To train PixelVLA effectively, we further propose a two-stage automated annotation pipeline that generates Pixel-160K, a large-scale dataset with pixel-level annotations derived from existing robot data.
Experiments on three standard VLA benchmarks and two VLA model variants show that PixelVLA improves manipulation success rates by $10.1\%\sim28.7\%$ over OpenVLA, while requiring only $1.5\%$ of its pretraining cost. These results demonstrate that PixelVLA can be integrated into existing VLAs to enable more accurate, efficient, and versatile robot control in complex environments. The dataset and code will be released as open source.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4917. BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment
๐ฌ 22
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Diffusion Models Multimodal Learning Text-to-Image Generation
Conditional image generation augments text-to-image synthesis with structural, spatial, or stylistic priors and is used in many domains. However, current methods struggle to harmonize guidance from both sources when conflicts arise: 1) input-level conflict, where the semantics of the conditioning image contradict the text prompt, and 2) model-bias conflict, where learned generative biases hinder alignment even when the condition and text are compatible. These scenarios demand nuanced, case-by-case trade-offs that standard supervised fine-tuning struggles to deliver. Preference-based optimization techniques, such as Direct Preference Optimization (DPO), offer a promising solution but remain limited: naive DPO suffers from gradient entanglement between text and condition signals and lacks disentangled, conflict-aware training data for multi-constraint tasks. To overcome these issues, we propose a self-driven, bidirectionally decoupled DPO framework (BideDPO). At its core, our method constructs two disentangled preference pairs for each sampleโone for the condition and one for the textโto mitigate gradient entanglement. The influence of these pairs is then managed by an Adaptive Loss Balancing strategy for balanced optimization. To generate these pairs, we introduce an automated data pipeline that iteratively samples from the model and uses vision-language model checks to create disentangled, conflict-aware data. Finally, this entire process is embedded within an iterative optimization strategy that progressively refines both the model and the data. We construct a DualAlign benchmark to evaluate a modelโs ability to resolve conflicts between text and condition, and experiments on commonly used modalities show that BideDPO delivers substantial gains in both text success rate (e.g., +35\%) and condition adherence. We also validated the robustness of our approach on the widely used COCO dataset. All models, code, and benchmarks will be released to support future work.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4918. Enhancing Hallucination Detection through Noise Injection
๐ฌ 13
โญ 4.50
๐ generative models
๐ท๏ธ Bayesian Inference Uncertainty Quantification Hallucination Detection
Large Language Models (LLMs) are prone to generating plausible yet incorrect responses, known as hallucinations. Effectively detecting hallucinations is therefore crucial for the safe deployment of LLMs. Recent research has linked hallucinations to model uncertainty,
suggesting that hallucinations can be detected by measuring dispersion over answer distributions obtained from multiple samples drawn from a model. While drawing from the distribution over tokens defined by the model is a natural way to obtain samples, in this work, we argue that it is sub-optimal for the purpose of detecting hallucinations. We show that detection can be improved significantly by taking
into account model uncertainty in the Bayesian sense. To this end, we propose a very simple, training-free approach based on perturbing an appropriate subset of model parameters, or equivalently hidden unit activations, during sampling. We demonstrate that our approach significantly improves inference-time hallucination detection over standard sampling across diverse datasets, model architectures, and uncertainty metrics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4919. Rewriting Pre-Training Data Boosts LLM Performance in Math and Code
๐ฌ 14
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ dataset pretraining code llm
The performance of large language models (LLMs) in program synthesis and mathematical reasoning is fundamentally limited by the quality of their pre-training corpora.
We introduce two openly licensed pre-training datasets, released under the Llama 3.3 Community License, that significantly enhance LLM performance by systematically rewriting public data. SwallowCode ($\approx$16.1 billion tokens) refines Python snippets from The-Stack-v2 through a novel four-stage pipeline: syntax validation, pylint-based style filtering, and a two-stage LLM rewriting process that enforces style conformity and transforms snippets into self-contained, algorithmically efficient examples. Unlike prior methods that rely on exclusionary filtering or limited transformations, our transform-and-retain approach refines low-quality code, maximizing data utility.
SwallowMath ($\approx$2.3 billion tokens) enhances Finemath-4+ by removing boilerplate, restoring context, and reformatting solutions into concise, step-by-step explanations. Within a fixed 50 billion token training budget, continual pre-training of Llama-3.1-8B with SwallowCode boosts pass@1 by +17.0 on HumanEval and +16.1 on HumanEval+ compared to Stack-Edu, surpassing the baseline model's code generation capabilities. Similarly, substituting SwallowMath yields +12.4 accuracy on GSM8K and +7.6 on MATH. Ablation studies confirm that each pipeline stage contributes incrementally, with rewriting yielding the largest gains.
By releasing datasets, prompts, checkpoints, and pipeline code, we ensure reproducibility and provide a transferable transform-and-retain methodology that can be adapted to other base models and LLM rewriting setups.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4920. NDAD: Negative-Direction Aware Decoding for Large Language Models via Controllable Hallucination Signal Injection
๐ฌ 14
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models;Contrastive Decoding;Robustness
Large language models (LLMs) have recently achieved impressive progress in knowledge-intensive and reasoning tasks. However, their tendency to produce fabricated or factually inconsistent content remains a fundamental challenge to their practical deployment. To address this issue, we propose Negative-Direction Aware Decoding (NDAD), a novel decoding method that identifies and exploits hallucination signals as repulsive directions in the modelโs representation space, thereby improving factual adherence without retraining. Specifically, NDAD elicits hallucination-leaning signals by selectively masking critical attention heads, which exposes unstable hypotheses that the model would otherwise amplify during generation. To regulate the influence of these signals, NDAD employs two complementary weights: a global alignment weight measuring how well the induced signal aligns with the layerโs native activations (thus quantifying its referential utility) and a local weight estimating whether low-probability tokens in the masked distribution are likely to evolve toward the final output. Based on the weights, we derive a latent hallucination distribution that serves as the negative direction. A lightweight gradient-descent step then subtracts mass from hallucination-prone regions of the output distribution, adjusting the final logits while preserving the modelโs high-confidence predictions. Extensive experiments across multiple LLMs and diverse benchmark datasets demonstrate that NDAD consistently enhances factual reliability without requiring additional training or external knowledge.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4921. Detecting and Mitigating Memorization in Diffusion Models through Anisotropy of the Log-Probability
๐ฌ 14
โญ 4.50
๐ generative models
๐ท๏ธ Memorization Diffusion Models
Diffusion-based image generative models produce high-fidelity images through iterative denoising but remain vulnerable to memorization, where they unintentionally reproduce exact copies or parts of training images. Recent memorization detection methods are primarily based on the norm of score difference as indicators of memorization. We prove that such norm-based metrics are mainly effective under the assumption of isotropic log-probability distributions, which generally holds at high or medium noise levels. In contrast, analyzing the anisotropic regime reveals that memorized samples exhibit strong angular alignment between the guidance vector and unconditional scores in the low-noise setting. Through these insights, we develop a memorization detection metric by integrating isotropic norm and anisotropic alignment. Our detection metric can be computed directly on pure noise inputs via two conditional and unconditional forward passes, eliminating the need for costly denoising steps. Detection experiments on Stable Diffusion v1.4 and v2 show that our metric outperforms existing denoising-free detection methods while being at least approximately 5x faster than the previous best approach. Finally, we demonstrate the effectiveness of our approach by utilizing a mitigation strategy that adapts memorized prompts based on our developed metric.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4922. Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness
๐ฌ 22
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Multimodal Large Language Models Cultural Awareness Cross-Cultural Benchmark Comics Multilingual Evaluation Multitask Evaluation
Cultural awareness capabilities has emerged as a critical capability for Multimodal Large Language Models (MLLMs). However, current benchmarks lack progressed difficulty in their task design and are deficient in cross-lingual tasks. Moreover, current benchmarks often use real-world images. Each real-world image typically contains one culture, making these benchmarks relatively easy for MLLMs. Based on this, we propose C$^3$B ($\textbf{C}$omics $\textbf{C}$ross-$\textbf{C}$ultural $\textbf{B}$enchmark), a novel multicultural, multitask and multilingual cultural awareness capabilities benchmark. C$^3$B comprises over 2000 images and over 18000 QA pairs, constructed on three tasks with progressed difficulties, from basic visual recognition to higher-level cultural conflict understanding, and finally to cultural content generation. We conducted evaluations on 11 open-source MLLMs, revealing a significant performance gap between MLLMs and human performance. The gap demonstrates that C$^3$B poses substantial challenges for current MLLMs, encouraging future research to advance the cultural awareness capabilities of MLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4923. DePO: Demonstration-guided Policy Optimization for Molecular Optimization
๐ฌ 27
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Large Language Model; Molecular Optimization; LLM Reasoning
Large language models (LLMs) exhibit remarkable mathematical reasoning abilities through supervised fine-tuning (SFT) or reinforcement learning with verifiable rewards (RLVR). However, adapting them to scientific domains like molecular optimization is challenging: its datasets provide only reference molecules, lacking the reasoning traces for SFT, while its competitive objectives hinder RLVR.
To address these issues, we introduce Demonstration-guided Policy Optimization (DePO). We leverage reference molecules as supervised signals to regularize the search direction while preserving the modelโs reasoning capabilities. Experiments show that DePO significantly outperforms both SFT and RLVR across key molecular optimization metrics, and excels in balancing the competitive optimization objectives. DePO achieves up to 13\% improvement compared to SFT and other baseline approaches. DePO also shows generalization capabilities and inference-scaling properties.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4924. CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
๐ฌ 20
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Image Caption Reinforcement learning Large Vision Language Model
Image captioning is a fundamental task that bridges the visual and linguistic
domains, playing a critical role in pre-training Large Vision-Language Models
(LVLMs). Current state-of-the-art captioning models are typically trained with
Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable
data annotated by humans or proprietary models. This approach often leads to
models that memorize specific ground-truth answers, limiting their generality and
ability to generate diverse, creative descriptions. To overcome the limitation of
SFT, we propose applying the Reinforcement Learning with Verifiable Rewards
(RLVR) paradigm to the open-ended task of image captioning. A primary challenge,
however, is designing an objective reward function for the inherently subjective
nature of what constitutes a "good" caption. We introduce Captioning Reinforce-
ment Learning (CapRL), a novel training framework that redefines caption quality
through its utility: a high-quality caption should enable a non-visual language
model to accurately answer questions about the corresponding image. CapRL
employs a decoupled two-stage pipeline where an LVLM generates a caption, and
the objective reward is derived from the accuracy of a separate, vision-free LLM
answering Multiple-Choice Questions based solely on that caption. As the first
study to apply RLVR to the subjective image captioning task, we demonstrate
that CapRL significantly enhances multiple settings. Pretraining on the CapRL-
5M caption dataset annotated by CapRL-3B results in substantial gains across 12
benchmarks. Moreover, within the Prism Framework for caption quality evaluation,
CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding
the baseline by an average margin of 8.4%. Results validate that our CapRL effec-
tively trains models to produce a more general and accurate image descriptions,
moving beyond the limitations of traditional SFT-based image captioning models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4925. A Physics-Inspired Optimizer: Velocity Regularized Adam
๐ฌ 23
โญ 4.50
๐ optimization
๐ท๏ธ Optimization in deep learning physics-inspired edge of stability
We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer for training deep neural networks that draws on ideas from quartic terms for kinetic energy with its stabilizing effects on various system dynamics.
Previous algorithms, including the ubiquitous Adam, operate at the so-called adaptive edge of stability regime during training, leading to rapid oscillations and slowed convergence of loss.
However, VRAdam adds a higher order penalty on the learning rate based on the velocity such that the algorithm automatically slows down whenever weight updates become large. In practice, we observe that the effective dynamic learning rate shrinks in high-velocity regimes, and damping oscillations. By combining this velocityโbased regularizer for global damping with Adamโs perโparameter scaling, we create a powerful hybrid optimizer. For this optimizer, we provide rigorous theoretical analysis of operation at the edge of stability from a physical and control perspective for the momentum. Furthermore, we derive convergence bounds with the rate $\mathcal{O}(\ln(N)/\sqrt{N})$ for a stochastic nonโconvex objective under mild assumptions. We demonstrate that VRAdam exceeds the performance against standard optimizers including AdamW. We benchmark various tasks such as image classification, language modeling, and generative modeling using diverse architectures and training methodologies including Convolutional Neural Networks (CNNs), Transformers, and GFlowNets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4926. DiffuDETR: Rethinking Detection Transformers with Diffusion Process
๐ฌ 12
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Object Detection Diffusion Models DETR Query Generation Deep Learning
In this paper, we present DiffuDETR, a novel approach that formulates object detection as a conditional object query generation task, conditioned on the image and a set of noisy reference points. We integrate DETR-based models with denoising diffusion training to generate object queries' reference points from a prior gaussian distribution. We propose two variants: DiffuDETR, built on top of the Deformable DETR decoder, and DiffuDINO, based on DINOโs decoder with contrastive denoising queries (CDNs). To improve inference efficiency, we further introduce a lightweight sampling scheme that requires only multiple forward passes through the decoder. Our method demonstrates consistent improvements across multiple backbones and datasets, including COCO2017, LVIS, and V3Det, surpassing the performance of their respective baselines, with notable gains in complex and crowded scenes. Using ResNet-50 backbone we observe a +1.0 in COCO-val reaching 51.9 mAP on DiffuDINO compared to 50.9 mAP of the DINO. We also observe similar improvements on LVIS and V3DET datasets with +2.4 and +2.2 respectively. Code will be released upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4927. TriC-Motion: Tri-Domain Causal Modeling Grounded Text-to-Motion Generation
๐ฌ 20
โญ 4.50
๐ generative models
๐ท๏ธ Text-to-Motion Generation Spatial-temporal-frequency Modeling Causal Learning
Text-to-motion generation, a rapidly evolving field in computer vision, aims to produce realistic and text-aligned motion sequences. Current methods primarily focus on spatial-temporal modeling or independent frequency domain analysis, lacking a unified framework for joint optimization across spatial, temporal, and frequency domains. This limitation hinders the model's ability to leverage information from all domains simultaneously, leading to suboptimal generation quality. Additionally, in motion generation frameworks, motion-irrelevant cues caused by noise are often entangled with features that contribute positively to generation, thereby leading to motion distortion. To address these issues, we propose Tri-Domain Causal Text-to-Motion Generation (TriC-Motion), a novel diffusion-based framework integrating spatial-temporal-frequency-domain modeling with causal intervention. TriC-Motion includes three core modeling modules for domain-specific modeling, namely Temporal Motion Encoding, Spatial Topology Modeling, and Hybrid Frequency Analysis. After comprehensive modeling, a Score-guided Tri-domain Fusion module integrates valuable information from the triple domains, simultaneously ensuring temporal consistency, spatial topology, motion trends, and dynamics. Moreover, the Causality-based Counterfactual Motion Disentangler is meticulously designed to expose motion-irrelevant cues to eliminate noise, disentangling the real modeling contributions of each domain for superior generation. Extensive experimental results validate that TriC-Motion achieves superior performance compared to state-of-the-art methods, attaining an outstanding R1-Precision of 0.612 on the HumanML3D dataset. These results demonstrate its capability to generate high-fidelity, coherent, diverse, and text-aligned motion sequences. Our code will be released soon.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4928. Jackpot: Align Actor-Policy Distribution for scalable and stable RL for LLM
๐ฌ 17
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Large language models Reinforcement Learning
Reinforcement learning (RL) has become an increasingly important paradigm for improving large language models (LLMs) on alignment, reasoning, and coding tasks, yet it remains extremely costly. The majority of training time is spent on rollouts. Allowing actor and policy distributions to differ could unlock substantial scalability and efficiency benefits, such as supporting large-batch or asynchronous training, and even enabling a lightweight rollout model. However, existing importance samplingโbased corrections for distribution mismatch suffer from an inherent trade-off between stability and training performance. To tackle this problem, we propose Jackpot, which leverages Optimal Budget Rejection Sampling to directly reduce the gap between actor and policy distributions. For efficiency and stability in practical training, We introduce an efficient probability estimation strategy based on Top-$K$ logits with batch bias correction, and designs a stabilized Jackpot-PPO loss that jointly accounts for both the importance sampling ratio and the trust-region constraint in PPO. Empirically, our method achieves stable improvements in large-batch and asynchronous training, and in extreme off-policy training it substantially delays the onset of collapse and delivers competitive performance. Specifically, we achieve 20\% improvement on AMC benchmarks and ~8\% AIME benchmarks over the off-policy baseline under 128$\times$ actor-policy update ratio for Qwen3-4B-Base and 64$\times$ for Qwen3-8B-Base, while achieving greater stability and better performance than prior off-policy RL methods under extreme settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4929. SciTS: Scientific Time Series Understanding and Generation with LLMs
๐ฌ 13
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ time series large language model benchmark
The scientific reasoning ability of large language models (LLMs) has recently attracted significant attention. Time series, as a fundamental modality in scientific data, presents unique challenges that are often overlooked in current multimodal LLMs, which either encode numerical sequences as text or convert them into images. Such approaches may be insufficient for comprehensive scientific time series understanding and generation. Existing unified time series models typically specialise in either forecasting or analysis, and their effectiveness on non-periodic, heterogeneous scientific signals remains unclear. To address these gaps, we introduce SciTS, a benchmark spanning 12 scientific domains and 43 tasks, with over 50k+ instances, both univariate and multivariate signals ranging from $10^0$ to $10^7$ in length and up to 10~MHz in frequency. We benchmark 17 models, including text-only LLMs, multimodal LLMs, and unified time series models, and find that general-purpose LLMs exhibit stronger generalisability than specialised time series models, while representing time series as text or images limits their performance due to excessively long sequences and loss of numerical precision, respectively. We then introduce TimeOmni, a framework that equips LLMs with the ability to understand and generate time series while remaining compatible with general-purpose LLM training. This work fills a gap in both dedicated benchmarks and modelling frameworks for scientific time series, paving the way for LLMs to understand and generate complex temporal scientific data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4930. Sparsity-promoting Fine-tuning for Equivariant Materials Foundation Model
๐ฌ 16
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ machine learning interatomic potentials equivariance sparsity-promoting
Pre-trained materials foundation models, or machine learning interatomic potentials, leverage general physicochemical knowledge to effectively approximate potential energy surfaces. However, they often require domain-specific calibration due to physicochemical diversity and mismatches between practical computational settings and those used in constructing the pre-training data. We propose a sparsity-promoting fine-tuning method for E(3)-equivariant materials foundation models that prune low-contribution parameters during training. Across molecular and crystalline benchmarks, our approach updates only 3 % of parameters, and in some cases as little as 0.5 %, while matching or exceeding the accuracy of full fine-tuning. Beyond energy and force calibration, we apply our method to magnetic moment prediction and magnetism-aware total energy estimation, achieving broader applicability of materials foundation models. Analysis of sparsity patterns further reveals physically interpretable signatures, such as enhanced $d$-orbital contributions in transition-metal systems. Overall, our results establish sparsity-promoting fine-tuning of equivariant models as a flexible and interpretable method for domain specialization of materials foundation models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4931. LLM as an Algorithmist: Enhancing Anomaly Detectors via Programmatic Synthesis
๐ฌ 31
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Tabular data Large language models Anomaly detection
Existing anomaly detection (AD) methods for tabular data usually rely on some assumptions about anomaly patterns, leading to inconsistent performance in real-world scenarios. While Large Language Models (LLMs) show remarkable reasoning capabilities, their direct application to tabular AD is impeded by fundamental challenges, including difficulties in processing heterogeneous data and significant privacy risks. To address these limitations, we propose LLM-DAS, a novel framework that repositions the LLM from a data processor to an algorithmist. Instead of being exposed to raw data, our framework leverages the LLM's ability to reason about algorithms. It analyzes a high-level description of a given detector to understand its intrinsic weaknesses and then generates detector-specific, data-agnostic Python code to synthesize ``hard-to-detect'' anomalies that exploit these vulnerabilities. This generated synthesis program, which is reusable across diverse datasets, is then instantiated to augment training data, systematically enhancing the detector's robustness by transforming the problem into a more discriminative two-class classification task. Extensive experiments on 36 TAD benchmarks show that LLM-DAS consistently boosts the performance of mainstream detectors. By bridging LLM reasoning with classic AD algorithms via programmatic synthesis, LLM-DAS offers a scalable, effective, and privacy-preserving approach to patching the logical blind spots of existing detectors.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4932. RADAR: Learning to Route with Asymmetry-aware Distance Representations
๐ฌ 5
โญ 4.50
๐ optimization
๐ท๏ธ Neural Combinatorial Optimization Vehicle Routing Problem
Recent neural solvers have achieved strong performance on vehicle routing problems (VRPs), yet they mainly assume symmetric Euclidean distances, restricting applicability to real-world scenarios. A core challenge is encoding the relational features in asymmetric distance matrices of VRPs. Early attempts directly encoded these matrices but often failed to produce compact embeddings and generalized poorly at scale. In this paper, we propose RADAR, a scalable neural framework that augments existing neural VRP solvers with the ability to handle asymmetric inputs. RADAR addresses asymmetry from both static and dynamic perspectives. It leverages Singular Value Decomposition (SVD) on the asymmetric distance matrix to initialize compact and generalizable embeddings that inherently encode the *static asymmetry* in the inbound and outbound costs of each node. To further model *dynamic asymmetry* in embedding interactions during encoding, it replaces the standard softmax with Sinkhorn normalization that imposes joint row and column distance awareness in attention weights. Extensive experiments on synthetic and real-world benchmarks across various VRPs show that RADAR outperforms strong baselines on both in-distribution and out-of-distribution instances, demonstrating robust generalization and superior performance in solving asymmetric VRPs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4933. Functional MRI Time Series Generation via Wavelet-Based Image Transform and Spectral Flow Matching for Brain Disorder Identification
๐ฌ 4
โญ 4.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Generative Models Time Series Flow Matching
Functional Magnetic Resonance Imaging (fMRI) provides non-invasive access to dynamic brain activity by measuring blood oxygen level-dependent (BOLD) signals over time. However, the resource-intensive nature of fMRI acquisition limits the availability of high-fidelity samples required for data-driven brain analysis models. While modern generative models can synthesize fMRI data, they often remain challenging in replicating their inherent non-stationarity, intricate spatiotemporal dynamics, and physiological variations of raw BOLD signals. To address these challenges, we propose Dual-Spectral Flow Matching (DSFM), a novel fMRI generative framework that cascades dual frequency representation of BOLD signals with spectral flow matching. Specifically, our framework first converts BOLD signals into a wavelet decomposition map via a discrete wavelet transform (DWT) to capture globalized transient and multi-scale variations, and projects into the discrete cosine transform (DCT) space across brain regions and time to exploit localized energy compaction of low-frequency dominant BOLD coefficients. Subsequently, a spectral flow matching model is trained to generate class-conditioned cosine-frequency representation. The generated samples are reconstructed through inverse DCT and inverse DWT operations to recover physiologically plausible time-domain BOLD signals. This dual-transform approach imposes structured frequency priors and preserves key physiological brain dynamics. Ultimately, we demonstrate the efficacy of our approach through improved downstream fMRI-based brain network classification.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4934. It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
๐ฌ 12
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Test Time Memorization Online Optimization Recurrent Neural Networks
Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the human cognitive phenomenon of attentional biasโthe natural tendency to prioritize certain events or stimuliโwe reconceptualize neural architectures, including Transformers, Titans, and modern linear recurrent neural networks as associative memory modules with attentional bias. We define and formalize the concept of attentional bias as the internal memory objective deep learning architectures. We show that existing deep learning architectures leverage the same attentional bias based on $L_2$ loss function. Going beyond $L_2$ loss function, we present a set of alternative attentional bias configurations along with their effective approximations. We then reinterpret forgetting mechanisms in modern deep learning architectures as a form of retention regularization. Building upon these insights, we present Miras, a general framework to design deep learning architectures based on the choice of attentional bias objective, retention gate, associative memory architecture, and memory learning algorithm. Our experiments show different designs yield models with varying strengths. Furthermore, our special instances of Miras achieve exceptional performance in language modeling, commonsense reasoning, recall intensive, and time series tasks, outperforming Transformers and other modern linear recurrent models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4935. Spinning Straw into Gold: Relabeling LLM Agent Trajectories in Hindsight for Successful Demonstrations
๐ฌ 9
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ hindsight learning agentic LLM LLM post training RL
Large language model agents operate in partially observable, long-horizon settings where obtaining supervision remains a major bottleneck. We address this by leveraging a source of supervision overlooked in existing post-training methods: ``unintended yet successful'' goals embedded within agent rollouts. We introduce Hindsight Supervised Learning (HSL), where an auxiliary LLM reviews each completed trajectory and relabels it with natural-language goals the agent actually achieved. HSL then pairs the trajectory with its relabeled goals and uses these pairs for additional fine-tuning. To mitigate suboptimality in the relabeled data, HSL incorporates irrelevant-action masking and sample reweighting. We show that HSL is flexible and compatible with existing post-training pipelines. It improves both SFT and DPO, with larger gains on long-horizon embodied and web agent tasks such as ALFWorld and WebShop. Moreover, HSL is sample-efficient: on ALFWorld, it surpasses baselines trained on the full dataset while using only one quarter of the ground-truth demonstrations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4936. RPM: Reasoning-Level Personalization for Black-Box Large Language Models
๐ฌ 23
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Personalization Large Language Model Reasoning-Level Personalization LLM LLM Personalization Black-Box LLM
While black-box large language models are widely deployed, they produce generic outputs that overlook individual user preferences.
Current personalization methods are fundamentally limited to response-level personalization; they only match final outputs, failing to model the underlying reasoning that connects user behavior to responses.
To address this, this work introduces reasoning-level personalization as a new paradigm and proposes RPM, the first systematic framework designed to guide the modelโs reasoning process using structured rationales constructed from patterns in a userโs behavior.
RPM constructs a structured model of user behaviorโbuilt from response-influential features and statistical factorsโto create personalized reasoning paths and retrieve beneficial examples for guiding inference through a feature-based retrieval mechanism.
Extensive experiments across four diverse tasks demonstrate that RPM consistently outperforms existing response-level methods while simultaneously enhancing both personalization performance and interpretability, providing a promising direction for black-box LLM personalization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4937. Are Reasoning LLMs Robust to Interventions on their Chain-of-Thought?
๐ฌ 6
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language model reasoning model robustness chain of thought
Reasoning LLMs (RLLMs) generate step-by-step chains of thought (CoTs) before giving an answer, which improves performance on complex tasks and makes reasoning transparent. But how robust are these reasoning traces to disruptions that occur within them? To address this question, we introduce a controlled evaluation framework that perturbs a modelโs own CoT at fixed timesteps. We design seven interventions (benign, neutral, and adversarial) and apply them to multiple open-weight RLLMs across MATH, SCIENCE, and LOGIC tasks. Our results show that RLLMs are generally robust, reliably recovering from diverse perturbations, with robustness improving with model size and degrading when interventions occur early. However, robustness is not style-invariant: paraphrasing suppresses doubt-like expressions and reduces performance, while other interventions trigger doubt and support recovery. Recovery also carries a cost: neutral and adversarial noise can inflate CoT length by more than 200%, whereas paraphrasing shortens traces but harms accuracy. These findings provide new evidence on how RLLMs maintain reasoning integrity, identify doubt as a central recovery mechanism, and highlight trade-offs between robustness and efficiency that future training methods should address.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4938. UniCalli: A Unified Diffusion Framework for Column-Level Generation and Recognition of Chinese Calligraphy
๐ฌ 13
โญ 4.50
๐ generative models
๐ท๏ธ calligraphy
Computational replication of Chinese calligraphy, a cornerstone of cultural heritage, remains challenging. Existing methods split into two flawed camps: some render high-quality isolated characters yet miss page-level aesthetics (ligatures, spacing, scale), while others attempt page/column synthesis but sacrifice calligraphic correctness. We introduce UniCalli, a unified diffusion framework for column-level recognition and generation. Training both tasks in one model is deliberate: recognition constrains the generator to preserve character identity and stroke structure, while generation supplies strong style/layout priorsโtogether fostering concept-level abstractions (radicals, stroke configurations) that improve both tasks under long-tail, limited-label regimes. We curate a dataset of 8,000+ digitized pieces, with ~4,000 densely annotated (script labels, character boxes, transcriptions). UniCalli employs asymmetric noising and a rasterized box map to inject spatial priors, and is trained on a mix of synthetic, labeled, and unlabeled data. The model is robust to rare styles, better disentangles style from script, and attains state-of-the-art generative quality with clear gains in ligature continuity and layout fidelity, alongside stronger recognition. The framework extends to other ancient scripts, demonstrated by successful transfer to Oracle bone inscriptions and Egyptian hieroglyphs. Code and data will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4939. OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
๐ฌ 42
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Multimodal Reasoning MLLM Benchmark Text Audio Video
Recent advances in multimodal large language models (MLLMs) have demonstrated substantial potential in video understanding. However, existing benchmarks fail to comprehensively evaluate synergistic reasoning capabilities across audio and visual modalities, often neglecting either one of the modalities or integrating them in a logically inconsistent manner. To bridge this gap, we introduce OmniVideoBench, a large-scale and rigorously designed benchmark dedicated to assessing synergistic audioโvisual understanding, with a strong emphasis on modality complementarity and logical consistency. Specifically, OmniVideoBench comprises 1000 high-quality questionโanswer(QA) pairs, each annotated with step-by-step reasoning traces, derived from 628 diverse videos ranging from several seconds to 30 minutes, and manually verified to guarantee complete correctness and uniqueness. Moreover, OmniVideoBench encompasses 13 carefully designed question types, covering temporal reasoning, spatial localization, counting, causal inference, summarization, and beyond, thereby capturing the essential challenges of video understanding. Evaluation of multiple MLLMs on OmniVideoBench reveals a pronounced gap between model performance and human reasoning, with open-source models lagging significantly behind their closed-source counterparts, underscoring the inherent difficulty of genuine audioโvisual reasoning. We will release OmniVideoBench to foster the development of MLLMs with stronger and more generalizable reasoning capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4940. Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems
๐ฌ 29
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ security and privacy security/privacy red teaming
Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by incorporating external knowledge bases, but this may expose them to extraction attacks, leading to potential copyright and privacy risks.
However, existing extraction methods typically rely on malicious inputs such as prompt injection or jailbreaking, making them easily detectable via input- or output-level detection.
In this paper, we introduce **I**mplicit **K**nowledge **E**xtraction **A**ttack (**IKEA**), which conducts *Knowledge Extraction* on RAG systems through benign queries.
Specifically, **IKEA** first leverages anchor conceptsโkeywords related to internal knowledgeโto generate queries with a natural appearance, and then designs two mechanisms that lead anchor concepts to thoroughly "explore" the RAG's knowledge:
(1) Experience Reflection Sampling, which samples anchor concepts based on past query-response histories, ensuring their relevance to the topic;
(2) Trust Region Directed Mutation, which iteratively mutates anchor concepts under similarity constraints to further exploit the embedding space.
Extensive experiments demonstrate **IKEA**'s effectiveness under various defenses, surpassing baselines by over 80% in extraction efficiency and 90\% in attack success rate. Moreover, the substitute RAG system built from **IKEA**'s extractions shows close performance to the original RAG and outperforms those based on baselines across multiple evaluation tasks, underscoring the stealthy copyright infringement risk in RAG systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4941. SE-Diff: Simulator and Experience Enhanced Diffusion Model for Comprehensive ECG Generation
๐ฌ 17
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Diffusion Model; ECG; Simulator; LLM
Cardiovascular disease (CVD) is a leading cause of mortality worldwide. Electrocardiograms (ECGs) are the most widely used non-invasive tool for cardiac assessment, yet large, well-annotated ECG corpora are scarce due to cost, privacy, and workflow constraints. Generating ECGs can aid mechanistic understanding of cardiac electrical activity, enable the construction of large, heterogeneous, and unbiased datasets, and facilitate privacy-preserving data sharing. Generating realistic ECG signals from clinical context is important yet underexplored. Recent work has leveraged diffusion models for text-to-ECG generation, but two challenges remain: (i) existing methods often overlook physiological simulator knowledge of cardiac activity; and (ii) they ignore broader, experience-based clinical knowledge grounded in real-world practice. To address these gaps, we propose **SE-Diff**, a physiological simulator- and experience-enhanced diffusion model for comprehensive ECG generation. SE-Diff integrates a lightweight ordinary differential equation (ODE)โbased ECG simulator into the diffusion process via a beat decoder and simulator-consistent constraints, injecting mechanistic priors that promote physiologically plausible waveforms. In parallel, we design an LLM-powered, experience retrievalโaugmented strategy to inject clinical knowledge, providing stronger guidance for ECG generation. Extensive experiments on real-world ECG datasets demonstrate that SE-Diff improves both signal fidelity and textโECG semantic alignment over baselines. We further show that simulator-based and experience-based knowledge benefit downstream ECG classification.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4942. Doloris: Dual Conditional Diffusion Implicit Bridges with Sparsity Masking Strategy for Unpaired Single-Cell Perturbation Estimation
๐ฌ 17
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ unpair transition diffusion Single-Cell Perturbation
Estimating single-cell responses across various perturbations facilitates the identification of key genes and enhances drug screening, significantly boosting experimental efficiency. However, single-cell sequencing is a destructive process, making it impossible to capture the same cell's phenotype before and after perturbation. Consequently, data collected under perturbed and unperturbed conditions are inherently unpaired, creating a critical yet unresolved problem in single-cell perturbation modeling. Moreover, the high dimensionality and sparsity of single-cell expression make direct modeling prone to focusing on zeros and neglecting meaningful patterns. To address these problems, we propose a new paradigm for single-cell perturbation modeling. Specifically, we leverage dual diffusion models to learn the control and perturbed distributions separately, and implicitly align them through a shared Gaussian latent space, without requiring explicit cell pairing. Furthermore, we introduce a sparsity masking strategy in which the mask model learns to predict zero-expressed genes, allowing the diffusion model to focus on capturing meaningful patterns among expressed genes and thereby preserving diversity in high-dimensional sparse data. We introduce \textbf{Doloris}, a generative framework that defines a new paradigm for modeling unpaired, high-dimensional, and sparse single-cell perturbation data. It leverages dual conditional diffusion models for separate learning of control and perturbed distributions, complemented by a sparsity masking strategy to enhance prediction of silent genes. The results on publicly available datasets show that our model effectively captures the diversity of single-cell perturbations and achieves state-of-the-art performance. To facilitate reproducibility, we include the code in the supplementary materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4943. ABBA-Adapters: Efficient and Expressive Fine-Tuning of Foundation Models
๐ฌ 30
โญ 4.50
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ LoRA Low-rank adaptation PEFT Parameter-Efficient Fine-Tuning
Large Language Models have demonstrated strong performance across a wide range of tasks, but adapting them efficiently to new domains remains a key challenge. Parameter-Efficient Fine-Tuning (PEFT) methods address this by introducing lightweight, trainable modules while keeping most pre-trained weights fixed. The prevailing approach, LoRA, models updates using a low-rank decomposition, but its expressivity is inherently constrained by the rank. Recent methods like HiRA aim to increase expressivity by incorporating a Hadamard product with the frozen weights, but still rely on the structure of the pre-trained model. We introduce **ABBA**, a new PEFT architecture that reparameterizes the update as a Hadamard product of two independently learnable low-rank matrices. In contrast to prior work, ABBA fully decouples the update from the pre-trained weights, enabling both components to be optimized freely. This leads to significantly higher expressivity under the same parameter budget, a property we validate through matrix reconstruction experiments.
Empirically, ABBA achieves state-of-the-art results on arithmetic and commonsense reasoning benchmarks, consistently outperforming existing PEFT methods by a significant margin across multiple models. Our code is available anonymously at: https://anonymous.4open.science/r/abba-anonymous-8359.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4944. Pairwise is Not Enough: Hypergraph Neural Networks for Multi-Agent Pathfinding
๐ฌ 17
โญ 4.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ hypergraph group interaction modeling imitation learning MAPF
Multi-Agent Path Finding (MAPF) is a representative multi-agent coordination problem, where multiple agents are required to navigate to their respective goals without collisions. Solving MAPF optimally is known to be NP-hard, leading to the adoption of learning-based approaches to alleviate the online computational burden. Prevailing approaches, such as Graph Neural Networks (GNNs), are typically constrained to *pairwise* message passing between agents. However, this limitation leads to suboptimal behaviours and critical issues, such as attention dilution, particularly in dense environments where group (i.e. beyond just two agents) coordination is most critical. Despite the importance of such higher-order interactions, existing approaches have not been able to fully explore them. To address this representational bottleneck, we introduce HMAGAT (Hypergraph Multi-Agent Attention Network), a novel architecture that leverages attentional mechanisms over directed hypergraphs to explicitly capture group dynamics. Empirically, HMAGAT establishes a new state-of-the-art among learning-based MAPF solvers: e.g., despite having just 1M parameters and being trained on 100$\times$ less data, it outperforms the current SoTA 85M parameter model. Through detailed analysis of HMAGAT's attention values, we demonstrate how hypergraph representations mitigate the attention dilution inherent in GNNs and capture complex interactions where pairwise methods fail. Our results illustrate that appropriate inductive biases are often more critical than the training data size or sheer parameter count for multi-agent problems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4945. Spatial Structure and Selective Text Jointly Facilitate Image Clustering
๐ฌ 22
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Image clustering
Image clustering is a fundamental task in visual machine learning. A key research direction in this field is the incorporation of prior knowledge. Recently, such prior knowledge has evolved from internal compactness constraints to external textual guidance. In particular, the introduction of textual modalities through CLIP has demonstrated impressive performance. However, CLIP is designed primarily for imageโtext alignment and may not be sufficient to capture clustering structures. Moreover, existing approaches often assume that textual features are universally beneficial, overlooking their varying suitability for different datasets. To address these issues, we propose to use spatial structure and selective text to jointly facilitate image clustering (SATC). Specifically, we design a graph attention network (GAT)-based encoder to capture relational dependencies among image patches, thereby extracting spatial features to facilitate clustering. In addition, we introduce a textual feature selector that uses the potential clustering compactness of textual features as the selection criterion and adaptively integrates them into the clustering process. Theoretical guidance is provided for this selector. Finally, the cluster assignment is produced through Tri-modal mutual distillation. Extensive experiments on 18 benchmark datasets demonstrate the effectiveness of SATC. The experimental results further verify the rationality of the textual feature selector. The code will be published.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4946. Unified Analyses for Hierarchical Federated Learning: Topology Selection under Data Heterogeneity
๐ฌ 19
โญ 4.50
๐ optimization
๐ท๏ธ Hierarchical Federated Learning Convergence Analysis Heterogeneous Data
Hierarchical Federated Learning (HFL) addresses critical scalability limitations in conventional federated learning by incorporating intermediate aggregation layers, yet optimal topology selection across varying data heterogeneity conditions and network conditions remains an open challenge. This paper establishes the first unified convergence framework for all four HFL topologies (Star-Star, Star-Ring, Ring-Star, and Ring-Ring) with full/partial client participation under non-convex objectives and different intra/inter-group data heterogeneity. Our theoretical analysis reveals three fundamental principles for topology selection: (1) The top-tier aggregation topology exerts greater influence on convergence than the intra-group topology, with ring-based top-tier configurations generally outperforming star-based alternatives; (2) Optimal topology strongly depends on client grouping characteristics, where Ring-Star excels with numerous small groups while Star-Ring is superior for large, client-dense clusters; and (3) Inter-group heterogeneity dominates convergence dynamics across all topologies, necessitating clustering strategies that minimize inter-group divergence. Extensive experiments on CIFAR-10/CINIC-10//Fashion-MNIST/SST-2 with ResNet-18/VGG-9/ResNet-10/MLP validate these insights, and provide practitioners with theoretically grounded guidance for HFL system design in real-world deployments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4947. PolySHAP: Extending KernelSHAP with Interaction-Informed Polynomial Regression
๐ฌ 17
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ Explainable AI Kernel SHAP Shapley Value
Shapley values have emerged as a central game-theoretic tool in explainable AI (XAI). However, computing Shapley values exactly requires $2^d$ game evaluations for a model with $d$ features. Lundberg and Lee's KernelSHAP algorithm has emerged as a leading method for avoiding this exponential cost. KernelSHAP approximates Shapley values by approximating the game as a linear function, which is fit using a small number of game evaluations for random feature subsets.
In this work, we extend KernelSHAP by approximating the game via higher degree polynomials, which capture non-linear interactions between features. Our resulting PolySHAP method yields empirically better Shapley value estimates for various benchmark datasets, and we prove that these estimates are consistent.
Moreover, we connect our approach to paired sampling (antithetic sampling), a ubiquitous modification to KernelSHAP that improves empirical accuracy. We prove that paired sampling outputs exactly the same Shapley value approximations as second-order PolySHAP, without ever fitting a degree 2 polynomial. To the best of our knowledge, this finding provides the first strong theoretical justification for the excellent practical performance of the paired sampling heuristic.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4948. Enhancing Stability of Physics-Informed Neural Network Training Through Saddle-Point Reformulation
๐ฌ 22
โญ 4.50
๐ optimization
๐ท๏ธ Physics-informed neural networks Multi-task learning Saddle-point problems Scientific machine learning
Physics-informed neural networks (PINNs) have gained prominence in recent years and are now effectively used in a number of applications. However, their performance remains unstable due to the complex landscape of the loss function. To address this issue, we reformulate PINN training as a nonconvex-strongly concave saddle-point problem. After establishing the theoretical foundation for this approach, we conduct an extensive experimental study, evaluating its effectiveness across various tasks and architectures. Our results demonstrate that the proposed method outperforms the current state-of-the-art techniques.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4949. TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models
๐ฌ 17
โญ 4.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ VLA Bimanual manipulation Imitation Learning
Vision-language-action models (VLAs) trained on large-scale robotic datasets have demonstrated strong performance on manipulation tasks, including bimanual tasks. However, because most public datasets focus on single-arm demonstrations, adapting VLAs for bimanual tasks typically requires substantial additional bimanual data and fine-tuning. To address this challenge, we introduce TwinVLA, a modular framework that composes two copies of a pretrained single-arm VLA into a coordinated bimanual VLA. Unlike monolithic cross-embodiment models trained on mixtures of single-arm and bimanual data, TwinVLA improves both data efficiency and performance by composing pretrained single-arm policies. Across diverse bimanual tasks in real-world and simulation settings, TwinVLA outperforms a comparably-sized monolithic RDT-1B model without requiring any bimanual pretraining. Furthermore, it narrows the gap to state-of-the-art model, $\pi_0$ which rely on extensive proprietary bimanual data and compute cost. These results establish our modular composition approach as a data-efficient and scalable path toward high-performance bimanual manipulation, leveraging public single-arm data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4950. ZeroGR: A Generalizable and Scalable Framework for Zero-Shot Generative Retrieval
๐ฌ 13
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Information Retrieval
Generative retrieval (GR) reformulates information retrieval (IR) by framing it as the generation of document identifiers (docids), thereby enabling an end-to-end optimization and seamless integration with generative language models (LMs).
Despite notable progress under supervised training, GR still struggles to generalize to zero-shot IR scenarios, which are prevalent in real-world applications.
To tackle this challenge, we propose ZeroGR, a zero-shot generative retrieval framework that leverages natural language instructions to extend GR across a wide range of IR tasks.
Specifically, ZeroGR is composed of three key components: (i) an LM-based docid generator that unifies heterogeneous documents (e.g., text, tables, code) into semantically meaningful docids; (ii) an instruction-tuned query generator that generates diverse types of queries from natural language task descriptions to enhance corpus indexing; and (iii) a reverse annealing decoding strategy to balance precision and recall during docid generation.
We investigate the impact of instruction fine-tuning scale and find that performance consistently improves as the number of IR tasks encountered during training increases.
Empirical results on the BEIR and MAIR benchmarks demonstrate that \textsc{ZeroGR} outperforms strong dense retrieval and generative baselines in zero-shot settings, establishing a new state-of-the-art for instruction-driven GR.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4951. Large Language Model Compression with Global Rank and Sparsity Optimization
๐ฌ 23
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Low-rank and sparse approximations Model Compression Probabilistic Pruning Global Sparsity-Rank Co-optimization
\begin{abstract}
Low-rank and sparse composite approximation is a natural idea to compress Large Language Models (LLMs). However, such an idea faces two primary challenges that adversely affect the performance of existing methods. The first challenge relates to the interaction and cooperation between low-rank and sparse matrices, while the second involves determining weight allocation across different layers, as redundancy varies considerably among them. To address these challenges, we propose a novel two-stage LLM compression method with the capability of global resource allocation for rank and sparsity. It is noteworthy that the overall optimization space is vast, making comprehensive optimization computationally prohibitive. Therefore, to reduce the optimization space, our first stage utilizes robust principal component analysis to decompose the weight matrices of LLMs into low-rank and sparse components, which span the low dimensional and sparse spaces containing the resultant low-rank and sparse matrices, respectively. In the second stage, we propose a probabilistic global allocation strategy to jointly identify the low-rank and sparse structures within the above two spaces. The appealing feature of our approach is its ability to automatically detect the redundancy across different layers and to manage the interaction between the sparse and low-rank components. Extensive experimental results indicate that our method significantly surpasses state-of-the-art techniques for sparsification and composite approximation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4952. Opponent Shaping in LLM Agents
๐ฌ 13
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Agents Opponent Shaping Multi-agent Systems
Large Language Models (LLMs) are increasingly being deployed as autonomous agents in real-world environments. As these deployments scale, multi-agent interactions become inevitable, making it essential to understand strategic behavior in such systems. A central open question is whether LLM agents, like reinforcement learning agents, can shape the learning dynamics and influence the behavior of others through interaction alone. In this paper, we present the first investigation of opponent shaping (OS) with LLM-based agents. Existing OS algorithms cannot be directly applied to LLMs, as they require higher-order derivatives, face scalability constraints, or depend on architectural components that are absent in transformers. To address this gap, we introduce ShapeLLM, an adaptation of model-free OS methods tailored for transformer-based agents. Using ShapeLLM, we examine whether LLM agents can influence co-playersโ learning dynamics across diverse game-theoretic environments. We demonstrate that LLM agents can successfully guide opponents toward exploitable equilibria in competitive games (Iterated Prisonerโs Dilemma, Matching Pennies, and Chicken) and promote coordination and improve collective welfare in cooperative games (Iterated Stag Hunt and a cooperative version of the Prisonerโs Dilemma). Our findings show that LLM agents can both shape and be shaped through interaction, establishing opponent shaping as a key dimension of multi-agent LLM research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4953. SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
๐ฌ 12
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ LLM evaluation search-augmented LLMs question answering reasoning
We introduce SealQA, a challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) SEAL-0 (main) and (2) SEAL-HARD, both of which assess factual accuracy and reasoning capabilities, where SEAL-0 targets the most challenging questions that frontier non-reasoning models (e.g., GPT-4.1) answer with near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models. Even frontier reasoning models face significant challenges across SealQA flavors. On SEAL-0, GPT-5 with tools achieves only 43.2% accuracy at its best reasoning effort. We also find that even advanced reasoning models (e.g., DeepSeek-R1) can be vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across GPT-5 and the o-series of models, with performance often plateauing or even declining early. Finally, while current models are less affected by the "lost-in-the-middle" issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at https://anonymous.4open.science/r/SealQA.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4954. Theory-Grounded Evaluation of Human-Like Fallacy Patterns in LLM Reasoning
๐ฌ 10
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Language models reasoning synthetic data contamination-proof human-like errors cognitive fallacies Erotetic Theory of Reasoning PyETR logical fallacies human-like reasoning patterns reasoning evaluation question-driven inference inverse scaling laws human cognition rationality vs fallibility cognitive biases order effects reasoning benchmarks cognitive science alignment AI alignment systematic deviations from logic normative vs descriptive reasoning reasoning tasks disjunction fallacy modus ponens modus tollens syllogistic inference logical validity data contamination natural language reasoning tasks formal semantics mental models evaluation harness Chatbot Arena medical diagnosis legal reasoning high-stakes decision-making alignment benchmarks robust reasoning systems AI evaluation frameworks
We study logical reasoning in language models by asking whether their errors follow established human fallacy patterns. Using the Erotetic Theory of Reasoning (ETR) and its openโsource implementation, PyETR, we programmatically generate 383 formally specified reasoning problems and evaluate 38 models. For each response, we judge logical correctness and, when incorrect, whether it matches an ETRโpredicted fallacy. Two results stand out: (i) as a capability proxy (Chatbot Arena Elo) increases, a larger share of a modelโs incorrect answers are ETRโpredicted fallacies ($\rho=0.360, p=0.0265$), while overall correctness on this dataset shows no correlation with capability; (ii) reversing premise order significantly reduces fallacy production for many models, mirroring human order effects. Methodologically, PyETR provides an openโsource pipeline for unbounded, synthetic, contaminationโresistant reasoning tests linked to a cognitive theory, enabling analyses that focus on error composition rather than error rate.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4955. Revisual-R1: Advancing Multimodal Reasoning From Optimized Cold Start to Staged Reinforcement Learning
๐ฌ 17
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multimodal Reasoning Multimodal Reinforcement Learning
Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning.
In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL.2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning.
This staged training approach effectively balances perceptual grounding and cognitive reasoning development.
By incorporating the above insights and addressing multimodal RL issues, we introduce \textbf{ReVisual-R1}, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4956. Fostering Video Reasoning via Next-Event Prediction
๐ฌ 22
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Large Language Models Video Instruction Fine-tuning
Next-token prediction serves as the foundational learning task that enables reasoning in LLMs. But what should the learning task be when aiming to equip MLLMs with temporal reasoning capabilities over video inputs? Existing tasks such as video captioning primarily promote modality alignment, while video question answering typically relies on annotations from humans or much stronger MLLMs. To address this gap, we propose next-event prediction (NEP), a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning. We segment each video into past and future frames: the MLLM takes the past frames as input and predicts events in the future, thereby encouraging the model to reason temporally in order to complete the task. To study this learning task, we curate V1-33K, a dataset comprising 33,000 automatically extracted videos spanning diverse real-world scenarios. Using the same videos, we further explore a range of video instruction-tuning tasks data to provide controlled comparisons and isolate the effect of NEP. To evaluate progress, we introduce FutureBench to assess coherence in predicting unseen future events. Experiments validate that NEP offers a scalable and effective training task for fostering temporal reasoning in MLLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4957. Belief-Based Offline Reinforcement Learning for Delay-Robust Policy Optimization
๐ฌ 19
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Delayed Reinforcement Learning Offline-to-Online Adaption
Offlineโtoโonline deployment of reinforcement learning (RL) agents often stumbles over two fundamental gaps: (1) the sim-to-real gap, where real-world systems exhibit latency and other physical imperfections not captured in simulation; and (2) the interaction gap, where policies trained purely offline face out-of-distribution (OOD) issues during online execution, as collecting new interaction data is costly or risky. As a result, agents must generalize from static, delay-free datasets to dynamic, delay-prone environments.
In this work, we propose $\textbf{DT-CORL}$ ($\textbf{D}$elay-$\textbf{T}$ransformer belief policy $\textbf{C}$onstrained $\textbf{O}$ffline $\textbf{RL}$), a novel framework for learning delay-resilient policies solely from static, delay-free offline data. DT-CORL introduces a transformer-based belief model to infer latent states from delayed observations and jointly trains this belief with a constrained policy objective, ensuring that value estimation and belief representation remain aligned throughout learning. Crucially, our method does not require access to delayed transitions during training and outperforms naive history-augmented baselines, SOTA delayed RL methods, and existing belief-based approaches.
Empirically, we demonstrate that DT-CORL achieves strong delay-robust generalization across both locomotion and goal-conditioned tasks in the D4RL benchmark under varying delay regimes. Our results highlight that joint belief-policy optimization is essential for bridging the sim-to-real latency gap and achieving stable performance in delayed environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4958. Predictive Differential Training Guided by Training Dynamics
๐ฌ 22
โญ 4.50
๐ optimization
๐ท๏ธ Training Dynamics Koopman Operator Theory Predictive Training Deep Neural Networks
This paper centers around a novel concept proposed recently by researchers from the control community where the training process of a deep neural network can be considered a nonlinear dynamical system acting upon the high-dimensional weight space. Koopman operator theory (KOT), a data-driven dynamical system analysis framework, can then be deployed to discover the otherwise non-intuitive training dynamics. Taking advantage of the predictive power of KOT, the time-consuming Stochastic Gradient Descent (SGD) iterations can be then bypassed by directly predicting network weights a few epochs later. This "predictive training" framework, however, often suffers from gradient explosion especially for more extensive and complex models. In this paper, we incorporate the idea of "differential learning" into the predictive training framework and propose the so-called "predictive differential training" (PDT) for accelerated learning even for complex network structures. The key contribution is the design of an effective masking strategy based on a dynamic consistency analysis, which selects only those predicted weights whose local training dynamics align with the global dynamics. We refer to these predicted weights as high-fidelity predictions. DT also includes the design of an acceleration scheduler to adjust the prediction interval and rectify deviations from off-predictions. We demonstrate that PDT can be seamlessly integrated as a plug-in with a diverse array of existing optimizers (SGD, Adam, RMSprop, LAMB, etc.). The experimental results show consistent performance improvement across different network architectures and various datasets, in terms of faster convergence and reduced training time (10-40%) to achieve the baseline's best loss, while maintaining (if not improving) final model accuracy. As the idiom goes, a rising tide lifts all boats; in our context, a subset of high-fidelity predicted weights can accelerate the training of the entire network!
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4959. NAB: Neural Adaptive Binning for Sparse-View CT reconstruction
๐ฌ 4
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Binning Rotation Reconstruction Computed Tomography
Computed Tomography (CT) plays a vital role in inspecting the internal structures of industrial objects. Furthermore, achieving high-quality CT reconstruction from sparse views is essential for reducing production costs. While classic implicit neural networks have shown promising results for sparse reconstruction, they are unable to leverage shape priors of objects. Motivated by the observation that numerous industrial objects exhibit rectangular structures, we propose a novel \textbf{N}eural \textbf{A}daptive \textbf{B}inning (\textbf{NAB}) method that effectively integrates rectangular priors into the reconstruction process. Specifically, our approach first maps coordinate space into a binned vector space. This mapping relies on an innovative binning mechanism based on differences between shifted hyperbolic tangent functions, with our extension enabling rotations around the input-plane normal vector. The resulting representations are then processed by a neural network to predict CT attenuation coefficients. This design enables end-to-end optimization of the encoding parameters---including position, size, steepness, and rotation---via gradient flow from the projection data, thus enhancing reconstruction accuracy. By adjusting the smoothness of the binning function, NAB can generalize to objects with more complex geometries. This research provides a new perspective on integrating shape priors into neural network-based reconstruction. Extensive experiments demonstrate that NAB achieves superior performance on two datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4960. Towards Understanding The Calibration Benefits of Sharpness-Aware Minimization
๐ฌ 18
โญ 4.50
๐ optimization
๐ท๏ธ deep neural networks stochastic gradient descent sharpness-aware minimization
Deep neural networks have been increasingly used in safety-critical applications such as medical diagnosis and autonomous driving. However, many studies suggest that they are prone to being poorly calibrated and have a propensity for overconfidence, which may have disastrous consequences. In this paper, unlike standard training such as stochastic gradient descent, we show that the recently proposed sharpness-aware minimization (SAM) counteracts this tendency towards overconfidence. The theoretical analysis suggests that SAM allows us to learn models that are already well-calibrated by implicitly maximizing the entropy of the predictive distribution. Inspired by this finding, we further propose a variant of SAM, coined as CSAM, to ameliorate model calibration. Extensive experiments on various datasets, including ImageNet-1K, demonstrate the benefits of SAM in reducing calibration error. Meanwhile, CSAM performs even better than SAM and consistently achieves lower calibration error than other approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4961. Human-LLM Collaborative Feature Engineering for Tabular Data
๐ฌ 6
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ human-ai interaction human-centered evaluation
Large language models (LLMs) are increasingly used to automate feature engineering in tabular learning. Given task-specific information, LLMs can propose diverse feature transformation operations to enhance downstream model performance. However, current approaches typically assign the LLM as a black-box optimizer, responsible for both proposing and selecting operations based solely on its internal heuristics, which often lack calibrated estimations of operation utility and consequently lead to repeated exploration of low-yield operations without a principled strategy for prioritizing promising directions. In this paper, we propose a humanโLLM collaborative feature engineering framework for tabular learning. We begin by decoupling the transformation operation proposal and selection processes, where LLMs are used solely to generate operation candidates, while the selection is guided by explicitly modeling the utility and uncertainty of each proposed operation. Since accurate utility estimation can be difficult especially in the early rounds of feature engineering, we design a mechanism within the framework that selectively elicits and incorporates human expert preference feedbackโcomparing which operations are more promisingโinto the selection process to help identify more effective operations.
Our evaluations on both the synthetic study and the real user study demonstrate that the proposed framework improves feature engineering performance across a variety of tabular datasets and reduces usersโ cognitive load during the feature engineering process.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4962. BrowseNet: Knowledge Graph-Based Associative Memory for Contextual Information Retrieval
๐ฌ 14
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ retrieval augmented generation knowledge graph continual learning large language models
Associative memory systems face significant challenges in efficiently retrieving semantically related information from large document collections, particularly when queries require traversing complex relationships between concepts. Traditional retrieval-augmented generation (RAG) approaches often struggle to capture intricate associative patterns and relationships embedded within textual data. To address this limitation, we propose BrowseNet, a novel associative memory framework that leverages query-specific subgraph exploration within a named-entity based graph for enhanced information retrieval. Our method transforms unstructured text into a graph-of-chunks representation, where nodes encode document chunks with semantic embeddings and edges capture lexical relationships between content segments. By dynamically traversing the graph-of-chunks based on query characteristics, BrowseNet emulates content-addressable memory systems that enable efficient pattern matching and associative recall. The framework incorporates both structural similarity derived from lexical relationships and semantic similarity based on embedding representations to optimize retrieval performance. We evaluate BrowseNet against established RAG baselines and state-of-the-art (SOTA) pipelines using publicly available datasets that require associative reasoning across multiple information sources. Experimental results demonstrate that BrowseNet achieves SOTA performance in exact match score over both the graph-based RAG approaches and the dense retrieval methods. The two-pronged approach combining structural graph traversal with semantic embeddings enables more effective associative memory retrieval, particularly for queries requiring the integration of disparate but related information.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4963. DSA: Efficient Inference For Video Generation Models via Distributed Sparse Attention
๐ฌ 10
โญ 4.50
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ Distributed System Diffusion Inference Sparsity
Diffusion Transformer models have driven the rapid advances in video generation, achieving state-of-the-art quality and flexibility. However, their attention mechanism remains a major performance bottleneck, as its dense computation scales quadratically with the sequence length. To overcome this limitation and reduce the generation latency, we propose DSA, a novel attention mechanism that integrates sparse attention with distributed inference for diffusion-based video generation. By leveraging carefully-designed parallelism strategies and scheduling, DSA significantly reduces redundant computation while preserving global context. Extensive experiments on benchmark datasets demonstrate that, when deployed on 8 GPUs, DSA achieves up to 1.43ร inference speedup than the existing distributed method and 10.79ร faster than single-GPU inference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4964. CompassNav: Steering From Path Imitation to Decision Understanding In Navigation
๐ฌ 18
โญ 4.50
๐ applications to robotics, autonomy, planning
๐ท๏ธ Embodied AI Goal-Driven Navigation Large Vision-Language Models Reinforcement Fine-Tuning
The dominant paradigm for training Large Vision-Language Models (LVLMs) in navigation relies on imitating expert trajectories. This approach reduces the complex navigation task to a sequence-to-sequence replication of a single correct path, fundamentally limiting the agent's ability to explore and generalize. In this work, we argue for and introduce a new paradigm: a shift from Path Imitation to Decision Understanding. The goal of this paradigm is to build agents that do not just follow, but truly understand how to navigate. We materialize this through two core contributions: first, we introduce Compass-Data-22k, a novel 22k-trajectory dataset.Its Reinforcement Fine-Tuning (RFT) subset provides a panoramic view of the decision landscape by annotating all feasible actions with A* geodesic distances. Second, we design a novel gap-aware hybrid reward function that dynamically adapts its feedback to decision certainty, shifting between decisive signals for optimal actions and nuanced scores to encourage exploration. Integrated into an SFT-then-RFT recipe, our CompassNav agent is trained not to memorize static routes, but to develop an internal 'compass' that constantly intuits the direction to the goal by evaluating the relative quality of all possible moves. This approach enables our 7B agent to set a new state-of-the-art on navigation benchmarks, outperforming even larger proprietary models, and achieve robust real-world goal navigation on a physical robot.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4965. A Brain-Inspired Gating Mechanism Unlocks Robust Computation in Spiking Neural Networks
๐ฌ 15
โญ 4.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Spiking Neural Networks (SNNs) Dynamic Gated Neurons Noise Robustness Brain-Inspired Computing
While spiking neural networks (SNNs) provide a biologically inspired and energy-efficient computational framework, their robustness and the dynamic advantages inherent to biological neurons remain significantly underutilized owing to oversimplified neuron models. In particular, conventional leaky integrate-and-fire (LIF) neurons often omit the dynamic conductance mechanisms inherent in biological neurons, thereby limiting their capacity to cope with noise and temporal variability. In this work, we revisit dynamic conductance from a functional perspective and uncover its intrinsic role as a bio-inspired gating mechanism that modulates information flow. Building on this insight, we introduce the Dynamic Gated Neuron~(DGN), a novel spiking unit in which membrane conductance evolves in response to neuronal activity, enabling selective input filtering and adaptive noise suppression. We provide a theoretical analysis showing that DGN possess enhanced stochastic stability compared to standard LIF models, with dynamic conductance intriguingly acting as a disturbance rejection mechanism. DGN-based SNNs demonstrate superior performance across extensive evaluations on anti-noise tasks and temporal-related benchmarks such as TIDIGITS and SHD, consistently exhibiting excellent robustness. To the best of our knowledge, for the first time, our results establish bio-inspired dynamic gating as a key mechanism for robust spike-based computation, providing not only theoretical guarantees but also strong empirical validations. This work thus paves the way for more resilient, efficient, and biologically inspired spiking neural networks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4966. Adaptive Data-Knowledge Alignment in Genetic Perturbation Prediction
๐ฌ 19
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Genetic Perturbation Gene Expression Prediction AI for Biology Neuro-Symbolic AI
The transcriptional response to genetic perturbation reveals fundamental insights into complex cellular systems. While current approaches have made progress in predicting genetic perturbation responses, they provide limited biological understanding and cannot systematically refine existing knowledge. Overcoming these limitations requires an end-to-end integration of data-driven learning and existing knowledge. However, this integration is challenging due to inconsistencies between data and knowledge bases, such as noise, misannotation, and incompleteness. To address this challenge, we propose ALIGNED (Adaptive aLignment for Inconsistent Genetic kNowledgE and Data), a neuro-symbolic framework based on the Abductive Learning (ABL) paradigm. This end-to-end framework aligns neural and symbolic components and performs systematic knowledge refinement. We introduce a balanced consistency metric to evaluate the predictions' consistency against both data and knowledge. Our results show that ALIGNED outperforms state-of-the-art methods by achieving the highest balanced consistency, while also re-discovering biologically meaningful knowledge. Our work advances beyond existing methods to enable both the transparency and the evolution of mechanistic biological understanding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4967. Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs
๐ฌ 23
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models ;Mixture of Experts; Manifold Regularization;
Sparse Mixture-of-Experts (MoE) have been widely adopted in recent large language models since it can efficiently scale up the model capability without increasing the inference cost. However, evaluations on broad downstream tasks reveal a consistent suboptimality of the routers in existing MoE LLMs, which results in a severe performance gap (e.g., 10-20% in accuracy) to the optimal routing. In this paper, we show that aligning the manifold of routing weights with that of task embedding via post-training can effectively reduce the gap and improve MoE LLMsโ generalization performance. Our method, โRouting Manifold Alignment (RoMA)โ, introduces an additional manifold regularization term in the post-training objective and only requires lightweight finetuning of routers (with other parameters frozen). Specifically, the regularization encourages the routing weights of each sample to be close to those of its successful neighbors (whose routing weights lead to correct answers) in a task embedding space. Consequently, samples targeting similar tasks will share similar expert choices across layers. Building such bindings between tasks and experts over different samples is essential to achieve better generalization. Moreover, RoMA demonstrates the advantage of unifying the task understanding (by embedding models) with solution generation (by MoE LLMs). In experiments, we finetune routers in two recent MoE LLMs using RoMA. Evaluations on diverse benchmarks and extensive comparisons with baselines show the substantial improvement brought by RoMA.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4968. Layerwise Federated Learning for Heterogeneous Quantum Clients using Quorus
๐ฌ 15
โญ 4.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Federated Learning Heterogeneity Quantum
Quantum machine learning (QML) holds the promise to solve classically intractable problems, but, as critical data can be fragmented across private clients, there is a need for distributed QML in a quantum federated learning (QFL) format. However, the quantum computers that different clients have access to can be error-prone and have heterogeneous error properties, requiring them to run circuits of different depths. We propose a novel solution to this QFL problem, Quorus, that utilizes a layerwise loss function for effective training of varying-depth quantum models, which allows clients to choose models for high-fidelity output based on their individual capacity. Quorus also presents various model designs based on client needs that optimize for shot budget, qubit count, midcircuit measurement, and optimization space. Our simulation and real-hardware results show the promise of Quorus: it increases the magnitude of gradients of higher depth clients and improves testing accuracy by 12.4% on average over the state-of-the-art.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4969. TimeSeriesExamAgent: Creating TimeSeries Reasoning Benchmarks at Scale
๐ฌ 18
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ Time Series Reasoning Scalable Benchmarking AI Agent Multimodal Learning
Large Language Models (LLMs) have shown promising performance in time series modeling tasks, but do they truly understand time series data? While multiple benchmarks have been proposed to answer this fundamental question, most are manually curated and focus on narrow domains or specific skill sets.
To address this limitation, we propose scalable methods for creating comprehensive time series reasoning benchmarks that combine the flexibility of templates with the creativity of LLM agents. We first develop TimeSeriesExam, a multiple-choice benchmark using synthetic time series to evaluate LLMs across five core reasoning categories: pattern recognition, noise understanding, similarity analysis, anomaly detection, and causality. We then scale our approach by automatically generating benchmarks from real-world datasets spanning healthcare, finance and weather domains. Through multi-dimensional quality evaluation, we demonstrate that our automatically generated benchmarks achieve diversity comparable to manually curated alternatives. However, our experiments reveal that LLM performance remains limited in both abstract time series reasoning and domain-specific applications, highlighting ongoing challenges in enabling effective time series understanding in these models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4970. Beyond Grid-Locked Voxels: Neural Response Functions for Continuous Brain Encoding
๐ฌ 18
โญ 4.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ Neural encoding model Computational Neuroscience Neuroimaging Medical imaging Implicit neural representation
Neural encoding models aim to predict fMRI-measured brain responses to natural images. fMRI data is acquired as a 3D volume of voxels, where each voxel has a defined spatial location in the brain. However, conventional encoding models often flatten this volume into a 1D vector and treat voxel responses as independent outputs. This removes spatial context, discards anatomical information, and ties each model to a subject-specific voxel grid. We introduce the NRF Neural Response Function, a framework that models fMRI activity as a continuous function over anatomical space rather than a flat vector of voxels. NRF represents brain activity as a continuous implicit function: given an image and a spatial coordinate (x, y, z) in standardized MNI space, the model predicts the response at that location. This formulation decouples predictions from the training grid, supports querying at arbitrary spatial resolutions, and enables resolution-agnostic analyses. By grounding the model in anatomical space, NRF exploits two key properties of brain responses: (1) local smoothnessโneighboring voxels exhibit similar response patterns; modeling responses continuously captures these correlations and improves data efficiency, and (2) cross-subject alignmentโMNI coordinates unify data across individuals, allowing a model pretrained on one subject to be fine-tuned on new subjects. In experiments, NRF outperformed baseline models in both intrasubject encoding and cross-subject adaptation. Achieving high performance while reducing the data size needed by orders of magnitude. To our knowledge, NRF is the first anatomically aware encoding model to move beyond flattened voxels, learning a continuous mapping from images to brain responses in 3D space.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4971. SMOTE and Mirrors: Exposing Privacy Leakage from Synthetic Minority Oversampling
๐ฌ 19
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ smote synthetic data generation privacy attacks
The Synthetic Minority Over-sampling Technique (SMOTE) is one of the most widely used methods for addressing class imbalance and generating synthetic data.
Despite its popularity, little attention has been paid to its privacy implications; yet, it is used in the wild in many privacy-sensitive applications.
In this work, we conduct the first systematic study of privacy leakage in SMOTE:
We begin by showing that prevailing evaluation practices, i.e., naive distinguishing and distance-to-closest-record metrics, completely fail to detect any leakage and that membership inference attacks (MIAs) can be instantiated with high accuracy.
Then, by exploiting SMOTE's geometric properties, we build two novel attacks with very limited assumptions: DistinSMOTE, which perfectly distinguishes real from synthetic records in augmented datasets, and ReconSMOTE, which reconstructs real minority records from synthetic datasets with perfect precision and recall approaching one under realistic imbalance ratios.
We also provide theoretical guarantees for both attacks.
Experiments on eight standard imbalanced datasets confirm the practicality and effectiveness of these attacks.
Overall, our work reveals that SMOTE is inherently non-private and disproportionately exposes minority records, highlighting the need to reconsider its use in privacy-sensitive applications and as a baseline for assessing the privacy of modern generative models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4972. An Orthogonal Learner for Individualized Outcomes in Markov Decision Processes
๐ฌ 11
โญ 4.50
๐ causal reasoning
๐ท๏ธ Causal Machine Learning Doubly Robust Estimation Neyman-Orthogonality Markov Decision Process
Predicting individualized potential outcomes in sequential decision-making is central
for optimizing therapeutic decisions in personalized medicine (e.g., which
dosing sequence to give to a cancer patient). However, predicting potential out-
comes over long horizons is notoriously difficult. Existing methods that break the
curse of the horizon typically lack strong theoretical guarantees such as orthogonality
and quasi-oracle efficiency. In this paper, we revisit the problem of predicting
individualized potential outcomes in sequential decision-making (i.e., estimating
Q-functions in Markov decision processes with observational data) through a
causal inference lens. In particular, we develop a comprehensive theoretical foundation
for meta-learners in this setting with a focus on beneficial theoretical properties.
As a result, we yield a novel meta-learner called DRQ-learner and establish
that it is: (1) doubly robust (i.e., valid inference under model misspecification),
(2) Neyman-orthogonal (i.e., insensitive to first-order estimation errors in the nuisance
functions), and (3) achieves quasi-oracle efficiency (i.e., behaves asymptotically
as if the ground-truth nuisance functions were known). Our DRQ-learner is
applicable to settings with both discrete and continuous state spaces. Further, our
DRQ-learner is flexible and can be used together with arbitrary machine learning
models (e.g., neural networks). We validate our theoretical results through
numerical experiments, thereby showing that our meta-learner outperforms state-of-the-art baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4973. Flipping the Dialogue: Training and Evaluating User Language Models
๐ฌ 19
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ User Language Models User Simulation Interactive Evaluation Post-Training
Conversations with LMs involve two participants: a human user leading the conversation, and an LM assistant responding to the user's request. To satisfy this specific role, LMs are post-trained to be helpful assistants - optimized to produce exhaustive and well-structured responses, often free of ambiguity. User utterances, on the other hand, are rarely perfected, with each user phrasing requests in unique and indirect ways, making the least effort at each turn and refining on the fly. To better understand LM performance in a realistic setting, prior work has proposed to simulate users in multi-turn conversations, often prompting an assistant model to play the role of the user. However, we show in this work that assistant LMs make for poor user simulators, with surprising evidence that better assistants yield worse simulators. Instead, we introduce purpose-built User Language Models (User LMs) - models post-trained to simulate human users in multi-turn conversations. Through various evaluations, we show how User LMs align better with human behavior and achieve better simulation robustness than existing simulation methods. When leveraging User LMs to simulate coding and math conversations, the performance of a strong assistant (GPT-4o) drops from 74.6% to 57.4%, confirming that more realistic simulation environments lead to assistant struggles as they fail to cope with the nuances of users in multi-turn setups.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4974. Topological Flow Matching
๐ฌ 18
โญ 4.50
๐ generative models
๐ท๏ธ Flow Matching Generative Models Topological Deep Learning Geometric Deep Learning Graphs Simplicial Complexes Schrรถdinger Bridge Optimal Transport
Flow matching is a powerful generative modeling framework, valued for its simplicity and strong empirical performance. However, its standard formulation treats signals on structured spaces---such as fMRI data on brain graphs---as points in Euclidean space, overlooking the rich topological features of their domains. To address this, we introduce \emph{topological flow matching}, a topology-aware generalization of flow matching. We interpret flow matching as a framework for solving a degenerate Schrรถdinger bridge problem and inject topological information by augmenting the reference process with a Laplacian-derived drift. This principled modification captures the structure of the underlying domain while preserving the desirable properties of flow matching: a stable, simulation-free objective and deterministic sample paths. As a result, our framework serves as a plug-and-play replacement for standard flow matching. We demonstrate its effectiveness on diverse structured datasets, including brain fMRIs, ocean currents, seismic events, and traffic flows.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4975. Developmental Federated Tuning: A Cognitive-Inspired Paradigm for Efficient LLM Adaptation
๐ฌ 24
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Federated Fine-Tuning Large Language Models Efficient Training
Federated fine-tuning enables Large Language Models (LLMs) to adapt to downstream tasks while preserving data privacy, but its resource-intensive nature limits deployment on edge devices. In this paper, we introduce Developmental Federated Tuning (DevFT), a resource-efficient approach inspired by cognitive development that progressively builds a powerful LLM from a compact foundation. DevFT decomposes the fine-tuning process into developmental stages, each optimizing a submodel with increasing parameter capacity. Knowledge acquired in earlier stages is transferred to subsequent submodels, providing optimized initialization parameters that prevent convergence to local minima and accelerate training. This paradigm mirrors human learning, gradually constructing comprehensive knowledge structure while refining existing skills. To efficiently build stage-specific submodels, DevFT introduces deconfliction-guided layer grouping and differential-based layer fusion to distill essential information and construct representative layers. Evaluations across multiple benchmarks demonstrate that DevFT significantly outperforms state-of-the-art methods, achieving up to $4.59\times$ faster convergence, $10.67\times$ reduction in communication overhead, and 9.07% average performance improvement, while maintaining compatibility with existing approaches. We submit the code with the paper for reproducibility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4976. Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning
๐ฌ 4
โญ 4.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM post-training; off-policy RLVR
Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5\% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7\% of problems that base models consistently fail to solve.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4977. Feature segregation by signed weights in artificial vision systems and biological models
๐ฌ 9
โญ 4.50
๐ interpretability and explainable AI
๐ท๏ธ ventral stream circuit mechanisms interpretability deep learning visual system excitation inhibition neuroscience closed-loop optimization ablation
Signed connections are central to both artificial and biological intelligence, positive and negative weights in artificial networks, and excitatory and inhibitory synapses in the brain, yet their representational role remains unclear.
Here, we investigate how signed weights shape visual representations in artificial and biological systems involved in object recognition.
Using sign consistency as a proxy for biological Dale's law, which requires neurons to send either exclusively excitatory or inhibitory outputs, we found that accuracy of ImageNet trained networks positively correlated with the Dale index of their output layer.
Ablation and feature visualization reveal a functional segregation: removing positive inputs disrupts object related, low frequency structure, while removing negative inputs mainly alters background textures. This segregation is more pronounced in adversarially robust models, persists with unsupervised learning, and vanishes with non-rectified activations.
In intermediate layers, the most positive Daleโlike channels encoded localized, objectโlike features, whereas the most negative ones captured dispersed, background features.
We next performed $\textit{in vivo}$ feature visualization in monkey ventral visual cortex (V1, V4, and IT) and fitted linear models using the input layer to the neural networks classification units. These models reproduced features similar to those preferred by the biological neurons. In the model neurons, removing positive inputs altered representations more than removing negative ones.
The most Dale-like positively projecting units exhibited localized features, while the negatively projecting units showed larger, more dispersed features, suited to carrying contextual input. Consistent with this, clearing the background around each neuron's preferred feature enhanced its response, likely by reducing inhibitory drive, supporting inhibition as a contextual modulation of the excitatory feature.
Our results demonstrate that both artificial and biological vision systems segregate features by weight sign: positive weights emphasize objects and low frequencies, negative weights encode context. This shows the convergence of representational strategies in brains and machines, yielding predictions for visual neuroscience.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4978. AbdCTBench: Learning Clinical Biomarker Representations from Abdominal Surface Geometry
๐ฌ 26
โญ 4.50
๐ datasets and benchmarks
๐ท๏ธ computer vision for healthcare radiology Computed Tomography (CT) vision transformers CNNs
Body composition analysis through CT and MRI imaging provides critical insights for cardio-metabolic health assessment but remains limited by accessibility barriers including radiation exposure, high costs, and infrastructure requirements. We present AbdCTBench, a large-scale dataset containing 23,506 CT-derived abdominal surface meshes from 18,719 patients, paired with 87 comorbidity labels, 31 specific diagnosis codes, and 16 CT-derived biomarkers. Our key insight is that external surface geometry is predictive of internal tissue composition, enabling accessible health screening through consumer devices. We establish comprehensive benchmarks across seven computer vision architectures (ResNet-18/34/50, DenseNet-121, EfficientNet-B0, ViT-Small, Swin Transformer-Base), demonstrating that models can learn robust surface-to-biomarker representations directly from 2D mesh projections. Our best-performing models achieve clinically relevant accuracy: age prediction with MAE 6.22 years (Rยฒ=0.757), mortality prediction with AUROC 0.839, and diabetes (with chronic complications) detection with AUROC 0.801. Notably, smaller architectures consistently matched or surpassed larger models, while medical-domain pre-training (RadImageNet) and self-supervised pre-training (DINOv2) showed competitive but not superior performance. AbdCTBench represents the largest publicly available dataset bridging external body geometry with internal clinical measurements, enabling future research in accessible medical AI. We plan to release the dataset, evaluation protocols, and baseline models to accelerate research in representation learning for medical applications, immediately following the review period.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4979. TSLM: Tree-Structured Language Modeling for Divergent Thinking
๐ฌ 12
โญ 4.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ language models reasoning planning Supervised Learning Inference-time scaling
Language models generate reasoning sequentially, preventing them from decoupling irrelevant exploration paths during search. We introduce Tree-Structured Language Modeling (TSLM), which uses special tokens to encode branching structure, enabling models to generate and selectively expand multiple search paths within a single generation process. By training on complete search trees including both successful and failed attempts, TSLM learns to internalize systematic exploration without redundant recomputation of shared prefixes. TSLM achieves 100\% accuracy on Game of 24 (vs. 17\% sequential baseline), robust extrapolation to 20ร20 grids (91.5\% vs. 42.7\% for Tree-of-Thought), and superior inference efficiency by avoiding the multiple independent forward passes required by external search methods. These results suggest a new paradigm of inference-time scaling for robust reasoning, demonstrating that supervised learning on complete tree-structured traces provides an efficient alternative for developing systematic exploration capabilities in language models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4980. Selective Rotary Position Embedding
๐ฌ 9
โญ 4.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ RoPE Linear Transformer Attention State Space Models Forget Gate
Position information is essential for language modeling. In softmax transformers, Rotary Position Embeddings (\textit{RoPE}) encode positions through \textit{fixed-angle} rotations, while in linear transformers, order is handled via input-dependent (selective) gating that decays past key-value associations. Selectivity has generally been shown to improve language-related tasks. Inspired by this, we introduce \textit{Selective RoPE}, an \textit{input-dependent} rotary embedding mechanism, that generalizes \textit{RoPE}, and enables rotation in \textit{arbitrary angles} for both linear and softmax transformers. We show that softmax attention already performs a hidden form of these rotations on query-key pairs, uncovering an implicit positional structure. We further show that in state-space models and gated linear transformers, the real part manages forgetting while the imaginary part encodes positions through rotations. We validate our method by equipping gated transformers with \textit{Selective RoPE}, demonstrating that its input-dependent rotations improve performance in language modeling and on difficult sequence tasks like copying, state tracking, and retrieval.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4981. CO3: CONTRASTING CONCEPTS COMPOSE BETTER
๐ฌ 21
โญ 4.50
๐ generative models
๐ท๏ธ Composition classifier-free guidance diffusion text2image
We propose to improve multi-concept prompt fidelity in text-to-image diffusion
models. We begin with common failure casesโprompts like โa cat and a clockโ
that sometimes yields images where one concept is missing, faint, or colliding
awkwardly with another. We hypothesize that this happens when the diffusion
model drifts into mixed modes that over-emphasize a single concept it learned
strongly during training. Instead of re-training, we introduce a corrective sampling
strategy that steers away from regions where the joint prompt behavior overlaps
too strongly with any single concept in the prompt. The goal is to steer towards
โpureโ joint modes where all concepts can coexist with balanced visual presence.
We further show that existing multi-concept guidance schemes can operate in unstable
weight regimes that amplify imbalance; we characterize favorable regions
and adapt sampling to remain within them. Our approach, CO3, is plug-and-play,
requires no model tuning, and complements standard classifier-free guidance. Experiments
on diverse multi-concept prompts indicate improvements in concept
coverage, balance and robustness, with fewer dropped or distorted concepts com-
pared to standard baselines and prior compositional methods. Results suggest that
lightweight corrective guidance can substantially mitigate brittle semantic alignment behavior in modern diffusion systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4982. From Cheap Geometry to Expensive Physics: Elevating Neural Operators via Latent Shape Pretraining
๐ฌ 21
โญ 4.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Neural Operator Pretraining Physics-agnostic Partial Differential Equation Autoencoder
Industrial design evaluation often relies on high-fidelity simulations of governing partial differential equations (PDEs). While accurate, these simulations are computationally expensive, making dense exploration of design spaces impractical. Operator learning has emerged as a promising approach to accelerate PDE solution prediction; however, its effectiveness is often limited by the scarcity of labeled physics-based data. At the same time, large numbers of geometry-only candidate designs are readily available but remain largely untapped. We propose a two-stage framework to better exploit this abundant, physics-agnostic resource and improve supervised operator learning under limited labeled data. In Stage 1, we pretrain an autoencoder on a geometry reconstruction task to learn an expressive latent representation without PDE labels. In Stage 2, the neural operator is trained in a standard supervised manner to predict PDE solutions, using the pretrained latent embeddings as inputs instead of raw point clouds. Transformer-based architectures are adopted for both the autoencoder and the neural operator to handle point cloud data and integrate both stages seamlessly. Across four PDE datasets and three state-of-the-art transformer-based neural operators, our approach consistently improves prediction accuracy compared to models trained directly on raw point cloud inputs. These results demonstrate that representations from physics-agnostic pretraining provide a powerful foundation for data-efficient operator learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4983. LoRA-S: An Efficient Low Rank Adaptation scheme via Sylvester equation
๐ฌ 14
โญ 4.50
๐ optimization
๐ท๏ธ optimization LoRA
Numerous studies on low-rank adaptation (LoRA) emerged in recent years, with the aim of accelerating the convergence of the LoRA framework. In this paper, we leverage the horizontal lift theory from differential geometry to establish the general iteration scheme on the quotient manifold \mathbb{R}\_\*^{m \times r} \times \mathbb{R}\_\*^{n \times r}/\sim.
By endowing the LoRA framework with Riemannian quotient geometries, our theory not only guarantees efficient feature learning but also bridges the LoRA algorithms and the pre-training algorithms for large models.
Furthermore, we theoretically analyze the role of the weight decay matrix $\epsilon_{decay}I$ in efficient feature learning and then replace it with the Sylvester matrix $K$, indicating that the theory helps remove an important hyperparameter while generating accurate and computationally efficient optimizers.
Based on the general scheme, we propose two efficient LoRA optimizers with runtime analysis, Adam-Sylvester (AdamS) and LRACS, then conduct experiments on the transformer-based networks. The results demonstrate evident improvements over existing optimizers.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4984. ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping
๐ฌ 31
โญ 4.50
๐ reinforcement learning
๐ท๏ธ Entropy based Multimodal Adaptive Reasoning
Recent advances in multimodal large reasoning models (MLRMs) have substantially
improved their ability to solve complex textual and visual tasks. However, these
models tend to *overthink* on
simple problems, producing unnecessarily lengthy reasoning traces, while
*under-exploring* on challenging ones, leading to missed solutions. To
address this imbalance, we propose **ARES**, a unified open-source framework
for *adaptive reasoning* that dynamically allocates exploration effort based
on task difficulty. Our approach is motivated by two key empirical findings:
(i) while single-token entropy is noisy, *high window-entropy (HWE)
tokens* (token-level entropies averaged under a sliding window) can reliably capture reasoning-critical moments; and (ii) reducing HWE usage
benefits easy problems, while increasing it is essential for solving hard ones.
Building on these insights, ARES introduces a two-stage training pipeline. In the
*Adaptive Cold-Start* stage, we curate multimodal and textual data paired
with reasoning traces of length proportional to problem difficulty, equipping the
model with initial difficulty awareness. In the second stage, we develop
*Adaptive Entropy Policy Optimization (AEPO)*, which uses HWE tokens as
exploration triggers to decide *when to explore*, and a hierarchical entropy
reward with dynamic KL control to decide *how much to explore*. Extensive
experiments demonstrate that ARES achieves state-of-the-art performance and
reasoning efficiency across diverse mathematical, logical, and multimodal
benchmarks, while closing the gap to leading commercial systems under
significantly lower inference costs. The anonymous code repository is available at https://anonymous.4open.science/r/ARES-60728M.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4985. Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers
๐ฌ 27
โญ 4.50
๐ generative models
๐ท๏ธ Audio-to-Video Generation Multimodal Synthesis Temporal Synchronization Diffusion Transformer Video Generation Audio-Conditioned Generation
Text-to-video and image-to-video generation have made rapid progress in visual quality, but they remain limited in controlling the precise timing of motion.
In contrast, audio provides temporal cues aligned with video motion, making it a promising condition for temporally controlled video generation.
However, existing audio-to-video (A2V) models struggle with fine-grained synchronization due to indirect conditioning mechanisms or limited temporal modeling capacity.
We present Syncphony, which generates 380ร640 resolution, 24fps videos synchronized with diverse audio inputs. Our approach builds upon a pre-trained video backbone and incorporates two key components to improve synchronization:
(1) Motion-aware Loss, which emphasizes learning at high-motion regions;
(2) Audio Sync Guidance, which guides the full model using a visually aligned off-sync model without audio layers to better exploit audio cues at inference while maintaining visual quality.
To evaluate synchronization, we propose CycleSync, a video-to-audio-based metric that measures the amount of motion cues in the generated video to reconstruct the original audio. Experiments on AVSync15 and The Greatest Hits datasets demonstrate that Syncphony outperforms existing methods in both synchronization accuracy and visual quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4986. PostAlign: Multimodal Grounding as a Corrective Lens for MLLMs
๐ฌ 18
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ multimodal grounding MLLM hallucination alignment
Multimodal Large Language Models (MLLMs) have shown remarkable performance in vision-language tasks, such as image captioning and visual question answering. However, these models often struggle with fine-grained visual understanding and are prone to hallucinations, primarily due to over-reliance on linguistic priors that distract them from leveraging actual visual information. This results in outputs that are often unanchored in the visual content, leading to errors. To address these challenges, we introduce MMGrounded-PostAlign, a post-multimodal alignment framework designed to enhance the visual understanding capabilities of MLLMs and mitigate hallucinations. In the framework, the visual grounding module identifies the referred objects in the image, while the textual grounding module generates the rationale for the final answer. This dual grounding approach ensures that outputs are firmly anchored in both visual and textual evidence. In particular, we incorporate a negative rejection mechanism within the visual grounding module to distinguish between grounded entities and non-existent objects influenced by linguistic biases. Moreover, we propose a selective reasoning mechanism within the textual grounding module to adjust the modelโs reasoning strategy based on the complexity of the query. These innovations together work to resolve the issues associated with hallucinations and enhance the overall alignment between visual and textual modalities. Extensive evaluations on benchmarks such as POPE, HaloQuest, ReasonSeg, MME, and MMBench demonstrate significant improvements in fine-grained visual understanding and hallucination suppression, showcasing the effectiveness of our approach in real-world multimodal tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4987. Learning from Synthetic Data Improves Multi-hop Reasoning
๐ฌ 12
โญ 4.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ multi-hop reasoning large language models reinforcement learning synthetic data
Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks.
However, RL fine-tuning requires abundant high-quality verifiable data, often obtained through human-annotated datasets and LLM-as-verifier loops.
Both of these data types have considerable limitations: human-annotated datasets are small and expensive to curate, while LLM verifiers have high scoring latency and are costly to operate.
In this work, we investigate the use of synthetic datasets in RL fine-tuning for multi-hop reasoning tasks.
We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, even though the synthetic data only contain fictional knowledge.
On stratifying model performance by question difficulty, we find that synthetic data teaches LLMs to compose knowledge, which we to be a fundamental and generalizable reasoning skill.
Our work thus highlights the utility of synthetic reasoning datasets in improving LLM reasoning capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4988. Compositional Generalization through Gradient Search in Nonparametric Latent Space
๐ฌ 25
โญ 4.40
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ compositional generalization variational Bayesian methods meta-learning abstract reasoning nonparametric representations
Neural network architectures have made considerable advances in their ability to solve reasoning problems, but many state-of-the-art methods fail at systematic compositional generalization. To address this, we propose a novel architecture which uses a nonparametric latent space, information-theoretic regularization of this space, and test-time gradient-based search to achieve strong performance on OOD compositional meta-learning tasks such as ARC-like program induction, Raven's progressive matrices, and linguistic systematicity tasks. Our proposed architecture, Abduction Transformer, uses nonparametric mixture distributions to represent inferred hidden causes of few-shot meta-learning instances. These representations are refined at test-time via gradient descent to better account for the observed few-shot examples, a form of variational posterior inference which allows Abduction Transformer to solve meta-learning tasks that require novel recombinations of knowledge acquired during training. Our method outperforms standard transformer architectures and previous test-time adaptive approaches, indicating a promising new direction for neural networks capable of systematic generalization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4989. Knowledge Fusion of Large Language Models via Modular SkillPacks
๐ฌ 40
โญ 4.40
๐ foundation or frontier models, including LLMs
๐ท๏ธ Knowledge Fusion Model Merging Large Language Model Task Vector
Cross-capability transfer represents a key challenge in large language model (LLM) research, particularly in multi-task integration, model compression, and knowledge fusion. Recent works such as FuseLLM and FuseChat have shown the potential of transferring multiple model capabilities to lightweight models, thereby enhancing adaptability and efficiency. This motivates our investigation into more efficient methods for cross-capability transfer. However, existing merging approaches primarily focus on small, homogeneous models, limiting their applicability.
For large, heterogeneous models, knowledge distillation with full-parameter fine-tuning often overlooks the student modelโs inherent capability and risks catastrophic forgetting, while PEFT methods struggle to effectively absorb knowledge from source LLMs.
To address these issues, we introduce **GraftLLM**, a novel grafting-based method that stores source model capabilities in a target model + SkillPack format. This approach preserves general capabilities, reduces parameter conflicts, and supports forget-free continual learning and model fusion. We employ a module-aware adaptive compression strategy for parameter updates, ensuring efficient storage while **preserving task-specific knowledge**. The resulting SkillPack serves as a compact and transferable knowledge carrier, ideal for **heterogeneous LLM fusion**.
Experiments across various scenarios demonstrate that GraftLLM outperforms existing techniques in knowledge transfer, knowledge fusion, and forget-free learning, providing a scalable and efficient solution for cross-capability transfer.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4990. The Matthew Effect of AI Programming Assistants: A Hidden Bias in Software Evolution
๐ฌ 22
โญ 4.40
๐ infrastructure, software libraries, hardware, systems, etc.
๐ท๏ธ AI programming assistants large language models code generation Matthew effect software ecosystem evolution programming languages and frameworks multilingual benchmarking agentic coding
AI-assisted programming is rapidly reshaping software development, with large language models (LLMs) enabling new paradigms such as vibe coding and agentic coding. While prior works have focused on prompt design and code generation quality, the broader impact of LLM-driven development on the iterative dynamics of software engineering remains underexplored. In this paper, we conduct large-scale experiments on thousands of algorithmic programming tasks and hundreds of framework selection tasks to systematically investigate how AI-assisted programming interacts with the software ecosystem. Our analysis quantifies a substantial performance asymmetry: mainstream languages and frameworks achieve significantly higher success rates than niche ones. This disparity suggests a feedback loop consistent with the Matthew Effect, where data-rich ecosystems gain superior AI support. While not the sole driver of adoption, current models introduce a non-negligible productivity friction for niche technologies, representing a hidden bias in software evolution.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4991. An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models
๐ฌ 29
โญ 4.40
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Multimodal Large Language Models Visual Token Pruning
Large Vision-Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches' characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Our erank-based quantitative analysis shows that many diversity-oriented pruning methods preserve substantially less feature diversity than intended; moreover, analysis using the CHAIR dataset reveals that the diversity they do retain is closely tied to increased hallucination
frequency compared to attention-based pruning. (2) We further observe that attention-based approaches are more effective on simple images where visual evidence is concentrated, while diversity-based methods better handle complex images with distributed features. Building on these empirical insights, we show that incorporating image-aware adjustments into existing hybrid pruning strategies consistently improves their performance. We also provide a minimal instantiation of our empirical findings through a simple adaptive
pruning mechanism, which achieves strong and reliable performance across standard benchmarks as well as hallucination-specific evaluations. Our project page available at https://anonymous.4open.science/w/AdaVTP-186A/
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4992. When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models
๐ฌ 27
โญ 4.40
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLMs Compression LRMs Compression Quantization Pruning Distillation
Compression methods, including quantization, distillation, and pruning, improve the computational efficiency of large reasoning models (LRMs). However, existing studies either fail to sufficiently compare all three compression methods on LRMs or lack in-depth interpretation analysis. In this paper, we investigate how the reasoning capabilities of LRMs are compromised during compression, through performance benchmarking and mechanistic interpretation. To uncover the effects of compression on reasoning performance, we benchmark quantized, distilled, and pruned DeepSeek-R1 models on four reasoning datasets (AIME 2024, FOLIO, Temporal Sequences, and MuSiQue). To precisely locate compression effects on model weights, we adapt difference of means and attribution patching techniques, focusing on the activation of every linear component in compressed LRMs, to interpret fine-grained causal relationships between weights and various reasoning capabilities. This fine-grained interpretation addresses a fundamental question of compression: which weights are the most important for reasoning? Overall, we find dynamically quantized 2.51-bit R1 reaches close-to-R1 performance. With empirical verification, we present three main findings that generalize across both Llama and Qwen: (1) Weight count has a greater impact on LRMs' knowledge memorization than reasoning, highlighting the risks of pruning and distillation; (2) The MLP up projection in the final layer of distilled LRMs is one of the most important components, offering a new perspective on locating critical weights - a fundamental problem in model compression; and (3) Current quantization methods overly compress the final-layer modules and MLP gate projections, so protecting just 2% of all weights that are excessively compressed can raise average accuracy by 6.57%, greatly surpassing the state-of-the-art.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4993. ProofBridge: Auto-Formalization of Natural Language Proofs in Lean via Joint Embeddings
๐ฌ 32
โญ 4.40
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ proof auto-formalization mathematical theorem proving Lean 4 interactive theorem prover joint embedding cross-modal retrieval large language model (LLM)
Translating human-written mathematical theorems and proofs from natural language (NL) into formal languages (FLs) like Lean 4 has long been a significant challenge for AI. Most state-of-the-art methods either focus on theorem-only NL-to-FL auto-formalization or on FL proof synthesis from FL theorems. In practice, auto-formalization of both theorem and proof still requires human intervention, as seen in AlphaProofโs silver-medal performance at the 2024 IMO, where problem statements were manually translated before automated proof synthesis.
We present ProofBridge, a unified framework for automatically translating entire NL theorems and proofs into Lean 4. At its core is a joint embedding model that aligns NL and FL (NL-FL) theorem-proof pairs in a shared semantic space, enabling cross-modal retrieval of semantically relevant FL examples to guide translation. Our training ensures that NL-FL theorems (and their proofs) are mapped close together in this space if and only if the NL-FL pairs are semantically equivalent. ProofBridge integrates retrieval-augmented fine-tuning with iterative proof repair, leveraging Leanโs type checker and semantic equivalence feedback to ensure both syntactic correctness and semantic fidelity. Experiments show substantial improvements in proof auto-formalization over strong baselines (including GPT-5, Gemini-2.5, Kimina-Prover, DeepSeek-Prover), with our retrieval-augmented approach yielding significant gains in semantic correctness (SC, via proving bi-directional equivalence) and type correctness (TC, via type-checking theorem+proof) across pass@k metrics on miniF2F-Test-PF, a dataset we curated. In particular, ProofBridge improves cross-modal retrieval quality by up to 3.28x Recall@1 over all-MiniLM-L6-v2, and achieves +31.14% SC and +1.64% TC (pass@32) compared to the baseline Kimina-Prover-RL-1.7B.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4994. Do LLM Agents Know How to Ground, Recover, and Assess? A Benchmark for Epistemic Competence in Information-Seeking Agents
๐ฌ 30
โญ 4.40
๐ datasets and benchmarks
๐ท๏ธ Epistemic Competence Evidence-Grounded Reasoning LLM Search Agents
Recent work has explored training Large Language Model (LLM) search agents with reinforcement learning (RL) for open-domain question answering. However, most evaluations focus solely on final answer accuracy, overlooking how these agents reason with and act on external evidence.
We introduce **SeekBench**, the first process-level evaluation framework for LLM search agents that operationalize *epistemic competence* through metrics derived from an annotation schema.
We develop and validate our annotation schema using an expert-annotated dataset of 190 traces (over 1,800 steps).
To evaluate at scale, we introduce an LLM-as-judge pipeline.
Our framework provides granular analysis of whether agents demonstrate: (1) **groundedness**, by generating reasoning steps supported by observed evidence; (2) **recovery**, by adaptively reformulating searches to recover from low-quality results; and (3) **calibration**, by correctly assessing whether current evidence is sufficient to provide an answer.
By applying our evaluation framework to state-of-the-art search agents tuned on Qwen2.5-7B, we uncover critical behavioral gaps that answer-only metrics miss, as well as specialized skills such as Search-R1's synthesis abilities. These analyses highlight distinct epistemic competencies, offering actionable insights for the development of more capable and trustworthy agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4995. t-SNE Exaggerates Clusters, Provably
๐ฌ 15
โญ 4.40
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ nonlinear dimension reduction data visualization t-SNE
Central to the widespread use of t-distributed stochastic neighbor embedding (t-SNE) is the conviction that it produces visualizations whose structure roughly matches that of the input. To the contrary, we prove that (1) the strength of the input clustering, and (2) the extremity of outlier points, cannot be reliably inferred from the t-SNE output. We demonstrate the prevalence of these failure modes in practice as well.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4996. Are Deep Speech Denoising Models Robust to Adversarial Noise?
๐ฌ 5
โญ 4.40
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Adversarial Robustness Adversarial Perturbations Security Safety Speech Enhancement Speech Denoising Noise Suppression Deep Noise Suppression Psychoacoustic Masking
Deep noise suppression (DNS) models enjoy widespread use throughout a variety of high-stakes speech applications.
However, we show that four recent DNS models can each be reduced to outputting unintelligible gibberish through the addition of psychoacoustically hidden adversarial noise, even in low-background-noise and simulated over-the-air settings. For three of the models, a small transcription study with audio and multimedia experts confirms unintelligibility of the attacked audio; simultaneously, an ABX study shows that the adversarial noise is generally imperceptible, with some variance between participants and samples.
While we also establish several negative results around targeted attacks and model transfer, our results nevertheless highlight the need for practical countermeasures before open-source DNS systems can be used in safety-critical applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4997. MEGS^{2}: Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning
๐ฌ 21
โญ 4.40
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ 3D vision 3D construction novel view synthesis 3D Gaussian Splatting lightweight 3DGS 3DGS compression
3D Gaussian Splatting (3DGS) has emerged as a dominant novel-view synthesis technique, but its high memory consumption severely limits its applicability on edge devices. A growing number of 3DGS compression methods have been proposed to make 3DGS more efficient, yet most only focus on storage compression and fail to address the critical bottleneck of rendering memory. To address this problem, we introduce MEGSยฒ, a novel memory-efficient framework that tackles this challenge by jointly optimizing two key factors: the total primitive number and the parameters per primitive, achieving unprecedented memory compression. Specifically, we fully replace the memory-intensive Spherical Harmonics with lightweight, arbitrarily oriented and prunable Spherical Gaussian lobes as our color representations. More importantly, we propose a unified soft pruning framework that models primitive-number and lobe-number pruning as a single constrained optimization problem. Experiments show that MEGSยฒ achieves a 50% static VRAM reduction and a 40% rendering VRAM reduction compared to existing methods, while maintaining comparable rendering quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4998. ViTSP: A Vision Language Models Guided Framework for Large-Scale Traveling Salesman Problems
๐ฌ 6
โญ 4.40
๐ optimization
๐ท๏ธ Machine Learning Large Language Model Traveling Salesman Problem Combinatorial Optimization
Solving Traveling Salesman Problem (TSP) is NP-hard yet fundamental for wide real-world applications. Classical exact methods face challenges in scaling, and heuristic methods often require domain-specific parameter calibration. While learning-based approaches have shown promise, they suffer from poor generalization and limited scalability due to fixed training data. This work proposes ViTSP, a novel framework that leverages pre-trained vision language models (VLMs) to visually guide the solution process for large-scale TSPs. The VLMs function to identify promising small-scale subproblems from a visualized TSP instance, which are then efficiently optimized using an off-the-shelf solver to improve the global solution. ViTSP bypasses the dedicated model training at the user end while maintaining effectiveness across diverse instances. Experiments on real-world TSP instances ranging from 1k to 88k nodes demonstrate that ViTSP consistently achieves solutions with average optimality gaps below 0.2\%, outperforming existing learning-based methods. Under the same runtime budget, it surpasses the best-performing heuristic solver, LKH-3, by reducing its gaps by 12\% to 100\%, particularly on very-large-scale instances with more than 10k nodes. Our framework offers a new perspective in hybridizing pre-trained generative models and operations research solvers in solving combinatorial optimization problems, with practical implications for integration into more complex logistics systems. The code is available at https://anonymous.4open.science/r/ViTSP_codes-6683.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
4999. Inference-Time Personalized Safety Control via Paired Difference-in-Means Intervention
๐ฌ 14
โญ 4.40
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ safety alignment personalized alignment
Safety preferences are inherently subjective, yet current LLM safety alignment methods often impose universal standards that fail to account for individual sensitivities. In this work, we propose an efficient, training-free method for personalized safety control via inference-time activation intervention. Our approach steers internal representations to suppress user-specific undesired content while preserving model utility. We systematically evaluate three strategies for estimating intervention directions: Instance-Level Contrast Shift (ILCS), Unpaired Mean Shift (UMS), and our primary method, Paired Contrast Mean Shift (PCMS). We provide theoretical insights into each approach and highlight the advantages of PCMS. Empirical results across diverse open-weight models demonstrate that our method effectively reduces undesired content in line with individual preferences, with minimal impact on helpfulnessโenabling more adaptive and user-aligned LLM behavior.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5000. A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models
๐ฌ 14
โญ 4.40
๐ interpretability and explainable AI
๐ท๏ธ Interpretability Multimodal Learning Large Vision-Language Models Partial Information Decomposition Information Theory
Large vision-language models (LVLMs) achieve impressive performance, yet their internal decision-making processes remain opaque, making it difficult to determine if the success stems from true multimodal fusion or reliance on unimodal priors. To address this attribution gap, we introduce a novel framework using partial information decomposition (PID) to quantitatively measure the ``information spectrum'' of LVLMsโdecomposing a modelโs decision-relevant information into redundant, unique, and synergistic components. By adapting a scalable estimator to modern LVLM outputs, Our model-agnostic pipeline profiles 26 LVLMs on four datasets across three dimensionsโ\emph{breadth} (cross-model \& cross-task), \emph{depth} (layer-wise information dynamics), and \emph{time} (learning dynamics across training). Our analysis reveals two key results: (i) two task regimes (synergy-driven vs. knowledge-driven) and (ii) two stable, opposing family-level strategies (fusion-centric vs. language-centric). We also uncover a consistent three-phase pattern in layer-wise processing and identify visual instruction tuning as the key stage where fusion is learned. Together, these contributions provide a quantitative lens beyond accuracy-only evaluation and offer guidance for analyzing and designing the next generation of LVLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5001. Federated Learning of Quantile Inference under Local Differential Privacy
๐ฌ 14
โญ 4.40
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Confidence interval; Federated learning; Local differential privacy; Quantile; Self-normalization
In this paper, we investigate federated learning for quantile inference under local differential privacy (LDP). We propose an estimator based on local stochastic gradient descent (SGD), whose local gradients are perturbed via a randomized mechanism with global parameters, making the procedure tolerant of communication and storage constraints without compromising statistical efficiency. Although the quantile loss and its corresponding gradient do not satisfy standard smoothness conditions typically assumed in existing literature, we establish asymptotic normality for our estimator as well as a functional central limit theorem. The proposed method accommodates data heterogeneity and allows each server to operate with an individual privacy budget. Furthermore, we construct confidence intervals for the target value through a selfโnormalization approach, thereby circumventing the need to estimate additional nuisance parameters. Extensive numerical experiments and real data application validate the theoretical guarantees of the proposed methodology.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5002. Model Predictive Adversarial Imitation Learning for Planning from Observation
๐ฌ 24
โญ 4.40
๐ applications to robotics, autonomy, planning
๐ท๏ธ Imitation Learning Reinforcement Learning Model Predictive Control
Humans can often perform a new task after observing a few demonstrations by inferring the underlying intent. For robots, recovering the intent of the demonstrator through a learned reward function can enable more efficient, interpretable, and robust imitation through planning. A common paradigm for learning how to plan-from-demonstration involves first solving for a reward via Inverse Reinforcement Learning (IRL) and then deploying it via Model Predictive Control (MPC). In this work, we unify these two procedures by introducing planning-based Adversarial Imitation Learning, which simultaneously learns a reward and improves a planning-based agent through experience while using observation-only demonstrations. We study advantages of planning-based AIL in generalization, interpretability, robustness, and sample efficiency through experiments in simulated control tasks and real-world navigation from few or single observation-only demonstration.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5003. Understanding and Relaxing the Limitations of Transformers for Linear Algebra
๐ฌ 12
โญ 4.40
๐ interpretability and explainable AI
๐ท๏ธ Numerical Linear Algebra Transformers Out-of-distribution Generalization
Matrix operations, such as linear solves, eigendecompositions, and log determinants, are foundational building blocks for any number of downstream applications. Therefore, any broadly capable learning system should be able to effectively approximate these operations in its internal representation. Accordingly, there is great motivation to study transformers for linear algebra --- for if transformers cannot even semi-competently perform matrix operations, then we cannot expect them to form a basis for a generally intelligent system. We demonstrate that current techniques developing transformers for linear algebra have striking failure modes, prohibitive scaling, and particularly poor out-of-distribution generalization to other matrix distributions, and matrices of different sizes. Investigating further, we find that current transformer approaches operate as statistical interpolators, rather than discovering algorithms that will generalize to matrices from other distributions. Based on our understanding of these limitations, we develop a sequence of interventions that substantially improve scaling and performance, including matrix embeddings through a learnable projection, linear attention, looping, and a data pre-training distribution of structured matrices. We term the resulting method the \emph{RangeFormer}, which we show has significantly improved scaling and performance on challenging OOD matrices from the \emph{matrix market}. Moreover, with RangeFormer we show for the first time that transformers can be successfully applied to downstream tasks that involve iterative matrix operations, including Gaussian process learning, and improving the sampling distribution of randomized methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5004. BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning
๐ฌ 30
โญ 4.40
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reinforcement finetuning (RFT) Large Language Models (LLMs) Online task selection Bayesian inference
Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning, yet its effectiveness is highly sensitive to which tasks are explored during training. Uniform task sampling is inefficient, wasting computation on tasks that are either trivial or unsolvable, while existing task selection methods often suffer from high rollout costs, poor adaptivity, or incomplete evidence. We introduce \textbf{BOTS}, a unified framework for \textbf{B}ayesian \textbf{O}nline \textbf{T}ask \textbf{S}election in LLM reinforcement finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior estimates of task difficulty as the model evolves. It jointly incorporates \emph{explicit evidence} from direct evaluations of selected tasks and \emph{implicit evidence} inferred from these evaluations for unselected tasks, with Thompson sampling ensuring a principled balance between exploration and exploitation for task selection. To make implicit evidence practical, we instantiate it with an ultra-light interpolation-based plug-in that estimates difficulties of unevaluated tasks without extra rollouts, adding negligible overhead. Empirically, across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations, providing a practical and extensible solution for dynamic task selection in RFT.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5005. W-EDIT: A Wavelet-Based Frequency-Aware Framework for Text-Driven Image Editing
๐ฌ 34
โญ 4.40
๐ generative models
๐ท๏ธ Diffusion Transformers Text-driven Image Editing Training-free Method
While recent advances in Diffusion Transformers (DiTs) have significantly advanced text-to-image generation, text-driven image editing remains challenging. Existing approaches either struggle to balance structural preservation with flexible modifications or require costly fine-tuning of large models. To address this, We introduce W-Edit, a training-free framework for text-driven image editing based on wavelet-based frequency-aware feature decomposition. W-Edit employs wavelet transforms to decompose diffusion features into multi-scale frequency bands, disentangling structural anchors from editable details. A lightweight replacement module selectively injects these components into pretrained models, while an inversion-based frequency modulation strategy refines sampling trajectories using structural cues from attention features. Extensive experiments demonstrate that W-Edit achieves high-quality results across a wide range of editing scenarios, outperforming previous training-free approaches. Our method establishes frequency-based modulation as both a sound and efficient solution for controllable image editing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5006. Defending against Backdoor Attacks via Module Switching
๐ฌ 19
โญ 4.40
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Backdoor attacks backdoor defense model merging
Backdoor attacks pose a serious threat to deep neural networks (DNNs), allowing adversaries to implant triggers for hidden behaviors in inference. Defending against such vulnerabilities is especially difficult in the post-training setting, since end-users lack training data or prior knowledge of the attacks. Model merging offers a cost-effective defense; however, latest methods like weight averaging (WAG) provide reasonable protection when multiple homologous models are available, but are less effective with fewer models and place heavy demands on defenders. We propose a module-switching defense (MSD) for disrupting backdoor shortcuts. We first validate its theoretical rationale and empirical effectiveness on two-layer networks, showing its capability of achieving higher backdoor divergence than WAG, and preserving utility. For deep models, we evaluate MSD on Transformer and CNN architectures and design an evolutionary algorithm to optimize fusion strategies with selective mechanisms to identify the most effective combinations. Experiments shown that MSD achieves stronger defense with fewer models in practical settings, and even under an underexplored case of collusive attacks among multiple models--where some models share same backdoors--switching strategies by MSD deliver superior robustness against diverse attacks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5007. From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization
๐ฌ 17
โญ 4.40
๐ datasets and benchmarks
๐ท๏ธ education agent benchmark llm application visualisation
While foundation models (FMs), such as diffusion models and large vision-language models (LVLMs), have been widely applied in educational contexts, their ability to generate pedagogically effective visual explanations remains limited. Most existing approaches focus primarily on textual reasoning, overlooking the critical role of structured and interpretable visualizations in supporting conceptual understanding. To better assess the visual reasoning capabilities of FMs in educational settings, we introduce EduVisBench, a multi-domain, multi-level benchmark. EduVisBench features diverse STEM problem sets requiring visually grounded solutions, along with a fine-grained evaluation rubric informed by pedagogical theory. Our empirical analysis reveals that existing models frequently struggle with the inherent challenge of decomposing complex reasoning and translating it into visual representations aligned with human cognitive processes. To address these limitations, we propose EduVisAgent, a multi-agent collaborative framework that coordinates specialized agents for instructional planning, reasoning decomposition, metacognitive prompting, and visualization design. Experimental results show that EduVisAgent substantially outperforms all baselines, achieving a 40.2% improvement and delivering more educationally aligned visualizations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5008. HiFo-Prompt: Prompting with Hindsight and Foresight for LLM-based Automatic Heuristic Design
๐ฌ 24
โญ 4.40
๐ optimization
๐ท๏ธ Large Language Models Evolutionary Computation Guided Prompt Synthesis Knowledge Accumulation Automated Algorithm Design
This paper investigates the application of Large Language Models (LLMs) in Automated Heuristic Design (AHD), where their integration into evolutionary frameworks reveals a significant gap in global control and long-term learning. We propose the Hindsight-Foresight Prompt (HiFo-Prompt), a novel framework for LLM-based AHD designed to overcome these limitations. This is achieved through two synergistic strategies: Foresight and Hindsight. Foresight acts as a high-level meta-controller, monitoring population dynamics(e.g., stagnation and diversity collapse) to switch the global search strategy between exploration and exploitation explicitly. Hindsight builds a persistent knowledge base by distilling successful design principles from past generations, making this knowledge reusable. This dual mechanism ensures that the LLM is not just a passive operator but an active reasoner, guided by a global plan (Foresight) while continuously improving from its cumulative experience (Hindsight). Empirical results demonstrate that HiFo-Prompt significantly outperforms a comprehensive suite of state-of-the-art AHD methods, discovering higher-quality heuristics with substantially improved convergence speed and query efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5009. SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs
๐ฌ 30
โญ 4.40
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM Reasoning
Recent work shows that, beyond discrete reasoning through explicit chain-of-thought steps, which are limited by the boundaries of natural languages, large language models (LLMs) can also reason continuously in latent space, allowing richer information per step and thereby improving token efficiency. Despite this promise, latent reasoning still faces two challenges, especially in training-free settings: 1) purely latent reasoning broadens the search distribution by maintaining multiple implicit paths, which diffuses probability mass, introduces noise, and impedes convergence to a single high-confidence solution, thereby hurting accuracy; and 2) overthinking persists even without explicit text, wasting tokens and degrading efficiency. To address these issues, we introduce SwiReasoning, a training-free framework for LLM reasoning which features two key innovations: 1) SwiReasoning dynamically switches between explicit and latent reasoning, guided by block-wise confidence estimated from entropy trends in next-token distributions, to balance exploration and exploitation and promote timely convergence. 2) By limiting the maximum number of thinking-block switches, SwiReasoning curbs overthinking and improves token efficiency across varying problem difficulties. On widely used mathematics and STEM benchmarks, SwiReasoning consistently improves average accuracy by 1.5%โ2.8% across reasoning LLMs of different model families and scales. Furthermore, under constrained budgets, SwiReasoning improves average token efficiency by 56%-79%, with larger gains as budgets tighten. The code will be released publicly.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5010. EvolProver: Advancing Automated theorem proving by Evolving Formalized Problems via Symmetry and Difficulty
๐ฌ 21
โญ 4.40
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Formal Theorem Proving AI4Math LLM4Math Automated Theorem Proving Prover
Large Language Models (LLMs) for formal theorem proving have shown significant promise, yet they often lack generalizability and are fragile to even minor transformations of problem statements. To address this limitation, we introduce a novel data augmentation pipeline designed to enhance model robustness from two perspectives: symmetry and difficulty. From the symmetry perspective, we propose two complementary methods: **EvolAST**, an Abstract Syntax Tree (AST) based approach that targets syntactic symmetry to generate semantically equivalent problem variants, and **EvolDomain**, which leverages LLMs to address semantic symmetry by translating theorems across mathematical domains. From the difficulty perspective, we propose **EvolDifficulty**, which uses carefully designed evolutionary instructions to guide LLMs in generating new theorems with a wider range of difficulty. We then use the evolved data to train **EvolProver**, a 7B-parameter non-reasoning theorem prover. EvolProver establishes a new state-of-the-art (SOTA) on FormalMATH-Lite with a 53.8\% pass@32 rate, surpassing all models of comparable size, including reasoning-based models. It also sets new SOTA records for non-reasoning models on MiniF2F-Test (69.8\% pass@32), Ineq-Comp-Seed (52.2\% pass@32), and Ineq-Comp-Transformed (34.0\% pass@32). Ablation studies further confirm our data augmentation pipeline's effectiveness across multiple benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5011. From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training
๐ฌ 21
โญ 4.40
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Multimodal Models Multi-token Prediction Non-Autoregressive Learning
Recent advances in large language models (LLMs) have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-to-speech conversational systems. However, existing multimodal models handling interleaved audio and text rely on autoregressive (AR) methods, overlooking that text depends on target-target relations whereas audio depends mainly on source-target relations. In this work, we propose Text-to-Talk (TtT), a unified audio-text framework that integrates AR text generation with non-autoregressive (NAR) audio diffusion in a single Transformer. By leveraging the any-order AR property of absorbing discrete diffusion, our approach provides a unified training objective for text and audio. To support this hybrid generation paradigm, we design a modality-aware attention mechanism that enforces causal decoding for text while allowing bidirectional modeling within audio spans, and further introduce three training strategies that reduce train-test discrepancies. During inference, TtT employs block-wise diffusion to synthesize audio in parallel while flexibly handling variable-length outputs. Comprehensive experiments on Audio-QA, ASR, AAC and speech-to-speech benchmarks show that TtT consistently surpasses strong AR and NAR baselines, with additional ablation and training-strategy analyses confirming the contribution of each component. We will open-source our models, data and code to facilitate future research in this direction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5012. ShinkaEvolve: Towards Open-Ended and Sample-Efficient Program Evolution
๐ฌ 17
โญ 4.40
๐ foundation or frontier models, including LLMs
๐ท๏ธ Evolution LLMs Scientific discovery
We introduce ShinkaEvolve: a new framework leveraging large language models (LLMs) to advance scientific discovery with state-of-the-art performance and efficiency. The field of LLM-driven scientific discovery has seen significant progress, but has yet to overcome a critical limitation: sample inefficiency, requiring thousands of samples to identify effective solutions. ShinkaEvolve takes a concrete step towards addressing this critical limitation by introducing three key innovations: a parent sampling technique balancing exploration and exploitation, code novelty rejection-sampling for efficient search space exploration, and a bandit-based LLM ensemble selection strategy. When applied to the canonical circle-packing optimization task, ShinkaEvolve discovers a new state-of-the-art circle packing solution using only 150 samples, orders of magnitude fewer than prior frameworks. Furthermore, applied to a broader set of engineering problems, ShinkaEvolve designs robust agentic harnesses for AIME mathematical reasoning tasks, identifies improvements to ALE-Bench competitive programming solutions, and discovers novel mixture-of-expert load balancing loss functions to stabilize LLM training itself. We provide ShinkaEvolve's full code together with this submission, which will be open-sourced to accelerate open advancements to open-ended automated discovery across diverse computational problems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5013. GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs
๐ฌ 23
โญ 4.40
๐ generative models
๐ท๏ธ Hallucinations Multimodal Large Language Models Spurious Correlations
Object hallucination in Multimodal Large Language Models (MLLMs) is a persistent failure mode that causes the model to perceive objects absent in the image. This weakness of MLLMs is currently studied using static benchmarks with fixed visual scenarios, which preempts the possibility of uncovering model-specific or unanticipated hallucination vulnerabilities. We introduce GHOST (Generating Hallucinations via Optimizing Stealth Tokens), a method designed to stress-test MLLMs by actively generating images that induce hallucination. GHOST is fully automatic and requires no human supervision or prior knowledge. It operates by optimizing in the image embedding space to mislead the model while keeping the target object absent, and then guiding a diffusion model conditioned on the embedding to generate natural-looking images. The resulting images remain visually natural and close to the original input, yet introduce subtle misleading cues that cause the model to hallucinate. We evaluate our method across a range of models, including reasoning models like GLM-4.1V-Thinking, and achieve a hallucination success rate exceeding 28%, compared to around 1% in prior data-driven discovery methods. We confirm that the generated images are both high-quality and object-free through quantitative metrics and human evaluation. Also, GHOST uncovers transferable vulnerabilities: images optimized for Qwen2.5-VL induce hallucinations in GPT-4o at a 66.5% rate. Finally, we show that fine-tuning on our images mitigates hallucination, positioning GHOST as both a diagnostic and corrective tool for building more reliable multimodal systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5014. Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization
๐ฌ 26
โญ 4.40
๐ optimization
๐ท๏ธ sharpness-aware minimization implicit bias gradient flow
We study the implicit bias of sharpness-aware minimization (SAM) when training $L$-layer linear diagonal networks on linearly separable binary classification. For linear models ($L=1$), both $\ell_\infty$- and $\ell_2$-SAM recover the $\ell_2$ max-margin classifier, matching gradient descent (GD). However, for depth $L = 2$, the behavior changes drasticallyโeven on a single-example dataset where we can analyze the dynamics. For $\ell_\infty$-SAM, the limit direction depends critically on initialization and can converge to $0$ or to any standard basis vector; this is in stark contrast to GD, whose limit aligns with the basis vector of the dominant coordinate in the data. For $\ell_2$-SAM, we uncover a phenomenon we call *sequential feature discovery*, in which the predictor initially relies on minor coordinates and gradually shifts to larger ones as training proceeds or initialization grows. Our theoretical analysis attributes this phenomenon to $\ell_2$-SAMโs gradient normalization factor applied in its perturbation, which amplifies minor coordinates early and allows major ones to dominate later. Synthetic and real-data experiments corroborate our findings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5015. SAFA-SNN: Sparsity-Aware On-Device Few-Shot Class-Incremental Learning with Fast-Adaptive Structure of Spiking Neural Network
๐ฌ 25
โญ 4.40
๐ applications to neuroscience & cognitive science
๐ท๏ธ Few-Shot Class-Incremental Learning Spiking Neural Network Brain-Inspired Learning Edge Computing
Continuous learning of novel classes is crucial for edge devices to preserve data privacy and maintain reliable performance in dynamic environments. However, the scenario becomes particularly challenging when data samples are insufficient, requiring on-device few-shot class-incremental learning (FSCIL) to maintain consistent model performance. Although existing work has explored parameter-efficient FSCIL frameworks based on artificial neural networks (ANNs), their deployment is still fundamentally constrained by limited device resources. Spiking neural networks (SNNs) process spatiotemporal information efficiently, offering lower energy consumption, greater biological plausibility, and compatibility with neuromorphic hardware than ANNs. In this work, we propose an SNN-based method containing Sparsity-Aware neuronal dynamics and Fast Adaptive structure (SAFA-SNN) for On-Device FSCIL. First, neurons are sparsity-aware with dynamic thresholds, keeping most of them stable. As a result, the synaptic traces that encode base-class knowledge are naturally preserved, thereby alleviating catastrophic forgetting. To cope with spike non-differentiability in backpropagation, we employ a gradient-free estimation, i.e., zeroth-order optimization. Moreover, class prototypes can limit overfitting on few-shot data but introduce bias in incremental learning. We enhance prototype discriminability by subspace projection. Extensive experiments conducted on two standard benchmark datasets (CIFAR100 and Mini-ImageNet) and three neuromorphic datasets (CIFAR-10-DVS, DVS128gesture, and N-Caltech101) demonstrate that SAFA-SNN outperforms baseline methods, specifically achieving at least 4.01\% improvement at the last incremental session on Mini-ImageNet and 20\% lower energy cost on Spiking ResNet-20 over baselines with practical implementation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5016. How Transformers Learn Causal Structures In-Context: Explainable Mechanism Meets Theoretical Guarantee
๐ฌ 17
โญ 4.40
๐ foundation or frontier models, including LLMs
๐ท๏ธ transformers in-context learning interpretability Markov chain
Transformers have demonstrated remarkable in-context learning abilities, adapting to new tasks from just a few examples without parameter updates. However, theoretical understanding of this phenomenon typically assumes fixed dependency structures, while real-world sequences exhibit flexible, context-dependent relationships. We address this gap by investigating whether transformers can learn causal structures -- the underlying dependencies between sequence elements -- directly from in-context examples.
We propose a novel framework using Markov chains with randomly sampled causal dependencies, where transformers must infer which tokens depend on which predecessors to make accurate predictions. Our key contributions are threefold: (1) We prove that a two-layer transformer with relative position embeddings can implement Bayesian Model Averaging (BMA), the optimal statistical algorithm for causal structure inference; (2) Through extensive experiments and parameter-level analysis, we demonstrate that transformers trained on this task learn to approximate BMA, with attention patterns directly reflecting the inferred causal structures; (3) We provide information-theoretic guarantees showing how transformers recover causal dependencies and extend our analysis to continuous dynamical systems, revealing fundamental differences in representational requirements.
Our findings bridge the gap between empirical observations of in-context learning and theoretical understanding, showing that transformers can perform sophisticated statistical inference over structural uncertainty.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5017. Testing Most Influential Sets
๐ฌ 37
โญ 4.40
๐ interpretability and explainable AI
๐ท๏ธ attribution robustness auditing causal inference fairness least squares extreme value
Small influential data subsets can dramatically impact model conclusions, with a few data points overturning key findings. While recent work identifies these most influential sets, there is no formal way to tell when maximum influence is excessive rather than expected under natural random sampling variation. We address this gap by developing a principled framework for most influential sets. Focusing on linear least-squares, we derive a convenient exact influence formula and identify the extreme value distributions of maximal influence โ the heavy-tailed Frรฉchet for constantโsize sets and heavy tailed data, and the well-behaved Gumbel for growing sets or light tails. This allows us to conduct rigorous hypothesis tests for excessive influence. We demonstrate through applications across economics, biology, and machine learning benchmarks, resolving contested findings and replacing adโhoc heuristics with rigorous inference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5018. Moving Beyond Diffusion: Hierarchy-to-Hierarchy Autoregression for fMRI-to-Image Reconstruction
๐ฌ 22
โญ 4.33
๐ applications to neuroscience & cognitive science
๐ท๏ธ fMRI-to-Image Reconstruction Coarse-to-Fine Generation Scale-wise Autoregressive Modeling Scale-aware Neural Guidance
Reconstructing visual stimuli from fMRI signals is a central challenge bridging machine learning and neuroscience. Recent diffusion-based methods typically map fMRI activity to a single neural embedding, using it as fixed guidance throughout the entire generation process. However, this fixed guidance collapses hierarchical neural information and is misaligned with the stage-dependent demands of image reconstruction. In response, we propose MindHier, a coarse-to-fine fMRI-to-image reconstruction framework built on scale-wise autoregressive modeling. MindHier introduces three components: a Hierarchical fMRI Encoder to extract multi-level neural embeddings, a Hierarchy-to-Hierarchy Alignment scheme to enforce layer-wise correspondence with CLIP features, and a Scale-Aware Coarse-to-Fine Neural Guidance strategy to inject these embeddings into autoregression at matching scales. These designs make MindHier an efficient and cognitively-aligned alternative to diffusion-based methods by enabling a hierarchical reconstruction process that synthesizes global semantics before refining local details, akin to human visual perception. Extensive experiments on the NSD dataset show that MindHier achieves superior semantic fidelity, 4.67$\times$ faster inference, and more deterministic results than the diffusion-based baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5019. Bilateral Information-aware Test-time Adaptation for Vision-Language Models
๐ฌ 30
โญ 4.33
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Test-time Adaptation Vision Language Model
Test-time adaptation (TTA) fine-tunes models using new data encountered during inference, which enables the vision-language models to handle test data with covariant shifts. Unlike training-time adaptation, TTA does not require a test-distributed validation set or consider the worst-case distribution within a given tolerance. However, previous methods primarily focused on adaption-objective design, while the data tend to be fully utilized or simply filtered through a fixed low-entropy selection criteria. In this paper, we analyze the weakness of previous selection criterion and find that only selecting fixed proportion of low-entropy samples fails to ensure optimal performance across various datasets and can lead the model to becoming over-confident in wrongly classified samples, showing unexpected overfitting to atypical features and compromising effective adaptation. To improve upon them, we propose Bilateral Information-aware Test-Time Adaptation (BITTA), which simultaneously leverages two distinct parts of the test inputs during adaptation. Specifically, a dynamic proportion of low-entropy samples are used to learn the core representation under covariant shifts, while high-entropy samples are adopted to unlearn atypical features. This dual approach prevents the model from undesired memorization and ensures extensive optimal performance. Comprehensive experiments validate the effectiveness in various datasets and model architectures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5020. FS-KAN: Permutation Equivariant Kolmogorov-Arnold Networks via Function Sharing
๐ฌ 22
โญ 4.33
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Kolmogorov-Arnold Networks Geometric Deep Learning Group invariant neural network Weight sharing Parameter sharing Symmetries
Permutation equivariant neural networks employing parameter-sharing schemes have emerged as powerful models for leveraging a wide range of data symmetries, significantly enhancing the generalization and computational efficiency of the resulting models. Recently, Kolmogorov-Arnold Networks (KANs) have demonstrated promise through their improved interpretability and expressivity compared to traditional architectures based on MLPs. While equivariant KANs have been explored in recent literature for a few specific data types, a principled framework for applying them to data with permutation symmetries in a general context remains absent. This paper introduces Function Sharing KAN (FS-KAN), a principled approach to constructing equivariant and invariant KA layers for arbitrary permutation symmetry groups, unifying and significantly extending previous work in this domain. We derive the basic construction of these FS-KAN layers by generalizing parameter-sharing schemes to the Kolmogorov-Arnold setup and provide a theoretical analysis demonstrating that FS-KANs have the same expressive power as networks that use standard parameter-sharing layers, allowing us to transfer well-known and important expressivity results from parameter-sharing networks to FS-KANs.
Empirical evaluations on multiple data types and symmetry groups show that FS-KANs exhibit superior data efficiency compared to standard parameter-sharing layers, by a wide margin in certain cases, while preserving the interpretability and adaptability of KANs, making them an excellent architecture choice in low-data regimes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5021. FeDaL: Federated Dataset Learning for General Time Series Foundation Models
๐ฌ 23
โญ 4.00
๐ learning on time series and dynamical systems
๐ท๏ธ Time Series Analysis Time Series Foundation Models Federated Learning
Dataset-level heterogeneity introduces significant domain biases that fundamentally degrade generalization on general Time Series Foundation Models (TSFMs), yet this challenge remains underexplored. This paper rethinks the from-scratch training of TSFMs using the paradigm of federated learning. We propose a novel Federated Dataset Learning (FeDaL) approach to tackle heterogeneous time series by learning dataset-agnostic temporal representations. Specifically, the distributed architecture of federated learning is a nature solution to decompose heterogeneous TS datasets into shared generalized knowledge and preserved personalized knowledge. Moreover, based on the TSFM architecture, FeDaL explicitly mitigates both local and global biases by adding two complementary mechanisms: Domain Bias Elimination (DBE) and Global Bias Elimination (GBE). FeDaL`s cross-dataset generalization has been extensively evaluated in real-world datasets spanning eight tasks (including various regression and classification), against 54 baselines. We further analyze federated scaling behavior, showing how data volume, client count, and join rate affect model performance under decentralization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5022. SimpleFold: Folding Proteins is Simpler than You Think
๐ฌ 16
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Generative models protein structure prediction
Protein folding models have achieved groundbreaking results typically via a combination of integrating domain knowledge into the architectural blocks and training pipelines. Nonetheless, given the success of generative models across different but related problems, it is natural to question whether these architectural designs are a necessary condition to build performant models. In this paper, we introduce SimpleFold, the first flow-matching based protein folding model that solely uses general purpose transformer blocks}. Protein folding models typically employ computationally expensive modules involving triangular updates, explicit pair representations or multiple training objectives curated for this specific domain. Instead, SimpleFold employs standard transformer blocks with adaptive layers and is trained via a generative flow-matching objective with an additional structural term. We scale SimpleFold to 3B parameters and train it on approximately 9M distilled protein structures together with experimental PDB data. On standard folding benchmarks, SimpleFold-3B achieves competitive performance compared to state-of-the-art baselines, in addition SimpleFold demonstrates strong performance in ensemble prediction which is typically difficult for models trained via deterministic reconstruction objectives. SimpleFold challenges the reliance on complex domain-specific architectures designs in protein folding, opening up an alternative design space for future progress.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5023. ICDiffAD: Implicit Conditioning Diffusion Model for Time Series Anomaly Detection
๐ฌ 18
โญ 4.00
๐ learning on time series and dynamical systems
๐ท๏ธ Time Series Anomaly Detection Diffusion Model Implicit Conditioning
Time series anomaly detection (TSAD) faces critical challenges from intrinsic data noisiness and temporal heterogeneity, which undermine the reconstruction fidelity of prevailing generative approaches.
While diffusion models offer theoretical advantages in capturing complex temporal dynamics, their inherent stochasticity introduces irreducible variance in reconstructions.
We present the ICDiffAD, a novel method that synergizes adaptive noise scheduling with semi-deterministic generation to address these limitations. ICDiffAD introduces two key innovations:
(1) an *SNR Scheduler* that governs training through quantifiable noise scales, enabling robust learning of normative patterns across non-stationary regimes; and
(2) an *SNR Implicit Conditioning Mechanism* that initializes reverse diffusion from partially corrupted inputs, preserving signal coherence while attenuating anomalous components.
This dual strategy ensures high-fidelity reconstructions aligned with the inputโs manifold, reconciling generative flexibility with detection accuracy.
Across five multivariate benchmarks, ICDiffAD improves the F1 score by 20.2\% and reduces false positives by 60.23\% compared to existing diffusion model-based TSAD methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5024. STRONGER TOGETHER: ON-POLICY REINFORCEMENT LEARNING FOR COLLABORATIVE LLMS
๐ฌ 24
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM multi agent system reinforcement learning
Multi-Agent System (MAS) and Reinforcement Learning (RL) are both widely adopted to improve large language model (LLM) agentic performance. MAS strengthens task-specialized performance via role-based orchestration; RL leverages environment rewards to train stronger policies, such as Group Relative Policy Optimization (GRPO)-style optimization. Yet applying on-policy RL training to MAS is underexplored. While promising, it poses several challenges. On the algorithm side, Standard GRPO grouping assumptions fail in MAS because prompts differ by role and turn. On the system side, the training system needs to support MAS-workflow-based rollouts and on-policy updates for both single and multiple policy models. To address these issues, we introduce AT-GRPO, consisting of (i) an Agent- and Turn-wise grouped RL algorithm tailored for MAS and (ii) a system to support both single-policy and multi-policy training. Across game, plan, coding, and math tasks, AT-GRPO demonstrates substantial performance gains across diverse domains. Especially on long-horizon planning tasks, AT-GRPO boosts accuracy from a 14.0โ47.0% single-agent RL baseline to 96.0โ99.5%. Furthermore, it improves reasoning performance, with an average gain of 3.87โ7.62% on coding and 9.0-17.93% on math.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5025. Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning
๐ฌ 9
โญ 4.00
๐ reinforcement learning
๐ท๏ธ Offline Goal-Conditioned Reinforcement Learning Reward Shaping
While offline goal-conditioned reinforcement learning (GCRL) provides a simple recipe to train generalist policies from large unlabeled datasets, Offline GCRL agents trained with sparse rewards typically struggle on long-horizon tasks. Manually designing task-specific reward functions undermines the simplicity, scalability and generality of this paradigm. Moreover, prior approaches to learn rewards for effective credit assignment fail to adequately capture goal-reaching information as tasks scale in complexity. To address this gap, we propose $\textrm{\textbf{Occupancy Reward Shaping(ORS)}}$, a novel reward-shaping approach that leverages a learned occupancy measure; a distribution that naturally captures complex long-horizon temporal dependencies between states; and distills goal-reaching information from the occupancy measure into a general-purpose reward function for effective credit assignment. We demonstrate that ORS achieves a $\mathbf{2.3\times}$ improvement in performance on average over its base RL algorithm across a diverse set of long-horizon locomotion and manipulation tasks and outperforms prior state-of-the-art methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5026. From Vicious to Virtuous Cycles: Synergistic Representation Learning for Unsupervised Video Object-Centric Learning
๐ฌ 15
โญ 4.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Video Object-Centric Learning Representation Learning Object-Centric Learning Unsupervised Learning
Unsupervised object-centric learning models, particularly slot-based architectures, have shown great promise in decomposing complex scenes.
However, their reliance on reconstruction-based training creates a fundamental conflict between the sharp, high-frequency attention maps of the encoder and the spatially consistent but blurry reconstruction maps of the decoder.
We identify that this discrepancy gives rise to a vicious cycle; the noisy feature map from the encoder forces the decoder to average over possibilities and produce even blurrier outputs, while the gradient computed from blurry reconstruction maps lacks high-frequency details necessary to supervise encoder features.
To break this cycle, we introduce Synergistic Representation Learning (SRL) that establishes a virtuous cycle where the encoder and decoder mutually refine one another.
SRL leverages the encoder's sharpness to deblur the semantic boundary within the decoder output, while exploiting the decoder's spatial consistency to denoise the encoder's features.
This mutual refinement process is stabilized by a warm-up phase with a slot regularization objective that initially allocates distinct entities per slot.
By bridging the representational gap between the encoder and decoder, our approach achieves state-of-the-art results on challenging video object-centric learning benchmarks.
Codes will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5027. MOLM: Mixture of LoRA Markers
๐ฌ 17
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Watermarking Diffusion models
Generative models can generate photorealistic images at scale. This raises urgent concerns about the ability to detect synthetically generated images and attribute these images to specific sources. While watermarking has emerged as a possible solution, existing methods remain fragile to realistic distortions, susceptible to adaptive removal, and expensive to update when the underlying watermarking key changes. We propose a general watermarking framework that formulates the encoding problem as key-dependent perturbation of the parameters of a generative model. Within this framework, we introduce Mixture of LoRA Markers (MOLM), a routing-based instantiation in which binary keys activate lightweight LoRA adapters inside residual and attention blocks. This design avoids key-specific re-training and achieves the desired properties such as imperceptibility, fidelity, verifiability, and robustness. Experiments on Stable Diffusion and FLUX show that MOLM preserves image quality while achieving robust key recovery against distortions, compression and regeneration, averaging attacks, and black-box adversarial attacks on the extractor.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5028. HierLoc: Hyperbolic Entity Embeddings for Hierarchical Visual Geolocation
๐ฌ 16
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Representation learning Multimodal learning Contrastive learning Manifold learning Hierarchical modeling Geospatial AI
Visual geolocalization, the task of predicting where an image was taken, remains challenging due to global scale, visual ambiguity, and the inherently hierarchical structure of geography. Existing paradigms rely on either large-scale retrieval, which requires storing a large number of image embeddings, grid-based classifiers that ignore geographic continuity, or generative models that diffuse over space but struggle with fine detail. We introduce an entity-centric formulation of geolocation that replaces image-to-image retrieval with a compact hierarchy of geographic entities embedded in Hyperbolic space. Images are aligned directly to country, region, subregion, and city entities through Geo-Weighted Hyperbolic contrastive learning by directly incorporating haversine distance into the contrastive objective. This hierarchical design enables interpretable predictions and efficient inference with 240k entity embeddings instead of over 5 million image embeddings on the OSV5M benchmark, on which our method establishes a new state-of-the-art performance. Compared to the current methods in the literature, it reduces mean geodesic error by 19.5\%, while improving the fine-grained subregion accuracy by 43\%. These results demonstrate that geometry-aware hierarchical embeddings provide a scalable and conceptually new alternative for global image geolocation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5029. Guided Speculative Inference for Efficient Test-Time Alignment of LLMs
๐ฌ 10
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Test-Time Scaling LLMs Large Language Models Speculative Decoding Inference Inference-Time Scaling Best-of-n Soft Best-of-n PRM Reward Models Reward Guidance KL Regularization GSI
We propose Guided Speculative Inference (GSI), a novel algorithm for efficient reward-guided decoding in large language models.
GSI combines soft best-of-$n$ test-time scaling with a reward model $r(x,y)$ and speculative samples from a small auxiliary model $\pi_S(y\mid x)$. We provably approximate both the optimal tilted policy
$\pi_{\beta,B}(y\mid x) \propto \pi_B(y\mid x)\exp(\beta\,r(x,y))$ of soft best-of-$n$ under the base model $\pi_B$, as well as the expected reward under the optimal policy. In experiments on reasoning benchmarks (MATH500, OlympiadBench, Minerva Math, MMLU-STEM, GSM8K) and across different model families, our method achieves higher accuracy than standard soft best-of-$n$ with $\pi_S$ and reward-guided speculative decoding (Liao et al., 2025), and in certain settings even outperforms soft best-of-$n$ with $\pi_B$, while reducing end-to-end latency by up to 28%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5030. Batch Pruning by Activation Stability
๐ฌ 24
โญ 4.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Batch Pruning Activation Stability Convolutional Neural Network CNN Activation Deep Learning
Training deep neural networks remains costly in terms of data, time, and energy, limiting their deployment in large-scale and resource-constrained settings. To address this, we propose Batch Pruning by Activation Stability (B-PAS), a dynamic plug-in strategy that accelerates training by adaptively removing data as batches that contribute less to learning. B-PAS monitors the stability of activation feature maps across epochs and prunes batches whose activation variance shows minimal change, indicating diminishing learning utility. Applied to ResNet-18, ResNet-50, and the Convolutional vision Transformer (CvT) on CIFAR-10, CIFAR-100, SVHN, and ImageNet-1K, B-PAS reduces training batch usage by up to 57\% with no loss in accuracy, and by 47\% while slightly improving accuracy. Moreover, it achieves as far as 61\% savings in GPU node-hours, outperforming prior state-of-the-art pruning methods with up to 29\% higher data savings and 21\% greater GPU node-hours savings. These results highlight activation stability as a powerful internal signal for efficient training by removing batches, offering a practical and sustainable path toward data and energy-efficient deep learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5031. Adaptive Moments are Surprisingly Effective for Plug-and-Play Diffusion Sampling
๐ฌ 17
โญ 4.00
๐ generative models
๐ท๏ธ Guided Diffusion Sampling Plug-and-Play Conditional Diffusion Sampling Adaptive Moment Estimation
Guided diffusion sampling relies on approximating intractable likelihood scores, which introduces significant noise into the sampling dynamics. We propose using adaptive moment estimation to stabilize these noisy likelihood scores during sampling. Despite its simplicity, our approach achieves state-of-the-art results on image restoration and class-conditional generation tasks, outperforming more complicated methods, which are often computationally more expensive. We provide empirical analysis of our method on both synthetic and real data, demonstrating that mitigating gradient noise through adaptive moments offers an effective way to improve alignment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5032. The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
๐ฌ 10
โญ 4.00
๐ datasets and benchmarks
๐ท๏ธ language agents tool use benchmark
Real-world language agents must handle complex, multi-step workflows across diverse applications. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database like BigQuery to detect anomalies and generate reports following a standard operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance.
To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse applications and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional applications like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as multiple Canvas courses each with dozens of students or real-world financial spreadsheets. The Toolathlon benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple applications over ~20 interaction turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of state-of-the-art models highlights their significant shortcomings in performing real-world, long-horizon tasks: the best-performing model, Claude-4-Sonnet, achieves only a 29.9% success rate with 28 tool calling turns on average, while the top open-weights model DeepSeek-V3.1 reaches 13.9%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5033. To Augment or Not to Augment? Diagnosing Distributional Symmetry Breaking
๐ฌ 24
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ equivariance symmetry symmetry breaking canonicalization data augmentation
Symmetry-aware methods for machine learning, such as data augmentation and equivariant architectures, encourage correct model behavior on all transformations (e.g. rotations or permutations) of the original dataset. These methods can impart improved generalization and sample efficiency, under the assumption that the transformed datapoints are highly probable, or "important", under the test distribution. In this work, we develop a method for critically evaluating this assumption. In particular, we quantify the amount of anisotropy, or symmetry-breaking, in a dataset, via a two-sample neural classifier test that distinguishes between the original dataset and its randomly augmented equivalent. We validate our metric on synthetic datasets, and then use it to uncover surprisingly high degrees of anisotropy in several benchmark point cloud datasets. In theory, this kind of distributional symmetry-breaking can actually preclude invariant methods from performing optimally even when the underlying labels truly are invariant, as we show for invariant ridge regression in the infinite feature limit. In practice, we find that the implication for symmetry-aware methods is dataset-dependent: equivariant methods still impart benefits on some anisotropic datasets, but not others. Overall, these findings suggest that understanding equivariance โ both when it works, and why โ may require rethinking symmetry biases in the data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5034. Knowledge Editing with Subspace-Aware Key-Value Mappings
๐ฌ 29
โญ 4.00
๐ interpretability and explainable AI
๐ท๏ธ Knowledge Editing Large Language Model Subspace
Knowledge editing aims to efficiently correct factual errors in Language Models (LMs). The popular locate-then-edit approach modifies an MLP layer by finding an optimal mapping between its input vector (key) and output vector (value) that leads to the expression of the edited knowledge. However, existing methods without any constraints on the key and value vectors cause significant perturbations to the edited model. To address this, we propose Subspace Knowledge Edit (SUIT), a method that identifies and modifies only the subspace of critical features relevant to the edit. Our empirical results on LLaMA-3-8B, GPT-J-6B, and Qwen2.5-7B models show that SUIT dramatically improves knowledge preservation over strong baselines while maintaining high edit efficacy. This effectiveness confirms that SUIT successfully identifies the critical subspace for the edit. Further analyses provide additional validation for our approach. The source code and data will be released to the public upon publication of the paper.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5035. Outrageously Large Context Windows via RACE Attention -- A Family of Non-Linear Attention that can be calculated in Strictly Linear-Time
๐ฌ 16
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Sketching Locality Sensitive Hashing RACE Attention Linear Transformers
Softmax Attention has a quadratic time complexity in sequence length, which becomes prohibitive to run at long contexts, even with highly optimized GPU kernels. For example, FlashAttention2 and FlashAttention3 (exact, GPU-optimized implementations of Softmax Attention) cannot complete a single forwardโbackward pass of a multi-head attention layer once the context exceeds $\sim4$ million tokens on an NVIDIA GH200 (96 GB). We introduce RACE Attention, a kernel-inspired alternative to Softmax Attention that is strictly linear in sequence length and embedding dimension. RACE Attention replaces the exponential kernel with a sharpened angular similarity, and approximates attention outputs via randomized projections and soft Locality-Sensitive Hashing (LSH). Across language modeling, masked language modeling, and text/image classification, RACE Attention matches or outperforms strong baselines while reducing wall-clock time and memory. In a controlled scale test, it processes up to 12 million tokens during a single forward-backward pass on an NVIDIA GH200 GPU and 75 million tokens on an Intel Xeonยฎ Gold 5220R CPUโwell beyond the practical limits of the current state-of-the-art attention implementations. RACE Attention thus offers a practical and theoretically grounded mechanism for long-context training on todayโs hardware.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5036. DeMo: Decoupled Momentum Optimization
๐ฌ 22
โญ 4.00
๐ optimization
๐ท๏ธ deep learning large language models optimization training generative models pre-training foundational models distributed training
Scaling neural network training increasingly depends on synchronous data-parallelism, yet full-precision gradient all-reduce imposes a severe communication bottleneck. We propose Decoupled Momentum Optimization, a drop-in replacement for any momentum-based optimizers that significantly reduces the communication bandwidth while maintaining convergence. DeMo (i) decouples local momentum updates, (ii) applies a fast orthonormal transform (e.g., DCT) followed by top-$k$ sparsification, and (iii) reuses the momentum buffer for error feedback via momentum subtraction. This design reduces per-step communication by up to two orders of magnitude with minimal computational overhead. Experiments on 300M- and 1B-parameter DeMo language models show DeMo transmits up to 85ร less data per GPU than AdamW-DDP while achieving comparable loss and accuracy. DeMo is topology-agnostic and enables training across multi-datacenter or Ethernet-based setups.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5037. MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs
๐ฌ 15
โญ 4.00
๐ datasets and benchmarks
๐ท๏ธ chemical language model chemical reasoning model chemistry large language model molecular graph molecular structure
Large Language Models (LLMs) are increasingly applied to chemistry, tackling tasks such as molecular name conversion, captioning, text-guided generation, and property or reaction prediction. A moleculeโs properties are fundamentally determined by its composition and structure, encoded in its molecular graph; thus, reasoning about molecular properties requires understanding and reasoning over the molecular structure. Yet, most existing benchmarks emphasize general chemical knowledge, rely on literature or surrogate labels that risk leakage or bias, or reduce evaluation to multiple-choice questions.
We introduce MolecularIQ, a molecular structure reasoning benchmark focused exclusively on symbolically verifiable tasks.
MolecularIQ spans three orthogonal axes โ molecular complexity, multi-task load, and reasoning complexity โ covering feature counting, index-based feature attributions, and constrained generation.
MolecularIQ enables fine-grained evaluation of reasoning over molecular graphs and produces capability fingerprints that localize model failures to specific tasks and molecular regimes. This provides actionable insights into the strengths and limitations of current chemistry LLMs and guides the development of models that reason faithfully over molecular structure.
On MolecularIQ, large MoE models with higher reasoning budgets lead across categories, while chemistry-tuned LLMs underperform their generalist bases, indicating limited transfer from narrow task fine-tuning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5038. When and Where to Reset Matters for Long-Term Test-Time Adaptation
๐ฌ 27
โญ 4.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Test-Time Adaptation Continual Test-Time Adaptation
When continual test-time adaptation (TTA) persists over the long term, errors accumulate in a model and further lead it to predict only a few classes regardless of the input, known as model collapse. Recent studies have explored reset strategies that erase these accumulated errors completely. However, their periodic resets lead to suboptimal adaptation, as they occur independently of collapse. Also, their full resets cause the catastrophic loss of knowledge acquired over time, even though it could be beneficial in future. To this end, we propose 1) an Adaptive and Selective Reset (ASR) scheme that dynamically determines when and where to reset, 2) an importance-aware regularizer to recover essential knowledge lost from reset, and 3) an on-the-fly adaptation adjustment scheme to enhance adaptability under challenging domain shifts. Extensive experiments across long-term TTA benchmarks demonstrate the effectiveness of our approach, particularly under challenging conditions. Our code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5039. FastVGGT: Fast Visual Geometry Transformer
๐ฌ 24
โญ 4.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ 3D reconstruction
Scaling visual geometry transformers for long image sequences poses a significant computational and memory challenge. In this work, we diagnose this issue in the state-of-the-art model VGGT, and trace the primary bottleneck to its Global Attention layer. Our analysis reveals a ``token collapse'' phenomenon, where many tokens attend to nearly identical regions, resulting in redundant computation and inefficiency. Motivated by this finding, we propose FastVGGT, a training-free framework that strategically prunes these redundant tokens. Instead of uniform merging, FastVGGT employs a tailored, three-part token partitioning strategy. It preserves initial-frame tokens as a stable global reference, retains salient tokens to maintain fine details, and utilizes region-based random sampling to ensure spatially balanced coverage. Extensive experiments on multiple 3D geometry benchmarks validate our approach's effectiveness. Notably, on sequences of 1000 images, FastVGGT achieves a 4$\times$ speedup over the original VGGT while simultaneously mitigating error accumulation, demonstrating its efficiency and robustness for long-sequence scenarios. For further details, please visit our project page: https://fastvggt.github.io/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5040. DADA: Dual Averaging with Distance Adaptation
๐ฌ 9
โญ 4.00
๐ optimization
๐ท๏ธ Adaptive Optimization Universal Gradient Method Dual Averaging
We present a novel parameter-free universal gradient method for solving convex optimization problems. Our algorithmโDual Averaging with Distance Adaptation (DADA)โis based on the classical scheme of dual averaging and dynamically adjusts its coefficients based on the observed gradients and the distance between its iterates to the starting point, without the need for knowing any problem-specific parameters. DADA is a universal algorithm that simultaneously works for a wide range of problem classes as long as one is able to bound the local growth of the objective around its minimizer. Particular examples of such problem classes are nonsmooth Lipschitz functions, Lipschitz-smooth functions, Hรถlder-smooth functions, functions with high-order Lipschitz derivative, quasi-self-concordant functions, and (L0, L1)-smooth functions. Furthermore, in contrast to many existing methods, DADA is suitable not only for unconstrained problems, but also constrained ones, possibly with unbounded domain, and it does not require fixing neither the number of iterations nor the accuracy in advance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5041. Joint Discriminative-Generative Modeling via Dual Adversarial Training
๐ฌ 25
โญ 4.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Joint Modeling Energy-Based Models (EBMs) Adversarial Training Robust Classification Generative Modeling PGD Attacks explainability
Simultaneously achieving robust classification and high-fidelity generative modeling within a single framework presents a significant challenge. Hybrid approaches, such as Joint Energy-Based Models (JEM), interpret classifiers as EBMs but are often limited by the instability and poor sample quality inherent in Stochastic Gradient Langevin Dynamics (SGLD)-based training. We address these limitations by proposing a novel training framework that integrates adversarial training (AT) principles for both discriminative robustness and stable generative learning. The proposed method introduces three key innovations: (1) the replacement of SGLD-based JEM learning with a stable, AT-based approach that optimizes the energy function by discriminating between real data and Projected Gradient Descent (PGD)-generated contrastive samples using the BCE loss; (2) synergistic adversarial training for the discriminative component that enhances classification robustness while eliminating the need for explicit gradient penalties; and (3) a two-stage training strategy that addresses normalization-related instabilities and enables leveraging pretrained robust classifiers, generalizing effectively across diverse architectures. Experiments on CIFAR-10/100 and ImageNet demonstrate that our approach: (1) is the first EBM-based hybrid to scale to high-resolution datasets with high training stability, simultaneously achieving state-of-the-art discriminative and generative performance on ImageNet 256$\times$256; (2) uniquely combines generative quality with adversarial robustness, enabling critical applications like robust counterfactual explanations; and (3) functions as a competitive standalone generative model, matching the generative quality of autoregressive methods (VAR-d16) and surpassing diffusion models while offering unique versatility.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5042. Entering the Era of Discrete Diffusion Models: A Benchmark for Schrรถdinger Bridges and Entropic Optimal Transport
๐ฌ 12
โญ 4.00
๐ generative models
๐ท๏ธ Benchmark Schrรถdinger Bridge Entropic Optimal Transport Optimal Transport Unpaired Learning Discrete Spaces Discrete Diffusion Models Generative Modeling
The Entropic Optimal Transport (EOT) problem and its dynamic counterpart, the Schrรถdinger bridge (SB) problem, play an important role in modern machine learning, linking generative modeling with optimal transport theory. While recent advances in discrete diffusion and flow models have sparked growing interest in applying SB methods to discrete domains, there is still no reliable way to evaluate how well these methods actually solve the underlying problem. We address this challenge by introducing a benchmark for SB on discrete spaces. Our construction yields pairs of probability distributions with analytically known SB solutions, enabling rigorous evaluation. As a byproduct of building this benchmark, we obtain two new SB algorithms, DLightSB and DLightSB-M, and additionally extend prior related work to construct the $\alpha$-CSBM algorithm. We demonstrate the utility of our benchmark by evaluating both existing and new solvers in high-dimensional discrete settings. This work provides the first step toward proper evaluation of SB methods on discrete spaces, paving the way for more reproducible future studies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5043. sleep2vec: Unified Cross-Modal Alignment for Heterogeneous Nocturnal Biosignals
๐ฌ 32
โญ 4.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Contrastive Learning Physiological Signal Sleep Medicine
Tasks ranging from sleep staging to clinical diagnosis traditionally rely on standard polysomnography (PSG) devices, bedside monitors and wearable devices, which capture diverse nocturnal biosignals (e.g., EEG, EOG, ECG, SpO$_2$). However, heterogeneity across devices and frequent sensor dropout pose significant challenges for unified modelling of these multimodal signals. We present sleep2vec, a foundation model for diverse and incomplete nocturnal biosignals that learns a shared representation via cross-modal alignment. sleep2vec is contrastively pre-trained on 42,249 overnight recordings spanning nine modalities using a Demography, Age, Site & History-aware InfoNCE objective that incorporates physiological and acquisition metadata (e.g., age, gender, recording site) to dynamically weight negatives and mitigate cohort-specific shortcuts. On downstream sleep staging and clinical outcome assessment, sleep2vec consistently outperforms strong baselines and remains robust to any subset of available modalities and sensor dropout. We further characterize, to our knowledge for the first time, scaling laws for nocturnal biosignals with respect to modality diversity and model capacity. Together, these results show that unified cross-modal alignment, coupled with principled scaling, enables label-efficient, general-purpose modelling of real-world nocturnal biosignals.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5044. MATA: A Trainable Hierarchical Automaton System for Multi-Agent Visual Reasoning
๐ฌ 18
โญ 4.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Visual Reasoning Agent Neuro-Symbolic
Recent vision-language models have strong perceptual ability but their implicit reasoning is hard to explain and easily generates hallucinations on complex queries. Compositional methods improve interpretability, but most rely on a single agent or hand-crafted pipeline and cannot decide when to collaborate across complementary agents or compete among overlapping ones. We introduce MATA (Multi-Agent hierarchical Trainable Automaton), a multi-agent system presented as hierarchical finite-state automaton for visual reasoning whose top-level transitions are chosen by a trainable hyper agent. Each agent corresponds to a state in the hyper automaton, and runs a small rule-based sub-automaton for reliable micro-control. All agents read and write a shared memory, yielding transparent execution history. To supervise the hyper agentโs transition policy, we build transition-trajectory trees and transform to memory-to-next-state pairs, forming the MATA-SFT-90K dataset for supervised finetuning (SFT). The finetuned LLM as the transition policy understands the query and the capacity of agents, and it can efficiently choose the optimal agent to solve the task. Across multiple visual reasoning benchmarks, MATA achieves the state-of-the-art results compared with monolithic and compositional baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5045. Complexity- and Statistics-Guided Anomaly Detection in Time Series Foundation Models
๐ฌ 13
โญ 4.00
๐ learning on time series and dynamical systems
๐ท๏ธ Timeseries anomaly detection Timeseries foundation model Reconstruction based anomaly detection
This paper introduces a methodology for anomaly detection in time series using Time Series Foundation Models (TFMs). While TFMs have achieved strong success in forecasting, their role in anomaly detection remains underexplored. We identify two key challenges when applying TFMs to reconstruction-based anomaly detection and propose solutions.
The first challenge is overgeneralization, where TFMs reconstruct both normal and abnormal data with similar accuracy, masking true anomalies. We find that this effect often occurs in data with strong low-frequency components. To address it, we propose a complexity metric, $\alpha$, that reflects how difficult the data is for TFMs and design a Complexity-Aware Ensemble (CAE) that adaptively balances TFMs with a statistical model.
The second challenge is overstationarization, caused by instance normalization layers that improve forecasting accuracy but remove essential statistical features such as mean and variance, which are critical for anomaly detection. We resolve this by reintroducing these features into the reconstruction process without retraining the TFMs.
Experiments on 23 univariate benchmark datasets demonstrate that our method significantly outperforms both deep learning and statistical baselines. Furthermore, we show that our complexity-based metric, $\alpha$, provides a theoretical foundation for improved anomaly detection, and we briefly explore prediction-based anomaly detection using TFMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5046. Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models
๐ฌ 35
โญ 4.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ VisionโLanguage Models Error Notebook Specification-Aware Part Retrieval CoT Reasoning Human-Preference Dataset
Effective specification-aware part retrieval within complex CAD assemblies is essential for automated engineering tasks. However, using LLMs/VLMs for this task is challenging: the metadata sequences often exceed token budgets, and fine-tuning high-performing proprietary models (e.g., GPT, Gemini) is unavailable. Therefore, we need a framework that delivers engineering value by handling long, non-natural-language metadata associated with real 3D assemblies. We propose an inference-time adaptation framework that combines corrected Error Notebooks with RAG to substantially improve VLM-based part retrieval. Each Error Notebook is built by correcting initial CoTs through reflective refinement, and then filtering each trajectory using a grammar-constraint (GC) verifier to ensure structural well-formedness. The resulting notebook forms a high-quality repository of specification-CoT-answer triplets, from which RAG retrieves specification-relevant exemplars to condition the modelโs inference. We additionally contribute a CAD dataset with preference annotations. Experiments with proprietary models (GPT-4o, Gemini, etc) show large gains, with GPT-4o (Omni) achieving up to +23.4 absolute accuracy points on the human-preference benchmark. The proposed GC verifier can further produce +4.5 accuracy points. Our approach also surpasses other training-free baselines (standard few-shot learning, self-consistency) and yields substantial improvements for open-source VLMs (Qwen2-VL-2B-Instruct, Aya-Vision-8B). Under the cross-model GC setting, where the Error Notebook is constructed using GPT-4o (Omni), the 2B model inference achieves performance that comes within roughly 4 points of GPT-4o mini.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5047. Decomposing Extrapolative Problem Solving: Spatial Transfer and Length Scaling with Map Worlds
๐ฌ 14
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Compositional generalization problem solving
Someone who learns to walk shortest paths in New York can, upon receiving a map of Paris, immediately apply the same rule to navigate, despite never practicing there. This ability to recombine known rules to solve novel problems exemplifies compositional generalization (CG), a hallmark of human cognition. Yet our understanding of what drives the success or failure of such extrapolative problem solving, particularly the roles of training data properties and optimization paradigms, remains limited. In this work, we introduce a controlled map-navigation testbed that cleanly separates two dimensions of CG: ***spatial transfer*** (systematicity across environments) and ***length scaling*** (productivity along problem difficulty). Through quantitative experiments, we show that transfer is enabled by sufficient distinct questions with high coverage and modest diversity, while scaling critically depends on exposure to neighboring-but-longer examples. Finally, we find that reinforcement learning (RL) stabilizes optimization but does not surpass the ceiling set by supervised fine-tuning (SFT). Together, these results provide principled insights into how data properties and training paradigms shape extrapolative problem solving.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5048. Symmetric Space Learning for Combinatorial Generalization
๐ฌ 13
โญ 4.00
๐ generative models
๐ท๏ธ Generative Model Generalization Combinatorial Generalization Machine Learning Manifold Learning Representation Learning
Combinatorial generalization (CG)โgeneralizing to unseen combinations of known semantic factorsโremains a grand challenge in machine learning.
While symmetry-based methods are promising, they learn from observed data and thus fail at what we term $\textbf{symmetry generalization}$: extending learned symmetries to novel data.
We tackle this by proposing a novel framework that endows the latent space with the structure of a $\textbf{symmetric space}$, a class of manifolds whose geometric properties provide a principled way to extend these symmetries.
Our method operates in two steps: first, it imposes this structure by learning the underlying algebraic properties via the $\textbf{Cartan decomposition}$ of a learnable Lie algebra.
Second, it uses $\textbf{geodesic symmetry}$ as a powerful self-supervisory signal to ensure this learned structure extrapolates from observed samples to unseen ones.
A detailed analysis on a synthetic dataset validates our geometric claims, and experiments on standard CG benchmarks show our method significantly outperforms existing approaches.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5049. On Optimal Hyperparameters for Differentially Private Deep Transfer Learning
๐ฌ 13
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ differential privacy hyperparameters deep learning transfer learning
Differentially private (DP) transfer learning, i.e., fine-tuning a pretrained model on private data, is the current state-of-the-art approach for training large models under privacy constraints.
We focus on two key hyperparameters in this setting: the clipping bound $C$ and batch size $B$.
We show a clear mismatch between the current theoretical understanding of how to choose an optimal $C$ (stronger privacy requires smaller $C$) and empirical outcomes (larger $C$ performs better under strong privacy), caused by changes in the gradient distributions.
Assuming a limited compute budget (fixed epochs), we demonstrate that the existing heuristics for tuning $B$ do not work, while cumulative DP noise better explains whether smaller or larger batches perform better.
We also highlight how the common practice of using a single $(C,B)$ setting across tasks can lead to suboptimal performance.
We find that performance drops especially when moving between loose and tight privacy and between plentiful and limited compute, which we explain by analyzing clipping as a form of gradient re-weighting and examining cumulative DP noise.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5050. Exploring Diverse Generation Paths via Inference-time Stiefel Activation Steering
๐ฌ 19
โญ 4.00
๐ optimization
๐ท๏ธ activation steering generation diversity manifold opimization
Language models often default to a narrow set of high-probability outputs, leaving their generation paths homogeneous and prone to mode collapse. Sampling-based strategies inject randomness but still struggle to guarantee diversity across multiple concurrent generation runs. We address this limitation by introducing STAR (**St**iefel-based **A**ctivation Steering for Diverse **R**easoning), a training-free, inference-time intervention method that transforms activation steering into an exploration engine. At each token, STAR collects the hidden activations of concurrent generation runs and optimizes multiple additive steering directions jointly on the Stiefel manifold. STAR maximizes the geometric volume of the steered activations, while the Stiefel manifold induces orthogonality of the steering interventions. This formulation explicitly promotes divergent activation vectors of concurrent generation runs, and implicitly promotes divergent generation trajectories. This manifold optimization formulation can be solved using a Riemannian gradient descent algorithm with convergence guarantees, but this algorithm is too time-consuming for real-time inference. To guarantee low latency, we further design a lightweight one-step update with an aggressive, closed-form stepsize. For test case generation and scientific discovery benchmarks, STAR consistently outperforms standard sampling methods, achieving greater diversity without sacrificing qualitative performance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5051. Physics-Inspired All-Pair Interaction Learning for 3D Dynamics Modeling
๐ฌ 13
โญ 4.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ 3D Dynamics Prediction Attention Mechanism
Modeling 3D dynamics is a fundamental problem in multi-body systems across scientific and engineering domains and has important practical implications in trajectory prediction and simulation. While recent GNN-based approaches have achieved strong performance by enforcing geometric symmetries, encoding high-order features or incorporating neural-ODE mechanics, they typically depend on explicitly observed structures and inherently fail to capture the unobserved interactions that are crucial to complex physical behaviors and dynamics mechanism. In this paper, we propose PAINET, a principled SE(3)-equivariant neural architecture for learning all-pair interactions in multi-body systems. The model comprises: (1) a novel physics-inspired attention network derived from the minimization trajectory of an energy function, and (2) a parallel decoder that preserves equivariance while enabling efficient inference.
Empirical results on diverse real-world benchmarks, including human motion capture, molecular dynamics, and large-scale protein simulations, show that PAINET consistently outperforms recently proposed models, yielding 4.7% to 41.5% error reductions in 3D dynamics prediction with comparable computation costs in terms of time and memory. Our codes, baseline models and datasets are available at https://anonymous.4open.science/r/PAINET-4668.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5052. MATH-Beyond: A Benchmark for RL to Expand Beyond the Base Model
๐ฌ 12
โญ 4.00
๐ datasets and benchmarks
๐ท๏ธ RLVR RL Reasoning Math LLM Evaluation
With the advent of DeepSeek-R1, a new wave of reinforcement learning (RL) methods has emerged that seem to unlock stronger mathematical reasoning. However, a closer look at the open-source ecosystem reveals a critical limitation: with sufficiently many draws (e.g., $\texttt{pass@1024}$), existing base models already solve nearly all questions on widely used math benchmarks such as MATH-500 and AIME 2024. This suggests that the RL fine-tuning methods prevalent in the LLM reasoning literature largely sharpen existing solution modes rather than discovering entirely new ones. Such sharpening stands in contrast to the broader promise of RL: to foster exploration and to acquire new skills. To move beyond this plateau, we introduce MATH-Beyond (MATH-B), a benchmark deliberately constructed to defeat common open-source models of up to 8B parameters even under large sampling budgets. Improving performance on our benchmark via RL requires methods that learn to reason in ways that go beyond base model capabilities in repeated sampling. Since the problems are drawn from subsets of DAPO-Math-17K and DeepScaleR datasets, they remain topically equivalent to standard high-school math. Validating our premise, RL fine-tuned models such as Nemotron-Research-Reasoning-Qwen-1.5B and DeepScaleR-1.5B-Preview perform poorly on MATH-B at $\texttt{pass@1024}$, showing how existing approaches fall short on tackling harder instances. We hope MATH-B will catalyze exploration-driven RL approaches that elicit deeper reasoning capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5053. GRACE: A Language Model Framework for Explainable Inverse Reinforcement Learning
๐ฌ 14
โญ 4.00
๐ reinforcement learning
๐ท๏ธ inverse reinforcement learning large language models evolution
Inverse Reinforcement Learning (IRL) aims to recover Reward Models from expert demonstrations, but traditional methods yield "black-box" models that are difficult to interpret and debug. In this work, we introduce GRACE (Generating Rewards As CodE), a method for using code Large Language Models (LLMs) within an evolutionary search to reverse-engineer an interpretable, code-based reward function directly from expert trajectories. The resulting reward function is executable code that can be inspected and verified. We empirically demonstrate that GRACE can efficiently learn highly accurate rewards in the multi-task setups as defined by two benchmarks, BabyAI and AndroidWorld. Further, we demonstrate that the resulting reward leads to strong policies compared to both competitive Imitation Learning and online RL approaches with groundtruth rewards. Finally, we show that GRACE is able to build complex reward APIs in mulit-task setups.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5054. Reinforcement Learning from Dynamic Critic Feedback for Free-Form Generations
๐ฌ 12
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ large language model reinforcement learning dynamic critics language model post-training open-ended generation
Open-ended generation tasks require outputs to satisfy diverse and often implicit task-specific evaluation rubrics. The sheer number of relevant rubrics leads to prohibitively high verification costs and incomplete assessments of a response, making reinforcement learning (RL) post-training with rubric-based rewards difficult to scale. This problem is exacerbated by the fact that often the best way to combine these rubrics into one single reward is also highly prompt-specific. We propose Reinforcement Learning from Dynamic Critic Feedback (RLDCF), a post-training approach that addresses these challenges via dynamic rubric verification. Our approach employs a large language model (LLM) as a critic that dynamically identifies only the most likely failure modes (e.g., a factual error or unhandled edge case), which are then verified by an external validator to optimize both generator and critic jointly. By training both the generator and the critic, this game enhances the critic's error detection and the generator's output quality while reducing required verifications. Our experiments demonstrate that RLDCF improves factual accuracy in text generation and correctness in code generation, while also outperforming exhaustive verification and reward model methods. We show that dynamic critics are more effective than fixed critics, showcasing the potential of RLDCF for scaling RL post-training to free-form generation tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5055. RFS: Reinforcement learning with Residual flow steering for dexterous manipulation
๐ฌ 4
โญ 4.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Robotics reinforcement learning sim-to-real
Imitation learning has been an effective tool for bootstrapping sequential decision making behavior, showing surprisingly strong results as methods are scaled up to high-dimensional, dexterous problems in robotics. These ``behavior cloning" methods have been further bolstered by the integration of generative modeling techniques such as diffusion modeling or flow matching for training expressive multimodal behavior policies. However, these pretrained models do not always generalize perfectly, and require finetuning to maximize deployment-time performance. This finetuning procedure must retain the strengths of pretraining for exploration, while being able to quickly correct for local inaccuracies in model performance. In this work, we propose an efficient reinforcement learning (RL) framework for fast adaptation of pretrained generative policies. Specifically, our proposed methodology - residual flow steering, instantiates an efficient RL technique that quickly adapts a pretrained flow-matching model by steering it jointly by optimizing a policy for selecting both a latent noise distribution and a residual action. Doing so allows policies to perform both local (residual actions) and global exploration (latent noise), data-efficient adaptation. We demonstrate that this technique is effective for dexterous manipulation problems, serving both as a tool to pretrain behaviors in simulation and efficiently finetune them in the real world.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5056. Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing
๐ฌ 3
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Large Language Models Data Attribution Model Auditing
Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their deployment is frequently undermined by undesirable behaviors such as generating harmful content, factual inaccuracies, and societal biases. Diagnosing the root causes of these failures poses a critical challenge for AI safety. Existing attribution methods, particularly those based on parameter gradients, often fall short due to prohibitive noisy signals and computational complexity. In this work, we introduce a novel and efficient framework that diagnoses a range of undesirable LLM behaviors by analyzing representation and its gradients, which operates directly in the model's activation space to provide a semantically meaningful signal linking outputs to their training data. We systematically evaluate our method for tasks that include tracking harmful content, detecting backdoor poisoning, and identifying knowledge contamination. The results demonstrate that our approach not only excels at sample-level attribution but also enables fine-grained token-level analysis, precisely identifying the specific samples and phrases that causally influence model behavior. This work provides a powerful diagnostic tool to understand, audit, and ultimately mitigate the risks associated with LLMs, paving the way for more reliable and aligned AI systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5057. Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference
๐ฌ 15
โญ 4.00
๐ generative models
๐ท๏ธ Video Generative Model Video Diffusion Model Intuitive Physics Understanding
Intuitive physics understanding in video diffusion models plays an essential role in building general-purpose physically plausible world simulators, yet accurately evaluating such capacity remains a challenging task due to the difficulty in disentangling physics correctness from visual appearance in generation. To the end, we introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models by distinguishing physically valid and impossible videos using the denoising objective as an ELBO-based likelihood surrogate on a curated dataset of valid-invalid pairs. By testing on our constructed benchmark of twelve scenarios spanning over four physics domains, we show that our evaluation metric, Plausibility Preference Error (PPE), demonstrates strong alignment with human
preference, outperforming state-of-the-art evaluator baselines. We then systematically benchmark intuitive physics understanding in current video diffusion models. Our study further analyses how model design and inference settings affect intuitive physics understanding and highlights domain-specific capacity variations across physical laws. Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale up.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5058. Evaluating Language Models' Evaluations of Games
๐ฌ 21
โญ 4.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ game AI; meta-reasoning; cognitive science; problem evaluation
Reasoning is not just about solving problems---it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over 100 novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more "jaggedness" across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5059. CERTIFIED VS. EMPIRICAL ADVERSARIAL ROBUSTNESS VIA HYBRID CONVOLUTIONS WITH ATTENTION STOCHASTICITY
๐ฌ 3
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Certified Defense Empirical Defense Adversarial Robustness
We introduce Hybrid Convolutions with Attention Stochasticity (HyCAS), an adversarial defense that narrows the long-standing gap between provable robustness under โ2 certificates and empirical robustness against strong โโ attacks, while preserving strong generalization across diverse imaging benchmarks. HyCAS unifies deterministic and randomized principles by coupling 1-Lipschitz, spectrally normalized convolutions with two stochastic componentsโspectral normalized random-projection filters and a randomized attention-noise mechanismโto realize a randomized defense. Injecting smoothing randomness inside the architecture yields an overall โค 2-Lipschitz network with formal certificates. Extensive experiments on diverse imaging benchmarksโincluding CIFAR-10/100, ImageNet-1k, NIH Chest X-ray, HAM10000โshow that HyCAS surpasses prior leading certified and empirical defenses, boosting certified accuracy by up to โ 7.3% (on NIH Chest X-ray) and empirical robustness by up to โ 3.1% (on HAM10000), without sacrificing clean accuracy. These results show that a randomized Lipschitz constrained architecture can simultaneously improve both certified โ2 and empirical โโ adversarial robustness, thereby supporting safer deployment of deep models in high-stakes applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5060. Q-Learning with Adjoint Matching
๐ฌ 20
โญ 4.00
๐ reinforcement learning
๐ท๏ธ Reinforcement learning flow-matching
We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion/flow-matching based policy with respect to a parameterized value function (i.e., the critic $Q_\phi(s, a)$). Effective optimization requires exploiting the first-order information of the critic (i.e., the action gradient, $\nabla_a Q_\phi(s, a)$), but it is especially challenging to do so for flow/diffusion policy because direct gradient-based optimization via backpropagation through their multi-step denoising process is unstable.
Existing methods work around either by only using the value and discarding the gradient information,
or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference (TD) backup for critic learning, QAM consistently outperforms prior approaches across challenging, sparse reward tasks in both offline and offline-to-online RL settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5061. Token-based Audio Inpainting via Discrete Diffusion
๐ฌ 14
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Audio inpainting Discrete diffusion models Transformer-based diffusion Audio tokenization Generative modeling Music restoration
Audio inpainting seeks to restore missing segments in degraded recordings. Previous diffusion-based methods exhibit impaired performance when the missing region is large. We introduce the first approach that applies discrete diffusion over tokenized music representations from a pre-trained audio tokenizer, enabling stable and semantically coherent restoration of long gaps. Our method further incorporates two training approaches: a derivative-based regularization loss that enforces smooth temporal dynamics, and a span-based absorbing transition that provides structured corruption during diffusion. Experiments on the MusicNet and MAESTRO datasets with gaps up to 750 ms show that our approach consistently outperforms strong baselines across range of gap lengths, for gaps of 150 ms and above. This work advances musical audio restoration and introduces new directions for discrete diffusion model training.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5062. Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
๐ฌ 14
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Data Selection Large Language Models
Data curation is a critical yet underexplored component in large language model (LLM) training. Existing approaches (such as data selection and data mixing) operate in an offline paradigm, decoupled from the training process. This separation introduces extra engineering overhead and makes curated subsets brittle: once the model or task changes, the entire pipeline must be re-run. Moreover, offline methods alter dataset size through hard filtering or resampling, often discarding data diversity, and thus face the generalization issue. We propose to rethink data curation as an online reweighting problem, where sample importance is dynamically adjusted during training via loss weighting rather than static preprocessing. This view preserves data diversity, adapts continuously to evolving model states, and yields a better performanceโFLOPs tradeoff. Thus, we introduce ADAPT (Adaptive Data reweighting for Pretraining and FineTuning), a dynamic online framework that reweights training samples with adaptive per-sample learning rates guided by similarity-based quality signals, without changing the number of training samples. ADAPT integrates reweighting directly into the optimization loop with minimal overhead. Experiments on both instruction tuning and large-scale pretraining show that ADAPT consistently outperforms offline selection/mixing and prior online methods, achieving stronger cross-benchmark generalization under equal FLOPs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5063. Oversmoothing, "Oversquashing'', Heterophily, Long-Range, and more: Demystifying Common Beliefs in Graph Machine Learning
๐ฌ 15
โญ 4.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ oversmoothing oversquashing heterophily long-range propagation graph neural networks graph machine learning
After a renaissance phase in which researchers revisited the message-passing paradigm through the lens of deep learning, the graph machine learning community shifted its attention towards a deeper and practical understanding of message-passing's benefits and limitations. In this paper, we notice how the fast pace of progress around the topics of oversmoothing and oversquashing, the homophily-heterophily dichotomy, and long-range tasks, came with the consolidation of commonly accepted beliefs and assumptions that are not always true nor easy to distinguish from each other. We argue that this has led to ambiguities around the investigated problems, preventing researchers from focusing on and addressing precise research questions while causing a good amount of misunderstandings. Our contribution wants to make such common beliefs explicit and encourage critical thinking around these topics, supported by simple but noteworthy counterexamples. The hope is to clarify the distinction between the different issues and promote separate but intertwined research directions to address them.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5064. Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
๐ฌ 17
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Positional Encoding Large Language Model Transformer Long Context Attention Mechanism
Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation.
However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims.
We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model.
BAM unifies existing methods (e.g., NoPE and ALiBi) and motivates a new Generalized Gaussian positional prior that substantially improves long-context generalization.
Empirically, BAM enables accurate information retrieval at $500\times$ the training context length, outperforming previous state-of-the-art context length generalization by more than $25\times$ in retrieval accuracy while maintaining comparable perplexity and introducing minimal additional parameters.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5065. GAPrune: Gradient-Alignment Pruning for Domain-Aware Embeddings
๐ฌ 18
โญ 4.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Embedding Model; Domain Adaptation; Domain Pruning
Domain-specific embedding models have shown promise for applications that require specialized semantic understanding, such as coding agents and financial retrieval systems, often achieving higher performance gains than general models. However, state-of-the-art embedding models are typically based on LLMs, which contain billions of parameters, making deployment challenging in resource-constrained environments. Model compression through pruning offers a promising solution, but existing pruning methods treat all parameters uniformly, failing to distinguish between general semantic representations and domain-specific patterns, leading to suboptimal pruning decisions. Thus, we propose GAPrune, a pruning framework that addresses this challenge by considering both domain importance and preserving general linguistic foundation. Our method uses Fisher Information to measure importance and general-domain gradient alignment to assess parameter behavior, then combines these signals using our Domain Alignment Importance (DAI) scoring. Lower DAI scores indicate that the parameter is either less important for the domain task or creates conflicts between domain and general objectives. Experiments on two domain benchmarks, FinMTEB and ChemTEB, show that GAPrune maintains performance within 2.5\% of dense models in one-shot pruning at 50\% sparsity, while outperforming all baselines. With retraining in 100 steps, GAPrune achieves +4.51\% improvement on FinMTEB and +1.73\% on ChemTEB, demonstrating that our pruning strategy not only preserves but enhances domain-specific capabilities. Our findings demonstrate that principled pruning strategies can achieve model compression and enhanced domain specialization, providing the research community with a new approach for development.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5066. Value Matching: Scalable and Gradient-Free Reward-Guided Flow Adaptation
๐ฌ 17
โญ 4.00
๐ generative models
๐ท๏ธ diffusion models flow models black-box reward optimization molecular design image generation stochastic optimal control reinforcement learning
Adapting large-scale flow and diffusion models to downstream tasks through reward optimization is essential for their adoption in real-world applications, including scientific discovery and image generation. While recent fine-tuning methods based on reinforcement learning and stochastic optimal control achieve compelling performance, they face severe scalability challenges due to high memory demands that scale with model complexity. In contrast, methods that disentangle reward adaptation from base model complexity, such as Classifier Guidance (CG), offer flexible control over computational resource requirements. However, CG suffers from limited reward expressivity and a train-test distribution mismatch due to its offline nature. To overcome the limitations of fine-tuning methods and CG, we propose Value Matching (VM), an online algorithm for learning the value function within an optimal control setting. VM provides tunable memory and compute demands through flexible value network complexity, supports optimization of non-differentiable rewards, and operates on-policy, which enables going beyond the data distribution to discover high-reward regions. Experimentally, we evaluate VM across image generation and molecular design tasks. We demonstrate improved stability and sample efficiency over CG and achieve comparable performance to fine-tuning approaches while requiring less than 5% of their memory usage.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5067. SkyEvents: A Large-Scale Event-enhanced UAV Dataset for Robust 3D Scene Reconstruction
๐ฌ 25
โญ 4.00
๐ datasets and benchmarks
๐ท๏ธ Event 3D Scene Reconstruction
Recent advances in large-scale 3D scene reconstruction using unmanned aerial vehicles (UAVs) have spurred increasing interest in neural rendering techniques. However, existing approaches with conventional cameras struggle to capture consistent multi-view images of scenes, particularly in extremely blurred and low-light environments, due to the inherent limitations in dynamic range caused by long exposure and motion blur resulting from camera motion. As a promising solution, bio-inspired event cameras exhibit robustness in extreme scenarios, thanks to their high dynamic range and microsecond-level temporal resolution. Nevertheless, dedicated event datasets specifically tailored for large-scale UAV 3D scene reconstruction remain limited. To bridge this gap, we introduce \textbf{SkyEvents}, a pioneering large-scale event-enhanced UAV dataset for 3D scene reconstruction, incorporating RGB, event, and LiDAR data. SkyEvents encompasses 22 sequences, spanning over 8 hours of video, captured across a diverse set of illumination conditions, scenarios, and flight altitudes. To facilitate the event-based 3D scene reconstruction with SkyEvents, we propose the Geometry-constrained Timestamp Alignment (\textbf{GTA}) module to align timestamps between the event and RGB cameras. Furthermore, we introduce Region-wise Event Rendering (\textbf{RER}) loss for supervising the rendering optimization. With SkyEvents, we aim to motivate and equip researchers to advance large-scale 3D scene reconstruction in challenging environments, harnessing the unique strengths of event cameras. Dataset and code will be available at \url{https://}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5068. From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents
๐ฌ 4
โญ 4.00
๐ datasets and benchmarks
๐ท๏ธ Database Agents LLM Agents EHR-QA DB-QA
Despite the impressive performance of LLM-powered agents, their adoption for Electronic Health Record (EHR) data access remains limited by the absence of benchmarks that adequately capture real-world clinical data access flows. In practice, two core challenges hinder deployment: query ambiguity from vague user questions and value mismatch between user terminology and database entries. To address this, we introduce EHR-ChatQA an interactive database question answering benchmark that evaluates the end-to-end workflow of database agents: clarifying user questions, using tools to resolve value mismatches, and generating correct SQL to deliver accurate answers. To cover diverse patterns of query ambiguity and value mismatch, EHR-ChatQA assesses agents in a simulated environment with an LLM-based user across two interaction flows: Incremental Query Refinement (IncreQA), where users add constraints to existing queries, and Adaptive Query Refinement (AdaptQA), where users adjust their search goals mid-conversation. Experiments with state-of-the-art LLMs (e.g., o4-mini and Gemini-2.5-Flash) over five i.i.d. trials show that while agents achieve high Pass@5 of 90-95% (at least one of five trials) on IncreQA and 60-80% on AdaptQA, their Pass^5 (consistent success across all five trials) is substantially lower by 35-60%. These results underscore the need to build agents that are not only performant but also robust for the safety-critical EHR domain. Finally, we provide diagnostic insights into common failure modes to guide future agent development.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5069. Generalized Spherical Neural Operators: Greenโs Function Formulation
๐ฌ 33
โญ 4.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Fourier neural operator Green function Spherical harmonic
Neural operators offer powerful approaches for solving parametric partial differential equations, but extending them to spherical domains remains challenging due to the need to preserve intrinsic geometry while avoiding distortions that break rotational consistency. Existing spherical operators rely on rotational equivariance but often lack the flexibility for real-world complexity. We propose a general operator-design framework based on the designable spherical Greenโs function and its harmonic expansion, establishing a solid operator-theoretic foundation for spherical learning. Based on this, we propose an absolute and relative position-dependent Greenโs function that enables flexible balance of equivariance and invariance for real-world modeling. The resulting operator, Green's-function Spherical Neural Operator (GSNO) with a novel spectral learning method, can adapt to anisotropic, constraint-rich systems while retaining spectral efficiency. To exploit GSNO, we develop GSHNet, a hierarchical architecture that combines multi-scale spectral modeling with spherical upโdown sampling, enhancing global feature representation. Evaluations on diffusion MRI, shallow water dynamics, and global weather forecasting, GSNO and GSHNet consistently outperform state-of-the-art methods. Our results position GSNO as a principled and general framework for spherical operator learning, bridging rigorous theory with real-world complexity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5070. Decision Pre-Trained Transformer is a Scalable In-Context Reinforcement Learner
๐ฌ 19
โญ 4.00
๐ reinforcement learning
๐ท๏ธ in-context reinforcement learning
Recent progress in in-context reinforcement learning (ICRL) has demonstrated its potential for training generalist agents that can acquire new tasks directly at inference. Algorithm Distillation (AD) pioneered this paradigm and was subsequently scaled to multi-domain settings, although its ability to generalize to unseen tasks remained limited. The Decision Pre-Trained Transformer (DPT) was introduced as an alternative, showing stronger in-context reinforcement learning abilities in simplified domains, but its scalability had not been established. In this work, we extend DPT to diverse multi-domain environments, applying Flow Matching as a natural training choice that preserves its interpretation as Bayesian posterior sampling. As a result, we obtain an agent trained across hundreds of diverse tasks that achieves clear gains in generalization to the held-out test set. This agent improves upon prior AD scaling and demonstrates stronger performance in both online and offline inference, reinforcing ICRL as a viable alternative to expert distillation for training generalist agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5071. Battery Fault: A Comprehensive Dataset and Benchmark for Battery Fault Diagnosis
๐ฌ 8
โญ 4.00
๐ datasets and benchmarks
๐ท๏ธ Lithium-ion batteries Fault diagnosis Benchmark dataset Generative modeling Time series analysis
With the accelerated popularization of electric vehicles (EV), battery safety issues have become an important research focus. Data-driven battery fault diagnosis algorithms, built on real-world operational data, are critical methods for reducing safety risks. However, existing battery datasets have limitations such as insufficient scale, coarse-grained labels, and lack of coverage of real-world operating conditions, which seriously restrict the development of data-driven fault diagnosis algorithms. To address these issues, this paper introduces a large-scale benchmark dataset named CH-BatteryGen, which is, to the best of our knowledge, the first EV battery system fault diagnosis dataset based on real-world operating conditions. This dataset integrates real on-board operation data with mechanism-constrained generative modeling technology, balancing authenticity and scalability. It covers two mainstream battery chemistries, namely nickel-cobalt-manganese (NCM) lithium batteries and lithium iron phosphate (LFP) batteries, and involves charging, discharging, and operation data of 1000 electric vehicles. It provides four fault labels (normal, self-discharge, high-resistance, low-capacity) and three severity level annotations, supporting two benchmark tasks: fault classification and fault grading. Through systematic validation using traditional machine learning methods (random forest (RF), support vector machine (SVM)) and deep learning models (long short-term memory (LSTM), convolutional neural network (CNN)), the results show that the CNN model performs best in the fault classification task, achieving an F1-score of 0.9280 in the LFP discharging scenario; in the fault grading task, the F1-score reaches 0.8813. The CH-BatteryGen dataset has been open-sourced, aiming to provide a standardized evaluation platform for battery fault diagnosis algorithms, promote research development in this field, and contribute to the transformation of sustainable transportation systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5072. Efficient Agent Training for Computer Use
๐ฌ 13
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Agents; Computer Use; Large Language Models; Vision Language Models
Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents. We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations. Starting with just 312 human-annotated computer use trajectories, we further augment them by synthesizing diverse alternative action decisions with Claude 3.7 Sonnet. Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141% relative improvement, and even surpassed the Claude 3.7 Sonnet by 10% on WindowsAgentArena-V2, an improved benchmark we also released. By integrating robust human computer use skills with automated AI data synthesis capabilities, our method not only brought substantial improvements over training on human trajectories alone, but also significantly surpassed direct distillation from Claude 3.7 Sonnet.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5073. Hinge Regression Tree: A Newton Method for Oblique Regression Tree Splitting
๐ฌ 12
โญ 4.00
๐ optimization
๐ท๏ธ Optimization Regression trees Newton method Convergence
Oblique decision trees combine the transparency of trees with the power of multivariate decision boundariesโbut learning high-quality oblique splits is NP-hard, and practical methods still rely on slow search or theory-free heuristics.
We present the Hinge Regression Tree (HRT), which reframes each split as a non-linear least-squares problem over two linear predictors whose max/min envelope induces ReLU-like expressive power.
The resulting alternating fitting procedure is exactly equivalent to a damped Newton (GaussโNewton) method within fixed partitions.
We analyze this node-level optimization and, for a backtracking line-search variant, prove that the local objective decreases monotonically and converges; in practice, both fixed and adaptive damping yield fast, stable convergence and can be combined with optional ridge regularization.
We further prove that HRTโs model class is a universal approximator with an explicit $O(\delta^2)$ approximation rate, and show on synthetic and real-world benchmarks that it matches or outperforms single-tree baselines with more compact structures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5074. Evaluating GFlowNet from partial episodes for stable and flexible policy-based training
๐ฌ 26
โญ 4.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Probabilistic Inference Reinforcement Learning Combinatorial Optimization
Generative Flow Networks (GFlowNets) were developed to learn policies for efficiently sampling combinatorial candidates by interpreting their generative processes as trajectories in directed acyclic graphs. In the value-based training workflow, the objective is to enforce the balance over partial episodes between the flows of the learned policy and the estimated flows of the desired policy, implicitly encouraging policy divergence minimization. The policy-based strategy alternates between estimating that divergence and updating the policy, but reliable estimation of the divergence under directed acyclic graphs remains a major challenge. This work bridges the two perspectives by showing that flow balance also yields a principled policy evaluator that measures the policy divergence, and an evaluation balance objective over partial episodes is proposed for learning the evaluator. As demonstrated on both synthetic and real-world tasks, the flow balance condition not only strengthens the reliability of policy-based training but also broadens its flexibility by seamlessly supporting parameterized backward policies and enabling the integration of offline data-collection techniques.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5075. Uncovering Semantic Selectivity of Latent Groups in Higher Visual Cortex with Mutual Information-Guided Diffusion
๐ฌ 11
โญ 4.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Neural Latent Discovery Selectivity Visual Cortex Diffusion Models
Understanding how neural populations in higher visual areas encode object-centered visual information remains a central challenge in computational neuroscience. Prior works have investigated representational alignment between artificial neural networks and the visual cortex. Nevertheless, these findings are indirect and offer limited insights to the structure of neural populations themselves. Similarly, decoding-based methods have quantified semantic features from neural populations but have not uncovered their underlying organizations. This leaves open a scientific question: ``how feature-specific visual information is distributed across neural populations in higher visual areas, and whether it is organized into structured, semantically meaningful subspaces" To tackle this problem, we present MIG-Vis, a method that leverages the generative power of diffusion models to visualize and validate the visual-semantic attributes encoded in neural latent subspaces. Our method first uses a variational autoencoder to infer a **group-wise disentangled** neural latent subspace from neural populations. Subsequently, we propose a **mutual information** (MI)โguided diffusion synthesis procedure to visualize the specific visual-semantic features encoded by each latent group. We validate MIG-Vis on multi-session neural spiking datasets from the inferior temporal (IT) cortex of two macaques. The synthesized results demonstrate that our method identifies neural latent groups with clear semantic selectivity to diverse visual features, including object pose, inter-category transformations, and intra-class content. These findings provide direct, interpretable evidence of structured semantic representation in the higher visual cortex and advance our understanding of its encoding principles.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5076. Jacobian Aligned Random Forests
๐ฌ 11
โญ 4.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Random forests; Decision trees; Axis-aligned splits; Oblique decision boundaries; Feature interactions; Supervised preconditioning; Gradient-based feature transforms
Axis-aligned decision trees are fast and stable but struggle on datasets with rotated
or interaction-dependent decision boundaries, where informative splits require linear combinations of features rather than single-feature thresholds. Oblique forests
address this with per-node hyperplane splits, but at added computational cost.
We propose a simple alternative: JARF, Jacobian-Aligned Random Forests. Concretely, we fit a random forest to estimate class probabilities or regression outputs,
compute finite-difference gradients with respect to each feature, form an expected
Jacobian outer product/expected gradient outer product, and use it as a single
global linear preconditioner for all inputs. This preserves the simplicity of axisaligned trees while applying a single global rotation to capture oblique boundaries
and feature interactions that would otherwise require many axis-aligned splits to
approximate. On tabular benchmarks, our preconditioned forest matches or surpasses oblique baselines while training faster. Our results suggest that supervised
preconditioning can deliver the accuracy of oblique forests while keeping the simplicity of axis-aligned trees.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5077. Towards Text-Mask Consistency in Medical Image Segmentation
๐ฌ 15
โญ 4.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Medical image segmentation Vision language models Multimodal learning KolmogorovโArnold Networks
Vision-language models for medical image segmentation often produce masks that conflict with the accompanying text, especially under multi-site/multi-lesion descriptions. We trace this failure to two factors: (i) highly templated and repetitive clinical language causes one-to-one hard contrastive learning to yield numerous false negatives, weakening cross-modal alignment; and (ii) predominantly vision-driven, one-way cross-attention lacks a language-dominant, spatially aware pathway, hindering effective injection of textual semantics into the spatial visual domain. To this end, we propose Consistency-enhanced Two-stage Segmentation (C2Seg). In the pretraining stage, Cluster-aware Contrastive Learning uses a frozen strong baseline to construct an intra-batch text similarity matrix as soft labels, thereby alleviating false negative conflicts and producing more discriminative visual representations. In the fusion stage, we introduce a Bidirectional Complementary Attention Module, where each modality dominates attention along its own path, fostering deep interaction and structural consistency between visual and textual representations. In order to enhance the expressive power of multimodal features, we further adopt KAN-based Attention Gating. Without updating the language encoder, our approach significantly improves text-mask consistency and segmentation accuracy on four public medical imaging datasets. Code is provided in the supplementary material.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5078. Multi-ReduNet: Interpretable Class-Wise Decomposition of ReduNet
๐ฌ 39
โญ 4.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ interpretable machine learning white-box neural networks ReduNet Multi-ReduNet
ReduNet has emerged as a promising white-box neural architecture grounded in the principle of maximal coding rate reduction, offering interpretability in deep feature learning. However, its practical applicability is hindered by computational complexity and limited ability to exploit class-specific structures, especially in undersampled regimes. In this work, we propose Multi-ReduNet and its variant Multi-ReduNet-LastNorm, which decompose the global learning objective into class-wise subproblems. These extensions preserve the theoretical foundation of ReduNet while improving training efficiency by reducing matrix inversion costs and enhancing feature separability. We provide a concise theoretical justification for the class-wise decomposition and show through experiments on diverse datasets that our models retain interpretability while achieving superior efficiency and discriminative power under limited supervision. Our findings suggest that class-wise extensions of ReduNet broaden its applicability, bridging the gap between interpretability and practical scalability in deep learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5079. Are LLMs Really Not Knowledgeable? Mining the Submerged Knowledge in LLMs' Memory
๐ฌ 12
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Knowledge-based QA Memory of LLMs
Large language models (LLMs) have shown promise as parametric knowledge bases, but often underperform on question answering (QA) tasks due to hallucinations and uncertainty. While prior work attributes these failures to knowledge gaps in the modelโs parameters, we uncover a complementary phenomenon: LLMs frequently retain correct knowledge even when generating incorrect or \``unsure'' answers.
By analyzing the token-level output distributions, we find that correct answers often appear among high-probability candidates, despite not being selected. Motivated by this, we propose Hits@k, a novel metric to evaluate latent knowledge retention independent of answer surface form. Our experiments reveal that LLMs possess significantly more factual knowledge than is reflected by standard QA accuracy.
Building on these insights, we further examine the prevailing few-shot QA paradigm. We find that prompting strategies which allow ``unsure'' outputs can inadvertently suppress correct answers by discouraging low-confidence generation. We design a set of quantitative experiments to measure this suppression effect, offering practical guidance for future prompt and decoding design in knowledge-intensive tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5080. Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning
๐ฌ 18
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ reinforcement learning self-imitation learning large language model agentic learning llm agents
Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent's own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL, where a replay buffer stores good experience for off-policy update, by gradually steering the policy entropy across stages. Specifically, the proposed curriculum scheduling harmonizes intrinsic reward shaping and self-imitation to 1) expedite exploration via frequent tool interactions at the beginning, and 2) strengthen exploitation of successful tactics upon convergence towards familiarity with the environment. We also combine bag-of-tricks of industrial RL optimizations
for a strong baseline Dr.BoT to demonstrate our effectiveness. In ALFWorld and WebShop, SPEAR increases the success rates of GRPO/GiGPO/Dr.BoT by up to 16.1\%/5.1\%/8.6\% and 20.7\%/11.8\%/13.9\%, respectively. In AIME24 and AIME25, SPEAR boosts Dr.BoT by up to 3.8\% and 6.1\%, respectively. Such gains incur only 10\%โ25\% extra theoretical complexity and negligible runtime overhead in practice, demonstrating the plug-and-play scalability of SPEAR.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5081. CREPE: Controlling diffusion with REPlica Exchange
๐ฌ 12
โญ 4.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ parallel tempering diffusion model inference-time control replica exchange
Inference-time control of diffusion models aims to steer model outputs to satisfy new constraints without retraining.
Previous approaches have mostly relied on heuristic guidance or have been coupled with Sequential Monte Carlo (SMC) for bias correction.
In this paper, we propose a flexible alternative based on replica exchange, an algorithm designed initially for sampling problems.
We refer to this method as the CREPE (Controlling with REPlica Exchange). Unlike SMC, CREPE:
(i) generates particles sequentially, (ii) maintains high diversity in the generated samples after a burn-in period,
and
(iii) enables online refinement or early termination.
We demonstrate its versatility across various tasks, including temperature annealing, reward tilting, model composition and classifier-free guidance debiasing, with competitive performance compared to prior SMC methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5082. Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness
๐ฌ 19
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Computer-Use Agents CUA Multimodal Agents GUI Agents LLM Agents Agent Safety Agent Reliability Goal-Directedness Safety Security Alignment Benchmark Evaluation
We **color-coded** the added changes to the **paper** and **Appendix** for the comfort of our reviewers.
Computer-Use Agents (CUAs) are an increasingly deployed class of agents that take actions on GUIs to accomplish user goals.
In this paper, we show that CUAs consistently exhibit *Blind Goal-Directedness* (BGD): a bias to pursue goals regardless of feasibility, safety, reliability, or context. We characterize three prevalent patterns of BGD: (i) lack of contextual reasoning, (ii) assumptions and decisions under ambiguity, and (iii) contradictory or infeasible goals. We develop BLIND-ACT, a benchmark of 90 tasks capturing these three patterns. Built on OSWorld, BLIND-ACT provides realistic environments and employs LLM-based judges to evaluate agent behavior, achieving 93.75% agreement with human annotations. We use BLIND-ACT to evaluate nine frontier models, including Claude Sonnet and Opus 4, Computer-Use-Preview, and GPT-5, observing high average BGD rates (80.8%) across them. We show that BGD exposes subtle risks that arise even when inputs are not directly harmful. While prompting-based interventions lower BGD levels, substantial risk persists, highlighting the need for stronger training- or inference-time interventions. Qualitative analysis reveals observed failure modes: execution-first bias (focusing on *how* to act over *whether* to act), thoughtโaction disconnect (execution diverging from reasoning), and request-primacy (justifying actions due to user request). Identifying BGD and introducing BLIND-ACT establishes a foundation for future research on studying and mitigating this fundamental risk and ensuring safe CUA deployment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5083. Probing in the Dark: State Entropy Maximization for POMDPs
๐ฌ 16
โญ 4.00
๐ reinforcement learning
๐ท๏ธ unsupervised RL State entropy maximization POMDPs Information states
Sample efficiency is one of the main bottlenecks for optimal decision making via reinforcement learning. Pretraining a policy to maximize the entropy of the state visitation can substantially speedup reinforcement learning of downstream tasks. It is still an open question how to maximize the state entropy in POMDPs, where the true states of the environment, or their entropy, are not observed. In this work, we propose to maximize the entropy of a sufficient statistic of the history, which is called an information state. First, we show that a recursive latent model that predicts future observations is an information state in this setting. Then, we provide a practical algorithm, called LatEnt, to simultaneously learn the latent model and a latent-based policy maximizing the corresponding entropy objective from reward-free interactions with the POMDP. We empirically show that our approach induces higher state entropy than existing methods, which translates to better performance on downstream tasks. As a byproduct, we open-source PROBE, the first benchmark to test reward-free pretraining in POMDPs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5084. LSA: Layer-wise Sparsity Allocation for Large Language Model Pruning Based on Minimal Linear Reconstruction Error
๐ฌ 28
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Layer-wise Sparsity Allocation Large Language Model Pruning Linear Reconstruction Error
Deploying large language models (LLMs) on platforms with insufficient computational resources remains a key challenge. Weight pruning is an efficient model compression technique that can reduce model size without retraining LLMs. However, due to the massive number of parameters, it is infeasible to estimate the importance of weights globally, and most prior studies assign a uniform sparsity ratio across all layers. Recent findings reveal that layers contribute unevenly to LLM performance, making it necessary to investigate Layer-wise importance. Existing Layer-wise sparsity allocation methods, such as OWL and DLP, rely on weight scoring and carefully designed score proxies to estimate Layer-wise importance and sparsity ratios, while enforcing identical sparsity to blocks and projection weights within a layer to avoid performance degradation. In this work, we propose Layer-wise Sparsity Allocation (LSA) for LLM pruning, which quantifies Layer-wise importance by evaluating the minimal linear reconstruction error (LSE) of each transformer layer under the assumption that 50\% of its least important weights are removed. Moreover, our method supports non-uniform sparsity allocation at block- or projection-level granularity within layers, without incurring catastrophic performance degradation. Experimental results demonstrate that LSA maintains high performance at high sparsity levels. At an overall sparsity ratio of 70\%, LSA surpasses state-of-the-art methods across language modeling tasks and seven zero-shot tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5085. TEN-DM: Topology-Enhanced Diffusion Model for Spatio-Temporal Event Prediction
๐ฌ 24
โญ 4.00
๐ learning on time series and dynamical systems
๐ท๏ธ Spatio-temporal point process Diffusion model Topological data analysis
Spatio-temporal point process (STPP) data appear in many domains. A natural way to model them is to describe how the instantaneous event rate varies over space and time given the observed history which enables interpretation, interaction detection, and forecasting. Traditional parametric kernel-based models, while historically dominant, struggle to capture complex nonlinear patterns. In contrast, deep learning methods leverage the representational power of neural networks to aggregate historical events and integrate spatio-temporal point processes. However, existing deep learning methods often process space and time independently, overlooking the spatio-temporal dependencies. To address this limitation, we propose a novel method called Topology-ENhanced Diffusion Model (TEN-DM), including two key components namely spatio-temporal graph construction and multimodal topological feature representation learning. Further, we use temporal query technique to effectively capture periodic temporal patterns for learning effective temporal representations. Extensive experiments show the effectiveness of TEN-DM on multiple STPP datasets compared to state-of-the-art methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5086. Features Emerge as Discrete States: The First Application of SAEs to 3D Representations
๐ฌ 14
โญ 4.00
๐ interpretability and explainable AI
๐ท๏ธ sparse autoencoders mechanistic interpretability computer vision
Sparse Autoencoders (SAEs) are a powerful dictionary learning technique for decomposing neural network activations, translating the hidden state into human ideas with high semantic value despite no external intervention or guidance. However, this technique has rarely been applied outside of the textual domain, limiting theoretical explorations of feature decomposition. We present the first application of SAEs to the 3D domain, analyzing the features used by a state-of-the-art 3D reconstruction VAE applied to 53k 3D models from the Objaverse dataset. We observe that the network encodes discrete rather than continuous features, leading to our key finding: such models approximate a discrete state space, driven by phase-like transitions from feature activations. Through this state transition framework, we address three otherwise unintuitive behaviors โ the inclination of the reconstruction model towards positional encoding representations, the sigmoidal behavior of reconstruction loss from feature ablation, and the bimodality in the distribution of phase transition points. This final observation suggests the model actively manipulates the interference caused by superposition to prioritize the saliency of different features. Our work not only compiles and explains unexpected phenomena regarding feature decomposition, but also describes a generally applicable, state-based feature framework.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5087. EVEREST: A Transformer for Probabilistic Rare-Event Anomaly Detection with Evidential and Tail-Aware Uncertainty
๐ฌ 4
โญ 4.00
๐ learning on time series and dynamical systems
๐ท๏ธ Transformer models Uncertainty quantification Evidential deep learning Extreme value theory Imbalanced classification
Forecasting rare events in multivariate time-series data is a central challenge in machine learning, complicated by severe class imbalance, long-range dependencies, and distributional uncertainty. We introduce EVEREST, a transformer-based architecture for probabilistic rare-event forecasting that delivers calibrated predictions and tail-aware risk estimation, with auxiliary interpretability through attention-based signal attribution. EVEREST integrates four key components: (i) a learnable attention bottleneck for soft aggregation of temporal dynamics; (ii) an evidential head for estimating aleatoric and epistemic uncertainty via a NormalโInverseโGamma distribution; (iii) an extreme-value head that models tail risk using a Generalized Pareto Distribution; and (iv) a lightweight precursor head for early-event detection. These modules are jointly optimised with a composite loss combining focal loss, evidential negative log-likelihood, and a tail-sensitive EVT penalty, and act only at training time; deployment uses a single classification head with no inference overhead. We evaluate EVEREST on a real-world benchmark spanning a decade of space-weather data and demonstrate state-of-the-art performance, including True Skill Statistic (TSS) scores of 0.973, 0.970, and 0.966 at 24, 48, and 72-hour horizons for C-class flares. The model is compact (โ0.81M parameters), efficient to train on commodity hardware, and applicable to other high-stakes domains such as industrial monitoring, weather, and satellite diagnostics. Limitations include reliance on fixed-length inputs and exclusion of image-based modalities, motivating future extensions to streaming and multimodal forecasting.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5088. ASTGI: Adaptive Spatio-Temporal Graph Interactions for Irregular Multivariate Time Series Forecasting
๐ฌ 14
โญ 4.00
๐ learning on time series and dynamical systems
๐ท๏ธ Irregular Multivariate Time Series Time Series Forecasting Dynamic Graph Neural Networks Spatio-Temporal Modeling Data-Driven Interaction
Irregular multivariate time series (IMTS) are prevalent in critical domains like healthcare and finance, where accurate forecasting is vital for proactive decision-making. However, the asynchronous sampling and irregular intervals inherent to IMTS pose two core challenges for existing methods: (1) how to accurately represent the raw information of irregular time series without introducing data distortion, and (2) how to effectively capture the complex dynamic dependencies between observation points. To address these challenges, we propose the Adaptive Spatio-Temporal Graph Interaction (ASTGI) framework. Specifically, the framework first employs a Spatio-Temporal Point Representation module to encode each discrete observation as a point within a learnable spatio-temporal embedding space. Second, a Neighborhood-Adaptive Graph Construction module adaptively builds a causal graph for each point in the embedding space via nearest neighbor search. Subsequently, a Spatio-Temporal Dynamic Propagation module iteratively updates information on these adaptive causal graphs by generating messages and computing interaction weights based on the relative spatio-temporal positions between points. Finally, a Query Point-based Prediction module generates the final forecast by aggregating neighborhood information for a new query point and performing regression. Extensive experiments on multiple benchmark datasets demonstrate that ASTGI outperforms various state-of-the-art methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5089. AssoMem: Scalable Memory QA with Multi-Signal Associative Retrieval
๐ฌ 15
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ memory-augmented LLM scalable retrieval memory question answering
Accurate recall from large-scale memories remains a core challenge for memory-augmented AI assistants performing question answering (QA), especially in similarity-dense scenarios where existing methods mainly rely on semantic distance to the query for retrieval. Inspired by how humans link information associatively, we propose AssoMem, a novel framework constructing an associative memory graph that anchors dialogue utterances to automatically extracted clues. This structure provides a rich organizational view of the conversational context and facilitates importance-aware ranking. Further, AssoMem integrates multi-dimensional retrieval signalsโrelevance, importance, and temporal alignmentโusing an adaptive mutual information (MI)-driven fusion strategy. Extensive experiments across three benchmarks and a newly introduced dataset, MeetingQA, demonstrate that AssoMem consistently outperforms state-of-the-art baselines, verifying its superiority in context-aware memory recall.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5090. Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks
๐ฌ 19
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ red-teaming LLM safety reinforcement learning
Despite recent rapid progress in AI safety, current large language models remain vulnerable to adversarial attacks in multi-turn interaction settings, where attackers strategically adapt their prompts across conversation turns and pose a more critical yet realistic challenge. Existing approaches that discover safety vulnerabilities either rely on manual red-teaming with human experts or employ automated methods using pre-defined templates and human-curated attack data, with most focusing on single-turn attacks. However, these methods did not explore the vast space of possible multi-turn attacks, failing to consider novel attack trajectories that emerge from complex dialogue dynamics and strategic conversation planning. This gap is particularly critical given recent findings that LLMs exhibit significantly higher vulnerability to multi-turn attacks compared to single-turn attacks. We propose DialTree-RPO, an on-policy reinforcement learning framework integrated with tree search that autonomously discovers diverse multi-turn attack strategies by treating the dialogue as a sequential decision-making problem, enabling systematic exploration without manually curated data. Through extensive experiments, our approach not only achieves more than 25.9% higher ASR across 10 target models compared to previous state-of-the-art approaches, but also effectively uncovers new attack strategies by learning optimal dialogue policies that maximize attack success across multiple turns.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5091. Towards Robust Real-World Multivariate Time Series Forecasting: A Unified Framework for Dependency, Asynchrony, and Missingness
๐ฌ 22
โญ 4.00
๐ learning on time series and dynamical systems
๐ท๏ธ Time Series Forecasting Channel Dependence Asynchronous Sampling Missing Blocks
Real-world time series data are inherently multivariate, often exhibiting complex inter-channel dependencies. Each channel is typically sampled at its own period and is prone to missing values due to various practical and operational constraints. These characteristics pose three fundamental challenges involving channel dependency, sampling asynchrony, and missingness, all of which must be addressed simultaneously to enable robust and reliable forecasting in practical settings. However, existing architectures typically address only parts of these challenges in isolation and still rely on simplifying assumptions, leaving unresolved the combined challenges of asynchronous channel sampling, test-time missing blocks, and intricate inter-channel dependencies. To bridge this gap, we propose ChannelTokenFormer, a Transformer-based forecasting framework with a flexible architecture designed to explicitly capture cross-channel interactions, accommodate channel-wise asynchronous sampling, and effectively handle missing values. Extensive experiments on public benchmark datasets reflecting practical settings, along with one private real-world industrial dataset, demonstrate the superior robustness and accuracy of ChannelTokenFormer under challenging real-world conditions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5092. When a Robot is More Capable than a Human: Learning from Constrained Demonstrators
๐ฌ 11
โญ 4.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Inverse Reinforcement Learning Learning from Observations Learning from Constrained Expert Demonstrations Robot Learning
Learning from demonstrations enables experts to teach robots complex tasks using interfaces such as kinesthetic teaching, joystick control, and sim-to-real transfer. However, these interfaces often constrain the expert's ability to demonstrate optimal behavior due to indirect control, setup restrictions, and hardware safety. For example, a joystick can move a robotic arm only in a 2D plane, even though the robot operates in a higher-dimensional space. As a result, the demonstrations collected by constrained experts lead to suboptimal performance of the learned policies. This raises a key question: Can a robot learn a better policy than the one demonstrated by a constrained expert? We address this by allowing the agent to go beyond direct imitation of expert actions and explore shorter and more efficient trajectories. We use the demonstrations to infer a state-only reward signal that measures task progress, and self-label reward for unknown states using temporal interpolation. Our approach outperforms common imitation learning in both sample efficiency and task completion time. On a real WidowX robotic arm, it completes the task in 11 seconds, 10x faster than behavioral cloning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5093. Rethinking Continual Learning with Progressive Neural Collapse
๐ฌ 15
โญ 4.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Continual Learning Neural Collapse
Continual Learning (CL) seeks to build an agent that can continuously learn a sequence of tasks, where a key challenge, namely Catastrophic Forgetting, persists due to the potential knowledge interference among different tasks. On the other hand, deep neural networks (DNNs) are shown to converge to a terminal state termed Neural Collapse during training, where all class prototypes geometrically form a static simplex equiangular tight frame (ETF). These maximally and equally separated class prototypes make the ETF an ideal target for model learning in CL to mitigate knowledge interference. Thus inspired, several studies have emerged very recently to leverage a fixed global ETF in CL, which however suffers from key drawbacks, such as *impracticability* and *limited performance*. To address these challenges and fully unlock the potential of ETF in CL, we propose **Progressive Neural Collapse (ProNC)**, a novel framework that completely removes the need of a fixed global ETF in CL. Specifically, ProNC progressively expands the ETF target in a principled way by adding new class prototypes as vertices for new tasks, ensuring maximal separability across all encountered classes with minimal shifts from the previous ETF. We next develop a new CL framework by plugging ProNC into commonly used CL algorithm designs, where distillation is further leveraged to balance between target shifting for old classes and target aligning for new classes. Extensive experiments show that our approach significantly outperforms related baselines while maintaining superior flexibility, simplicity, and efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5094. Strategic Scaling of Test-Time Compute: A Bandit Learning Approach
๐ฌ 7
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Test-time scaling bandit learning large language models pure exploration
Scaling test-time compute has emerged as an effective strategy for improving the performance of large language models. However, existing methods typically allocate compute uniformly across all queries, overlooking variation in query difficulty. To address this inefficiency, we formulate test-time compute allocation as a novel bandit learning problem and propose adaptive algorithms that estimate query difficulty on the fly and allocate compute accordingly. Compared to uniform allocation, our algorithms allocate more compute to challenging queries while maintaining accuracy on easier ones. Among challenging queries, our algorithms further learn to prioritize solvable instances, effectively reducing excessive computing on unsolvable queries. We theoretically prove that our algorithms achieve better compute efficiency than uniform allocation and empirically validate their effectiveness on math and code benchmarks. Specifically, our algorithms achieve up to an 11.10\% performance improvement (15.04\% relative) on the MATH-500 dataset, up to 10.82\% (14.44\% relative) on the AIME25 dataset, and up to an 11.23\% performance improvement (15.29\% relative) on the LiveCodeBench dataset.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5095. CellAgent: LLM-Driven Multi-Agent Framework for Natural Language-Based Single-Cell Analysis
๐ฌ 20
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Large Language Models LLM Agent Single-cell RNA sequencing Spatial transcriptomics
Single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) data analysis are pivotal for advancing biological research, enabling precise characterization of cellular heterogeneity. However, existing analysis approaches require extensive manual programming and complex tool integration, posing significant challenges for researchers. To address this, we introduce CellAgent, an autonomous, LLM-driven approach that performs end-to-end scRNA-seq and spatial transcriptomics data analysis through natural language interactions. CellAgent employs a multi-agent hierarchical decision-making framework, simulating a โdeep-thinkingโ workflow to ensure that analytical steps are logically coherent and aligned with the overarching research goal. To further enhance its capabilities, we develop sc-Omni, a high-performance, expert-curated toolkit that consolidates essential tools for scRNA-seq and spatial transcriptomics analysis. Additionally, we introduce a self-reflective optimization mechanism, enabling automated, iterative refinement of results through specialized evaluation methods, effectively replacing traditional manual assessments. Benchmarking against human experts demonstrates that CellAgent achieves significant improvement in efficiency across multiple downstream applications while maintaining excellent performance comparable to existing approaches and preserving natural language interactions. By translating high-level scientific questions into optimized computational workflows, CellAgent represents a step toward a new, more accessible paradigm in bioinformatics, allowing researchers to perform complex data analyses autonomously. In lowering technical barriers, CellAgent serves to advance the democratization of the scientific discovery process in genomics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5096. Accelerated Parallel Tempering via Neural Transports
๐ฌ 33
โญ 4.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ MCMC Parallel Tempering Diffusion Generative Models Neural Samplers Normalising Flows
Markov Chain Monte Carlo (MCMC) algorithms are essential tools in computational statistics for sampling from unnormalised probability distributions, but can be fragile when targeting high-dimensional, multimodal, or complex target distributions. Parallel Tempering (PT) enhances MCMC's sample efficiency through annealing and parallel computation, propagating samples from tractable reference distributions to intractable targets via state swapping across interpolating distributions. The effectiveness of PT is limited by the often minimal overlap between adjacent distributions in challenging problems, which requires increasing the computational resources to compensate. We introduce a framework that accelerates PT by leveraging neural samplers---including normalising flows, diffusion models, and controlled diffusions---to reduce the required overlap. Our approach utilises neural samplers in parallel, circumventing the computational burden of neural samplers while preserving the asymptotic consistency of classical PT. We demonstrate theoretically and empirically on a variety of multimodal sampling problems that our method improves sample quality, reduces the computational cost compared to classical PT, and enables efficient free energy/normalising constant estimation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5097. Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents
๐ฌ 22
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM-based Agents Process Supervision Curriculum Learning
Large Language Models (LLMs) as autonomous agents are increasingly tasked with solving complex, long-horizon problems.
Aligning these agents via preference-based methods like Direct Preference Optimization (DPO) is a promising direction, yet it faces a critical granularity mismatch.
Trajectory-level DPO provides stable signals but blur where credit should be assigned within long trajectories, whereas step-level DPO offers fine-grained supervision but can be statistically noisy and data-inefficient when Monte Carlo rollouts are limited, and can be hard to fully exploit multi-step structured behaviors that only reveal their effect over several actions.
To balance this trade-off, we introduce **H**ierarchical **P**reference **L**earning (HPL), a hierarchical framework that optimizes LLM agents by leveraging preference signals at multiple, synergistic granularities.
While HPL incorporates trajectory- and step-level DPO for global and local policy stability, its core innovation lies in group-level preference optimization guided by a dual-layer curriculum.
Our approach first decomposes expert trajectories into semantically coherent action groups and then generates contrasting suboptimal groups to enable preference learning at a fine-grained, sub-task level.
Then, instead of treating all preference pairs equally, HPL introduces a curriculum scheduler that organizes the learning process from simple to complex.
This curriculum is structured along two axes: the group length, representing sub-task complexity, and the sample difficulty, defined by the reward gap between preferred and dispreferred action groups.
Experiments on three challenging agent benchmarks show that HPL outperforms existing state-of-the-art methods.
Our analyses demonstrate that the hierarchical DPO loss effectively integrates preference signals across multiple granularities, while the dual-layer curriculum is crucial for enabling the agent to solve a wide range of tasks, from simple behaviors to complex multi-step sequences.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5098. Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression
๐ฌ 17
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Attention Mechanism; Sequence Modeling; Test-Time Training; Local Linear Regression; Associative Memory; Hardware-Efficient Attention
Transformer architectures have achieved remarkable success in various domains. While efficient alternatives to Softmax Attention have been widely studied, the search for more expressive mechanisms grounded in theoretical insightโeven at greater computational costโhas been relatively underexplored. In this work, we bridge this gap by proposing Local Linear Attention (LLA), a novel attention mechanism derived from nonparametric statistics through the lens of test-time regression. First, we show that LLA offers theoretical advantages over Linear and Softmax Attention for associative memory via a bias-variance trade-off analysis. Next, we address its computational challenges and propose two memory-efficient primitives to tackle the $\Theta(n^2d)$ and $\Theta(nd^2)$ complexity. We then introduce {FlashLLA}, a hardware-efficient, blockwise algorithm that enables scalable and parallel computation on modern accelerators. In addition, we implement and profile a customized inference kernel that significantly reduces memory overheads. Finally, we empirically validate the advantages and limitations of LLA on test-time regression, in-context regression, associative recall and state tracking tasks. Experiment results demonstrate that LLA effectively adapts to non-stationarity, outperforming strong baselines in test-time training and in-context learning, and exhibiting promising evidence for its scalability and applicability in large-scale models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5099. Verification and Co-Alignment via Heterogeneous Consistency for Preference-Aligned LLM Annotations
๐ฌ 20
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Verification Co-Alignment Preference-Aligned LLM Annotations Reference-Free Metric
Large Language Models (LLMs) are increasingly expected to be culturally customizable and personally aligned for natural language understanding (NLU). However, existing methods, from supervised fine-tuning (SFT) to personalized RLHF and prompting, either require costly large-scale annotations or remain constrained by pretraining distributions. Moreover, acquiring annotations that reflect subjective, diverse, and evolving user preferences is both expensive and labor-intensive. To address these limitations, we propose \textit{\textbf{H}eterogeneous-\textbf{C}onsistency \textbf{C}o-Alignment} (HCC) is a training-free annotation paradigm that leverages two heterogeneous models, which consists of an LLM, rich in knowledge yet often prone to overconfidence, is paired with a task-specialised lightweight model guided by a small user-preference set to verify and co-align misaligned outputs over unlabeled corpora. For verification, HCC introduces the reference-free \textit{\textbf{C}onsistent}-\textit{\textbf{A}nd}-\textit{\textbf{I}nconsistent} (\textbf{CAI}) Ratio, an uncertainty signal derived from inter-model agreements (consistent samples) and disagreements (inconsistent samples) to determine when refinement is needed. For co-alignment, HCC employs a non-parametric, embedding-based preference assignment scheme to recalibrate inconsistent samples according to user preferences. Across eight NLU datasets and both open- and closed-source LLMs, HCC consistently improves annotation quality and, in several tasks, even enables \textit{Llama-3-8B} to surpass \textit{GPT-3.5/4o} after co-alignment. Moreover, CAI correlates strongly with accuracy and reliably tracks pre-/post-alignment gains, offering a reference-free signal for scaling preference-aligned annotation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5100. Deep FlexQP: Accelerated Nonlinear Programming via Deep Unfolding
๐ฌ 11
โญ 4.00
๐ optimization
๐ท๏ธ Learning-to-Optimize Deep Unfolding Nonlinear Programming
We propose an always-feasible ``flexible'' quadratic programming (QP) optimizer, FlexQP, which is based on an exact relaxation of the QP constraints. If the original constraints are feasible, then the optimizer finds the optimal solution to the original QP. On the other hand, if the constraints are infeasible, the optimizer identifies a solution that minimizes the constraint violation in a sparse manner. FlexQP scales favorably with respect to the problem dimension, is robust to both feasible and infeasible QPs with minimal assumptions on the problem data, and can be effectively warm-started. We subsequently apply deep unfolding to improve our optimizer through data-driven techniques, leading to an accelerated version called Deep FlexQP. By learning dimension-agnostic feedback policies for the parameters from a small number of training examples, Deep FlexQP generalizes to problems with larger dimensions and can optimize for many more iterations than it was initially trained for. Our approach outperforms two recently proposed state-of-the-art accelerated QP approaches on a suite of benchmark systems including portfolio optimization, classification, and regression problems. We provide guarantees on the expected performance of our deep QP optimizer through probably approximately correct (PAC) Bayes generalization bounds. These certificates are used to design an accelerated sequential quadratic programming solver that solves nonlinear optimal control and predictive safety filter problems faster than traditional approaches. Overall, our approach is very robust and greatly outperforms existing non-learning and learning-based optimizers in terms of both runtime and convergence to the optimal solution across multiple classes of NLPs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5101. Rรฉnyi Sharpness: A Novel Sharpness that Strongly Correlates with Generalization
๐ฌ 21
โญ 4.00
๐ learning theory
๐ท๏ธ generalization sharpness sharpness aware minimization
Sharpness (of the loss minima) is widely believed to be a good indicator of generalization of neural networks. Unfortunately, the correlation between existing sharpness measures and the generalization is not that strong as expected, sometimes even contradiction occurs. To address this problem, a key observation in this paper is: what really matters for the generalization is the *average spread* (or unevenness) of the spectrum of loss Hessian $\mathbf{H}$. For this reason, the conventional sharpness measures, such as the trace sharpness $\operatorname{tr}(\mathbf{H})$, which cares about the *average value* of the spectrum, or the max-eigenvalue sharpness $\lambda_{\max}(\mathbf{H})$), which concerns the *maximum spread* of the spectrum, are not sufficient to well predict the generalization. To finely characterize the average spread of the Hessian spectrum, we leverage the notion of *Rรฉnyi entropy* in information theory, which is capable of capturing the unevenness of a probability vector and thus can be extended to describe the unevenness for a general non-negative vector (which is the case for the Hessian spectrum at the loss minima). In specific, in this paper we propose the *Rรฉnyi sharpness*, which is defined as the negative of the Rรฉnyi entropy of loss Hessian $\mathbf{H}$.
Extensive experiments demonstrate that Rรฉnyi sharpness exhibit *strong* and *consistent* correlation with generalization in various scenarios. Moreover, on the theoretical side, two generalization bounds with respect to the Rรฉnyi sharpness are established, by exploiting the desirable reparametrization invariance property of Rรฉnyi sharpness. Finally, as an initial attempt to take advantage of the Rรฉnyi sharpness for regularization, Rรฉnyi Sharpness Aware Minimization (RSAM) algorithm is proposed where a variant of Rรฉnyi Sharpness is used as the regularizer. It turns out this RSAM is competitive with the state-of-the-art SAM algorithms, and far better than the conventional SAM algorithm based on the max-eigenvalue sharpness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5102. MicroVerse: A Preliminary Exploration Toward a Micro-World Simulation
๐ฌ 16
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Microscale Simulation Video Generation Benchmark Text-to-Video Dataset
Recent advances in video generation have opened new avenues for macroscopic simulation of complex dynamic systems, but their application to microscopic phenomena remains largely unexplored. Microscale simulation holds great promise for biomedical applications such as drug discovery, organ-on-chip systems, and disease mechanism studies, while also showing potential in education and interactive visualization. In this work, we introduce **MicroWorldBench**, a multi-level rubric-based benchmark for microscale simulation tasks. MicroWorldBench enables systematic, rubric-based evaluation through 459 unique expert-annotated criteria spanning multiple microscale simulation task (e.g., organ-level processes, cellular dynamics, and subcellular molecular interactions) and evaluation dimensions (e.g., scientific fidelity, visual quality, instruction following). MicroWorldBench reveals that current SOTA video generation models fail in microscale simulation, showing violations of physical laws, temporal inconsistency, and misalignment with expert criteria.
To address these limitations, we construct **MicroSim-10K**, a high-quality, expert-verified simulation dataset built with expert verification. Leveraging this dataset, we train **MicroVerse**, a video generation model tailored for microscale simulation. MicroVerse can accurately reproduce complex microscale phenomena. Our work first introduce the concept of **Micro-World Simulation** and present a **proof of concept**, paving the way for applications in biology, education, and scientific visualization. Our work demonstrates the potential of educational microscale simulations of biological mechanisms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5103. Action Chunking and Data Augmentation Yield Exponential Improvements for Imitation Learning in Continuous Spaces
๐ฌ 8
โญ 4.00
๐ learning theory
๐ท๏ธ Imitation learning compounding errors distribution shift control theory dynamical systems robotics action chunking noise injection
This paper presents a theoretical analysis of two of the most impactful interventions in modern learning from demonstration in robotics and continuous control: the practice of *action-chunking* (predicting sequences of actions in open-loop) and *exploratory augmentation* of expert demonstrations. Though recent results show that learning from demonstration, also known as imitation learning (IL), can suffer errors that compound *exponentially* with task horizon in continuous settings, we demonstrate that action chunking and exploratory data collection circumvent exponential compounding errors in different regimes. Our results identify control-theoretic stability as the key mechanism underlying the benefits of these interventions. On the empirical side, we validate our predictions and the role of control-theoretic stability through experimentation on popular robot learning benchmarks. On the theoretical side, we demonstrate that the control-theoretic lens provides fine-grained insights into how compounding error arises, leading to tighter statistical guarantees on imitation learning error when these interventions are applied than previous techniques based on information-theoretic considerations alone.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5104. PERSISTENCE SPHERES: BI-CONTINUOUS REPRESENTATIONS OF PERSISTENCE DIAGRAMS.
๐ฌ 24
โญ 4.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ topological data analysis topological machine learning linearization lift zonoid
Persistence spheres are a new functional representation of persistence diagrams. In contrast to existing embeddings such as persistence images, landscapes, or kernel-based methods, persistence spheres define a bi-continuous mapping: they are Lipschitz continuous with respect to the 1-Wasserstein distance and admit a continuous inverse on their image. This provides, in a theoretically optimal sense, both stability and geometric fidelity, so that persistence spheres are the representation that most closely preserves the Wasserstein geometry of PDs in a linear space.
We derive explicit formulas for persistence spheres, show that they can be computed efficiently, and note that they parallelize with minimal overhead. Empirically, we evaluate them on clustering, regression, and classification tasks involving functional data, time series, graphs, meshes, and point clouds. Across these benchmarks, persistence spheres consistently achieve state-of-the-art or competitive performance when compared with persistence images, persistence landscapes, persistence splines, and the sliced Wasserstein kernel. Additional simulations in the appendices further support the method and provide practical guidance for tuning its parameters.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5105. Frequency-Balanced Retinal Representation Learning with Mutual Information Regularization
๐ฌ 38
โญ 4.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Masked Image Modeling Masked Autoencoders Representation Learning Mutual Information Retinal Imaging Medical Imaging
We propose a frequency-oriented perspective on retinal representation learning by analyzing masked autoencoders (MAE) through the lens of spatial frequency. Our analysis shows that MAE favors low-frequency content while under-encoding diagnostically critical high-frequency structures in retinal images. Because retinal pathology often manifests in high-frequency detail, this bias limits diagnostic performance and motivates frequency-balanced representations. Within a mutual-information (MI) formulation of MAE, we introduce the \emph{Frequency-Balanced Retinal Masked Autoencoder (RetMAE)}, which augments the reconstruction objective with a MI regularizer that suppresses low-frequency redundancy and accentuates clinically salient high-frequency information. Without altering architecture, RetMAE learns frequency-balanced features that surpass those of MAE-based retinal encoders in both quantitative and qualitative evaluations. These results suggest that a frequency-oriented view provides a principled foundation for future advances in ophthalmic modeling.
offering new insight into how MAEโs reconstruction objective amplifies ViTโs low-pass tendencies in spatially heterogeneous retinal images and enabling a simple MI-based correction that improves retinal encoders.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5106. Realtime Video Frame Interpolation using One-Step Diffusion Sampling
๐ฌ 11
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Frame Interpolation; Diffusion Models; Realtime Processing
Recent research on video Frame Interpolation (VFI) shows that a pretrained Video Diffusion Model (VDM) can solve many challenging scenarios, including large or complex motion. However, VDMs require tedious diffusion sampling, making the inference slow. One possible way to accelerate is to distill a multi-step model into a one-step model, but additional modules are often introduced during distillation, which significantly increase training overhead. Instead, we propose a Real-time Diffusion-based Video Frame Interpolation pipeline, \method. \method achieves efficient interpolation by disentangling this task into two subproblems: motion and appearance generation. Specifically, \method first calculates pixel movements across frames with the continuous motion fields, only utilizing a few sparse key frames. As a result, \method only forwards the diffusion model for these sparse key frames rather than for each intermediate frame, effectively reducing one-step training cost. In the second appearance estimation step, \method then only needs to create intermediate frames by warping input frames with sampled optical flows from the estimated continuous motion field in the first step. Because our diffusion model creates motions only, it can work at a fixed and relatively small resolution, leading to superior training and inference efficiency. Extensive experiments show that our \method generates comparable or superior interpolation quality compared with existing multi-step solutions. It also offers outstanding inference efficiency, interpolating 17FPS at $1024\times 576$ resolution, achieving \textbf{50$\times$ acceleration} than the fastest diffusion-based generation by Wan.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5107. Compositional amortized inference for large-scale hierarchical Bayesian models
๐ฌ 29
โญ 4.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Amortized Bayesian Inference Hierarchical Models Compositional Modeling Score Matching
Amortized Bayesian inference (ABI) has emerged as a powerful simulation-based approach for estimating complex mechanistic models, offering fast posterior sampling via generative neural networks. However, extending ABI to hierarchical models, a cornerstone of modern Bayesian analysis, remains a major challenge due to the need to simulate massive data sets and estimate thousands of parameters. In this work, we build on compositional score matching (CSM), a divide-and-conquer strategy for Bayesian updating using diffusion models. To address existing stability issues of CSM in dealing with large data sets, we couple adaptive solvers with a novel, error-damping compositional estimator. Our estimator remains stable even with hundreds of thousands of data points and parameters. We validate our approach on a controlled toy example, a high-dimensional autoregressive model, and a real-world advanced microscopy application involving over 750,000 parameters.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5108. Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs
๐ฌ 27
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Language Models Group Relative Policy Optimization
Reinforcement Learning with Verifiable Rewards (RLVR) has become a widely adopted technique for enhancing the reasoning ability of Large Language Models (LLMs). However, the effectiveness of RLVR strongly depends on the capability of base models. This issue arises because it requires the model to have sufficient capability to perform high-quality exploration, which involves both effectiveness and diversity. Unfortunately, existing methods address this issue by imitating expert trajectories, which improve effectiveness but neglect diversity. To address this, we argue that the expert only needs to provide guidance only at critical decision points rather than the entire reasoning path. Based on this insight, we propose MENTOR: Mixed-policy Expert Navigation for Token-level Optimization of Reasoning, a framework that provides expert guidance only at critical decision points to perform effective and diverse exploration in RLVR. Extensive experiments show that MENTOR enables models capture the essence of expert strategies rather than surface imitation, thereby performing high-quality exploration and achieving superior overall performance. Our code is available online.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5109. ATGen: Adversarial Reinforcement Learning for Test Case Generation
๐ฌ 22
โญ 4.00
๐ generative models
๐ท๏ธ Test Case Generation Reinforcement Learning Large Language Models Code Generation
Large Language Models (LLMs) show remarkable code generation capabilities but often produce imperfect code with subtle bugs. A critical bottleneck for improving code quality is the scarcity of high-quality test cases. Existing approaches, primarily based on Supervised Fine-Tuning (SFT) over static datasets, are limited in their ability to discover novel bugs and struggle with the fundamental trade-off between generating error-triggering inputs and maintaining correct expected outputs. To address these limitations, we reframe test case generation as an iterative, adversarial process. We introduce ATGEN (Adversarial Test Generator), a novel framework that trains a test case generator via Reinforcement Learning (RL) in an adversarial loop with an evolving code generator. Instead of learning from a fixed set of bugs, our test generator is dynamically trained to create "attacking" I/O pairs for buggy code that is itself being iteratively generated. This process is guided by a reward function that explicitly balances the dual objectives of maximizing the bug detection rate and maintaining high output accuracy. Extensive experiments show that ATGEN dramatically outperforms the state-of-the-art SFT-based approach, UTGen, improving IO Accuracy by nearly 40 absolute points (71.56% vs. 31.83%) and more than doubling the Attack Rate (34.02% vs. 16.24%). The adversarial curriculum is particularly effective for hard-to-detect bugs, achieving an attack rate more than double that of the strongest baseline. Furthermore, tests generated by ATGEN serve as a more effective filter in Best-of-N code generation, significantly closing the gap to the human expert upper bound. Our work establishes a new and more effective paradigm for automated test generation and debugging for LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5110. PiCa: Parameter-Efficient Fine-Tuning with Column Space Projection
๐ฌ 4
โญ 4.00
๐ generative models
๐ท๏ธ Large Language Models Fine Tuning Deep Learning
Fine-tuning large foundation models is essential for building expert models tailored to specialized tasks and domains, but fully updating billions of parameters is computationally prohibitive. Reducing the number of trainable parameters using parameter-efficient fine-tuning is therefore crucial not only to reduce training costs but also to mitigate storage, caching, and serving overheads during deployment. Prior works, such as Singular Vectors-guided Fine-Tuning, have shown that exploiting the geometry of pre-trained weights can significantly improve parameter-efficiency, but they lack a solid theoretical foundation. In this paper, we introduce Parameter-efficient Fine-tuning with Column Space Projection (PiCa), a novel theoretically grounded PEFT method. We prove that projecting gradients onto the principal column space of pre-trained weights provides an effective inductive bias for adaptation and further enhance parameter efficiency through a novel weight-sharing strategy. Across diverse NLP and vision tasks, PiCa consistently outperforms state-of-the-art baselines under comparable or smaller parameter budgets, demonstrating both theoretical rigor and practical effectiveness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5111. Routing Channel-Patch Dependencies in Time Series Forecasting with Graph Spectral Decomposition
๐ฌ 31
โญ 4.00
๐ learning on time series and dynamical systems
๐ท๏ธ Time Series Forecasting Channel Dependency Graph Learning
Time series forecasting has attracted significant attention in the field of AI. Previous works have revealed that the Channel-Independent (CI) strategy improves forecasting performance by modeling each channel individually, but it often suffers from poor generalization and overlooks meaningful inter-channel interactions. Conversely, Channel-Dependent (CD) strategies aggregate all channels, which may introduce irrelevant information and lead to oversmoothing. Despite recent progress, few existing methods offer the flexibility to adaptively balance CI and CD strategies in response to varying channel dependencies. To address this, we propose a generic plugin xCPD, that can adaptively model the channel-patch dependencies from the perspective of graph spectral decomposition. Specifically, xCPD first projects multivariate signals into the frequency domain using a shared graph Fourier basis, and groups patches into low-, mid-, and high-frequency bands based on their spectral energy responses. xCPD then applies a channel-adaptive routing mechanism that dynamically adjusts the degree of inter-channel interaction for each patch, enabling selective activation of frequency-specific experts. This facilitates fine-grained, input-aware modeling of smooth trends, local fluctuations, and abrupt transitions. xCPD can be seamlessly integrated on top of existing CI and CD forecasting models, consistently enhancing both accuracy and generalization across benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5112. Stable-LoRA: Stabilizing Feature Learning of Low-Rank Adaption
๐ฌ 12
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models parameter-efficient fine tuning low-rank adaption
Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient method for fine-tuning Large Langauge Models. The weight matrix is updated as $W=W_0+sBA$, where $W_0$ is the original frozen weight, $s$ is a scaling factor and $A$,$B$ are trainable low-rank matrices. In this paper, we first theoretically show that, LoRA can naturally achieve and sustain stable feature learning (i.e., can be self-stabilized) given appropriate hyper-parameters and initializations of $A$ and $B$. However, we also claim that the non-zero initialization of $A$ could potentially compromise self-stability. To address this issue, we propose Stable-LoRA, a weight-shrinkage optimization strategy that enhances stability of LoRA feature learning. By progressively shrinking $A$ in the earliest training steps, Stable-LoRA is theoretically proved and empirically verified to prevent potential instability of LoRA while preserving the benefits of the non-zero start. With only 3 lines of code modification, Stable-LoRA consistently outperforms classical LoRA and other baselines in accuracies across various tasks, with no extra memory usage and negligible additional computation costs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5113. From Sequential to Parallel: Reformulating Dynamic Programming as GPU Kernels for Large-Scale Stochastic Combinatorial Optimization
๐ฌ 27
โญ 4.00
๐ optimization
๐ท๏ธ CUDA GPU computing Large-scale optimization Dynamic Programming Stochastic Optimization
Dynamic programming (DP) is central to combinatorial optimization, optimal control, and reinforcement learning, yet its perceived sequentiality has long hindered scalability. We introduce a general-purpose GPU framework that reformulates broad classes of forward DP recursions as batched min--plus matrix--vector products over layered DAGs, collapsing actions into masked state-to-state transitions that map directly to GPU kernels. This approach removes a major bottleneck in scenario-based stochastic programming (SP), where the use of DP has traditionally restricted the number of scenarios due to excessive computational cost.
Our framework exposes massive parallelism across scenarios, transition layers, and, when applicable, route or action options, via self-designed GPU kernels that implement Bellman updates with warp-/block-level reductions and numerically safe masking. In a single GPU pass, these kernels can process over $10^6$ uncertainty realizations, far beyond the capacity of prior scenario-based methods. We demonstrate the approach in two canonical SP applications: (i) a vectorized split operator for the capacitated vehicle routing problem with stochastic demand, exploiting **2D** parallelism (scenarios $\times$ transitions); and (ii) a forward inventory reinsertion DP under an order-up-to policy, exploiting **3D** parallelism (scenarios $\times$ inventory transitions $\times$ route options). Across benchmarks, the implementation scales nearly linearly in the number of scenarios and achieves one to three orders of magnitude speedups over multithreaded CPU baselines, yielding tighter SAA estimates and consistently stronger first-stage decisions under identical wall-clock budgets. Viewed as hardware-aware software primitives, our min--plus DP kernels offer a drop-in path to scalable, GPU-accelerated stochastic discrete optimization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5114. Histopathology-Genomics Multi-modal Structural Representation Learning for Data-Efficient Precision Oncology
๐ฌ 47
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ multi-modal learning histopathology image representation learning genomic data graph structure learning
Fusing histopathology images and genomics data with deep learning has significantly advanced precision oncology. However, genomics data is often missing due to its high acquisition cost and complexity in real-world clinical scenarios. Existing solutions aim to reconstruct genomics data from histopathology images. Nevertheless, these methods typically relied only on individual cases and overlooked the potential relationships among cases. Additionally, they failed to take advantage of the authentic genomics data of diagnostically related cases that are accessible from training for inference. In this work, we propose a novel Multi-modal Structural Representation Learning (MSRL) framework for data-efficient precision oncology. We pre-train a histopathology-genomics multi-modal representation graph adopting Graph Structure Learning (GSL) to construct inter-case relevance based on the data inherently. During the fine-tuning stage, we dynamically capture structural relevance between the training cases and the acquired authentic cases for precise prediction. MSRL leverages prior inter-case associations and authentic genomics data from diagnosed cases based on the graph, which contributes to effective inference based on the single histopathology image modality. We evaluated MSRL on public TCGA datasets with 7,263 cases across various tasks, including survival prediction, cancer grading, and gene mutation prediction. The results demonstrate that MSRL significantly outperforms existing missing-genomics generation approaches with improvements of 1.44% to 3.12% in C-Index on survival prediction tasks and achieves comparable performance to multi-modal fusion methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5115. Stochastic Neural Networks for Causal Inference with Missing Confounders
๐ฌ 18
โญ 4.00
๐ causal reasoning
๐ท๏ธ Adaptive Stochastic Gradient MCMC Causal Inference Latent Variable Model Missing Confounder Stochastic Neural Network
One of the major challenges in causal inference with observational data is handling missing confounders. Latent variable modeling offers a valid framework to address this challenge, but existing approaches within this framework often suffer from consistency issues in causal effect estimation and are difficult to extend to more complex application scenarios. To bridge this gap, we propose a new latent variable modeling approach, Confounder Imputation with Stochastic Neural Networks (CI-StoNet). The CI-StoNet utilizes a stochastic neural network to jointly model the outcome function and the missing confounders, and employs an adaptive stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm to impute the missing confounders and train the neural networks simultaneously. Under mild conditions, we show that the causal effect remains identifiable through CI-StoNet, even though the missing confounders are non-identifiable -- these confounders can only be identified up to an unknown loss-invariant transformation due to the non-identifiability inherent in neural network models. The CI-StoNet provides state-of-the-art performance on benchmarks for causal effect estimation and showcases its adaptability to proxy variable and multiple-cause scenarios. This new approach also serves as a versatile tool for modeling various causal relationships, leveraging the flexibility of stochastic neural networks in natural process modeling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5116. When Language Models Lose Their Mind: The Consequences of Brain Misalignment
๐ฌ 13
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ language models brain alignment brain misalignment linguistic competence neuroscience fMRI
While brain-aligned large language models (LLMs) have garnered attention for their potential as cognitive models and for potential for enhanced safety and trustworthiness in AI, the role of this brain alignment for linguistic competence remains uncertain. In this work, we investigate the functional implications of brain alignment by introducing brain-misaligned models--LLMs intentionally trained to predict brain activity poorly while maintaining high language modeling performance. We evaluate these models on over 200 downstream tasks encompassing diverse linguistic domains, including semantics, syntax, discourse, reasoning, and morphology. By comparing brain-misaligned models with well-matched brain-aligned counterparts, we isolate the specific impact of brain alignment on language understanding. Our experiments reveal that brain misalignment substantially impairs downstream performance, highlighting the critical role of brain alignment in achieving robust linguistic competence. These findings underscore the importance of brain alignment in LLMs and offer novel insights into the relationship between neural representations and linguistic processing.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5117. Triangle Multiplication is All You Need for Biomolecular Structure Representations
๐ฌ 16
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ structure prediction cofolding triangle multiplication
AlphaFold has transformed protein structure prediction, but emerging applications such as virtual ligand screening, proteome-wide folding, and de novo binder design demand predictions at a massive scale, where runtime and memory costs become prohibitive.
A major bottleneck lies in the Pairformer backbone of AlphaFold3-style models, which relies on computationally expensive triangular primitivesโespecially triangle attentionโfor pairwise reasoning.
We introduce Pairmixer, a streamlined alternative that eliminates triangle attention while preserving higher-order geometric reasoning capabilities that are critical for structure prediction.
Pairmixer substantially improves computational efficiency, matching state-of-the-art structure predictors across folding and docking benchmarks, delivering up to 4x faster inference on long sequences while reducing training cost by 34%.
Its efficiency alleviates the computational burden of downstream applications such as modeling large protein complexes, high-throughput ligand and binder screening, and hallucination-based design.
Within BoltzDesign, for example, Pairmixer delivers over 2x faster sampling and scales to sequences 30% longer than the memory limits of Pairformer.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5118. Reinforcing General Reasoning Without Verifiers
๐ฌ 16
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ General Reasoning Reinforcement Learning Large Language Models
The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (**VeriFree**) that bypasses answer verification and instead directly maximizes the probability of generating the reference answer, derived in a principled way from the RL objective. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5119. In-Context Learning of Temporal Point Processes with Foundation Inference Models
๐ฌ 10
โญ 4.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ temporal point processes zero-shot inference in-context learning zero-shot parameter estimation inference of point processes foundation models foundation inference models
Modeling event sequences with multiple event types with marked temporal point processes (MTPPs) provides a principled way to uncover governing rules and predict future events. Current neural network approaches to MTPP inference rely on training separate, specialized models for each target system. We pursue a radically different approach: drawing on amortized inference and in-context learning, we pretrain a deep neural network to infer, *in-context*, the conditional intensity functions of event histories from a context defined by sets of event sequences. Pretraining is performed on a large synthetic dataset of MTPPs sampled from a broad distribution over Hawkes processes. Once pretrained, our Foundation Inference Model for Point Processes (FIM-PP) can estimate MTPPs from real-world data without any additional training, or be rapidly finetuned to target systems. Experiments show that this amortized approach matches the performance of specialized models on next-event prediction across common benchmark datasets. We provide the pretrained model weights with the supplementary material.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5120. VITA: Vision-to-Action Flow Matching Policy
๐ฌ 15
โญ 4.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Flow Matching Robot Learning Imitation Learning Robotics Robotics Policy Manipulation
Conventional flow matching and diffusion-based policies sample through iterative denoising from standard noise distributions (e.g., Gaussian), and require conditioning modules to repeatedly incorporate visual information during the generative process, incurring substantial time and memory overhead. To reduce the complexity, we develop VITA (Vision-To-Action policy), a noise-free and conditioning-free flow matching policy learning framework that directly flows from visual representations to latent actions. Since the source of the flow is visually grounded, VITA eliminates the need of visual conditioning during generation. As expected, bridging vision and action is challenging, because actions are lower-dimensional, less structured, and sparser than visual representations; moreover, flow matching requires the source and target to have the same dimensionality. To overcome this, we introduce an action autoencoder that maps raw actions into a structured latent space aligned with visual latents, trained jointly with flow matching. To further prevent latent space collapse, we propose flow latent decoding, which anchors the latent generation process by backpropagating the action reconstruction loss through the flow matching ODE (ordinary differential equation) solving steps. We evaluate VITA on 9 simulation and 5 real-world tasks from ALOHA and Robomimic. VITA achieves 1.5x-2x faster inference compared to conventional methods with conditioning modules, while outperforming or matching state-of-the-art policies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5121. Remaining-data-free Machine Unlearning by Suppressing Sample Contribution
๐ฌ 29
โญ 4.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Machine Unlearning
Machine unlearning (MU) aims to remove the influence of specific training samples from a well-trained model, a task of growing importance due to the ``right to be forgotten.โ The unlearned model should approach the retrained model, where forgetting data do not contribute to the training process. Therefore, unlearning should withdraw their contribution from the pre-trained model. However, quantifying and disentangling sample's contribution to overall learning process is highly challenging, leading most existing MU approaches to adopt other heuristic strategies such as random labeling or knowledge distillation. These operations inevitably degrade model utility, requiring additional maintenance with remaining data. To advance MU towards better utility and efficiency for practical deployment, we seek to approximate sample contribution with only the pre-trained model. We theoretically and empirically reveal that sample's contribution during training manifests in the learned model's increased sensitivity to it. In light of this, we propose MU-Mis (Machine Unlearning by Minimizing input sensitivity), which directly suppresses the contribution of forgetting data. This straightforward suppression enables MU-Mis to successfully unlearn without degrading model utility on the remaining data, thereby eliminating the need for access to the remaining data. To the best of our knowledge, this is the first time that a remaining-data-free method can outperform state-of-the-art (SOTA) unlearning methods that utilize the remaining data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5122. KGOT: Unified Knowledge Graph and Optimal Transport Pseudo-Labeling for Molecule-Protein Interaction Prediction
๐ฌ 17
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Knowledge graph molecule protein interaction optimal transport
Predicting molecule-protein interactions (MPIs) is a fundamental task in computational biology, with crucial applications in drug discovery and molecular function annotation. However, existing MPI models face two major challenges. First, the scarcity of labeled molecule-protein pairs significantly limits model performance, as available datasets capture only a small fraction of biological relevant interactions.
Second, most methods rely solely on molecular and protein features, ignoring broader biological contextโsuch as genes, metabolic pathways, and functional annotationsโthat could provide essential complementary information. To address these limitations, our framework first aggregates diverse biological datasets, including molecular, protein, genes and pathway-level interactions, and then develop an optimal transport-based approach to generate high-quality pseudo-labels for unlabeled molecule-protein pairs, leveraging the
underlying distribution of known interactions to guide label assignment. By treating pseudo-labeling as a mechanism for bridging disparate biological modalities, our approach enables the effective use of heterogeneous data to enhance MPI prediction. We evaluate our framework on multiple MPI datasets including virtual screening tasks and protein retrieval tasks, demonstrating substantial improvements over state-of-the-art methods in prediction accuracies and zero shot ability across unseen interactions. Beyond MPI prediction, our approach provides a new paradigm for leveraging diverse biological data sources to tackle problems traditionally constrained by single or bi-modal learning, paving the way for future advances in computational biology and drug discovery.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5123. Keep the Best, Forget the Rest: Reliable Alignment with Order-Aware Preference Optimization
๐ฌ 19
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Language Models; Preference Optimization; RLHF
Direct Preference Optimization (DPO) has emerged as a powerful framework for aligning large language models (LLMs) with human preferences via pairwise comparisons. However, its performance is highly sensitive to the quality of training samples: when the reference policy is poorly aligned with human preferences, ambiguous pairs can dominate the gradient signal and degrade generalization. To address this, we propose RAPPO($\textbf{R}$eliable $\textbf{A}$lignment for $\textbf{P}$reference $\textbf{P}$olicy $\textbf{O}$ptimization), a simple sample-aware modification of the DPO loss that mitigates reference-policy misalignment by filtering out the hardest, most ambiguous samples. We theoretically show that RAPPO yields improved generalization guarantees. RAPPO is lightweight and requires only a few lines of code to be integrated into any existing DPO-type algorithm. Surprisingly, With this simple modification, our simulations across a broad suite of alignment tasks and benchmarks show consistent gains over DPO and recent state-of-the-art baselines. On the PKU-SafeRLHF benchmark, RAPPO attains helpfulness $0.693$ ($+34.8\%$ over DPO) and harmlessness $0.357$ ($-21.0\%$ vs DPO).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5124. Analyzing and Evaluating Unbiased Language Model Watermark
๐ฌ 17
โญ 4.00
๐ generative models
๐ท๏ธ LLM watermarking
Verifying the authenticity of AI-generated text has become increasingly important with the rapid advancement of large language models, and unbiased watermarking has emerged as a promising approach due to its ability to preserve output distribution without degrading quality. However, recent work reveals that unbiased watermarks can accumulate distributional bias over multiple generations and that existing robustness evaluations are inconsistent across studies. To address these issues, we introduce UWBench, the first open-source benchmark dedicated to the principled evaluation of unbiased watermarking methods. Our framework combines theoretical and empirical contributions: we propose a statistical metric to quantify multi-batch distribution shift, prove an impossibility result showing that no unbiased watermark can perfectly preserve the distribution under infinite queries, and develop a formal analysis of robustness against token-level modification attacks. Complementing this theory, we establish a three-axis evaluation protocolโunbiasedness, detectability, and robustnessโand show that token modification attacks provide more stable robustness assessments than paraphrasing-based methods. Together, UWBench offers the community a standardized and reproducible platform for advancing the design and evaluation of unbiased watermarking algorithms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5125. SPRIG: Improving Large Language Model Performance by System Prompt Optimization
๐ฌ 17
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ prompting system prompt prompt optimization
Large Language Models (LLMs) have shown impressive capabilities in many scenarios, but their performance depends, in part, on the choice of prompt. Past research has focused on optimizing prompts specific to a task. However, much less attention has been given to optimizing the general instructions included in a prompt, known as a system prompt. To address this gap, we propose SPRIG, an edit-based genetic algorithm that iteratively constructs prompts from prespecified components to maximize the model's performance in general scenarios. We evaluate the performance of system prompts on a collection of 47 different types of tasks to ensure generalizability. Our study finds that a single optimized system prompt performs on par with task prompts optimized for each individual task. Moreover, combining system and task-level optimizations leads to further improvement, which showcases their complementary nature. Experiments also reveal that the optimized system prompts generalize effectively across model families, parameter sizes, and languages. This study provides insights into the role of system-level instructions in maximizing LLM potential.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5126. SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing
๐ฌ 12
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Ethical AI Bayesian Experimental Design System-level Evaluation
As autonomous systems such as drones, become increasingly deployed in high-stakes, human-centric domains, it is critical to evaluate the ethical alignment since failure to do so imposes imminent danger to human lives, and long term bias in decision-making. Automated ethical benchmarking of these systems is understudied due to the lack of ubiquitous, well-defined metrics for evaluation, and stakeholder-specific subjectivity, which cannot be modeled analytically. To address these challenges, we propose SEED-SET, a Bayesian experimental design framework that incorporates domain-specific objective evaluations, and subjective value judgments from stakeholders. SEED-SET models both evaluation types separately with hierarchical Gaussian Processes, and uses a novel acquisition strategy to propose interesting test candidates based on both models.
We validate our approach for ethical benchmarking of autonomous agents on two applications and find our method to perform the best. Our method provides an interpretable and efficient trade-off between exploration and exploitation, by generating up to $2\times$ optimal test candidates compared to baselines, with $1.25\times$ improvement in coverage of high dimensional search spaces.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5127. UnLoc: Leveraging Depth Uncertainties for Floorplan Localization
๐ฌ 10
โญ 4.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ floorplan localization sequential localization depth uncertainties mono-depth networks
We propose UnLoc, an efficient data-driven solution for sequential camera localization within floorplans. Floorplan data is readily available, long-term persistent, and robust to changes in visual appearance. We address key limitations of recent methods, such as the lack of uncertainty modeling in depth predictions and the necessity for custom depth networks trained for each environment. We introduce a novel probabilistic model that incorporates uncertainty estimation, modeling depth predictions as explicit probability distributions. By leveraging off-the-shelf pre-trained monocular depth models, we eliminate the need to rely on per-environment-trained depth networks, enhancing generalization to unseen spaces. We evaluate UnLoc on large-scale synthetic and real-world datasets, demonstrating significant improvements over existing methods in terms of accuracy and robustness. Notably, we achieve $2.7$ times higher localization recall on long sequences (100 frames) and $42.2$ times higher on short ones (15 frames) than the state of the art on the challenging LaMAR HGE dataset.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5128. Learnable Fractional Superlets with a Spectro-Temporal Emotion Encoder for Speech Emotion Recognition
๐ฌ 14
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Speech Emotion Recognition TimeโFrequency Analysis Learnable Fractional Superlets Spectro-Temporal Encoding Representation Learning End-to-End Neural Networks
Speech emotion recognition (SER) hinges on front-ends that expose informative time-frequency (TF) structure from raw speech. Classical short-time Fourier and wavelet transforms impose fixed resolution trade-offs, while prior "superlet" variants rely on integer orders and hand-tuned hyperparameters. We revisit TF analysis from first principles and formulate a learnable continuum of superlet transforms. Starting from DC-corrected analytic Morlet wavelets, we define superlets as multiplicative ensembles of wavelet responses and realize learnable fractional orders via softmax-normalized weights over discrete orders, computed as a log-domain geometric mean. We establish admissibility (zero mean) and continuity in order and frequency, and characterize approximate analyticity by bounding negative-frequency leakage as a function of an effective cycle parameter. Building on these results, we introduce the Learnable Fractional Superlet Transform (LFST), a fully differentiable front-end that jointly optimizes (i) a monotone, log-spaced frequency grid, (ii) frequency-dependent base cycles, and (iii) learnable fractional-order weights, all trained end-to-end. LFST further includes a learnable asymmetric hard-thresholding (LAHT) module that promotes sparse, denoised TF activations while preserving transients; we provide sufficient conditions for boundedness and stability under mild cycle and grid constraints. To exploit LFST for SER, we design the Spectro-Temporal Emotion Encoder (STEE), which consumes two-channel TF maps, magnitude $S$ and phase-congruency $\kappa$, through a compact multi-scale stack with residual temporal and depthwise-frequency blocks, Adaptive FiLM gating, axial (time-axis) self-attention, global attentive pooling, and a lightweight classifier. The full LFST+STEE system is trained in a standard train-validate-test regime using focal loss with optional class rebalancing, and is validated on IEMOCAP, EMO-DB, and the private NSPL-CRISE dataset under standard protocols. By unifying a principled, learnable TF transform with a compact encoder, LFST+STEE replaces ad hoc front-ends with a mathematically grounded alternative that is differentiable, stable, and adaptable to data, enabling systematic ablations over frequency grids, cycle schedules, and fractional orders within a single end-to-end model. The source code for this paper is shared in this anonymous repository: https://anonymous.4open.science/r/LFST-for-SER-C5D2.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5129. Helmsman: Autonomous Synthesis of Federated Learning Systems via Multi-Agent Collaboration
๐ฌ 42
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Multi-agent System Federated Learning LLM-based Agent
Federated Learning (FL) offers a powerful paradigm for training models on decentralized data, but its promise is often undermined by the immense complexity of designing and deploying robust systems. The need to select, combine, and tune strategies for multifaceted challenges like data heterogeneity and system constraints has become a critical bottleneck, resulting in brittle, bespoke solutions. To address this, we introduce Helmsman, a novel LLM-based multi-agent framework that automates the end-to-end synthesis of federated learning systems from high-level user specifications. It emulates a principled research and development workflow through three collaborative phases: (1) interactive human-in-the-loop planning to formulate a sound research plan, (2) modular code generation by supervised generative agent teams, and (3) a closed-loop of autonomous evaluation and refinement in a sandboxed simulation environment. To facilitate rigorous evaluation, we also introduce AgentFL-Bench, a new benchmark comprising 16 diverse tasks designed to assess the system-level generation capabilities of LLM-driven agentic systems in FL. Extensive experiments demonstrate that our approach generates solutions competitive with, and often superior to, established hand-crafted baselines. Our work represents a significant step towards the automated engineering of complex decentralized AI systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5130. Channel-Aware Mixed-Precision Quantization for Efficient Long-Context Inference
๐ฌ 10
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Long Context Efficiency
The key-value (KV) cache plays a vital role in accelerating autoregressive inference for large language models (LLMs). However, its linear memory growth with sequence length poses significant memory bottlenecks, especially in long-context scenarios.
Quantization offers a promising solution for memory efficiency. While existing methods typically apply channel-wise quantization to the key cache and token-wise quantization to the value cache, they suffer from severe performance degradation under low-bit configurations. Our analysis reveals that quantization sensitivity varies across individual KV channels, presenting an opportunity for non-uniform bit allocation. Following this finding, we propose ChanMix, a mixed-precision quantization framework that supports channel-wise quantization on 2-bit setting with FP8 precision with a custom Triton kernel implementation. To improve low-bit quantization performance, we introduce a channel-aware bit reallocation strategy, which allocates bits across channel sensitivity.
Through extensive evaluation, ChanMix demonstrates superior performance across the NIAH, RULER, and InfiniteBench benchmarks for the Llama, Mistral, and Qwen model families, achieving improvements of at least 5 absolute percentage points on RULER compared to all baseline methods. Additionally, ChanMix enables a 2.3ร increase in batch size and supports a 1.5ร longer context length during inference.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5131. Bridging ML and algorithms: comparison of hyperbolic embeddings
๐ฌ 12
โญ 4.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ hyperbolic embeddings network theory social networks
Hyperbolic embeddings are well-studied both in the machine learning and algorithm community. However, as the research proceeds independently in those two communities, comparisons and even awareness seem to be currently lacking. We compare the performance (time needed to compute embeddings) and the quality of the embeddings obtained by the popular approaches, both on real-life hierarchies and networks and simulated networks. In particular, according to our results, the algorithm by Blรคsius et al (ESA 2016) is about 100 times faster than the Poincarรฉ embeddings (NIPS 2017) and Lorentz embeddings (ICML 2018) by Nickel and Kiela, while achieving results of similar (or, in some cases, even better) quality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5132. Robust Multi-Objective Controlled Decoding of Large Language Models
๐ฌ 18
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Inference-time Alignment Robustness
We introduce Robust Multi-Objective Decoding (RMOD), a novel inference-time algorithm that robustly aligns Large Language Models (LLMs) to multiple human objectives (e.g., instruction-following, helpfulness, safety) by maximizing the worst-case rewards. RMOD formulates the robust decoding problem as a maximin two-player game between adversarially computed reward weights and the sampling policy, solvable through a Nash equilibrium. We demonstrate that this game reduces to a convex optimization problem to identify the worst-case reward weights, with the optimal sampling policy analytically derived. For practical applications, we propose an efficient algorithm of RMOD tailored for contemporary LLMs, introducing minimal computational overhead compared to standard non-robust Controlled Decoding methods. Experimental results across the range of popular alignment datasets with up to 10 objectives show the effectiveness of RMOD and its distilled version, consistently outperforming baselines in worst-case rewards and win rates.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5133. MOAI: Module-Optimizing Architecture for Non-Interactive Secure Transformer Inference
๐ฌ 5
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ fully homomorphic encryption secure transformer inference
Privacy concerns have been raised in Large Language Models (LLM) inference when models are deployed in Cloud Service Providers (CSP). Homomorphic encryption (HE) offers a promising solution by enabling secure inference directly over encrypted inputs. However, the high computational overhead of HE remains a major bottleneck. To address this challenge, we propose MOAI, an efficient HE-based, non-interactive framework for secure transformer inference. MOAI gains significant efficiency improvement from: (1) a novel evaluation flow that combines column and diagonal packing with consistent strategies across all layers, eliminating expensive format conversions. (2) rotation-free algorithms for Softmax and LayerNorm that significantly reduce the number of costly HE rotations, removing 2448 HE rotations in BERT-base inference. (3) Column packing removes rotations in plaintextโciphertext matrix multiplications and interleaved batching further reduces the rotations in ciphertextโciphertext matrix multiplications. MOAI uses at least 1.7x fewer HE rotations compared to the state-of-the-art works across all matrix multiplications of BERT-base. As a result, We achieve a 52.8\% reduction in evaluation time compared to the state-of-the-art HE-based non-interactive secure transformer inference, THOR (Moon et al., CCS'25). We then apply MOAI on the Powerformer's framework and achieve a 55.7\% reduction in evaluation time compared to Powerformer (Park et al., ACL'25), which approximates Softmax and LayerNorm with simpler functions in transformer and proposes HE-based non-interactive transformer inference. We report an amortized time of 2.36 minutes per input on a single GPU environment. We show the extendibility by applying MOAI in LLaMA-3-8B. Our implementation is publicly available as open source.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5134. Flow-based Conformal Prediction for Multi-dimensional Time Series
๐ฌ 19
โญ 4.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Conformal Prediction Time Series Prediction
Time series prediction underpins a broad range of downstream tasks across many scientific domains. Recent advances and increasing adoption of black-box machine learning models for time series prediction highlight the critical need for reliable uncertainty quantification. While conformal prediction has gained attention as a reliable uncertainty quantification method, conformal prediction for time series faces two key challenges: (1) adaptively leveraging correlations in features and non-conformity scores to overcome the exchangeability assumption, and (2) constructing prediction sets for multi-dimensional outcomes. To address these challenges jointly, we propose a novel conformal prediction method for time series using flow with classifier-free guidance. We provide coverage guarantees by establishing exact non-asymptotic marginal coverage and a finite-sample bound on conditional coverage for the proposed method. Evaluations on real-world time series datasets demonstrate that our method constructs significantly smaller prediction sets than existing conformal prediction methods while maintaining target coverage.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5135. Are we measuring oversmoothing in graph neural networks correctly?
๐ฌ 4
โญ 4.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ graph neural networks oversmoothing low-rank
Oversmoothing is a fundamental challenge in graph neural networks (GNNs): as the number of layers increases, node embeddings become increasingly similar, and model performance drops sharply. Traditionally, oversmoothing has been quantified using metrics that measure the similarity of neighbouring node features, such as the Dirichlet energy. We argue that these metrics have critical limitations and fail to reliably capture oversmoothing in realistic scenarios. For instance, they provide meaningful insights only for very deep networks, while typical GNNs show a performance drop already with as few as 10 layers. As an alternative, we propose measuring oversmoothing by examining the numerical or effective rank of the feature representations. We provide extensive numerical evaluation across diverse graph architectures and datasets to show that rank-based metrics consistently capture oversmoothing, whereas energy-based metrics often fail. Notably, we reveal that drops in the rank align closely with performance degradation, even in scenarios where energy metrics remain unchanged. Along with the experimental evaluation, we provide theoretical support for this approach, clarifying why Dirichlet-like measures may fail to capture performance drop and proving that the numerical rank of feature representations collapses to one for a broad family of GNN architectures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5136. Culture in Action: Evaluating Text-to-Image Models through Social Activities
๐ฌ 14
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Text-to-Image Models Cross Cultural Evaluation Fairness Bias Social Activities
Text-to-image (T2I) diffusion models achieve impressive photorealism by training on large-scale web data, but models inherit cultural biases and fail to depict underrepresented regions faithfully. Existing cultural benchmarks focus mainly on object-centric categories (e.g., food, attire, and architecture), overlooking the social and daily activities that more clearly reflect cultural norms. Few metrics exist for measuring cultural faithfulness. We introduce CULTIVate, a benchmark for evaluating T2I models on cross-cultural activities (e.g., greetings, dining, games, traditional dances, and cultural celebrations). CULTIVate spans 16 countries with 576 prompts and more than 19,000 images, and provides an explainable descriptor-based evaluation framework across multiple cultural dimensions, including background, attire, objects, and interactions. We propose four metrics to measure cultural alignment, hallucination, exaggerated elements, and diversity. Our findings reveal systematic disparities: models perform better for global north countries than for the global south, with distinct failure modes across T2I systems. Human studies confirm that our metrics correlate more strongly with human judgments than existing textโimage measures.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5137. U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs
๐ฌ 24
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Universal multimodal retrieval Multimodal large language models Embedding
Universal multimodal retrieval (UMR), which aims to address complex retrieval tasks where both queries and candidates span diverse modalities, has been significantly advanced by the emergence of MLLMs. While state-of-the-art MLLM-based methods in the literature predominantly adopt contrastive learning principles, they often differ in their specific training recipes. Despite their success, the mechanisms underlying their retrieval capabilities remain largely unexplored, potentially resulting in suboptimal performance and limited generalization ability. To address these issues, we present a comprehensive study aimed at uncovering the key factors that drive effective embedding learning for UMR using MLLMs. We begin by implementing a general MLLM-based embedding learning pipeline, and systematically analyze the primary contributors to high-performing universal retrieval systems. Based on this, we explore various aspects of the details in embedding generation and training strategies, including progressive transition, hard negative mining and re-ranker distillation. Notably, our findings reveal that often-overlooked factors can have a substantial impact on model performance. Building on these discoveries, we introduce a unified framework termed U-MARVEL (Universal MultimodAl RetrieVal via Embedding Learning), which outperforms state-of-the-art competitors on the M-BEIR benchmark by a large margin in supervised settings, and also exhibits strong zero-shot performance on several tasks such as composed image retrieval and text-to-video retrieval. These results underscore the generalization potential of our framework across various embedding-based retrieval tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5138. Robust Adaptive Multi-Step Predictive Shielding
๐ฌ 20
โญ 4.00
๐ reinforcement learning
๐ท๏ธ Safe Reinforcement Learning Control Barrier functions Model Predictive shielding
Reinforcement learning for safety-critical tasks requires policies that are both high-performing and safe throughout the learning process. While model-predictive shielding is a promising approach, existing methods are often computationally intractable for the high-dimensional, nonlinear systems where deep RL excels, as they typically rely on a patchwork of local models. We introduce **RAMPS**, a scalable shielding framework that overcomes this limitation by leveraging a learned, linear representation of the environment's dynamics. This model can range from a linear regression in the original state space to a more complex operator learned in a high-dimensional feature space. The key is that this linear structure enables a robust, look-ahead safety technique based on a *multi-step Control Barrier Function (CBF)*. By moving beyond myopic one-step formulations, **RAMPS** accounts for model error and control delays to provide reliable, real-time interventions. The resulting framework is minimally invasive, computationally efficient, and built upon robust control-theoretic foundations. Our experiments demonstrate that **RAMPS** significantly reduces safety violations compared to existing safe RL methods while maintaining high task performance in complex control environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5139. TableDART: Dynamic Adaptive Multi-Modal Routing for Table Understanding
๐ฌ 24
โญ 4.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Table Understanding Large Language Models
Modeling semantic and structural information from tabular data remains a core challenge for effective table understanding. Existing Table-as-Text approaches flatten tables for large language models (LLMs), but lose crucial structural cues, while Table-as-Image methods preserve structure yet struggle with precise semantics. Recent Table-as-Multimodality strategies attempt to combine textual and visual views, but they (1) statically process both modalities for every query-table pair within large multimodal LLMs (MLLMs), inevitably introducing redundancy and even conflicts, and (2) depend on costly fine-tuning of MLLMs. In light of this, we propose TableDART, a training-efficient framework that integrates multimodal views by reusing pretrained single-modality models. TableDART introduces a lightweight 2.59M-parameter MLP gating network that dynamically selects the optimal path (Text-only, Image-only, or Fusion) for each tableโquery pair, reducing redundancy and avoiding conflicts that arise when textual and visual views of the same table provide inconsistent cues. By routing to the most appropriate view, our framework improves both accuracy and efficiency. In addition, we propose a novel agent to mediate cross-modal knowledge integration by analyzing outputs from text- and image-based models, either selecting the best result or synthesizing a new answer through reasoning. This design avoids the prohibitive costs of full MLLM fine-tuning. Extensive experiments on seven benchmarks show that TableDART establishes new state-of-the-art performance among open-source models, surpassing the strongest baseline by an average of 4.02%. The code is available at: https://anonymous.4open.science/r/TableDART-C52B.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5140. Let's (not) just put things in Context: Test-time Training for Long-context LLMs
๐ฌ 10
โญ 4.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ long-context language models test-time training inference-time scaling
Advances in training and architectural design have enabled LLMs with million-token context windows, yet in practice these models often read far more than they can reliably use. While inference-time compute scalingโtypically via โthinking tokensโโcan help on short multi-step reasoning tasks, our controlled long-context experiments show rapidly diminishing returns that collapse as context grows. We trace this to score dilution in static self-attention and prove that, in such regimes, decoding more tokens cannot reliably recover buried evidence. We propose query-only test-time training (qTTT): a cache-preserving adaptation that performs a single prefill to fix keys/values and then applies a handful of gradient updates to the query projections. qTTT provably increases the targetโdistractor margin and, empirically, delivers consistent gains across model sizes and benchmarks. On Qwen3-4B, qTTT improves average accuracy by +12.6 and +14.1 absolute points on LongBench-v2 and ZeroSCROLLS, respectively. The practical takeaway is simple: for long contexts, spending a small inference-time budget on context-specific adaptation is a more effective use of compute than generating additional thinking tokens.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5141. Causally Robust Preference Learning with Reasons
๐ฌ 11
โญ 4.00
๐ causal reasoning
๐ท๏ธ Preference-based learning causal confusion learning from human feedback
Preferenceโbased reward learning is widely used for shaping agent behavior to match a user's preference, yet its sparse binary feedback makes it especially vulnerable to causal confusion. The learned reward often latches onto spurious features that merely coโoccur with preferred trajectories during training, collapsing when those correlations disappear or reverse at test time. We introduce ReCouPLe, a lightweight framework that uses natural language rationales to provide the missing causal signal. Each rationale is treated as a guiding projection axis in embedding space, training the model to score trajectories based on features aligned with that axis while de-emphasizing context that is unrelated to the stated reason. Because identical rationales can arise across multiple tasks (e.g., "_it avoids collisions with a fragile object_"), ReCouPLe naturally reuses the same causal direction whenever tasks share semantics, and transfers preference knowledge to novel tasks without extra data or languageโmodel fineโtuning.
Our learned reward model can ground preferences on the articulated reason, aligning better with user intent and generalizing beyond spurious features.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5142. LLMS ON TRIAL: Evaluating Judicial Fairness For Large Language Models
๐ฌ 12
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Fairness LLM-as-judge
Large Language Models (LLMs) are increasingly used in high-stakes fields, such as law, where their decisions can directly impact people's lives. When LLMs act as judges, the ability to fairly resolve judicial issues is necessary to ensure their trustworthiness. Based on theories of judicial fairness, we construct a comprehensive framework to measure LLM fairness, leading to a selection of 65 labels and 161 corresponding values. We further compile an extensive dataset, JudiFair, comprising 177,100 unique case facts. To achieve robust statistical inference, we develop three evaluation metricsโinconsistency, bias, and imbalanced inaccuracyโand introduce a method to assess the overall fairness of multiple LLMs across various labels. Through experiments with 16 LLMs, we uncover pervasive inconsistency, bias, and imbalanced inaccuracy across models, underscoring severe LLM judicial unfairness.
Particularly, LLMs display notably more pronounced biases on demographic labels, with slightly less bias on substance labels compared to procedure ones. Interestingly, increased inconsistency correlates with reduced biases, but more accurate predictions exacerbate biases. While we find that adjusting the temperature parameter can influence LLM fairness, model size, release date, and country of origin do not exhibit significant effects on judicial fairness. Accordingly, we introduce a publicly available toolkit to support future research in evaluating and improving LLM fairness, along with a full technical analysis included as an appendix.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5143. Test-Time Adaptation without Source Data for Out-of-Domain Bioactivity Prediction
๐ฌ 27
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ out-of-domain bioactivity prediction source data-absent test-time adaptation
Accurate prediction of protein-ligand bioactivity is a cornerstone of modern drug discovery, yet current deep learning methods often struggle with out-of-domain (OOD) generalization. The existing methods rely on access to source data, making them impractical in scenarios where data cannot be accessed due to confidentiality, privacy concerns or intellectual property restrictions. In this paper, we provide the first exploration of a more realistic setting for bioactivity prediction, where models are expected to adapt to out-of-domain distributions without access to source data. Motivated by the critical role of binding-relevant interactions in determining ligand-protein bioactivity, we introduce an uncertainty-weighted consistency strategy, in which original samples with high confidence guide their augmented counterparts by minimizing feature distance. This encourages the model to focus on informative interaction regions while suppressing reliance on spurious or non-causal substructures. To further enhance representation discriminability and prevent feature collapse, we integrate a contrastive optimization objective that pulls together augmented views of the same complex and pushes away views from different complexes. Together, these two components enable the learning of invariant, bioactivity-aware representations, allowing robust adaptation under distribution shifts. Extensive experiments across DTIGN, SIU 0.6, and DrugOOD demonstrate that our framework consistently outperforms state-of-the-art baselines under scaffold, protein, and assay based OOD settings. Especially on the eight subsets of DTIGN, it improves Pearsonโs $R$ by 8.2\% and Kendallโs Tau $\tau$ by 5.8\% on average over the best baseline, underscoring its effectiveness as a source data-absent solution for OOD bioactivity prediction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5144. HSIC Bottleneck for Cross-Generator and Domain-Incremental Synthetic Image Detection
๐ฌ 39
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ HSIC Bottleneck;Cross-Generator Synthetic Image Detection;Domain-Incremental Learning
Synthetic image generators evolve rapidly, challenging detectors to generalize across current methods and adapt to new ones. We study domain-incremental synthetic image detection with a two-phase evaluation. Phase I trains on either diffusion- or GAN-based data and tests on the combined group to quantify bidirectional cross-generator transfer. Phase II sequentially introduces renders from 3D Gaussian Splatting (3DGS) head avatar pipelines, requiring adaptation while preserving earlier performance. We observe that CLIP-based detectors inherit text-image alignment semantics that are irrelevant to authenticity and hinder generalization. We introduce a Hilbert-Schmidt Independence Criterion (HSIC) bottleneck loss on intermediate CLIP ViT features, encouraging representations predictive of real versus synthetic while independent of generator identity and caption alignment. For domain-incremental learning, we propose HSIC-Guided Replay (HGR), which selects per-class exemplars via a hybrid score combining HSIC relevance with k-center coverage, yielding compact memories that mitigate forgetting. Empirically, the HSIC bottleneck improves transfer between diffusion and GAN families, and HGR sustains prior accuracy while adapting to 3DGS renders. These results underscore the value of information-theoretic feature shaping and principled replay for resilient detection under shifting generative regimes.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5145. Curation Leaks: Membership Inference Attacks against Data Curation for Machine Learning
๐ฌ 20
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ machine learning privacy
In machine learning, data curation is used to select the most valuable data for
improving both model accuracy and computational efficiency. Recently, curation
has also been explored as a solution for private machine learning: rather than
training directly on sensitive data, which is known to leak information through
model predictions, the private data is used only to guide the selection of useful
public data. The resulting model is then trained solely on curated public data.
It is tempting to assume that such a model is privacy-preserving because it has
never seen the private data. Yet, we show that without further protection curation
pipelines can still leak private information. Specifically, we introduce novel attacks
against popular curation methods, targeting every major step: the computation of
curation scores, the selection of the curated subset, and the final trained model.
We demonstrate that each stage reveals information about the private dataset,
and that even models trained exclusively on curated public data leak membership
information about the private data that guided curation. These findings highlight the
inherent privacy risks in data curation that were previously overlooked, and suggest
that (1) in the context of curation, privacy analysis must extend beyond the training
procedure to include the data selection process, and (2) true privacy-preserving
curation will require new methods with formal privacy guarantees.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5146. Multimodal LLM-assisted Evolutionary Search for Programmatic Control Policies
๐ฌ 28
โญ 4.00
๐ optimization
๐ท๏ธ Automated control policy discovery Evolutionary computation Multimodal large language models
Deep reinforcement learning has achieved impressive success in control tasks. However, its policies, represented as opaque neural networks, are often difficult for humans to understand, verify, and debug, which undermines trust and hinders real-world deployment. This work addresses this challenge by introducing a novel approach for programmatic control policy discovery, called **M**ultimodal Large **L**anguage Model-assisted **E**volutionary **S**earch (MLES). MLES utilizes multimodal large language models as programmatic policy generators, combining them with evolutionary search to automate policy generation. It integrates visual feedback-driven behavior analysis within the policy generation process to identify failure patterns and guide targeted improvements, thereby enhancing policy discovery efficiency and producing adaptable, human-aligned policies. Experimental results demonstrate that MLES achieves performance comparable to Proximal Policy Optimization (PPO) across two standard control tasks while providing transparent control logic and traceable design processes. This approach also overcomes the limitations of predefined domain-specific languages, facilitates knowledge transfer and reuse, and is scalable across various tasks, showing promise as a new paradigm for developing transparent and verifiable control policies.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5147. Reconstructing KV Caches with Cross-Layer Fusion for Enhanced Transformers
๐ฌ 42
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ KV Cache Pretraining LLM
Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths.
Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA.
To understand the root cause, we investigate the information flow of keys and values of the top-layers.
Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers.
Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers.
This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings.
To further improve efficiency, we propose FusedKV-Lite, an cross-layer sharing approach, where top-layer KV caches are directly derived from the bottom-layer values and the middle-layer keys.
Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity.
In experiments on LLMs ranging from 332M to 4B parameters, our proposed method reduce 50\% cache memory while achieving lower validation perplexity than the standard Transformer decoder, establishing it as a memory-efficient, high-performance architectural alternative. We have made our Triton implementation available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5148. Equivariant Splitting: Self-supervised learning from incomplete data
๐ฌ 21
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ inverse problems self-supervised imaging equivariant neural networks
Self-supervised learning for inverse problems allows to train a reconstruction network from noise and/or incomplete data alone. These methods have the potential of enabling learning-based solutions when obtaining ground-truth references for training is expensive or even impossible. In this paper, we propose a new self-supervised learning strategy devised for the challenging setting where measurements are observed via a single incomplete observation model. We introduce a new definition of equivariance in the context of reconstruction networks, and show that the combination of self-supervised splitting losses and equivariant reconstruction networks results in unbiased estimates of the supervised loss. Through a series of experiments on image inpainting, accelerated magnetic resonance imaging, sparse-view computed tomography, and compressive sensing, we demonstrate that the proposed loss achieves state-of-the-art performance in settings with highly rank-deficient forward models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5149. Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models
๐ฌ 27
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ VLMs Visual Understanding
Although current large Vision-Language Models (VLMs) have advanced in multimodal understanding and reasoning, their fundamental perceptual and reasoning abilities remain limited. Specifically, even on simple jigsaw tasks, existing VLMs perform near randomly, revealing deficiencies in core perception and reasoning capabilities. While high-quality vision-language data can enhance these capabilities, its scarcity and limited scalability impose significant constraints. To address this, we propose AGILE, an Agentic jiGsaw Interaction Learning for Enhancing visual perception and reasoning in VLMs. AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment. At each step, the model generates executable code to perform an action based on the current state, while the environment provides fine-grained visual feedback to guide task completion. Through this iterative cycle of observation and interaction, the model incrementally improves its perceptual and reasoning capabilities via exploration and feedback. Experimental results show that AGILE not only substantially boosts performance on jigsaw tasks of varying complexity (e.g., increasing accuracy from 9.5\% to 82.8\% under the $2 \times 2$ setting) but also demonstrates strong generalization across 9 general vision tasks, achieving an average improvement of 3.1\%. These results indicate notable enhancements in both perceptual and reasoning abilities. This work opens a new avenue for advancing reasoning and generalization in multimodal models and provides an efficient, scalable solution to the scarcity of multimodal reinforcement learning data. Code and datasets will be released soon.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5150. INTIMA: A Benchmark for Human-AI Companionship Behavior
๐ฌ 10
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ AI companionship benchmark
AI companionship, where users develop emotional bonds with AI systems, has emerged as a significant pattern with positive but also concerning implications. We introduce Interactions and Machine Attachment Benchmark (INTIMA), a benchmark for evaluating companionship behaviors in language models. Drawing from psychological theories and user data, we develop a taxonomy of 31 behaviors across four categories and 368 targeted prompts. Responses to these prompts are evaluated as companionship-reinforcing, boundary-maintaining, or neutral. Applying INTIMA to Gemma-3, Phi-4, o4-mini, GPT5-mini, and Claude-4 reveals that companionship-reinforcing behaviors remain much more common across all models, though we observe marked differences between models. Different commercial providers prioritize different categories within the more sensitive parts of the benchmark, which is concerning since both appropriate boundary-setting and emotional support matter for user well-being. These findings highlight the need for more consistent approaches to handling emotionally charged interactions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5151. Mechanistic Detection and Mitigation of Hallucination in Large Reasoning Models
๐ฌ 24
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Reasoning Hallucination Mechanistic Interpretability
Large Reasoning Models (LRMs) have shown impressive capabilities in multi-step reasoning tasks. However, alongside these successes, a more deceptive form of model error has emergedโ**Reasoning Hallucination**โwhere logically coherent but factually incorrect reasoning traces lead to persuasive yet faulty conclusions. Unlike traditional hallucinations, these errors are embedded within structured reasoning, making them more difficult to detect and potentially more harmful. In this work, we investigate reasoning hallucinations from a mechanistic perspective. We propose the **Reasoning Score**, which quantifies the depth of reasoning by measuring the divergence between logits obtained from projecting late layers of LRMs to the vocabulary space, effectively distinguishing shallow pattern-matching from genuine deep reasoning. Using this score, we conduct an in-depth analysis on the ReTruthQA dataset and identify two key reasoning hallucination patterns: early-stage fluctuation in reasoning depth and incorrect backtracking to flawed prior steps. These insights motivate our **R**easoning **H**allucination **D**etection (**RHD**) framework, which achieves state-of-the-art performance across multiple domains. To mitigate reasoning hallucinations, we further introduce **GRPO-R**, an enhanced reinforcement learning algorithm that incorporates step-level deep reasoning rewards via potential-based shaping. Our theoretical analysis establishes stronger generalization guarantees, and experiments demonstrate improved reasoning quality and reduced hallucination rates.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5152. REI-Bench: Can Embodied Agents Understand Vague Human Instructions in Task Planning?
๐ฌ 26
โญ 4.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Robot task planning vagueness LLMs
Robot task planning decomposes human instructions into executable action sequences that enable robots to complete a series of complex tasks. Although recent large language model (LLM)-based task planners achieve amazing performance, they assume that human instructions are clear and straightforward. However, real-world users are not experts, and their instructions to robots often contain significant vagueness. Linguists suggest that such vagueness frequently arises from referring expressions (REs), whose meanings depend heavily on dialogue context and environment. This vagueness is even more prevalent among the elderly and children, whom robots should serve more. This paper studies how such vagueness in REs within human instructions affects LLM-based robot task planning and how to overcome this issue. To this end, we propose the first robot task planning benchmark with vague REs (REI-Bench), where we discover that the vagueness of REs can severely degrade robot planning performance, leading to success rate drops of up to 36.9\%. We also observe that most failure cases stem from missing objects in planners. To mitigate the REs issue, we propose a simple yet effective approach: task-oriented context cognition, which generates clear instructions for robots, achieving state-of-the-art performance compared to aware prompts, chains of thought, and in-context learning. By tackling the overlooked issue of vagueness, this work contributes to the research community by advancing real-world task planning and making robots more accessible to non-expert users, e.g., the elderly and children.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5153. Arbitrary Generative Video Interpolation
๐ฌ 12
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Frame Interpolation ROPE Video Generation
Video frame interpolation (VFI), which generates intermediate frames from given start and end frames, has become a fundamental function in video generation applications. However, existing generative VFI methods are constrained to synthesizing a fixed number of intermediate frames, lacking the flexibility to adjust generated frame rates or total sequence duration. In this work, we present ArbInterp, a novel generative VFI framework that enables efficient interpolation at any timestamp and of any length. Specifically, to support interpolation at any timestamp, we propose the Timestamp-aware Rotary Position Embedding (TaRoPE), which modulates positions in temporal RoPE to align generated frames with target normalized timestamps. This design enables fine-grained control over frame timestamps, addressing the inflexibility of fixed-position paradigms in prior work. For any-length interpolation, we decompose long-sequence generation into segment-wise frame synthesis. We further design a novel appearance-motion decoupled conditioning strategy: it leverages prior segment endpoints to enforce appearance consistency and temporal semantics to maintain motion coherence, ensuring seamless spatiotemporal transitions across segments. Experimentally, we develop comprehensive benchmarks for multi-scale frame interpolation (2ร to 32ร) to assess generalizability across arbitrary interpolation factors. Results show that ArbInterp outperforms prior methods across all scenarios with higher fidelity and more seamless spatiotemporal continuity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5154. GRAM-DTI: Adaptive Multimodal Representation Learning for DrugโTarget Interaction Prediction
๐ฌ 32
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Drug-target interaction prediction Multimodal representation learning Adaptive modality dropout
Drug target interaction (DTI) prediction is a cornerstone of computational drug discovery, enabling rational design, repurposing, and mechanistic insights. While deep learning has advanced DTI modeling, existing approaches primarily rely on SMILESโprotein pairs and fail to exploit the rich multimodal information available for small molecules and proteins. Inspired by recent successes in multimodal molecular property prediction, we introduce GRAM-DTI, a pre-training framework that integrates multimodal small molecule and protein inputs into a unified representation. GRAM-DTI extends volume-based contrastive learning to four modalities, capturing higher-order semantic alignment beyond conventional pairwise approaches. To handle modality informativeness, we propose adaptive modality dropout, dynamically regulating each modalityโs contribution during pretraining. Additionally, IC50 activity measurements, when available, are incorporated as weak supervision to ground representations in biologically meaningful interaction strengths. Experiments on four publicly available datasets demonstrate that GRAM-DTI consistently outperforms state-of-the-art baselines. Our results highlight the benefits of higher-order multimodal alignment, adaptive modality utilization, and auxiliary supervision for robust and generalizable DTI prediction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5155. Vertically Unified Agents for Graph Retrieval-Augmented Complex Reasoning
๐ฌ 22
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ GraphRAG Schema Complex QA
Graph retrieval-augmented generation (GraphRAG) has effectively enhanced large language models in complex reasoning by organizing fragmented knowledge into explicitly structured graphs. Prior efforts have been made to improve either graph construction or graph retrieval in isolation, yielding suboptimal performance, especially when domain shifts occur. In this paper, we propose a vertically unified agentic paradigm, $\texttt{UniGraphRAG}$, to jointly connect the entire framework as an intricate integration. Specifically, $(i)$ a seed graph schema is introduced to bound the automatic extraction agent with targeted entity types, relations and attribute types, also continuously expanded for scalability over unseen domains; $(ii)$ To obtain higher-level knowledge upon the schema, we develop novel dually-perceived community detection, fusing structural topology with subgraph semantics for comprehensive knowledge organization. This naturally yields a hierarchical knowledge tree that supports both top-down filtering and bottom-up reasoning with community summaries; $(iii)$ An agentic retriever is designed to interpret the same graph schema to transform complex queries into tractable and parallel sub-queries. It iteratively performs reflection for more advanced reasoning; $(iv)$ To alleviate the knowledge leaking problem in pre-trained LLM, we propose a tailored anonymous dataset and a novel 'Anonymity Reversion' task that deeply measures the real performance of the GraphRAG frameworks. Extensive experiments across six challenging benchmarks demonstrate the robustness of $\texttt{UniGraphRAG}$, remarkably moving the Pareto frontier of performance and efficiency with up to 33.6% cost saving and 16.62% higher accuracy over state-of-the-art baselines. The results indicate our adaptability, allowing seamless domain transfer with minimal intervention on the schema.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5156. Polynomial Convergence of Riemannian Diffusion Models
๐ฌ 8
โญ 4.00
๐ learning theory
๐ท๏ธ diffusion model Riemannian manifold gradient estimate
Diffusion generative models have demonstrated remarkable empirical success in the recent years and are now considered the state-of-the-art generative models in modern AI. These models consist of a forward process, which gradually diffuses the data distribution to a noise distribution spanning the whole space, and a backward process, which inverts this transformation to recover the data distribution from noise. Most of the existing literature assumes that the underlying space is Euclidean. However, in many practical applications, the data are constrained to lie on a submanifold of Euclidean space. Addressing this setting, de Bortoli et al. (2022) introduced Riemannian diffusion models and proved that using an exponentially small step size yields small sampling error in Wasserstein distance, provided the data distribution is smooth and strictly positive.
In this paper, we prove that a *polynomially small stepsize* suffices to guarantee small sampling error in total variation distance, without any assumption on the smoothness or positivity of the data distribution. Our analysis only requires mild and standard curvature assumptions on the underlying manifold. The main ingredients in our proof are Li-Yau estimate for log-gradient of heat kernel, and Minakshisundaram-Pleijel parametrix expansion for perturbed heat equation. Our approach opens the door to a sharper analysis of diffusion models on non-Euclidean spaces.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5157. EยฒLoRA: Efficient and Effective Low-Rank Adaptation with Entropy-Guided Adaptive Sharing
๐ฌ 22
โญ 4.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ LoRA PEFT
As large pre-trained models rapidly scale, Parameter-Efficient Fine-Tuning (PEFT) through methods like Low-Rank Adaptation (LoRA) becomes increasingly crucial. While LoRA has emerged as a cornerstone of PEFT, excelling at preserving performance with minimal additional parameters, exploring parameter-sharing mechanisms of LoRA remains critical to pushing efficiency boundaries. However, existing naive LoRA sharing methods often degrade performance due to sacrificed representational diversity and weakened model expressiveness. To overcome this issue, we conduct an in-depth analysis of pre-trained models using gradient-based proxy entropy, and uncover two critical, previously overlooked properties: Local Similarity and Layer-wise Information Heterogeneity. Building on these insights, we propose EยฒLoRA, a novel dual-adaptive sharing framework. It enables adaptive sharing interval partitioning, guided by inter-layer proxy entropy similarity, and adaptive rank allocation, informed by layer-wise absolute proxy entropy. This unique design leverages inherently informative properties of pre-trained models to significantly reduce parameter redundancy while maintaining or enhancing expressiveness. Comprehensive evaluations across diverse tasks, modalities, and models consistently demonstrate that EยฒLoRA achieves an excellent balance of efficiency and effectiveness, consistently matching or surpassing baselines with approximately 50% fewer trainable parameters.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5158. CitySeeker: How Do VLMs Explore Embodied Urban Navigation with Implicit Human Needs?
๐ฌ 28
โญ 4.00
๐ datasets and benchmarks
๐ท๏ธ Embodied Urban Navigation Vision-Language Models Urban Intelligence Spatial Cognition
Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., ''I am thirsty'') in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMsโ spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of exploratory strategiesโBacktracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval (BCR), inspired by human cognitive mapping's emphasis on iterative observation-reasoning cycles and adaptive path optimization. Our analysis provides actionable insights for developing VLMs with robust spatial intelligence required for tackling ''last-mile'' navigation challenges.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5159. Learning Explicit Single-Cell Dynamics Using ODE Representations
๐ฌ 24
โญ 4.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ AI4Science AI4Biology gene interaction discovery single-cell dynamics dynamical systems
Modeling the dynamics of cellular differentiation is fundamental to advancing the understanding and treatment of diseases associated with this process, such as cancer. With the rapid growth of single-cell datasets, this has also become a particularly promising and active domain for machine learning.
Current state-of-the-art models, however, rely on computationally expensive optimal transport preprocessing and multi-stage training, while also not discovering explicit gene interactions.
To address these challenges we propose Cell-Mechanistic Neural Networks (*Cell-MNN*), an encoder-decoder architecture whose latent representation is a *locally linearized ODE* governing the dynamics of cellular evolution from stem to tissue cells. Cell-MNN is fully end-to-end (besides a standard PCA pre-processing) and its ODE representation explicitly learns biologically consistent and interpretable gene interactions.
Empirically, we show that Cell-MNN achieves competitive performance on single-cell benchmarks, surpasses state-of-the-art baselines in scaling to larger datasets and joint training across multiple datasets, while also learning interpretable gene interactions that we validate against the TRRUST database of gene interactions.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5160. World2Minecraft: Occupancy-Driven simulated scenes Construction
๐ฌ 23
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Embodied AI; World2Minecraft; 3D Semantic Occupancy Prediction; MinecraftOcc Dataset; Vision-Language Navigation
Embodied intelligence requires high-fidelity simulation environments to support perception and decision-making, yet existing platforms often suffer from data contamination and limited flexibility. To mitigate this, we propose World2Minecraft to convert real-world scenes into structured Minecraft environments based on 3D semantic occupancy prediction. In the reconstructed scenes, we can effortlessly perform downstream tasks such as Vision-Language Navigation(VLN). However, we observe that reconstruction quality heavily depends on accurate occupancy prediction, which remains limited by data scarcity and poor generalization in existing models. We introduce a low-cost, automated, and scalable data acquisition pipeline for creating customized occupancy datasets, and demonstrate its effectiveness through MinecraftOcc, a large-scale dataset featuring 100,165 images from 156 richly detailed indoor scenes. Extensive experiments show that our dataset provides a critical complement to existing datasets and poses a significant challenge to current SOTA methods. These findings contribute to improving occupancy prediction and highlight the value of World2Minecraft in providing a customizable and editable platform for personalized embodied AI research. We will publicly release the dataset and the complete generation framework to ensure reproducibility and encourage future work.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5161. Prediction with Expert Advice under Local Differential Privacy
๐ฌ 14
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ privacy differential privacy online learning online linear optimization local differential privacy
We study the classic problem of prediction with expert advice under the constraint of local differential privacy (LDP). In this context, we first show that a classical algorithm naturally satisfies LDP and then design two new algorithms that improve it: RW-AdaBatch and RW-Meta. For RW-AdaBatch, we exploit the limited-switching behavior induced by LDP to provide a novel form of privacy amplification that grows stronger on easier data, analogous to the shuffle model in offline learning. Drawing on the theory of random walks, we prove that this improvement carries essentially no utility cost. For RW-Meta, we develop a general method for privately selecting between experts that are themselves non-trivial learning algorithms, and we show that in the context of LDP this carries no extra privacy cost. In contrast, prior work has only considered data-independent experts. We also derive formal regret bounds that scale inversely with the degree of independence between experts. Our analysis is supplemented by evaluation on real-world data reported by hospitals during the COVID-19 pandemic; RW-Meta outperforms both the classical baseline and a state-of-the-art \textit{central} DP algorithm by 1.5-3$\times$ on the task of predicting which hospital will report the highest density of COVID patients each week.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5162. Correlations in the Data Lead to Semantically Rich Feature Geometry Under Superposition
๐ฌ 10
โญ 4.00
๐ interpretability and explainable AI
๐ท๏ธ Mechanistic Interpretability Superposition Linear Representation Hypothesis Feature Geometry Feature Manifold
Recent advances in mechanistic interpretability have shown that many features
represented by deep learning models can be captured by dictionary learning approaches such as sparse autoencoders. However, our understanding of the structures formed by these internal representations is still limited. Initial โtoy-modelโ analyses showed that in an idealized setting features can be arranged in local structures, such as small regular polytopes, through a phenomenon known as _superposition_. However, these local structures have not been observed in real language models. In contrast, language models display rich structures like semantically clustered representations or ordered circles for the months of the year which are not predicted by current theories. In this work, we introduce Bag-of-Words Superposition (BOWS), a framework in which autoencoders (AEs) with a non-linearity are trained to compress sparse, binary bag-of-words vectors drawn from Internet-scale text.
Our framework reveals that under restrictive bottlenecks, or when trained with weight decay, non-linear AEs linearly encode the low rank structure in the data, arranging feature representations according to their co-activation patterns. This _linear superposition_ gives rise to structures like ordered circles and semantic clusters, similar to those observed in language models. Our findings suggest that
the semantically meaningful structures observed in language models could arise driven by compression alone, without necessarily having a functional role beyond efficiently arranging feature representations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5163. HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks
๐ฌ 16
โญ 4.00
๐ datasets and benchmarks
๐ท๏ธ Human Evaluation Embeddings MTEB Benchmarking NLP Multilingual
Comparing human and model performance offers a valuable perspective for understanding the strengths and limitations of embedding models, highlighting wherethey succeed and where they fail to capture meaning and nuance. However, such comparisons are rarely made, as human performance on embedding tasks is difficult to measure. To fill this gap, we introduce HUME: Human Evaluation Framework for Text Embeddings. While frameworks like MTEB provide broad model evaluation, they lack reliable estimates of human performance, limiting the interpretability of model scores. We measure human performance across 16 MTEB datasets spanning reranking, classification, clustering, and semantic textual similarity across linguistically diverse high- and low-resource languages. Humans achieve an average performance of 77.6% compared to 80.1% for the best embedding model, though with substantial variation: models reach high performance on some datasets while struggling on notably low-resource languages. Our human annotation also reveals multiple notable dataset issues. We also benchmark nine LLMs as annotators, finding they fall short of human performance (76.1% vs. 81.2%) despite offering scalability advantages. We provide human performance baselines, insights into task difficulty patterns, and an extensible evaluation framework that enables a more meaningful interpretation of the model and informs the development of both
models and benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5164. Many Eyes, One Mind: Temporal Multi-Perspective and Progressive Distillation for Spiking Neural Networks
๐ฌ 21
โญ 4.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Spiking Neural Networks Knowledge Distillation Neuromorphic Computing
Spiking Neural Networks (SNNs), inspired by biological neurons, are attractive for their event-driven energy efficiency but still fall short of Artificial Neural Networks (ANNs) in accuracy. Knowledge distillation (KD) has emerged as a promising approach to narrow this gap by transferring ANN knowledge into SNNs. Temporal-wise distillation (TWD) leverages the temporal dynamics of SNNs by providing supervision across timesteps, but it applies a constant teacher output to all timesteps, mismatching the inherently evolving temporal process of SNNs. Moreover, while TWD improves per-timestep accuracy, truncated inference still suffers from full-length temporal information loss due to the progressive accumulation process. We propose **MEOM** (**M**any **E**yes, **O**ne **M**ind), a unified KD framework that enriches supervision with diverse temporal perspectives through mask-weighted teacher features and progressively aligns truncated predictions with the full-length prediction, thereby enabling more reliable inference across all timesteps. Extensive experiments and theoretical analyses demonstrate that MEOM achieves state-of-the-art performance on multiple benchmarks. Code will be released on GitHub.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5165. Entropy-Guided Dynamic Tokens for Graph-LLM Alignment in Molecular Understanding
๐ฌ 35
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Multimodal Modeling GraphโLLM Alignment Molecule Understanding Backbone-Free Tuning
Molecular understanding is central to advancing areas such as scientific and drug discovery, yet Large Language Models (LLMs) struggle to understand molecular graphs effectively. Existing graphโLLM bridges often adapt the Q-Former-style connector with fixed-length static tokens, which is originally designed for vision tasks. These designs overlook stereochemistry and substructural context and typically require costly LLM-backbone fine-tuning, limiting efficiency and generalization.
We introduce **EDT-Former**, an **E**ntropy-guided **D**ynamic **T**oken Trans**former** that generates tokens aligned with informative molecular patches, thereby preserving both local and global structural features for molecular graph understanding. Beyond prior approaches, EDT-Former enables alignment between frozen graph encoders and LLMs without tuning the LLM backbone (excluding the embedding layer), resulting in computationally efficient finetuning, and achieves state-of-the-art results on MoleculeQA, Mol-Instructions, and property prediction benchmarks (TDC, MoleculeNet), underscoring its effectiveness for scalable and generalizable multimodal molecular understanding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5166. Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
๐ฌ 5
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ representation learning for language datasets and benchmarks reward modeling reinforcement learning natural langauge processing large language models reasoning alignment
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for complex reasoning tasks with clear correctness signals such as math and coding. However, extending it to real-world reasoning tasks is challenging, as evaluation depends on nuanced, multi-criteria judgments rather than binary correctness. Instance-specific rubrics have recently been used in evaluation benchmarks to capture such judgments, but their potential as reward signals for on-policy post-training remains underexplored. We introduce $\textbf{Rubrics as Rewards (\textit{RaR})}$, an on-policy reinforcement learning method that extends RLVR beyond verifiable domains by using rubric-based feedback. Across both medical and science domains, we evaluate multiple strategies for aggregating rubric feedback into rewards. The best RaR variant achieves relative improvements of up to 31\% on HealthBench and 7\% on GPQA-Diamond over popular LLM-as-judge baselines that rely on direct Likert-based rewards. These results demonstrate that RaR-trained policies adapt well to diverse evaluation formats, performing strongly on both rubric-based and multiple-choice tasks. Moreover, we find that using rubrics as structured reward signals yields better alignment for smaller judges and reduces performance variance across judge scales.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5167. KernelFusion: Zero-Shot Blind Super-Resolution via Patch Diffusion
๐ฌ 17
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Kernel Estimation Super Resolution
Traditional super-resolution (SR) methods assume an "ideal'' downscaling SR-kernel (e.g., bicubic downscaling) between the high-resolution (HR) image and the low-resolution (LR) image. Such methods fail once the LR images are generated differently. Current blind-SR methods aim to remove this assumption, but are still fundamentally restricted to rather simplistic downscaling SR-kernels (e.g., anisotropic Gaussian kernels), and fail on more complex (out of distribution) downscaling degradations. However, using the correct SR-kernel is often more important than using a sophisticated SR algorithm. In "KernelFusion'' we introduce a zero-shot diffusion-based method that uses an unrestricted kernel. Our method recovers the unique image-specific SR-kernel directly from the LR input image, while simultaneously recovering its corresponding HR image. KernelFusion exploits the principle that the correct SR-kernel is the one that maximizes patch similarity across different scales of the LR image. We first train an image-specific patch-based diffusion model on the single LR input image, capturing its unique internal patch statistics. We then reconstruct a larger HR image with the same learned patch distribution, while simultaneously recovering the correct downscaling SR-kernel that maintains this cross-scale relation between the HR and LR images. Empirical results demonstrate that KernelFusion handles complex downscaling degradations where existing Blind-SR methods fail, achieving robust kernel recovery and superior SR quality. By breaking free from predefined kernel assumptions and training distributions, KernelFusion establishes a new paradigm of zero-shot Blind-SR that can handle unrestricted, image-specific kernels previously thought impossible.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5168. Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization
๐ฌ 25
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ safety alignment
As Large Language Models (LLMs) are increasingly deployed in real-world applications, it is important to ensure their behaviors align with human values, societal norms, and ethical principles. However, safety alignment under Reinforcement Learning (RL) often suffers from forgetting learned general abilities, which is also known as the alignment tax. To address this issue, we introduce Null-Space constrained Policy Optimization (NSPO), a novel RL framework for LLM safety alignment while preserving their core abilities. The safety policy gradients are geometrically projected into the null space of general tasks, thereby mitigating the safety alignment tax. In addition, we theoretically prove that NSPO preserves the model's original core capabilities, while still guaranteeing a descent direction for effective safety alignment. Extensive experiments demonstrate that NSPO outperforms existing methods by a large margin, achieving state-of-the-art safety performance without sacrificing accuracy on general tasks, including math, code, and instruction-following tasks.
Notably, NSPO is data-efficient and only requires 40% of public human-annotated safety data from PKU-SafeRLHF to achieve promising safety performance, without a large amount of mixed general tasks data in existing alignment methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5169. Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction
๐ฌ 11
โญ 4.00
๐ datasets and benchmarks
๐ท๏ธ Dynamic Benchmark Construction LLM Evaluation
The evaluation of code-generating Large Language Models (LLMs) is fundamentally constrained by two intertwined challenges: a reliance on static, easily contaminated problem sources and the use of superficial, low-rigor testing. This paper introduces a new benchmark construction philosophy, Dual Scaling, designed to systematically address both limitations. Our approach involves continuously
scaling the source of problems from dynamic, real-world code repositories and systematically scaling the rigor of tests via automated, high-coverage Property-Based Testing (PBT). We instantiate this philosophy in CODE2BENCH, an end-to-end framework that leverages Scope Graph analysis for principled dependency classification and a 100% branch coverage quality gate to ensure test suite integrity.
Using this framework, we construct CODE2BENCH-2509, a new benchmark suite with native instances in both Python and Java. Our extensive evaluation of 10 state-of-the-art LLMs on CODE2BENCH-2509, powered by a novel "diagnostic
fingerprint" visualization, yields three key insights: (1) models exhibit a fundamental performance gap, excelling at API application (Weakly Self-Contained tasks) but struggling with algorithmic synthesis (Self-Contained tasks); (2) a modelโs
performance is profoundly shaped by the target languageโs ecosystem, a nuance we are the first to systematically quantify; and (3) our rigorous, scaled testing is critical in uncovering an "illusion of correctness" prevalent in simpler benchmarks. Our work presents a robust, scalable, and diagnostic paradigm for the next generation of LLM evaluation in software engineering. The code, data, and results are available at https://code2bench.github.io/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5170. PILOT-Bench: Probabilistic Interaction for LLM Operations in Tool-driven Scenarios
๐ฌ 4
โญ 4.00
๐ datasets and benchmarks
๐ท๏ธ Workflow execution LLM robustness Probabilistic Tool Behavior
We introduce PILOT-Bench, a benchmark that evaluates LLM workflow execution under simulated realistic conditions of instruction quality variability and tool execution uncertainty. Unlike existing benchmarks that encounter these challenges incidentally, our work makes uncertainty the primary focus of systematic study. The benchmark incorporates three key aspects: (1) modeling of probabilistic tool behaviors through parameterized error models that simulate real-world API failure patterns, (2) provision of MDP-derived workflows that maximize expected success rates, and (3) systematic evaluation of model robustness through controlled perturbations of workflow instruction quality. Our construction pipeline generates 5,040 tasks from a tool library of 30 APIs. The evaluation conducted across widely used large language models under conditions of probabilistic tool failures and varying instruction quality reveals notable performance differences. Specifically, MDP-optimal workflow prompts achieve an average success rate of 62.1\%, Chain-of-Thought prompts yield an average success rate of 50.8\%, and flawed workflow prompts result in an average success rate of 54.3\%. Our benchmark is available at \url{https://github.com/PilotBenchAnonymous/PilotBench}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5171. Steering Diffusion Models Towards Credible Content Recommendation
๐ฌ 36
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Credible content recommendation Societal Considerations Diffusion models
In recent years, diffusion models (DMs) have achieved remarkable success in recommender systems (RSs), owing to their strong capacity to model the complex distributions of item content and user behaviors. Despite their effectiveness, existing methods pose the danger of generating uncredible content recommendations (e.g., fake news, misinformation) that may significantly harm social well-being, as they primarily emphasize recommendation accuracy while neglecting the credibility of the recommended content. To address this issue, in this paper, we propose Disco, a novel method to steer diffusion models towards credible content recommendation. Specifically, we design a novel disentangled diffusion model to mitigate the harmful influence of uncredible content on the generation process while preserving high recommendation accuracy. This is achieved by reformulating the diffusion objective to encourage generation conditioned on preference-related signals while discouraging generation conditioned on uncredible content-related signals. In addition, to further improve the recommendation credibility, we design a progressively enhanced credible subspace projection that suppresses uncredible content by projecting diffusion targets into the null space of uncredible content. Extensive experiments on real-world datasets demonstrate the effectiveness of Disco in terms of both accurate and credible content recommendations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5172. NEO โ No-Optimization Test-Time Adaptation through Latent Re-Centering
๐ฌ 8
โญ 4.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ test-time adaptation domain adaptation on-device
Test-Time Adaptation (TTA) methods are often computationally expensive, require a large amount of data for effective adaptation, or are brittle to hyperparameters. Based on a theoretical foundation of the geometry of the latent space, we are able to significantly improve the alignment between source and distribution-shifted samples by re-centering target data embeddings at the origin. This insight motivates NEO โ a hyperparameter-free fully TTA method, that adds no significant compute compared to vanilla inference. NEO is able to improve the classification accuracy of ViT-Base on ImageNet-C from 55.6\% to 59.2\% after adapting on just one batch of 64 samples. When adapting on 512 samples NEO beats all 7 TTA methods we compare against on ImageNet-C, ImageNet-R and ImageNet-S and beats 6/7 on CIFAR-10-C, while using the least amount of compute. NEO performs well on model calibration metrics and additionally is able to adapt from 1 class to improve accuracy on 999 other classes in ImageNet-C. On Raspberry Pi and Jetson Orin Nano devices, NEO reduces inference time by 63\% and memory usage by 9\% compared to baselines. Our results based on 3 ViT architectures and 4 datasets show that NEO can be used efficiently and effectively for TTA.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5173. Learning Dynamics of Logits Debiasing for Long-Tailed Semi-Supervised Learning
๐ฌ 24
โญ 4.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ learning dynamics; semi-supervised learning; long-tailed; logits debiasing
Long-tailed distributions are prevalent in real-world semi-supervised learning (SSL), where pseudo-labels tend to favor majority classes, leading to degraded generalization. Although numerous long-tailed SSL (LTSSL) methods have been proposed, the underlying mechanisms of class bias remain underexplored. In this work, we investigate LTSSL through the lens of learning dynamics and introduce the notion of baseline images to characterize accumulated bias during training. We provide a step-wise decomposition showing that baseline predictions are determined solely by shallow bias terms, making them reliable indicators of class priors. Building on this insight, we propose a novel framework, DyTrim, which leverages baseline images to guide data pruning. Specifically, we perform class-aware pruning on labeled data to balance class distribution and label-agnostic soft pruning with confidence filtering on unlabeled data to mitigate error accumulation. Theoretically, we show that our method implicitly realizes risk reweighting, effectively suppressing class bias. Extensive experiments on public benchmarks show that DyTrim consistently enhances the performance of existing LTSSL methods by improving representation quality and prediction accuracy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5174. TraPO: A Semi-Supervised Reinforcement Learning Framework for Boosting LLM Reasoning
๐ฌ 19
โญ 4.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Reinforcement Learning with Verifiable Rewards Semi-supervised Learning Large Language Model
Reinforcement learning with verifiable rewards (RLVR) has proven effective in training large reasoning models (LRMs) by leveraging answer-verifiable signals to guide policy optimization, which, however, suffers from high annotation costs. To alleviate this problem, recent work has explored unsupervised RLVR methods that derive rewards solely from the modelโs internal consistency, such as through entropy and majority voting. While seemingly promising, these methods often suffer from model collapse in the later stages of training, which may arise from the reinforcement of incorrect reasoning patterns in the absence of external supervision. In this work, we investigate a novel semi-supervised RLVR paradigm that utilizes a small labeled set to **guide** RLVR training on unlabeled samples. Our key insight is that supervised rewards are essential for stabilizing consistency-based training on unlabeled samples, ensuring that only reasoning patterns verified on labeled instances are incorporated into RL training. Technically, we propose an effective policy optimization algorithm TraPO that filters out reliable unlabeled samples by matching their learning trajectory similarity to labeled ones. Building on this, TraPO achieves remarkable data efficiency and strong generalization on nine advanced benchmarks. With only 1K labeled and 3K unlabeled samples, TraPO reaches 42.6% average accuracy, surpassing the best unsupervised method trained on 45K unlabeled samples (38.3%). Notably, when using 4K labeled and 12K unlabeled samples, TraPO even **outperforms the fully supervised model** trained on the full 45K labeled samples on all benchmarks, while using only **10%** of the labeled data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5175. FlexLoRA: Entropy-Guided Flexible Low-Rank Adaptation
๐ฌ 21
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ PEFT; Dynamic Rank; LoRA
Large pre-trained models achieve remarkable success across diverse domains, yet fully fine-tuning incurs prohibitive computational and memory costs.
Parameter-efficient fine-tuning (PEFT) has thus become a mainstream paradigm.
Among them, Low-Rank Adaptation (LoRA) introduces trainable low-rank matrices and shows strong performance, nevertheless, its fixed-rank design limits flexibility.
Dynamic rank allocation methods mitigate this issue by pruning redundant directions; however, they often rely on heuristic, element-level metrics that globally sort rank directions without matrix-wise distinction, and they lack mechanisms to expand capacity in layers requiring additional adaptation.
To overcome these limitations, we propose FlexLoRA, an entropy-guided flexible low-rank adaptation framework that (i) evaluates matrix importance via spectral energy entropy, (ii) supports rank pruning and expansion under a global budget, and (iii) employs zero-impact initialization for newly added singular directions to ensure stability.
By addressing granularity, flexibility, and stability limitations, FlexLoRA provides a more principled solution for PEFT.
Extensive experiments show that FlexLoRA consistently outperforms state-of-the-art baselines across benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5176. ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity
๐ฌ 19
โญ 4.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Lean 4 Evaluation Metric Operator Trees Autoformalization
Statement autoformalization, the automated translation of statements from natural language into formal languages, has seen significant advancements, yet the development of automated evaluation metrics remains limited. Existing metrics for formal statement similarity often fail to balance semantic and structural information. String-based approaches capture syntactic structure but ignore semantic meaning, whereas proof-based methods validate semantic equivalence but disregard structural nuances and, critically, provide no graded similarity score in the event of proof failure. To address these issues, we introduce ASSESS (A Semantic and Structural Evaluation Framework for Statement Similarity), which comprehensively integrates semantic and structural information to provide a continuous similarity score. Our framework first transforms formal statements into Operator Trees to capture their syntactic structure and then computes a similarity score using our novel TransTED (Transformation Tree Edit Distance) Similarity metric, which enhances traditional Tree Edit Distance by incorporating semantic awareness through transformations. For rigorous validation, we present EPLA (Evaluating Provability and Likeness for Autoformalization), a new benchmark of 1,247 expert-annotated formal statement pairs derived from miniF2F and ProofNet, with labels for both semantic provability and structural likeness. Experiments on EPLA demonstrate that TransTED Similarity outperforms existing methods, achieving state-of-the-art accuracy and the highest Kappa coefficient. The code, benchmark and experimental results are available at https://anonymous.4open.science/r/ASSESS-1103/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5177. Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning
๐ฌ 20
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ LLM RL Human Behavior Simulation Online Shopping Reward Hacking Prevention
Large Language Models (LLMs) have recently demonstrated strong potential in generating โbelievable human-likeโ behavior in web environments. Prior work has explored augmenting training data with LLM-synthesized rationales and applying supervised fine-tuning (SFT) to enhance reasoning ability, which in turn can improve downstream action prediction. However, the performance of such approaches remains inherently bounded by the reasoning capabilities of the model used to generate the rationales. In this paper, we introduce Shop-R1, a novel reinforcement learning (RL) framework aimed at enhancing the reasoning ability of LLMs for simulation of real human behavior in online shopping environments. Specifically, Shop-R1 decomposes the human behavior simulation task into two stages: rationale generation and action prediction, each guided by distinct reward signals. For rationale generation, we leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner. For action prediction, we propose a hierarchical reward structure with difficulty-aware scaling to prevent reward hacking and enable fine-grained reward assignment. This design evaluates both high-level action types and the correctness of fine-grained sub-action details (attributes and values), rewarding outputs proportionally to their difficulty. Experimental results show that our method achieves a relative improvement of over 65% compared to the baseline.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5178. Grounding Generative Planners in Verifiable Logic: A Hybrid Architecture for Trustworthy Embodied AI
๐ฌ 36
โญ 4.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ neurosymbolic AI hybrid AI formal reasoning large language models AI safety verifiable AI embodied AI robotics
While Large Language Models (LLMs) show immense promise as planners for embodied AI, their stochastic nature and lack of formal reasoning capabilities prevent the strict safety guarantees required for physical deployment. Current approaches fall short: they either rely on other unreliable LLMs for safety checks or simply reject unsafe plans without offering a path to success. This work bridges this critical gap by introducing the Verifiable Iterative Refinement Framework (VIRF), a neuro-symbolic architecture that shifts the paradigm from a passive safety gatekeeper to an active safety collaborator. Where prior verifiers simply reject failures, our framework provides causal, pedagogical feedback that teaches the LLM why its plan was unsafe, enabling intelligent repairs rather than mere avoidance.Our core contribution is a novel tutor-apprentice dialogue, where a deterministic Logic Tutor, grounded in a formal safety ontology, provides causal and explanatory feedback to an LLM Apprentice planner. This pedagogical interaction allows the apprentice to perform intelligent, creative plan repairs, resolving safety conflicts rather than merely avoiding them. To ground this dialogue in verifiable truth, we introduce a scalable knowledge acquisition pipeline that synthesizes a comprehensive safety knowledge base from real-world documents, a process that simultaneously reveals and corrects significant blind spots in existing benchmarks. On a new suite of challenging home safety tasks, VIRF achieves a perfect 0\% Hazardous Action Rate (HAR), completely eliminating unsafe actions while attaining a 77.3\% Goal-Condition Rate (GCR)โthe highest among all baselines. It does so with remarkable efficiency, requiring only 1.1 correction iterations on average. By acting as a verifiable safety scaffold, VIRF demonstrates a principled and robust pathway toward building embodied agents that are not just capable, but fundamentally trustworthy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5179. Let's Explore Step by Step: Generating Provable Formal Statements with Deductive Exploration
๐ฌ 23
โญ 4.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ large language model math formal theorem proving problem generation automated conjecturing
Mathematical problem synthesis shows promise in resolving data exhaustion, contamination, and leakage for AI training and evaluation. Despite enormous efforts,
an **expressiveness-validity-complexity trilemma** remains an open question. Existing methods either lack whole-process verifiability, are constrained to a particular domain, or are bounded by external models.
This paper breaks the trilemma by proposing the framework of **DExploration** _(**D**eductive **Exploration**)_, which formulates problem synthesis as a step-by-step exploration process instead of one-shot generation. Agents are equipped with three simple yet powerful atomic actions: _introducing_ variables/hypotheses, _deducing_ new facts, and _submitting_ derived facts. The entire exploration process is formally verified by Lean 4, which encompasses most mathematical domains up to the research level.
Once a conclusion is submitted, the framework outputs a formal statement with guaranteed provability, reducing the need for external models.
To bootstrap training data for DExploration, we propose **Exploratory Transformation** to distill exploration trajectories from existing large-scale theorem-proving data. It rewrites formal proofs into a deductive style, parses dependencies among variables, hypotheses, and proof steps, then reassembles them into exploration trajectories by a topological order.
Experiments validate the effectiveness and efficiency of our methods, achieving an improved success rate ($40.70\\% \mapsto 54.52\\%$), reduced token cost ($52.9\text{K} \mapsto 8.8\text{K}, 83\\%\downarrow$), broader complexity and difficulty distributions, and Pareto optimality.
In $2726$ valid generations, three state-of-the-art provers fail on $60$ (Pass@4) and $8$ (Pass@64). Code, data, and models will be available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5180. CARPRT: Class-Aware Zero-Shot Prompt Reweighting for Vision-Language Model
๐ฌ 25
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Prompt Weighting Prompt Ensembling Pre-trained Models Vision-Language Models
Pre-trained vision-language models (VLMs) enable zero-shot image classification by computing the similarity score between an image and textual descriptions, typically formed by inserting a class label (e.g., "cat") into a prompt (e.g., "a photo of a").Existing studies have shown that the score between a given image-class pair is highly sensitive to the choice of prompt, and they proposed a scheme using a weighting vector to reassemble scores regarding different prompts. We observe that these studies assign the same weighting vector across all classes, by implicitly assuming the conditional independence of classes and weights, which, however, often does not hold in practice. For instance, a prompt like "an aerial view of" might be apt for "airport" but ill-suited for "apple". To address this, we propose class-aware zero-shot prompt reweighting (CARPRT), a scoring scheme that adjusts the weighting vector for each class by capturing the class-specific relevance of different prompts in a training-free manner. For each class and every available prompt, it first identifies the maximum image-text relevance score using that prompt-class pair across the dataset. These maximum scores are then normalized to estimate class-specific weights that reflect how effectively a prompt represents different semantic labels. Evaluations on standard fine-grained image classification benchmarks show that CARPRT outperforms existing class-independent reweighting, confirming that modeling prompt-class dependency is crucial for effective zero-shot prediction and even broader VLM-based application settings that rely on prompt ensembling.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5181. Leveraging Discrete Function Decomposability for Scientific Design
๐ฌ 10
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ scientific protein design generative model decomposability
In the era of AI-driven science and engineering, we often want to design discrete
objects (e.g., circuits, proteins, materials) in silico according to user-specified
properties (e.g., that a protein binds its target). Given a property predictive model,
in silico design typically involves training a generative model over the design
space (e.g., over the set of all length-L proteins) to concentrate on designs with the
desired properties. Distributional optimization, formalized as an estimation of distribution algorithm or as reinforcement learning policy optimization, maximizes
an objective function in expectation over samples. Optimizing a distribution over
discrete-valued designs is in general challenging due to the combinatorial nature
of the design space. However, many property predictors in scientific applications
are decomposable in the sense that they can be factorized over design variables in a
way that will prove useful. For example, the active site amino acids in a catalytic
protein may need to only loosely interact with the rest of the protein for maximal catalytic activity. Current distributional optimization algorithms are unable to
make use of such structure, which could dramatically improve the optimization.
Herein, we propose and demonstrate use of a new distributional optimization algorithm, Decomposition-Aware Distributional Optimization (DADO),
that can leverage any decomposability defined by a junction tree on the design
variables. At its core, DADO employs a factorized โsearch distributionโโa
learned generative modelโfor efficient navigation of the search space, and invokes graph message passing to coordinate optimization across all variables.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5182. Proximal Supervised Fine-Tuning
๐ฌ 18
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ SFT generalization language models vision language models
Supervised fine-tuning (SFT) of foundation models often leads to poor generalization, where prior capabilities deteriorate after tuning on specific tasks. Inspired by trust-region policy optimization (TRPO) and proximal policy optimization (PPO) in reinforcement learning (RL), we propose Proximal SFT (PSFT), a fine-tuning objective that incorporates the benefits of trust-region, effectively constraining policy drift during SFT while maintaining competitive tuning. By viewing SFT as a special case of policy gradient methods with constant positive advantages, we derive PSFT that stabilizes optimization and leads to generalization, while leaving room for further optimization in subsequent post-training stages. Experiments across mathematical, human-value, and multimodal domains show that PSFT matches standard SFT in-domain, outperforms it in out-of-domain generalization, remains stable under prolonged training without causing entropy collapse, and provides a stronger foundation for the subsequent optimization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5183. CoCoDiff: Correspondence-Consistent Diffusion Model for Fine-grained Style Transfer
๐ฌ 17
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Style Transfer Diffusion Model Computer Vision
Transferring visual style between images while preserving semantic correspondence between similar objects remains a central challenge in computer vision. While existing methods have made great strides, most of them operate at global level but overlook region-wise and even pixel-wise semantic correspondence. To address this, we propose **CoCoDiff**, a novel *training-free* and *low-cost* style transfer framework that leverages pretrained latent diffusion models to achieve fine-grained, semantically consistent stylization. We identify that correspondence cues within generative diffusion models are under-explored and that content consistency across semantically matched regions is often neglected. CoCoDiff introduces a pixel-wise semantic correspondence module that mines intermediate diffusion features to construct a dense alignment map between content and style images. Furthermore, a cycle-consistency module then enforces structural and perceptual alignment across iterations, yielding object and region level stylization that preserves geometry and detail. Despite requiring no additional training or supervision, CoCoDiff delivers state-of-the-art visual quality and strong quantitative results, outperforming methods that rely on extra training or annotations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5184. FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
๐ฌ 9
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Vision Language Model Cross-Modal Alignment and Integration Text-Guided Vision Encoding Context-Aware Decoding Dual-Semantic Mapping Loss Text-Driven VQA Synthesis
We introduce FLARE, a family of vision language models (VLMs) with a fully vision-language alignment and integration paradigm. Unlike existing approaches that rely on single MLP projectors for modality alignment and defer cross-modal interaction to LLM decoding, FLARE achieves deep, dynamic integration throughout the pipeline. Our key contributions include: (1) Text-Guided Vision Encoding that incorporates textual information during vision encoding to achieve pixel-level alignment; (2) Context-Aware Alignment Decoding that aggregates visual features conditioned on textual context during decoding for query-level integration; (3) Dual-Semantic Mapping Loss to supervise feature mapping from both modalities and enable modality-level bridging; and (4) Text-Driven VQA Synthesis that leverages high-quality text to generate VQA pairs and synthesize corresponding images, enabling data-level optimization. We train FLARE at 3B and 8B scales under both fixed and dynamic resolution settings, demonstrating that our full-modality alignment significantly outperforms existing methods while maintaining strong generalizability. FLARE 3B surpasses Cambrian-1 8B and Florence-VL 8B using only 630 vision tokens. Ablation studies reveal that FLARE achieves superior performance over existing methods with minimal computational cost. Even without dynamic resolution, FLARE outperforms LLaVA-NeXT, validating the effectiveness of our approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5185. SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training
๐ฌ 29
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Post-training Transferability Sparse Autoencoder Large Language Models
In recent years, pre-trained large language models have achieved remarkable success across diverse tasks. Besides the pivotal role of self-supervised pre-training, their effectiveness in downstream applications also depends critically on the post-training process, which adapts models to task-specific data and objectives. However, this process inevitably introduces model shifts that can influence performance in different domains, and how such shifts transfer remains poorly understood. To open up the black box, we propose the SAE-based Transferability Score (STS), a new metric that leverages sparse autoencoders (SAEs) to forecast post-training transferability. Taking supervised fine-tuning as an example, STS identifies shifted dimensions in SAE representations and calculates their correlations with downstream domains, enabling reliable estimation of transferability \textit{before} fine-tuning. Extensive experiments across multiple models and domains show that STS accurately predicts the transferability of supervised fine-tuning, achieving Pearson correlation coefficients above 0.75 with actual performance changes. Beyond this, we take an initial step toward extending STS to reinforcement learning. We believe that STS can serve as an interpretable tool for guiding post-training strategies in LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5186. SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge
๐ฌ 22
โญ 4.00
๐ datasets and benchmarks
๐ท๏ธ Large Language Model safety alignment scientific knowledge misuse
Large language models (LLMs) exhibit advancing capabilities in complex tasks, such as reasoning and graduate-level question answering, yet their resilience against misuse, particularly involving scientifically sophisticated risks, remains underexplored. Existing safety benchmarks typically focus either on instructions requiring minimal knowledge comprehension (e.g., ``tell me how to build a bomb") or utilize prompts that are relatively low-risk (e.g., multiple-choice or classification tasks about hazardous content). Consequently, they fail to adequately assess model safety when handling knowledge-intensive, hazardous scenarios.
To address this critical gap, we introduce SOSBench, a regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas). We evaluate frontier models within a unified evaluation framework using our SOSBench. Despite their alignment claims, advanced models consistently disclose policy-violating content across all domains, demonstrating alarmingly high rates of harmful responses (e.g., 84.9% for Deepseek-R1 and 50.3% for GPT-4.1). These results highlight significant safety alignment deficiencies and underscore urgent concerns regarding the responsible deployment of powerful LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5187. CLEAR: Calibrated Learning for Epistemic and Aleatoric Risk
๐ฌ 28
โญ 4.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Uncertainty Quantification Prediction intervals Epistemic Uncertainty Aleatoric Uncertainty Conditional Coverage Calibration
Existing methods typically address either aleatoric uncertainty due to measurement noise or epistemic uncertainty resulting from limited data, but not both in a balanced manner. We propose CLEAR, a calibration method with two distinct parameters, $\gamma_1$ and $\gamma_2$, to combine the two uncertainty components and improve the conditional coverage of predictive intervals for regression tasks. CLEAR is compatible with any pair of aleatoric and epistemic estimators; we show how it can be used with (i) quantile regression for aleatoric uncertainty and (ii) ensembles drawn from the PredictabilityโComputabilityโStability (PCS) framework for epistemic uncertainty. Across 17 diverse real-world datasets, CLEAR achieves an average improvement of 28.2\% and 17.4\% in the interval width compared to the two individually calibrated baselines while maintaining nominal coverage. Similar improvements are observed when applying CLEAR to Deep Ensembles (epistemic) and Simultaneous Quantile Regression (aleatoric). The benefits are especially evident in scenarios dominated by high aleatoric or epistemic uncertainty.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5188. Breaking Gradient Temporal Collinearity for Robust Spiking Neural Networks
๐ฌ 25
โญ 4.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Neuromophic computing Spiking neural networks robustness
Spiking Neural Networks (SNNs) have emerged as an efficient neuromorphic computing paradigm, offering low energy consumption and strong representational capacity through binary spike-based information processing. However, their performance is heavily shaped by the input encoding method. While direct encoding has gained traction for its efficiency and accuracy, it proves less robust than traditional rate encoding. To illuminate this issue, we introduce Gradient Temporal Collinearity (GTC), a principled measure that quantifies the directional alignment of gradient components across time steps, and we showโboth empirically and theoreticallyโthat elevated GTC in direct encoding undermines robustness. Guided by this insight, we propose Structured Temporal Orthogonal Decorrelation (STOD), which integrates parametric orthogonal kernels with structured constraints into the input layer of direct encoding to diversify temporal features and effectively reduce GTC. Extensive experiments on visual classification benchmarks, show that STOD consistently outperforms state-of-the-art methods in robustness, highlighting its potential to drive SNNs toward safer and more reliable deployment.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5189. JailbreakLoRA: Your Downloaded LoRA from Sharing Platforms might be Unsafe
๐ฌ 28
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Jailbreak LoRA Large Language Models
Low-Rank Adaptation (LoRA) benefits from its plug-and-play nature, enabling large language models (LLMs) to achieve significant performance gains at low cost, has driven the development of LoRA-sharing platforms. However, the jailbreak and backdoor concerns associated with LoRA-sharing platforms remain underexplored. Existing LoRA-based attacks primarily focus on achieving high attack success rates, while neglecting the core reason why LoRA is adopted by user, i.e. to gain downstream task capabilities. However, achieving effective attacks while preserving strong multi-task performance remains challenging, as the largely unrelated objectives tend to interfere with each other during optimization. In this paper, we propose JailbreakLoRA, a multi-task jailbreak LoRA training method that balances task utility and attack capability, it resolves training interference by uncertainty-weighting losses and mitigating gradient conflicts. Additionally, JailbreakLoRA is designed to generate an affirmative prefix upon trigger activation, exploiting inference-time hallucinations to enhance the effectiveness of jailbreak. Experimental results demonstrate that our method outperforms SOTA LoRA-based attacks, achieving a 16.0\% improvement in attack success rate while also enhancing performance on multi-downstream tasks by 16.5\% in average.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5190. WavePolyp: Video Polyp Segmentation via Hierarchical Wavelet-Based Feature Aggregation and Inter-Frame Divergence Perception
๐ฌ 9
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Video Polyp Segmentation
Automatic polyp segmentation from colonoscopy videos is a crucial technique that assists clinicians in improving the accuracy and efficiency of diagnosis, preventing polyps from developing into cancer.
However, video polyp segmentation (VPS) is a challenging task due to (1) the significant inter-frame divergence in videos, (2) the high camouflage of polyps in normal colon structures and (3) the clinical requirement of real-time performance.
In this paper, we propose a novel segmentation network, WavePolyp, which consists of two innovative components: a hierarchical wavelet-based feature aggregation (HWFA) module and inter-frame divergence perception (IDP) blocks.
Specifically, HWFA excavates and amplifies discriminative information from high-frequency and low-frequency features decomposed by wavelet transform, hierarchically aggregating them into refined spatial representations within each frame.
This module enhances the representation capability of intra-frame spatial features, effectively addressing the high camouflage of polyps in normal colon structures.
Furthermore, IDP perceives and captures inter-frame polyp divergence through a temporal divergence perception mechanism, enabling accurate polyp tracking while mitigating temporal inconsistencies caused by the significant inter-frame variations across frames.
Extensive experiments conducted on the SUN-SEG and CVC-612 datasets demonstrate that our method outperforms other state-of-the-art methods.
Codes will be released upon publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5191. STORK: Faster Diffusion and Flow Matching Sampling by Resolving both Stiffness and Structure-Dependence
๐ฌ 11
โญ 4.00
๐ generative models
๐ท๏ธ diffusion model fast sampling method stabilized Runge--Kutta training-free
Diffusion models (DMs) and flow-matching models have demonstrated remarkable performance in image and video generation. However, such models require a significant number of function evaluations (NFEs) during sampling, leading to costly inference. Consequently, quality-preserving fast sampling methods that require fewer NFEs have been an active area of research. However, prior training-free sampling methods fail to simultaneously address two key challenges: the stiffness of the ODE (i.e., the non-straightness of the velocity field) and dependence on the semi-linear structure of the DM ODE (which limits their direct applicability to flow-matching models). In this work, we introduce the Stabilized Taylor Orthogonal RungeโKutta (STORK) method, addressing both design concerns. We demonstrate that STORK consistently improves the quality of diffusion and flow-matching sampling for image and video generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5192. Augmented Radiance Field: A General Framework for Enhanced Gaussian Splatting
๐ฌ 19
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Gaussian Splatting Novel View Synthesis Decoupled Radiance Fields View-dependent Opacity
Due to the real-time rendering performance, 3D Gaussian Splatting (3DGS) has emerged as the leading method for radiance field reconstruction. However, its reliance on spherical harmonics for color encoding inherently limits its ability to separate diffuse and specular components, making it challenging to accurately represent complex reflections. To address this, we propose a novel enhanced Gaussian kernel that explicitly models specular effects through view-dependent opacity. Meanwhile, we introduce an error-driven compensation strategy to improve rendering quality in existing 3DGS scenes. Our method begins with 2D Gaussian initialization and then adaptively inserts and optimizes enhanced Gaussian kernels, ultimately producing an augmented radiance field. Experiments demonstrate that our method not only surpasses state-of-the-art NeRF methods in rendering performance but also achieves greater parameter efficiency. Code will be released.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5193. Causal-Steer: Disentangled Continuous Style Control without Parallel Corpora
๐ฌ 26
โญ 4.00
๐ generative models
๐ท๏ธ Controllable Generation Activation Steering Style Control Large Language Models
Controlling stylistic attributes of Large Language Models (LLMs), such as formality or conceptual complexity, is crucial for effective human-AI interaction. However, current methods often suffer from discreteness, reliance on expensive parallel corpora, and instability, limiting their practical utility. This paper introduces a novel framework for robust activation steering that eliminates the need for parallel corpora, enabling continuous, fine-grained, and linear control over LLM outputs. Our key insight is to reframe Low-Rank Adaptation (LoRA) as a causal intervention tool. By contrasting activations on identical inputs with and without a LoRA perturbation trained via a contrastive objective, we separate the influence of content. To enhance reliability, we introduce a robust aggregation pipeline that uses Principal Component Analysis (PCA) for denoising and the geometric median for centrality estimation, yielding a stable and disentangled style vector. At inference, this vector allows for precise bidirectional control via activation steering with negligible computational overhead. We demonstrate state-of-the-art performance on controlling conceptual complexity, text detoxification, and formality control. Our method not only provides superior control but also generalizes across different models and tasks, and enables simultaneous multi-attribute control.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5194. Genomic Foundationless Models: Pretraining Does Not Promise Performance
๐ฌ 11
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ ai4science foundation models genomics biology pretraining deep learning
The success of Large Language Models has inspired the development of Genomic Foundation Models (GFMs) through similar pretraining techniques. However, the relationship between pretraining performance and effectiveness in downstream ge- nomic tasks remains unclear. Additionally, the high computational cost of pretraining raises questions about its cost-efficiency. To assess the usefulness of pretraining in genomics, we evaluated seven different GFMs across 52 diverse genomic tasks, comparing them to their counterparts with randomly initialized weights. Across benchmarks, we find that randomly initialized models provide surprisingly strong baselines and tokenizer and architecture choices strongly shape both these baselines and the gains from pretraining. Specifically, character-token models often match or exceed the performance of larger pretrained k-mer or BPE models, whereas subword models appear to benefit from pretraining. We also find that the evaluated GFMs fail to capture clinically relevant genetic mutations, with embeddings and log-likelihood ratios showing limited sensitivity to annotated variants. For the tasks we study, these results suggest that current NLP-style pretraining strategies provide modest, tokenizer-gated improvements over strong random baselines and motivate more biologically informed tokenization and variant-aware objectives. Our code is available at github.com/z6JfFK/gfm.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5195. Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking
๐ฌ 20
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Language Models Efficient Reasoning Overthink
Large Reasoning Models (LRMs) have achieved impressive performance on challenging tasks, yet their deep reasoning often incurs substantial computational costs. To achieve efficient reasoning, existing reinforcement learning methods still struggle to construct short reasoning path during the rollout stage, limiting effective learning. Inspired by Evidence Accumulation Models, we find that LRMs have accumulated sufficient information early in reasoning, making further reasoning steps redundant. Based on this insight, we propose Just-Enough Thinking (JET), which trains models to proactively terminate unnecessary reasoning. JET performs trajectory truncation during rollout to expose the model to short, distributionally consistent reasoning paths. Besides, it uses a quality-controlled length reward to better encourage concise reasoning while maintaining correctness. Extensive experiments demonstrate that JET significantly improves reasoning efficiency without sacrificing accuracy. In particular, JET delivers a 4.6% accuracy improvement while reducing the output length by 46.3% on the Olympiad benchmark using DeepSeek-R1-Distill-Qwen-1.5B. Our code is available in the GitHub.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5196. Aurelius: Relation Aware Text-to-Audio Generation At Scale
๐ฌ 18
โญ 4.00
๐ datasets and benchmarks
๐ท๏ธ Relation Aware Text-to-Audio Generaion Audio Event Corpus Relation Corpus
We present Aurelius, a new framework that enables relation aware text-to-audio (TTA) generation research at scale. Given the lack of essential audio event and relation corpora, \emph{Aurelius} contributes a large-scale audio event corpus \emph{AudioEventSet} and another large-scale relation corpus \emph{AudioRelSet}. Comprising 110 event categories, AudioEventSet maximally covers all commonly heard audio events and each event is unique, realistic and of high-quality. AudioRelSet consists of 100 relations, comprehensively covering the relations that present in the physical world or can be neatly described by text. As the two corpora provide audio event and relation independently, they can be combined to create massive pairs with our pair generation strategy to support relation aware TTA investigation at scale. We comprehensively benchmark all existing TTA models from both general and relation aware evaluation perspective. We further provide in-depth investigation on scaling up existing TTA models' relation aware generation by either training from scratch or leveraging cross-domain general TTA knowledge. The introduced corpora and the findings through investigation in this work potentially facilitate future research on relation aware TTA generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5197. FlowAlign: Trajectory-Regularized, Inversion-Free Flow-based Image Editing
๐ฌ 12
โญ 4.00
๐ generative models
๐ท๏ธ flow models image editing inversion-free
Recent inversion-free, flow-based image editing methods such as FlowEdit leverages a pre-trained noise-to-image flow model such as Stable Diffusion 3, enabling text-driven manipulation by solving an ordinary differential equation (ODE). While the lack of exact latent inversion is a core advantage of these methods, it often results in unstable editing trajectories and poor source consistency. To address this limitation, we propose {\em FlowAlign}, a novel inversion-free flow-based framework for consistent image editing with principled trajectory control. FlowAlign introduces a flow-matching loss as a regularization mechanism to promote smoother and more stable trajectories during the editing process. Notably, the flow-matching loss is shown to explicitly balance semantic alignment with the edit prompt and structural consistency with the source image along the trajectory. Furthermore, FlowAlign naturally supports reverse editing by simply reversing the ODE trajectory, highliting the reversible and consistent nature of the transformation. Extensive experiments demonstrate that FlowAlign outperforms existing methods in both source preservation and editing controllability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5198. Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport
๐ฌ 16
โญ 4.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ hyperparameter optimal transport trajectory inference manifold learning interpolation
Neural networks (NNs) often have critical behavioural trade-offs that are set at design time with hyperparametersโsuch as reward weighting in reinforcement learning or quantile targets in regression.
Post-deployment, however, user preferences can evolve, making initially optimal settings undesirable, necessitating expensive retraining.
To circumvent this, we introduce the task of Hyperparameter Trajectory Inference (HTI), to learn, from observed data, how a NN's conditional output distribution changes as a function of its hyperparameters, such that a surrogate model can approximate the NN at unobserved hyperparameter settings.
HTI requires extending existing trajectory inference approaches to incorporate conditions, posing key challenges to ensure meaningful inferred conditional probability paths.
We propose an approach grounded in conditional Lagrangian optimal transport theory, jointly learning the Lagrangian function governing hyperparameter-induced dynamics along with the associated optimal transport maps and geodesics, which form the surrogate model.
We incorporate inductive biases based on the manifold hypothesis and least-action principles into the learned Lagrangian, improving surrogate model feasibility.
We empirically demonstrate that our approach reconstructs NN behaviour across hyperparameter spectrums better than other alternatives, enabling effective inference-time adaptation of NNs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5199. Learning What Matters Now: Dynamic Preference Inference under Contextual Shifts
๐ฌ 14
โญ 4.00
๐ applications to robotics, autonomy, planning
๐ท๏ธ Dynamic Value Preference Inference Cognitive Modeling Adaptive Decision-Making
Humans often juggle multiple, sometimes conflicting objectives and shift their priorities as circumstances change, rather than following a fixed objective function.
In contrast, most computational decision-making and multi-objective RL methods assume static preference weights or a known scalar reward.
In this work, we study sequential decision-making problem when these preference weights are unobserved latent variables that drift with context.
Specifically, we propose Dynamic Preference Inference (DPI), a cognitively inspired framework in which an agent maintains a probabilistic belief over preference weights, updates this belief from recent interaction, and conditions its policy on inferred preferences.
We instantiate DPI as a variational preference inference module trained jointly with a preference-conditioned actorโcritic, using vector-valued returns as evidence about latent trade-offs.
In queueing, gridworld maze, and multi-objective continuous-control environments with event-driven changes in objectives, DPI adapts its inferred preferences to new regimes and achieves higher post-shift performance than fixed-weight and heuristic envelope baselines.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5200. A Balanced Neuro-Symbolic Approach for Commonsense Abductive Logic
๐ฌ 22
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Reasoning Logic Artificial Intelligence Large Language Models Abduction
Although Large Language Models (LLMs) have demonstrated impressive formal reasoning abilities, they often break down when problems require complex proof planning. One promising approach for improving LLM reasoning abilities involves translating problems into formal logic and using a logic solver. Although off-the-shelf logic solvers are in principle substantially more efficient than LLMs at logical reasoning, they assume that all relevant facts are provided in a question and are unable to deal with missing commonsense relations. In this work, we propose a novel method that uses feedback from the logic solver to augment a logic problem with commonsense relations provided by the LLM, in an iterative manner. This involves a search procedure through potential commonsense assumptions to maximize the chance of finding useful facts while keeping cost tractable. On a collection of pure-logical reasoning datasets, from which some commonsense information has been removed, our method consistently achieves considerable improvements over existing techniques, demonstrating the value in balancing neural and symbolic elements when working in human contexts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5201. Once-More: Continuous Self-Correction for Large Language Models via Perplexity-Guided Intervention
๐ฌ 28
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model Natural Language Processing Self-Correction Agent Guided Generation Post-Hoc Refinement
Large Language Models (LLMs) often experience compounding errors during long text generation. Early mistakes can propagate and lead to drift, faulty reasoning, or repetition. While scaling up models improves capabilities, it requires substantial computational resources, and the resulting self-correction behaviour remains unpredictable at inference time. Self-correction is a promising technique for addressing this issue. However, existing approaches have limitations. Supervised training methods can build self-correcting behaviours into models, but require training data collection and lack cross-domain generalizability. Current post-hoc iterative refinement methods operate only at inference time, but must wait for substantial portions of the draft to be generated before providing feedback. This feedback does not guarantee effective guidance, and the same mistake patterns can still reappear. In this paper, we introduce Once-More, a model-agnostic post-hoc self-correction framework that intervenes during generation. Once-More leverages token-level perplexity and feedback from verifiers to provide continuous guided steering of the generation path through a logit redistribution mechanism. This approach essentially helps accumulate "more correct" steps throughout the generation process. Evaluation on multiple benchmarks demonstrates that Once-More achieves state-of-the-art results compared to other self-correction methods. To our knowledge, Once-More is the first post-hoc method to leverage token perplexity and external feedback to perform continuous guided self-correction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5202. CAR-LoRA: Training Compression-Aware and Robust LoRA Adapters for Evolving LLMs
๐ฌ 9
โญ 4.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ Low Rank Adaptation Edge Devices Quantization Compression Efficient Fine-tuning
The deployment of large language models (LLMs) for specialized tasks on resource-constrained edge devices like smartphones and sensors presents a significant scalability problem. To run on such hardware, these massive models must be compressed using techniques like \emph{quantization or pruning} to reduce their memory and computational footprint. Concurrently, foundational LLMs are periodically updated by their developers with new data, making their $\textit{internal parameters shift over time}$. While parameter-efficient methods like Low-Rank Adaptation (LoRA) streamline personalization by fine-tuning only a small fraction of parameters, the resulting adapters are $\textbf{brittle}$; a LoRA trained for one specific compression scheme is incompatible with another, and an adapter trained on an older base model performs poorly on an updated one. This forces a costly cycle of retraining for each unique device and every new model release. To address this, we introduce a novel framework that creates a single, universally portable adapter that is both $\textbf{\textit{(i)} compression-aware and \textit{(ii)} temporally robust}$. We achieve this by augmenting the training process with a variety of simulated compression techniques during a single run, utilizing a quantized forward pass to build resilience while maintaining a full-precision backward pass for stable gradient optimization. $\textit{This method yields a unified adapter robust to diverse compression artifacts and the subtle parameter shifts from model evolution}$. Extensive experiments on models such as $\texttt{Llama-2, Llama-3.1, Gemma-2}$, and $\texttt{Mistral}$ across reasoning benchmarks like $\textit{SQA, MATH, and GSM8K}$ demonstrate that our single adapter achieves performance comparable to specialized adapters ($\textit{e.g.}$, QLoRA) that are individually retrained for each compression scheme. Furthermore, we show this single adapter maintains its high performance when applied to future, evolved versions of the base model, eliminating the need for periodic retraining. Our work pioneers an efficient paradigm for edge AI, creating portable model patches that bridge the gap between cloud-based personalization, the diverse hardware ecosystem, and the lifecycle of evolving LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5203. SpatiaLab: Can VisionโLanguage Models Perform Spatial Reasoning in the Wild?
๐ฌ 19
โญ 4.00
๐ datasets and benchmarks
๐ท๏ธ Spatial reasoning Visionโlanguage models Large languge models Reasoning Models LLM Evaluation Spatial Understanding
Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary visionโlanguage models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce **_SpatiaLab_**, a comprehensive benchmark for evaluating VLMsโ spatial reasoning in realistic, unconstrained contexts.
**_SpatiaLab_** comprises 1,400 visual questionโanswer pairs across six major categories: *Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation,* and *3D Geometry*, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation.
Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10โ25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry.
By providing a diverse, real-world evaluation framework, **_SpatiaLab_** exposes critical challenges and opportunities for advancing VLMsโ spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding. We will release **_SpatiaLab_**.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5204. PETRI: Learning Unified Cell Embeddings from Unpaired Modalities via Early-Fusion Joint Reconstruction
๐ฌ 10
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ high-content screening cell biology single cell transcriptomics microscopy multimodal
Integrating multimodal screening data is challenging because biological signals only partially overlap and cell-level pairing is frequently unavailable. Existing approaches either require pairing or fail to capture both shared and modality-specific information in an end-to-end manner. We present PETRI, an early-fusion transformer that learns a unified cell embedding from unpaired cellular images and gene expression profiles. PETRI groups cells by shared experimental context into multimodal โdocumentsโ and performs masked joint reconstruction with cross-modal attention, permitting information sharing while preserving modality-specific capacity. The resulting latent space supports construction of perturbation-level profiles by simple averaging across modalities. Applying sparse autoencoders to the embeddings reveals learned concepts that are biologically meaningful, multimodal, and retain perturbation-specific effects. To support further machine learning research, we release a blinded, matched optical pooled screen (OPS) and Perturb-seq dataset in HepG2 cells.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5205. Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation
๐ฌ 27
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Mathematical Reasoning Group Relative Policy Optimization Question Reformulation
Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing the mathematical reasoning capabilities of large models.
However, we identify that harder questions lack sufficient attention in existing methods from both algorithmic and data perspectives.
Algorithmically, widely used Group Relative Policy Optimization (GRPO) and its variants exhibit a critical limitation: their advantage estimation introduces an implicit imbalance where the magnitude of policy updates is lower for harder questions.
From a data-centric viewpoint, existing augmentation approaches primarily rephrase questions to enhance diversity, without systematically increasing their intrinsic difficulty.
To address these issues, we propose a two-dual MathForge framework to improve mathematical reasoning by targeting harder questions from both perspectives, which comprises a Difficulty-Aware Group Policy Optimization (DGPO) algorithm and a Multi-Aspect Question Reformulation (MQR) strategy.
Specifically, DGPO first rectifies the implicit imbalance in GRPO via difficulty-balanced group advantage estimation and further prioritizes more challenging questions by difficulty-aware question-level weighting.
Meanwhile, MQR reformulates questions across multiple aspects to increase their difficulty while maintaining the original gold answer.
Overall, MathForge creates a synergistic loop: MQR expands the data frontier, and DGPO efficiently masters the augmented data.
Extensive experiments demonstrate that MathForge markedly outperforms existing methods on various mathematical reasoning tasks.
The code and augmented data will all be available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5206. Strictly Constrained Generative Modeling via Split Augmented Langevin Sampling
๐ฌ 18
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Langevin sampling generative models constrained sampling duality Lagrangian data assimilation optimal control diffusion models inverse problems optimal control
Deep generative models hold great promise for representing complex physical systems, but their deployment is currently limited by the lack of guarantees on the physical plausibility of the generated outputs. Ensuring that known physical constraints are enforced is therefore critical when applying generative models to scientific and engineering problems. We address this limitation by developing a principled framework for sampling from a target distribution while rigorously satisfying physical constraints. Leveraging the variational formulation of Langevin dynamics, we propose Split Augmented Langevin (SAL), a novel primal-dual sampling algorithm that enforces constraints progressively through variable splitting, with convergence guarantees. While the method is developed theoretically for Langevin dynamics, we demonstrate its effective applicability to diffusion models. In particular, we use constrained diffusion models to generate physical fields satisfying energy and mass conservation laws. We apply our method to diffusion-based data assimilation on a complex physical system, where enforcing physical constraints substantially improves both forecast accuracy and the preservation of critical conserved quantities. We also demonstrate the potential of SAL for challenging feasibility problems in optimal control.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5207. String Seed of Thought: Prompting LLMs for Distribution-Faithful and Diverse Generation
๐ฌ 21
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Prompting Diversity
We introduce _String Seed of Thought (SSoT)_, a novel prompting method for LLMs that improves _Probabilistic Instruction Following (PIF)_. We define PIF as a task requiring an LLM to select its answer from a predefined set of options, each associated with a specific probability, such that the empirical distribution of the generated answers aligns with the target distribution when prompted multiple times. While LLMs excel at tasks with single, deterministic answers, they often fail at PIF, exhibiting biases problematic for applications requiring non-deterministic behaviors, such as human-behavior simulation, content diversification, and multiplayer games.
It also harms the diversity of generated responses, a crucial factor in test-time scaling, by causing the outputs to collapse into a limited set of answers. To address this, we propose SSoT, a simple prompting method that instructs an LLM to first output a random string to generate sufficient entropy. SSoT also instructs the LLM to extract randomness by manipulating this string to derive a final answer, thereby preserving diversity while adhering to specific constraints. We demonstrate that SSoT significantly improves the PIF performance of LLMs, approaching the ideal performance of a pseudo-random number generator. Notably, our experiments on NoveltyBench show SSoT's benefits extend beyond closed-set tasks to open-ended tasks by enhancing response diversity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5208. AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators
๐ฌ 27
โญ 4.00
๐ datasets and benchmarks
๐ท๏ธ code generation benchmark autocodebench llm
Large Language Models (LLMs) have shown impressive performance across diverse domains, with code generation emerging as a particularly prominent application. However, existing benchmarks designed to evaluate code generation exhibit several critical limitations. First, most rely on manual annotations, which are time-consuming and difficult to scale across programming languages and problem complexities. Second, the majority focus primarily on Python, while the few multilingual benchmarks suffer from limited difficulty and imbalanced language coverage.
To overcome these challenges, we present AutoCodeGen, an automated framework for constructing high-difficulty, multilingual code generation datasets without manual annotations. Our approach guarantees correctness and completeness by generating test inputs with LLMs, obtaining test outputs within a multilingual sandbox, and further enhancing quality through reverse problem generation and multi-stage filtering.
Based on this novel method, we introduce AutoCodeBench, a large-scale benchmark suite spanning 20 programming languages with balanced coverage. AutoCodeBench is designed to rigorously evaluate LLMs on diverse, challenging, and realistic multilingual programming tasks. Extensive experiments reveal that even state-of-the-art models struggle on these tasks, particularly in low-resource languages.
Besides, we release complementary training and evaluation resources, including a large-scale, verifiable multilingual instruction dataset generated via the same pipeline, as well as a multilingual sandbox with high-concurrency support. We hope these contributions will provide a solid foundation for future research and inspire the community to explore more automatic and scalable approaches to multilingual code generation, with a particular emphasis on advancing progress in low-resource languages.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5209. Lifelong Learning with Behavior Consolidation for Vehicle Routing
๐ฌ 26
โญ 4.00
๐ optimization
๐ท๏ธ Neural Combinatorial Optimization Vechicle Routing Lifelong Learning for Optimization
Recent neural solvers have demonstrated promising performance in learning to solve routing problems. However, existing studies are primarily based on one-off training on one or a set of predefined problem distributions and scales, i.e., tasks.
When a new task arises, they typically rely on either zero-shot generalization, which may be poor due to the discrepancies between the new task and the training task(s), or fine-tuning the pretrained solver on the new task, which possibly leads to catastrophic forgetting of knowledge acquired from previous tasks. This paper explores a novel lifelong learning paradigm for neural VRP solvers, where multiple tasks with diverse distributions and scales arise sequentially over time. Solvers are required to effectively and efficiently learn to solve new tasks while maintaining their performance on previously learned tasks. Consequently, a novel framework called Lifelong Learning Router with Behavior Consolidation (LLR-BC) is proposed. LLR-BC consolidates prior knowledge effectively by aligning behaviors of the solver trained on a new task with the buffered ones in a decision-seeking way. To encourage more focus on crucial experiences, LLR-BC assigns greater consolidated weights to decisions with lower confidence. Extensive experiments on capacitated vehicle routing problems and traveling salesman problems demonstrate LLR-BCโs effectiveness in training high-performance neural solvers in a lifelong learning setting, addressing the catastrophic forgetting issue, maintaining their plasticity, and improving zero-shot generalization ability.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5210. The Value of Information in Human-AI Decision-making
๐ฌ 16
โญ 4.00
๐ interpretability and explainable AI
๐ท๏ธ Explanation Human-AI complementarity Decision theory
Multiple agents are increasingly combined to make decisions with the expectation of achieving complementary performance, where the decisions they make
together outperform those made individually. However, knowing how to improve
the performance of collaborating agents requires knowing what information and
strategies each agent employs. With a focus on human-AI pairings, we contribute a
decision-theoretic framework for characterizing the value of information. By defining complementary information, our approach identifies opportunities for agents to
better exploit available informationโin AI-assisted decision workflows. We present
a novel explanation technique (ILIV-SHAP) that adapts SHAP explanations to
highlight human-complementing information. We validate the effectiveness of
the framework on examples from chest X-ray diagnosis and deepfake detection
and ILIV-SHAP through a study of human-AI decision-making. We also find that
presenting ILIV-SHAP with AI predictions leads to reliably greater reductions in
error over non-AI assisted decisions more than vanilla SHAP.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5211. Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management
๐ฌ 28
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Long Context Active Context Management Tool Use Proactive Interference Reinforcement Learning
Large Language Models (LLMs) suffer from significant performance degradation when processing long contexts due to proactive interference, where irrelevant information in earlier parts of the context disrupts reasoning and memory recall. While most research focuses on external memory systems to augment LLMs' capabilities, we propose a complementary approach: empowering LLMs with Active Context Management (ACM) tools to actively sculpt their internal working memory. We introduce Sculptor, a framework that equips LLMs with three categories of tools: (1) context fragmentation, (2) summary, hide, and restore, and (3) precise search. Our approach enables LLMs to proactively manage their attention and working memory, analogous to how humans selectively focus on relevant information while filtering out distractions. Experimental evaluation on diverse long-context benchmarks demonstrates that Sculptor significantly improves performance even without specific training, leveraging LLMs' inherent tool-calling and instruction-following capabilities. To further optimize these strategies, we introduce a novel dynamic context-aware reinforcement learning (RL) approach, advancing the training of an agent that actively modifies its own conversational history. By enabling Active Context Management, Sculptor not only mitigates proactive interference but also provides a cognitive foundation for more reliable reasoning across diverse long-context tasksโhighlighting that explicit context-control strategies, rather than merely larger token windows, are key to robustness at scale.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5212. MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse
๐ฌ 20
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ spatial reasoning vision language model
We present MetaSpatial, the first reinforcement learning (RL) framework for enhancing 3D spatial reasoning in vision-language models (VLMs), enabling real-time 3D scene layout generation without post-processing. MetaSpatial addresses two key challenges: (i) the need for extensive post-processing, as existing VLMs lack inherent 3D spatial reasoning to generate realistic layouts; and (ii) the inefficiency of supervised fine-tuning (SFT) for layout generation due to scarcity of perfect annotations. Our core contribution is the 3D Spatial Policy Optimization (3D-SPO) algorithm, which incorporates physics-aware modulation into advantage estimates at the object level and trajectory-level reward from a training-only multi-turn refinement pipeline. This design enhances temporal credit assignment and encourages spatially consistent policy learning. Empirical evaluations across models of varying scales demonstrate that MetaSpatial improves spatial coherence, physical plausibility, and formatting stability, leading to more realistic and functionally coherent object placements applicable to metaverse environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5213. SafeMPO: Constrained Reinforcement Learning with Probabilistic Incremental Improvement
๐ฌ 15
โญ 4.00
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning Constrained Reinforcement Learning
Reinforcement Learning (RL) has demonstrated significant success in optimizing complex control and planning problems. However, scaling RL to real-world applications with multiple, potentially conflicting requirements requires an effective handling of constraints. We propose a novel approach to constraint satisfaction in RL algorithms, focusing on incrementally improving policy safety rather than directly projecting the policy onto a feasible region. We accomplish this by first solving a nonparametric surrogate problem which is guaranteed to contract towards the feasible set, and then cloning that solution into a neural network policy. As a result, our approach improves stability, particularly during early training stages, when the policy lacks knowledge of constraint boundaries. We provide general theoretical results guaranteeing convergence to the safe set for this class of incremental systems. Notably, even the simplest algorithm produced by our theory produces comparable or superior performance when compared to highly tuned constrained RL baselines in challenging constrained environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5214. Synthesising Counterfactual Explanations via Label-Conditional Gaussian Mixture Variational Autoencoders
๐ฌ 11
โญ 4.00
๐ interpretability and explainable AI
๐ท๏ธ Counterfactual Explanations Contrastive Explanations Explainable AI Trustworthy AI Algorithmic Recourse
Counterfactual explanations (CEs) provide recourse recommendations for individuals affected by algorithmic decisions. A key challenge is generating CEs that are robust against various perturbation types (e.g. input and model perturbations) while simultaneously satisfying other desirable properties. These include plausibility, ensuring CEs reside on the data manifold, and diversity, providing multiple distinct recourse options for single inputs. Existing methods, however, mostly struggle to address these multifaceted requirements in a unified, model-agnostic manner. We address these limitations by proposing a novel generative framework. First, we introduce the Label-conditional Gaussian Mixture Variational Autoencoder (L-GMVAE), a model trained to learn a structured latent space where each class label is represented by a set of Gaussian components with diverse, prototypical centroids. Building on this, we present LAPACE (LAtent PAth Counterfactual Explanations), a model-agnostic algorithm that synthesises entire paths of CE points by interpolating from inputs' latent representations to those learned latent centroids. This approach inherently ensures robustness to input changes, as all paths for a given target class converge to the same fixed centroids. Furthermore, the generated paths provide a spectrum of recourse options, allowing users to navigate the trade-off between proximity and plausibility while also encouraging robustness against model changes. In addition, user-specified actionability constraints can also be easily incorporated via lightweight gradient optimisation through the L-GMVAE's decoder. Comprehensive experiments show that LAPACE is computationally efficient and achieves competitive performance across eight quantitative metrics.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5215. VeriEquivBench: An Equivalence Score for Ground-Truth-Free Evaluation of Formally Verifiable Code
๐ฌ 16
โญ 4.00
๐ datasets and benchmarks
๐ท๏ธ Formal verification verifiable coding agents code generation large language models benchmark
Formal verification is the next frontier for ensuring the correctness of code generated by Large Language Models (LLMs).
While methods that co-generate code and formal specifications in formal languages, like Dafny, can, in principle, prove alignment with user intent, progress is bottlenecked by specification quality evaluation.
Current benchmarks rely on matching against ground-truth specifications, a manual and expertise-intensive process that has limited existing datasets to a few hundred simple problems and also suffers from a reliability issue.
To address this, we introduce VeriEquivBench, a new benchmark with $2,389$ complex algorithmic problems that probe the limitations of current models in both code generation and formal reasoning.
Our evaluation framework replaces ground-truth matching with a formally grounded metric, the equivalence score, and rigorously verifies the quality of generated specifications and code.
Our results show that generating formally verifiable code remains a profound challenge for state-of-the-art LLMs. This underscores both the difficulty of the task and the need for benchmarks like VeriEquivBench to drive progress toward scalable and reliable coding agents.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5216. Mixture-of-World Models: Scaling Multi-Task Reinforcement Learning with Modular Latent Dynamics
๐ฌ 16
โญ 4.00
๐ reinforcement learning
๐ท๏ธ Multi-task reinforcement learning world model transformer mixture-of-world models
A fundamental challenge in multi-task reinforcement learning (MTRL) is achieving sample efficiency in visual domains where tasks exhibit significant heterogeneity in both observations and dynamics. Model-based RL (MBRL) offers a promising path to sample efficiency through world models, but standard monolithic architectures struggle to capture diverse task dynamics, leading to poor reconstruction and prediction accuracy. We introduce the Mixture-of-World Models (MoW), a scalable architecture that integrates three key components: i) modular VAEs for task-adaptive visual compression, ii) a hybrid Transformer-based dynamics model combining task-conditioned experts with a shared backbone, and, iii) a gradient-based task clustering strategy for efficient parameter allocation. On the Atari 100k benchmark, \textbf{a single MoW agent} (trained once over Atari $26$ games) achieves a mean human-normalized score of $\mathbf{110.4}$%, competitive with the $\mathbf{114.2}$% achieved by the recent STORMโan ensemble of $26$ task-specific modelsโwhile requiring $50$% fewer parameters. On Meta-World, MoW attains a $\mathbf{74.5}$% average success rate within 300k steps, establishing a new state-of-the-art. These results demonstrate that MoW provides a scalable and parameter-efficient foundation for generalist world models. Our code is available in the supplementary materials.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5217. A Resolution-Agnostic Geometric Transformer for Chromosome Modeling Using Inertial Frame
๐ฌ 9
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Chromosome Modeling Inertial Frame Resolution-Agnostic 3D Transformer AI for Biology
Chromosomes are the carriers of genetic information. Further understanding their 3D structure can help reveal gene-regulatory mechanisms and cellular functions. A standard pipeline for reconstructing the chromosome 3D structure first applies the single-cell Hi-C high-throughput screening method to measure pairwise interactions between DNA fragments at different resolutions; then it adopts computational methods to reconstruct the 3D structures from these contacts. These include traditional numerical methods and deep learning models, which struggle with limited model expressiveness and poor generalization across resolutions. To solve this issue, we propose InertialGenome, a novel transformer-based framework for robust and resolution-agnostic chromosome reconstruction. InertialGenome first adopts the inertial frame for the pose canonicalization. Then, based on such an invariant frame, it proposes a Transformer with geometry-aware positional encoding, leveraging Nystrรถm estimation. To verify the effect of InertialGenome, we evaluate our model on two single-cell 3D reconstruction datasets with four resolutions, reaching superior performance over all four computational baselines. In addition to the structure metrics, we observe that InertialGenome outperforms when analyzing the function of reconstructed structures on two validation tasks. Finally, we leverage InertialGenome for cross-resolution transfer learning, yielding up to a 5\% improvement from low to high resolution.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5218. Explainable LLM Unlearning through Reasoning
๐ฌ 43
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Unlearning
LLM unlearning is essential for mitigating safety, copyright, and privacy concerns in pre-trained Large Language Models (LLMs). Compared to preference alignment, it offers a more explicit way by removing undesirable knowledge characterized by specific unlearning datasets.
In previous works, Gradient Ascent (GA) and its variants have shown promise for implementing unlearning, yet their untargeted nature results in unintended degradation of general capabilities, incomplete removal of knowledge, and the generation of incoherent responses, among many others. We argue that these issues stem from the absence of explicit guidance on what and how models should unlearn.
To fill this gap, we introduce a novel unlearning target, *reasoning-based unlearning target*, which satisfies both the specified unlearning scope and the specified post-unlearning response. Building on this, we propose *Targeted Reasoning Unlearning* (TRU), which leverages reasoning-based unlearning target as guidance. We employ the target using a cross-entropy supervised loss combined with a GA-based loss, enabling the model to learn reasoning ability for precise knowledge removal while preserving unrelated abilities.
We evaluate TRU against strong baselines across multiple benchmarks and LLM backbones, and find that it achieves more reliable unlearning while preserving general capabilities. Moreover, TRU exhibits superior robustness under diverse attack scenarios, stemming from the reasoning ability learned through reasoning-based targets. Overall, our study establishes reasoning-augmented unlearning as a practical paradigm for reliable and explainable LLM unlearning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5219. SeRI: Gradient-Free Sensitive Region Identification in Decision-Based Black-Box Attacks
๐ฌ 22
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Machine learning AI safety Decision-based adversarial attacks Sensitive region
Deep neural networks (DNNs) are highly vulnerable to adversarial attacks, where small, carefully crafted perturbations are added to input images to cause misclassification. These perturbations are particularly effective when concentrated in sensitive regions of an image that strongly influence the modelโs prediction. However, in decision-based black-box settings, where only the top-1 predicted label is observable and query budgets are strictly limited, identifying sensitive regions becomes extremely challenging. This issue is critical because without accurate region information, decision-based attacks cannot refine adversarial examples effectively, limiting both their efficiency and accuracy.
We propose Sensitive Region Identification, SeRI, the first decision-based method that assigns a continuous sensitivity score to each image pixel. It enables fine-grained region discovery and substantially improves the efficiency of adversarial attacks, all without access to gradients, confidence scores, or surrogate models.
SeRI progressively partitions the image into finer sub-regions and refines a continuous sensitivity score to capture their true importance. At each iteration, it generates two perturbation variants of the selected region by scaling its magnitude up or down, and compares their decision boundaries to derive an accurate, continuous characterization of pixel sensitivity.
SeRI further divides selected region into smaller sub-regions, recursively refining the search for sensitive areas. This recursive refinement process enables more precise sensitivity estimation through fine-grained analysis, distinguishing SeRI from prior binary or one-shot region selection approaches. Experiments on two benchmark datasets show that SeRI significantly enhances state-of-the-art decision-based attacks in both targeted and non-targeted attack scenarios. Additionally, SeRI generates precise heatmaps that identify sensitive image regions. The code is available at https://anonymous.4open.science/r/SeRI-5310.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5220. AlphaAgentEvo: Evolution-Oriented Alpha Mining via Self-Evolving Agentic Reinforcement Learning
๐ฌ 19
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Alpha Mining Agentic AI Quantitative Investment Self-evolving Agent
Alpha mining seeks to identify predictive alpha factors that generate excess returns beyond the market from a vast and noisy search space; however, existing approaches struggle to facilitate the systematic evolution of alphas. Traditional methods, such as genetic programming, are unable to interpret human natural-language instructions and often fail to extract valuable insights from unsuccessful attempts, leading to low interpretability and inefficient exploration. Analogously, without mechanisms for systematic evolution, e.g., long-term planning and reflection, multi-agent approaches may easily fall into repetitive evolutionary routines, thereby failing to realize efficient self-evolution. To overcome these limitations, we introduce **AlphaAgentEvo**, a self-evolving Agentic Reinforcement Learning (ARL) framework for alpha mining, which moves alpha mining beyond the brittle โsearchโbacktestโrestartโ cycle toward a continuous trajectory of evolution. Instructed by a hierarchical reward function, our agent engages in self-exploration of the search space, progressively learning basic requirements (e.g., valid tool calls) and then harder objectives (e.g., continuous performance improvements). Through this process, the agent acquires advanced behaviors such as long-horizon planning and reflective reasoning, which enable it to actively react to the underlying state (e.g., market regime shifting) and realize a self-evolving agent, taking a step toward more principled and scalable alpha mining. Extensive experiments demonstrate that AlphaAgentEvo achieves more efficient alpha evolution and generates more diverse and transferable alphas, consistently surpassing a wide range of baselines. Notably, with only 4B parameters, it outperforms LLM-driven evolution methods configured with state-of-the-art close-source reasoning models, highlighting the promise of ARL for next-generation alpha mining.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5221. Disentanglement of Variations with Multimodal Generative Modeling
๐ฌ 9
โญ 4.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Multimodal Variational Autoencoder Disentanglement Multi-view Information Bottleneck Diffusion Models
Multimodal data are prevalent across various domains, and learning robust representations of such data is paramount to enhancing generation quality and downstream task performance. To handle heterogeneity and interconnections among different modalities, recent multimodal generative models extract shared and private (modality-specific) information with two separate variables. Despite attempts to enforce disentanglement between these two variables, these methods struggle with challenging datasets where the likelihood model is insufficient. In this paper, we propose Information-disentangled Multimodal VAE (IDMVAE) to explicitly address this issue, with rigorous mutual information-based regularizations, including cross-view mutual information maximization for extracting shared variables, and a cycle-consistency style loss for redundancy removal using generative augmentations. We further introduce diffusion models to improve the capacity of latent priors. These newly proposed components are complementary to each other. Compared to existing approaches, IDMVAE shows a clean separation between shared and private information, demonstrating superior generation quality and semantic coherence on challenging datasets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5222. InfoNCE Induces Gaussian Distribution
๐ฌ 20
โญ 4.00
๐ learning theory
๐ท๏ธ Contrastive learning Gaussian distribution InfoNCE
Contrastive learning has been at the bedrock of unsupervised learning in recent years, allowing training with massive
unlabeled data for both task-specific and general (foundation) models.
A prototypical loss in contrastive training is InfoNCE and its variants. In this paper we show that the embedding of the features
which emerge from InfoNCE training can be well approximated by
a multivariate Gaussian distribution. We justify this claim by
taking two approaches. First, we show that under certain alignment and concentration assumptions, finite projections of a high dimensional representation approach multivariate Gaussian distribution, as the representation dimensions approach infinity.
Next, under less strict assumptions, we show that adding a small regularization term (which vanishes asymptotically) that promotes low feature norm and high feature entropy, we reach similar asymptotic results.
We demonstrate experimentally, in a synthetic setting, CIFAR-10 and on pretrained foundation models, that the features indeed follow almost precise
Gaussian distribution. One can use the Gaussian model to easily derive analytic expressions in the representation space and to obtain very useful measures, such as likelihood, data entropy and mutual information. Hence, we expect such theoretical grounding to be very
useful in various applications involving contrastive learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5223. MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head
๐ฌ 26
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Linear Attention Model Architecture Efficiency
While the Transformer architecture dominates many fields, its quadratic self-attention complexity hinders its use in large-scale applications. **Linear attention** offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re-introducing computational overhead through extra modules (e.g., depthwise separable convolution and few self-attention blocks) that defeat the original purpose. In this work, we identify a key failure mode in these methods: **global context collapse**, where the model loses representational diversity. To address this, we propose **Multi-Head Linear Attention (MHLA)**, which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a **3.6%** improvement on ImageNet classification, a **6.3%** gain on NLP, a **12.6%** improvement in image generation tasks and a **41%** enhancement in video generation tasks with the same computational complexity,
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5224. GAS: Improving Discretization of Diffusion ODEs via Generalized Adversarial Solver
๐ฌ 20
โญ 4.00
๐ generative models
๐ท๏ธ diffusion models diffusion acceleration diffusion distillation ODE solvers adversarial training
While diffusion models achieve state-of-the-art generation quality, they still suffer from computationally expensive sampling. Recent works address this issue with gradient-based optimization methods that distill a few-step ODE diffusion solver from the full sampling process, reducing the number of function evaluations from dozens to just a few. However, these approaches often rely on intricate training techniques and do not explicitly focus on preserving fine-grained details. In this paper, we introduce the Generalized Solver: a simple parameterization of the ODE sampler that does not require additional training tricks and improves quality over existing approaches. We further combine the original distillation loss with adversarial training, which mitigates artifacts and enhances detail fidelity. We call the resulting method the Generalized Adversarial Solver and demonstrate its superior performance compared to existing solver training methods under similar resource constraints.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5225. Escaping the Homophily Trap: A Threshold-free Graph Outlier Detection Framework via Clustering-guided Edge Reweighting
๐ฌ 29
โญ 4.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Outlier Detection Graph Neural Networks Clustering
Graph outlier detection is a critical task for identifying rare, deviant patterns in graph-structured data.
However, prevalent methods based on graph convolution are fundamentally challenged by the ''Homophily Trap'': the aggregation of features from neighboring nodes inadvertently contaminates the representations of normal nodes near anomalies, blurring their distinctions.
To overcome this limitation, we propose a Clustering-guided Edge Reweighting framework for Graph Outlier Detection (CER-GOD), which jointly optimizes a self-discriminative masking spoiler with an adaptive clustering-based outlier detector.
The masking spoiler learns to selectively weaken the influence of heterogeneous neighbors, preserving the discriminative power of node embeddings.
This process is guided by the clustering detector, which generates pseudo-labels in an unsupervised manner, thereby eliminating the need for predefined anomaly thresholds.
To ensure robust optimization and prevent class collapseโa failure mode exacerbated by the homophily trapโwe introduce a diversity loss that stabilizes the clustering process.
Our end-to-end framework demonstrates superior performance on multiple benchmark datasets, establishing a new state-of-the-art by effectively dismantling the homophily trap.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5226. Entropy-preserving reinforcement learning
๐ฌ 9
โญ 4.00
๐ reinforcement learning
๐ท๏ธ Large language model reinforcement learning entropy GRPO PPO
Policy gradient algorithms have been a driver of much recent advancement in language model reasoning. One of their most appealing properties is the ability to learn from exploration on their own trajectories, a process crucial for discovering diverse approaches and fostering creative solutions. As we show in this paper, most policy gradient algorithms naturally reduce the entropy---and thus the diversity of explored trajectories---as part of training, yielding a policy increasingly limited in its ability to explore. However, not all algorithms exhibit this collapse in entropy equally. In this paper, we formally analyze the contributions of leading policy gradient objectives on entropy, show which mechanisms they employ to implicitly limit entropy collapse, and propose a new regularization method, REPO, that stabilizes entropy over training through the use of an adaptive controller. Models trained with REPO preserve entropy throughout training, yielding final policies that are, on average, more performant. By preserving entropy in the final policy, REPO-trained models can even be re-trained on evolving data distributions in new environments, unlike their non-entropy-preserving counterparts.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5227. FlowNIB: An Information Bottleneck Analysis of Bidirectional vs. Unidirectional Language Models
๐ฌ 32
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Bidirectional Language Models Information Bottleneck Mutual Information FlowNIB Layer-wise Analysis Context Understanding Natural Language Understanding
Bidirectional language models (LMs) consistently show stronger context understanding than unidirectional models, yet the theoretical reason remains unclear. We present a simple information bottleneck (IB) perspective: bidirectional representations preserve more mutual information (MI) about both the input and the target, yielding richer features for downstream tasks. We adopt a layerโwise view and hypothesize that, at comparable capacity, bidirectional layers retain more useful signal than unidirectional ones. To test this claim empirically, we present Flow Neural Information Bottleneck (FlowNIB), a lightweight, post-hoc framework capable of estimating comparable mutual information values for individual layers in LMs, quantifying how much mutual information each layer carries for a dataset. FlowNIB takes three inputsโ(i) the original LMโs inputs/dataset, (ii) groundโtruth labels, and (iii) layer activationsโsimultaneously estimates the mutual information for both the inputโlayer and layerโlabel pairs. Empirically, bidirectional LM layers exhibit higher mutual information than similarโand even largerโunidirectional LMs. As a result, bidirectional LMs outperform unidirectional LMs across extensive experiments on NLU benchmarks (e.g., GLUE), commonsense reasoning, and regression tasks, demonstrating superior context understanding.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5228. TRAC: Tensor-Train based Across-layer Compression for Parameter-Efficient Fine-Tuning
๐ฌ 22
โญ 4.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Parameter-efficient fine-tuning Low-rank adaptation Tensor decomposition
Fine-tuning large pre-trained models under resource constraints remains challenging due to the massive number of parameters involved. Existing parameter-efficient tuning methods, such as low-rank adaptation (LoRA) and its variants, rely heavily on matrix factorization and often struggle in extremely low-parameter regimes. In this work, we propose TRAC, a novel fine-tuning framework that leverages Tensor-Train decomposition with Across-layer Compression. Specifically, TRAC represents each adaptation module as a compact sequence of tensor-train cores and allows certain cores to be frozen or shared across layers, thereby exploiting the inherent similarity and redundancy among layer weight matrices. To retain layer-specific flexibility, lightweight controllers are introduced, enabling shared tensor cores to adaptively modulate representations. We evaluate TRAC on diverse architectures, including LLaMA-2, GPT, BERT, and ViT, across benchmarks covering text classification, text generation, and image classification. Experimental results demonstrate that TRAC achieves performance comparable to or better than LoRA and its variants, while substantially reducing trainable parameters and storage requirements.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5229. Decomposition of Concept-Level Rules in Visual Scenes
๐ฌ 16
โญ 4.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Compositionality Concept-Level Rules Decomposition Large Vision-Language Models
Human cognition is compositional, and one can parse a visual scene into independent concepts and the corresponding concept-changing rules. By contrast, many vision-language systems process images holistically, with limited support for explicit decomposition. And previous methods of decomposing concepts and rules often rely on hand-crafted inductive biases or human-designed priors. We introduce a framework (CRD) to decompose concept-level rules with Large Vision-Language Models (LVLMs), which explains visual input by extracting LVLM-extracted concepts and the rules governing their variation. The proposed method operates in two stages: (1) a pretrained LVLM proposes visual concepts and concept values, which are employed to instantiate a space of concept rule functions that model concept changes and spatial distributions; (2) an iterative process to select a concise set of concepts that best account for the input according to the rule function. We evaluate CRD on an abstract visual reasoning benchmark and a real-world image caption dataset. Across both settings, our approach outperforms baseline models while improving interpretability by explicitly revealing underlying concepts and compositional rules, advancing explainable and generalizable visual reasoning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5230. Content-Aware Mamba for Learned Image Compression
๐ฌ 26
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Learned Image Compression
Recent Learned image compression (LIC) leverages Mamba-style state-space models (SSMs) for global receptive fields with linear complexity. However, the standard Mamba adopts content-agnostic, predefined raster (or multi-directional) scans under strict causality. This rigidity hinders its ability to effectively eliminate redundancy between tokens that are content-correlated but spatially distant.
We introduce Content-Aware Mamba (CAM), an SSM that dynamically adapts its processing to the image content.
Specifically, CAM overcomes prior limitations with two novel mechanisms. First, it replaces the rigid scan with a content-adaptive token permutation strategy to prioritize interactions between content-similar tokens regardless of their location. Second, it overcomes the sequential dependency by injecting sample-specific global priors into the state-space model, which effectively mitigates the strict causality without multi-directional scans.
These innovations enable CAM to better capture global redundancy while preserving computational efficiency. Our Content-Aware Mamba-based LIC model (CMIC) achieves state-of-the-art rate-distortion performance, surpassing VTM-21.0 by -15.91\%, -21.34\%, and -17.58\% BD-rate on Kodak, Tecnick, and CLIC benchmarks, respectively.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5231. Compositional Generalization from Learned Skills via CoT Training: A Theoretical and Structural Analysis for Reasoning
๐ฌ 22
โญ 4.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Information-Theoretic Bounds Compositional Circuits Reasoning Generalization CoT Training
Chain-of-Thought (CoT) training has markedly advanced the reasoning capabilities of large language models (LLMs), yet the mechanisms by which CoT training enhances generalization remain inadequately understood. In this work, we demonstrate that compositional generalization is fundamental: models systematically combine simpler learned skills during CoT training to address novel and more complex problems. Through a theoretical and structural analysis, we formalize this process: 1) Theoretically, the information-theoretic generalization bounds through distributional divergence can be decomposed into in-distribution (ID) and out-of-distribution (OOD) components. Specifically, the non-CoT models fail on OOD tasks due to unseen compositional patterns, whereas CoT-trained models achieve strong generalization by composing previously learned skills. In addition, controlled experiments and real-world validation confirm that CoT training accelerates convergence and enhances generalization from ID to both ID and OOD scenarios while maintaining robust performance even with tolerable noise. 2) Structurally, CoT training internalizes reasoning into a two-stage compositional circuit, where the number of stages corresponds to the explicit reasoning steps during training. Notably, CoT-trained models resolve intermediate results at shallower layers compared to non-CoT counterparts, freeing up deeper layers to specialize in subsequent reasoning steps. A key insight is that CoT training teaches models how to thinkโby fostering compositional reasoningโrather than merely what to think, through the provision of correct answers alone. This paper offers valuable insights for designing CoT strategies to enhance LLMs' reasoning robustness.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5232. Spike-based Digital Brain: a novel fundamental model for brain activity analysis
๐ฌ 23
โญ 4.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ Brain activity Fundamental model Spike computing fMRI prediction Brain diseases.
Modeling the temporal dynamics of the human brain remains a core challenge in computational neuroscience and artificial intelligence. Traditional methods often ignore the biological spike characteristics of brain activity and find it difficult to reveal the dynamic dependencies and causal interactions between brain regions, limiting their effectiveness in brain function research and clinical applications. To address this issue, we propose a Spike-based Digital Brain (Spike-DB), a novel fundamental model that introduces the spike computing paradigm into brain time series modeling. Spike-DB encodes fMRI signals as spike trains and learns the temporal driving relationships between anchor and target regions to achieve high-precision prediction of brain activity and reveal underlying causal dependencies and dynamic relationship characteristics. Based on Spike-DB, we further conducted downstream tasks including brain disease classification, abnormal brain region identification, and effective connectivity inference. Experimental results on real-world epilepsy datasets and the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset show that Spike-DB outperforms existing mainstream methods in both prediction accuracy and downstream tasks, demonstrating its broad potential in clinical applications and brain science research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5233. Multi-Agent Debate with Memory Masking
๐ฌ 14
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ multi-agent debate memory selection robustness
Large language models (LLMs) have demonstrated impressive capabilities in various language-based reasoning tasks (e.g., math reasoning). Among all LLM reasoning frameworks, _multi-agent debate_ (MAD), which employs multiple LLM agents and performs reasoning in the way of multi-round debate, has emerged as a powerful reasoning paradigm since it allows agents to access previous memories to refine their reasoning iteratively in each debate round and facilitates LLMs in alleviating the potential intrinsic self-preference bias. Although MAD improves the reasoning capabilities of LLMs significantly, in this paper, however, we theoretically demonstrate that the performance of MAD is closely related to the quality of memories. This indicates that MAD is still vulnerable to wrong reasoning memories, which poses a threat to the robustness of MAD. To address this problem, we introduce a simple yet effective multi-agent debate framework, _multi-agent debate with memory masking_ (MAD-M$^2$), to enhance the robustness of MAD by allowing LLM agents to select memories in the previous debate round before they perform reasoning in the current debate round. In this way, MAD-M$^2$ can polish the contextual information at the beginning of each debate round by preserving as many informative and meaningful memories as possible while dropping the noisy memories and, in turn, achieve better reasoning results. Extensive empirical results on several mainstream mathematical and logical reasoning benchmarks demonstrate that MAD-M$^2$ is able to achieve better results than the typical MAD.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5234. Inducing Dyslexia in Vision Language Models
๐ฌ 11
โญ 4.00
๐ applications to neuroscience & cognitive science
๐ท๏ธ VLMs Dyslexia Reading Cognition Causal hypothesis testing NeuroAI
Dyslexia, a neurodevelopmental disorder characterized by persistent reading difficulties, is often linked to reduced activity of the visual word form area in the ventral occipito-temporal cortex. Traditional approaches to studying dyslexia, such as behavioral and neuroimaging methods, have provided valuable insights but remain limited in their ability to test causal hypotheses about the underlying mechanisms of reading impairments. In this study, we use large-scale vision-language models (VLMs) to simulate dyslexia by functionally identifying and perturbing artificial analogues of word processing. Using stimuli from cognitive neuroscience, we identify visual-word-form-selective (VWF-selective) units within VLMs and demonstrate that targeted ablation of these units, unlike ablation of random units, leads to selective impairments in reading tasks while general visual and language comprehension abilities remain intact.
In particular, the resulting model matches dyslexic humans' phonological deficits without a significant change in orthographic processing. Additionally, the modelโs VWF-selective units predict human-VWFA neural responses better than random units and the ablated model mirrors dyslexic behavior in font sensitivity.
Taken together, our modeling results replicate key characteristics of dyslexia and establish a computational framework for investigating reading disorders.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5235. Efficient Adversarial Attacks on High-dimensional Offline Bandits
๐ฌ 11
โญ 4.00
๐ learning theory
๐ท๏ธ Offline Bandits Adversarial Attacks
Bandit algorithms have recently emerged as a powerful tool for evaluating machine learning models, including generative image models and large language models, by efficiently identifying top-performing candidates without exhaustive comparisons. These methods typically rely on a reward model---often distributed with public weights on platforms such as Hugging Face---to provide feedback to the bandit. While online evaluation is expensive and requires repeated trials, offline evaluation with logged data has become an attractive alternative. However, the adversarial robustness of offline bandit evaluation remains largely unexplored, particularly when an attacker perturbs the reward model (rather than the training data) prior to bandit training. In this work, we fill this gap by investigating, both theoretically and empirically, the vulnerability of offline bandit training to adversarial manipulations of the reward model. We introduce a novel threat model in which an attacker exploits offline data in high-dimensional settings to hijack the bandit's behavior. Starting with linear reward functions and extending to nonlinear models such as ReLU neural networks, we study attacks on two Hugging Face evaluators used for generative model assessment: one measuring aesthetic quality and the other assessing compositional alignment. Our results show that even small, imperceptible perturbations to the reward modelโs weights can drastically alter the bandit's behavior. From a theoretical perspective, we prove a striking high-dimensional effect: as input dimensionality increases, the perturbation norm required for a successful attack decreases, making modern applications such as image evaluation especially vulnerable.
Extensive experiments confirm that naive random perturbations are ineffective, whereas carefully targeted perturbations achieve near-perfect attack success rates.
To address computational challenges, we design efficient heuristics that preserve almost 100\% success while dramatically reducing attack cost. In parallel, we propose a practical defense mechanism that partially mitigates such attacks, paving the way for safer offline bandit evaluation. Finally, we validate our findings on the UCB bandit and provide theoretical evidence that adversaries can delay optimal arm selection proportionally to the input dimension. Code is available at the anonymous repository: [https://anonymous.4open.science/r/offline-bandit](https://anonymous.4open.science/r/offline-bandit).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5236. ProtoTS: Learning Hierarchical Prototypes for Explainable Time Series Forecasting
๐ฌ 26
โญ 4.00
๐ learning on time series and dynamical systems
๐ท๏ธ Time series forecasting; Interpretability
While deep learning has achieved impressive performance in time series forecasting, it becomes increasingly crucial to understand its decision-making process for building trust in high-stakes scenarios. Existing interpretable models often provide only local and partial explanations, lacking the capability to reveal how heterogeneous and interacting input variables jointly shape the overall temporal patterns in the forecast curve. We propose ProtoTS, a novel interpretable forecasting framework that achieves both high accuracy and transparent decision-making through modeling prototypical temporal patterns. ProtoTS computes instance-prototype similarity based on a denoised representation that preserves abundant heterogeneous information. The prototypes are organized hierarchically to capture global temporal patterns with coarse prototypes while capturing finer-grained local variations with detailed prototypes, enabling expert steering and multi-level interpretability. Experiments on multiple realistic benchmarks, including a newly released LOF dataset, show that ProtoTS not only exceeds existing methods in forecast accuracy but also delivers expert-steerable interpretations for better model understanding and decision support. The source code is available at https://anonymous.4open.science/r/ProtoTS-D791/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5237. HiMAE: Hierarchical Masked Autoencoders Discover Resolution-Specific Structure in Wearable Time Series
๐ฌ 26
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ SSL Wearables Interpretability Inductive Bias
Wearable sensors provide abundant physiological time series, yet the principles governing their predictive utility remain unclear. We hypothesize that temporal resolution is a fundamental axis of representation learning, with different clinical and behavioral outcomes relying on structure at distinct scales. To test this \emph{resolution hypothesis}, we introduce \emph{HiMAE} (Hierarchical Masked Autoencoder), a self-supervised framework that combines masked autoencoding with a hierarchical convolutional encoderโdecoder. HiMAE produces multi-resolution embeddings that enable systematic evaluation of which temporal scales carry predictive signal, transforming resolution from a hyperparameter into a probe for interpretability. Across classification, regression, and generative benchmarks, HiMAE consistently outperforms state-of-the-art foundation models that collapse scale, while being orders of magnitude smaller. HiMAE is an efficient representation learner compact enough to run entirely on-watch, achieving sub-millisecond inference on smartwatch-class CPUs for true edge inference. Together, these contributions position HiMAE as both an efficient self supervised learning method and a discovery tool for scale-sensitive structure in wearable health.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5238. TokMem: Tokenized Procedural Memory for Large Language Models
๐ฌ 21
โญ 4.00
๐ transfer learning, meta learning, and lifelong learning
๐ท๏ธ Tokenized Procedural Memory Memory tokens Continual adaptation Large language models
Large language models rely heavily on prompts to specify tasks, recall knowledge and guide reasoning. However, this reliance is inefficient as prompts must be re-read at each step, scale poorly across tasks, and lack mechanisms for modular reuse. We introduce TokMem, a tokenized procedural memory that stores recurring procedures as compact, trainable embeddings. Each memory token encodes both an address to a procedure and a control signal that steers generation, enabling targeted behavior with constant-size overhead. To support continual adaptation, TokMem keeps the backbone model frozen, allowing new procedures to be added without interfering with existing ones. We evaluate TokMem on 1,000 tasks for atomic recall and multi-step function-calling for compositional recall, where it consistently outperforms retrieval-augmented generation while avoiding repeated context overhead, and fine-tuning with far fewer parameters. These results establish TokMem as a scalable and modular alternative to prompt engineering and fine-tuning, offering an explicit procedural memory for LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5239. When Silence Is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?
๐ฌ 21
โญ 4.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Abstention Temporal and non-temporal reasoning Question answering
Large language models (LLMs) rarely admit uncertainty, often producing fluent but misleading answers, rather than abstaining (i.e., refusing to answer). This weakness is even evident in temporal question answering (QA), where models frequently ignore time-sensitive evidence and conflate facts across different time-periods. In this paper, we present the first empirical study of training LLMs with abstention ability while reasoning about temporal QA. Existing approaches such as calibration might be unreliable in capturing uncertainty in complex reasoning. We instead frame abstention as a teachable skill and introduce a pipeline that couples Chain-of-Thought (CoT) supervision with Reinforcement Learning (RL) guided by abstention-aware rewards. Our goal is to systematically analyze how different information types and training techniques affect temporal reasoning with abstention behavior in LLMs. Through extensive experiments studying various methods, we find that RL yields strong empirical gains on reasoning: a model initialized by Qwen2.5-1.5B-Instruct surpasses GPT-4o by 3.46% and 5.80% in Exact Match on TimeQA-Easy and Hard, respectively. Moreover, it improves the True Positive rate on unanswerable questions by 20% over a pure supervised fine-tuned (SFT) variant. Beyond performance, our analysis shows that SFT induces overconfidence and harms reliability, while RL improves prediction accuracy but exhibits similar risks. Finally, by comparing implicit reasoning cues (e.g., original context, temporal sub-context, knowledge graphs) with explicit CoT supervision, we find that implicit information provides limited benefit for reasoning with abstention. Our study provides new insights into how abstention and reasoning can be jointly optimized, providing a foundation for building more reliable LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5240. Omni-iEEG: A Large-Scale, Comprehensive iEEG Dataset and Benchmark for Epilepsy Research
๐ฌ 18
โญ 4.00
๐ datasets and benchmarks
๐ท๏ธ Computational neuroscience iEEG Epilepsy Computer Aided Diagnosis Neurophysiology
Epilepsy affects over 50 million people worldwide, and one-third of patients suffer drug-resistant seizures where surgery offers the best chance of seizure freedom. Accurate localization of the epileptogenic zone (EZ) relies on intracranial EEG (iEEG). Clinical workflows, however, remain constrained by labor-intensive manual review. At the same time, existing data-driven approaches are typically developed on single-center datasets that are inconsistent in format and metadata, lack standardized benchmarks, and rarely release pathological event annotations, creating barriers to reproducibility, cross-center validation, and clinical relevance. With extensive efforts to reconcile heterogeneous iEEG formats, metadata, and recordings across publicly available sources,
we present $\textbf{Omni-iEEG}$, a large-scale, pre-surgical iEEG resource comprising $\textbf{302 patients}$ and $\textbf{178 hours}$ of high-resolution recordings. The dataset includes harmonized clinical metadata such as seizure onset zones, resections, and surgical outcomes, all validated by board-certified epileptologists. In addition, Omni-iEEG provides over 36K expert-validated annotations of pathological events, enabling robust biomarker studies. Omni-iEEG serves as a bridge between machine learning and epilepsy research. It defines clinically meaningful tasks with unified evaluation metrics grounded in clinical priors, enabling systematic evaluation of models in clinically relevant settings. Beyond benchmarking, we demonstrate the potential of end-to-end modeling on long iEEG segments and highlight the transferability of representations pretrained on non-neurophysiological domains. Together, these contributions establish Omni-iEEG as a foundation for reproducible, generalizable, and clinically translatable epilepsy research.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5241. PRISM: Controllable Diffusion for Compound Image Restoration with Scientific Fidelity
๐ฌ 25
โญ 4.00
๐ generative models
๐ท๏ธ prompted diffusion image restoration expert-in-the-loop scientific imaging
Scientific and environmental imagery are often degraded by multiple compounding factors related to sensor noise and environmental effects. Existing restoration methods typically treat these mixed effects by iteratively removing fixed categories, lacking the compositionality needed to handle real-world mixtures and often introducing cascading artifacts, overcorrection, or signal loss. Moreover, many supervised approaches rely on paired ground-truth data, which may be unavailable or impossible to simulate for real-world degradations. We present PRISM (Precision Restoration with Interpretable Separation of Mixtures), a prompted conditional diffusion framework for expert-guided restoration under compound degradations. PRISM combines (1) compound-aware supervision on mixtures of distortions and (2) a weighted contrastive disentanglement objective that aligns compound distortions with their constituent primitives to enable high-fidelity joint restoration. Our compositional latent space supports both prompt-guided and automated restoration in novel settings by generalizing to unseen combinations of degradations. We outperform image restoration baselines on unseen complex real-world degradations, including underwater visibility, under-display camera effects, and fluid distortions. PRISM also enables selective restoration. Across microscopy, wildlife monitoring, and urban weather datasets, our method enhances downstream analysis by letting experts remove only degradations that hinder analysis, avoiding black-box "over-restoration." Together, these results establish PRISM as a generalizable and controllable framework for high-fidelity restoration in domains where scientific utility is a priority.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5242. Interpolation-Based Conditioning of Flow Matching Models for Bioisosteric Ligand Design
๐ฌ 21
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ drug discovery 3D molecule generation bioisosteric fragment merging conditional generation flow matching generative models
Fast, unconditional 3D generative models can now produce high-quality molecules, but adapting them for specific design tasks often requires costly retraining. To address this, we introduce Interpolate-Integrate and Replacement Guidance, two training-free, inference-time conditioning strategies that provide control over E(3)-equivariant flow-matching models.
Our methods generate bioisosteric 3D molecules by conditioning on seed ligands or fragment sets to preserve key determinants like shape and pharmacophore patterns, without requiring the original fragment atoms to be present. We demonstrate their effectiveness on three drug-relevant tasks: natural product ligand hopping, bioisosteric fragment merging, and pharmacophore merging.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5243. Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis
๐ฌ 28
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Model KV Cache Compression Attention Pattern
Attention patterns play a crucial role in both training and inference of large language models (LLMs). Prior works have identified individual patternsโsuch as retrieval heads, sink heads, and diagonal tracesโbut these observations remain fragmented and lack a unifying explanation. To bridge this gap, we provide a unifying framework to explain the existence of diverse attention patterns by analyzing their underlying mathematical formulations with a temporal continuous perspective. Our work can both deepen the understanding of attention behavior and guide inference acceleration approaches. Specifically, this framework characterizes attention patterns as either predictable patterns, characterized by clear regularities, or unpredictable ones that appear random. Our analysis further reveals that the distinction between them can be explained by variations in query self-similarity across the temporal dimension. Focusing on the predictable patterns, we further provide a detailed mathematical analysis of three representative predictable patterns in terms of the joint effect of queries, keys, and Rotary Positional Embeddings. To validate the framework, we apply it to KV cache compression and LLM pruning tasks. In these experiments, a simple metric inspired by our theory consistently improves performance over baseline methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5244. Flowing Through States: Neural ODE Regularization for Reinforcement Learning
๐ฌ 26
โญ 4.00
๐ reinforcement learning
๐ท๏ธ Neural ODE Reinforcement Learning MDP Regularization Actor-Critic A2C
Neural networks applied to sequential decision-making tasks typically rely on latent representations of environment states. While environment dynamics dictate how semantic states evolve, the corresponding latent transitions are usually left implicit, leaving room for misalignment between the two. We propose to model latent dynamics explicitly by drawing an analogy between Markov decision process (MDP) trajectories and ordinary differential equation (ODE) flows: in both cases, the current state fully determines its successors. Building on this view, we introduce a neural ODEโbased regularization method that enforces latent embeddings to follow consistent ODE flows, thereby aligning representation learning with environment dynamics. Although broadly applicable to deep learning agents, we demonstrate its effectiveness in reinforcement learning by integrating it into ActorโCritic algorithms, where it yields major performance gains across various standard Atari benchmarks for A2C as well as gridworld environments for PPO.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5245. CFT-RAG: An Entity Tree Based Retrieval Augmented Generation Algorithm With Cuckoo Filter
๐ฌ 10
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Retrieval-Augmented Generation Tree-RAG Cuckoo Filter Knowledge Retrieval
Although retrieval-augmented generation(RAG) significantly improves generation quality by retrieving external knowledge bases and integrating generated content, it faces computational efficiency bottlenecks, particularly in knowledge retrieval tasks involving hierarchical structures for Tree-RAG. This paper proposes a Tree-RAG acceleration method based on the improved Cuckoo Filter, which optimizes entity localization during the retrieval process to achieve significant performance improvements. Tree-RAG effectively organizes entities through the introduction of a hierarchical tree structure, while the Cuckoo Filter serves as an efficient data structure that supports rapid membership queries and dynamic updates. The experiment results demonstrate that our method is much faster than baseline methods while maintaining high levels of generative quality. For instance, our method is more than 800% faster than naive Tree-RAG on DART dataset. Our work is available at https://github.com/TUPYP7180/CFT-RAG-2025.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5246. Frequency-Domain Better than Time-Domain for Causal Structure Recovery in Dynamical Systems on Networks
๐ฌ 17
โญ 4.00
๐ causal reasoning
๐ท๏ธ Causal Inference Wiener Filter Fast Fourier Transform Graphical Model
Learning causal effects from data is a fundamental and well-studied problem across science, especially when the cause-effect relationship is static in nature. However, causal effect is less explored when there are dynamical dependencies, i.e., when dependencies exist between entities across time. In general, it is not possible to reconstruct the causal graph from data alone. The conventional static causal structure recovery algorithms employ tests such as the Fischer-z test and the chi-square test to assess the conditional independence (CI) of data which forms the basis for recovering Markov Equivalent Graphs (MEGs) wherein causal structure can be recovered partially. For data that are dynamically related, multivariate least square estimation, based on Wiener Filters (WFs) relying on second order statistics for estimating a data stream from other streams, provides a means of recovering influence structures of the directed network underlying the data. Here, WF based projections can be determined in time-domain or in frequency-domain; the question this article sets out to answer is which is better? Here, we obtain concentration bounds on the accuracy of the WF estimation in both time and frequency-based approaches. Exploiting the computation speed of Fast Fourier Transform (FFT), we establish that the frequency domain provides distinct advantages. Moreover, frequency domain projections involve complex numbers; we establish that the phase properties of the resulting estimates can be effectively leveraged for better recovery of the MEG in a large class of networks; the time-domain has no analogue of phase. Thus we report the "Wiener-Phase" algorithm provides the best accuracy as well as computational advantages. We validate the theoretical analysis with numerical results. Performance comparison with state of the art algorithms are also provided. Further, the proposed algorithms are validated on a real field dataset known as the "river-runoff" dataset collected from the online repository of CauseMe, and on measurement data from transistor based circuits.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5247. Robustness in Text-Attributed Graph Learning: Insights, Trade-offs, and New Defenses
๐ฌ 7
โญ 4.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ Graph Robustness Graph Adversarial Attack Text Attributed Graph Large Language Model
While Graph Neural Networks (GNNs) and Large Language Models (LLMs) are powerful approaches for learning on Text-Attributed Graphs (TAGs), a comprehensive understanding of their robustness remains elusive.
Current evaluations are fragmented, failing to systematically investigate the distinct effects of textual and structural perturbations across diverse models and attack scenarios.
To address these limitations, we introduce a unified and comprehensive framework to evaluate robustness in TAG learning.
Our framework evaluates classical GNNs, robust GNNs (RGNNs), and GraphLLMs across ten datasets from four domains, under diverse text-based, structure-based, and hybrid perturbations in both poisoning and evasion scenarios.
Our extensive analysis reveals multiple findings, among which three are particularly noteworthy: 1) models have inherent robustness trade-offs between text and structure, 2) the performance of GNNs and RGNNs depends heavily on the text encoder and attack type, and 3) GraphLLMs are particularly vulnerable to training data corruption.
To overcome these trade-offs, we introduce SFT-auto, a novel framework that delivers superior and balanced robustness against both textual and structural attacks within a single model.
Our work establishes a foundation for future research on TAG security and offers practical solutions for robust TAG learning in adversarial environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5248. Reconstruct Anything Model a lightweight foundation model for computational imaging
๐ฌ 17
โญ 4.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ computational imaging deep learning self-supervised learning foundation models
Most existing learning-based methods for solving imaging inverse problems can be roughly divided into two classes: iterative algorithms, such as plug-and-play and diffusion methods leveraging pretrained denoisers, and unrolled architectures that are trained end-to-end for specific imaging problems. Iterative methods in the first class are computationally costly and often yield suboptimal reconstruction performance, whereas unrolled architectures are generally problem-specific and require expensive training. In this work, we propose a novel non-iterative, lightweight architecture that incorporates knowledge about the forward operator (acquisition physics and noise parameters) without relying on unrolling. Our model is trained to solve a wide range of inverse problems, such as deblurring, magnetic resonance imaging, computed tomography, inpainting, and super-resolution, and works on arbitrary image sizes and channels, such as grayscale, complex, and color data. The proposed model can be easily adapted to unseen inverse problems or datasets with a few fine-tuning steps (up to a few images) in a self-supervised way, without ground-truth references. Throughout a series of experiments, we demonstrate state-of-the-art performance from medical imaging to low-photon imaging and microscopy.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5249. Empowering Small VLMs to Think with Dynamic Memorization and Exploration
๐ฌ 14
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ GRPO SFT Small-scale VLM Computer Vision
Small-scale Vision-Language Models (SVLMs) are exceptionally well-suited for proprietary tasks. Equipping them with thinking capabilities is a critical step to enhance their performance and reliability in these specific domains. However, existing training paradigms, including Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Reward (RLVR), impose substantial demands on the base VLM, exceeding the capacity of SVLMs. Consequently, directly applying these paradigms to SVLMs fails to instill the desired thinking abilities. A natural solution is to combine SFT and RLVR, leveraging their complementarity to reduce the dependence on model capacity. Yet the core challenge lies in managing the inherent trade-off: excessive reliance on SFT can force the model to memorize pseudo thinking traces, while over-emphasizing RLVR can lead to unstable exploration (*i.e.,* advantage collapse). To address this, we propose *DyME*, a novel training paradigm that **Dy**namically selects between **M**emorization (via SFT) and **E**xploration (via RLVR) at each optimization step, ensuring that every update contributes to the trade-off. By ensuring that every update contributes to the trade-off, *DyME* serves as a robust, standalone strategy that stabilizes SVLM learning. Complementing this paradigm, we further introduce a synergistic \textit{Visual Supervision} mechanism (comprising a visual checker and refiner) designed to inject dynamically enhanced, image-grounded guidance during optimization. Extensive experiments across diverse domains demonstrate that *DyME* consistently achieves this balance, and thus delivers substantial performance improvements on specialized tasks. These results establish *DyME* as a practical and effective solution for empowering SVLMs with reliable thinking capabilities.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5250. Text2Arch: A Dataset for Generating Scientific Architecture Diagrams from Natural Language Descriptions
๐ฌ 12
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ NLP: Generation NLP: Applications
Communicating complex system designs or scientific processes through text alone is inefficient and prone to ambiguity. A system that automatically generates scientific architecture diagrams from text with high semantic fidelity can be useful in multiple applications like enterprise architecture visualization, AI-driven software design, and educational content creation. Hence, in this paper, we focus on leveraging language models to perform semantic understanding of the input text description to generate intermediate code that can be processed to generate high-fidelity architecture diagrams. Unfortunately, no clean large-scale open-access dataset exists, implying lack of any effective open models for this task. Hence, we contribute a comprehensive dataset, \system, comprising scientific architecture images, their corresponding textual descriptions, and associated DOT code representations. Leveraging this resource, we fine-tune a suite of small language models, and also perform in-context learning using GPT-4o. Through extensive experimentation, we show that \system{} models significantly outperform existing baseline models like DiagramAgent and perform at par with in-context learning based generations from GPT-4o. We have added code and data as Supplementary material, and will make them (and models) publicly available on acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5251. Beyond the Known: An Unknown-Aware Large Language Model for Open-Set Text Classification
๐ฌ 16
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Large Language Models Open-set text classification
Open-set text classification (OSTC) requires models to correctly classify in-distribution (ID) samples while reliably rejecting out-of-distribution (OOD) inputsโan essential capability for real-world NLP systems. Most OSTC methods train on ID data under the closed assumption that all outputs belong to the known label space and then perform OOD detection with the biased representations, which inherently lack awareness of unknowns and thus yield overconfident predictions on OOD inputs. In this work, we present UnLLM, an Unknown-aware Large Language Model for OSTC. Instead of fixing classification to the entire known label space, we reformulate it into a subset-conditioned text generation task: the LLM is prompted with sampled subsets of known labels, and any instance outside the candidate set is explicitly assigned as โunknownโ. This reformulation transforms OOD detection from a post-hoc procedure into an intrinsic modeling capability. More importantly, our approach is the first to explicitly incorporate the unknown into classification, enabling systematic modeling of unknowns through a unified representationโlogitsโinference optimization, which progressively strengthens the modelโs capacity to capture open-set risk. Extensive experiments across six benchmarks show that UnLLM consistently outperforms state-of-the-art (SOTA) baselines. Code is available in an anonymous repository: https://anonymous.4open.science/r/UnLLM-03C2.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5252. Scaling Bayesian Experimental Design to High-Dimensions with Information-Guided Diffusion
๐ฌ 13
โญ 4.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Bayesian Experimental Design Active Data Acquisition Diffusion Models
We present DiffBED, a Bayesian experimental design (BED) approach that scales to problems with high-dimensional design spaces. Our key insight is that current BED approaches typically cannot be scaled to real high--dimensional design problems because of the need to specify a likelihood model that remains accurate throughout the design space. We show that without this, their design optimisation procedures exploit deficiencies in the likelihood and produce implausible designs. We overcome this issue by introducing a generative prior over feasible designs using a diffusion model. By guiding this diffusion model using principled information-theoretic experimental design objectives, we are then able to generate highly informative yet realistic designs at an unprecedented scale: while previous applications of BED have been restricted to design spaces with a handful of dimensions, we show that DiffBED can successful scale to designing high-resolution images.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5253. Spectral-guided Physical Dynamics Distillation
๐ฌ 24
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ 3D Physical dynamics Knowledge distillation
The problem of physical dynamics, which involves predicting the 3D trajectories of particles, is a fundamental task with wide-ranging applications across science and engineering. However, accurately forecasting long-horizon trajectories from initial states remains challenging, due to complex particle interactions and entangled multiscale dynamics involving both low- and high-frequency components.
To address this, we propose a novel knowledge-distillation-based framework, SGDD (Spectral-Guided Dynamics Distillation), which integrates a spectral-guided enhancement to adaptively prioritize key frequency components within a unified spatio-temporal representation. Through knowledge distillation, SGDD leverages future trajectories as privileged information during training, guiding a teacher encoder to generate comprehensive dynamics representations while a student encoder approximates them using only the initial state. This enables the student can generate effective dynamics representations at inference, even without privileged information, thereby enabling accurate long-horizon trajectory prediction. Experimental results on molecule, protein, and human motion datasets demonstrate that our method achieves more accurate and stable long-term predictions than previous physical dynamics models, successfully capturing the complex spatio-temporal structures of real-world systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5254. EigenScore: OOD Detection using Posterior Covariance in Diffusion Models
๐ฌ 14
โญ 4.00
๐ generative models
๐ท๏ธ OOD detection diffusion models uncertainty estimation
Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems in safety-sensitive domains. Diffusion models have recently emerged as powerful generative models, capable of capturing complex data distributions through iterative denoising. Building on this progress, recent work has explored their potential for OOD detection.
We propose *EigenScore*, a new OOD detection method that leverages the eigenvalue spectrum of the posterior covariance induced by a diffusion model.
We argue that posterior covariance provides a consistent signal of distribution shift, leading to larger trace and leading eigenvalues on OOD inputs, yielding a clear spectral signature. We further provide analysis explicitly linking posterior covariance to distribution mismatch, establishing it as a reliable signal for OOD detection.
To ensure tractability, we adopt a Jacobian-free subspace iteration method to estimate the leading eigenvalues using only forward evaluations of the denoiser.
Empirically, EigenScore achieves state-of-the-art performance, with up to 5% AUROC improvement over the best baseline. Notably, it remains robust in near-OOD settings such as CIFAR-10 vs CIFAR-100, where existing diffusion-based methods often fail.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5255. scDFM: Distributional Flow Matching Model for Robust Single-Cell Perturbation Prediction
๐ฌ 20
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Machine Learning Single Cell
A central goal in systems biology and drug discovery is to predict the transcriptional response of cells to perturbations. This task is challenging due to the noisy, sparse nature of single-cell measurements and the fact that perturbations often induce population-level shifts rather than changes in individual cells. Existing deep learning methods typically assume cell-level correspondences, limiting their ability to capture such global effects.
We present **scDFM**, a generative framework based on conditional flow matching that models the full distribution of perturbed cells conditioned on control states.
By incorporating an MMD objective, our method aligns perturbed and control populations beyond cell-level correspondences.
To further improve robustness to sparsity and noise, we propose the Perturbation-Aware Differential Transformer architecture (PAD-Transformer), a backbone that leverages gene interaction graphs and differential attention to capture context-specific expression changes.
**scDFM** outperforms prior methods across multiple genetic and drug perturbation benchmarks, excelling in both unseen and combinatorial settings. In the combinatorial setting, it reduces MSE by 19.6\% over the strongest baseline.
These results highlight the importance of distribution-level generative modeling for robust $\textit{in silico}$ perturbation prediction.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5256. Improving 2D Diffusion Models for 3D Medical Imaging with InterโSlice Consistent Stochasticity
๐ฌ 26
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ diffusion models inverse problems CT reconstruction 3D medical imaging
3D medical imaging is in high demand and essential for clinical diagnosis and scientific research. Currently, diffusion models have become an effective tool for medical imaging reconstruction thanks to their ability to learn rich, highโquality data priors. However, learning the 3D data distribution with diffusion models in medical imaging is challenging, not only due to the difficulties in data collection but also because of the significant computational burden during model training. A common compromise is to train the diffusion model on 2D data priors and reconstruct stacked 2D slices to address 3D medical inverse problems. However, the intrinsic randomness of diffusion sampling causes severe interโslice discontinuities of reconstructed 3D volumes. Existing methods often enforce continuity regularizations along the $z$โaxis, which introduces sensitive hyperโparameters and may lead to over-smoothing results. In this work, we revisit the origin of stochasticity in diffusion sampling and introduce InterโSlice Consistent Stochasticity (ISCS), a simple yet effective strategy that encourages interโslice consistency during diffusion sampling. Our key idea is to control the consistency of stochastic noise components during diffusion sampling, thereby aligning their sampling trajectories without adding any new loss terms or optimization steps. Importantly, the proposed ISCS is plugโandโplay and can be dropped into any 2Dโtrained diffusionโbased 3D reconstruction pipeline without additional computational cost. Experiments on several medical imaging problems show that our method can effectively improve the performance of medical 3D imaging problems based on 2D diffusion models. Our findings suggest that controlling interโslice stochasticity is a principled and practically attractive route toward highโfidelity 3D medical imaging with 2D diffusion priors. The code is available at: [https://anonymous.4open.science/r/ICLR-ISCS-3281](https://anonymous.4open.science/r/ICLR-ISCS-3281).
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5257. Learning Molecular Chirality via Chiral Determinant Kernels
๐ฌ 18
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ chirality molecular representation learning axial chirality
Chirality is a fundamental molecular property that governs stereospecific behavior in chemistry and biology. Capturing chirality in machine learning models remains challenging due to the geometric complexity of stereochemical relationships and the limitations of traditional molecular representations that often lack explicit stereochemical encoding. Existing approaches to chiral molecular representation primarily focus on central chirality, relying on handcrafted stereochemical tags or limited 3D encodings, and thus fail to generalize to more complex forms, such as axial chirality. In this work, we introduce \textbf{ChiDeK} (\textbf{Chi}ral \textbf{De}terminant \textbf{K}ernels), a framework that systematically integrates stereogenic information into molecular representation learning. We propose the chiral determinant kernel to encode the SE(3)-invariant chirality matrix and employ cross-attention to integrate stereochemical information from local chiral centers into the global molecular representation. This design enables explicit modeling of chiral-related features within a unified architecture, capable of jointly encoding central and axial chirality. To support the evaluation of axial chirality, we construct a new benchmark for electronic circular dichroism (ECD) and optical rotation (OR) prediction. Across four tasks, including R/S configuration classification, enantiomer ranking, ECD spectrum prediction, and OR prediction, ChiDeK achieves substantial improvements over state-of-the-art baselines, most notably yielding over 7\% higher accuracy on axially chiral tasks on average.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5258. INSTANT: Compressing Gradients and Activations for Resource-Efficient Training
๐ฌ 3
โญ 4.00
๐ optimization
๐ท๏ธ Gradient Compression Activation Compression Resource-Constraint Training
Deep learning has advanced at an unprecedented pace. This progress has led to a significant increase in its complexity. However, despite extensive research on accelerating inference, training deep models directly within a resource-constrained budget remains a considerable challenge due to its high computational and memory requirements. In this paper, we introduce INSTANT (compressIng gradieNtS and acTivAtions for resource-efficieNt Training), a method designed to address both the computational and the memory bottlenecks when training. INSTANT reduces resource demands during backpropagation by projecting gradients and activations into a low-rank subspace and performing computation within that compressed representation. Experimental results demonstrate that INSTANT achieves a $15\times$ reduction in computational cost and $32\times$ reduction in activation memory with negligible impact on model performance. The code will be made publicly available upon the paper's acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5259. Journey to the Centre of Cluster: Harnessing Interior Nodes for A/B Testing under Network Interference
๐ฌ 18
โญ 4.00
๐ causal reasoning
๐ท๏ธ causal inference A/B test network interference cluster-level randomization
A/B testing on platforms often faces challenges from network interference, where a unit's outcome depends not only on its own treatment but also on the treatments of its network neighbors. To address this, cluster-level randomization has become standard, enabling the use of network-aware estimators. These estimators typically trim the data to retain only a subset of informative units, achieving low bias under suitable conditions but often suffering from high variance.
In this paper, we first demonstrate that the interior nodesโunits whose neighbors all lie within the same clusterโconstitute the vast majority of the post-trimming subpopulation. In light of this, we propose directly averaging over the interior nodes to construct the mean-in-interior (MII) estimator, which circumvents the delicate reweighting required by existing network-aware estimators and substantially reduces variance in classical settings. However, we show that interior nodes are often not representative of the full population, particularly in terms of network-dependent covariates, leading to notable bias.
We then augment the MII estimator with a counterfactual predictor trained on the entire network, allowing us to adjust for covariate distribution shifts between the interior nodes and full population.
By rearranging the expression, we reveal that our augmented MII estimator embodies an analytical form of the point estimator within prediction-powered inference framework~\citep{angelopoulos2023prediction,angelopoulos2023ppi++}. This insight motivates a semi-supervised lens, wherein interior nodes are treated as labeled data subject to selection bias. Extensive and challenging simulation studies demonstrate the outstanding performance of our augmented MII estimator across various settings.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5260. When to Retrain after Drift: A Data-Only Test of Post-Drift Data Size Sufficiency
๐ฌ 3
โญ 4.00
๐ learning on time series and dynamical systems
๐ท๏ธ Concept drift Stream learning Data sufficiency Time series
Sudden concept drift makes previously trained predictors unreliable, yet deciding when to retrain and what post-drift data size is sufficient is rarely addressed. We propose CALIPER โa detector- and model-agnostic, data-only test that estimates the post-drift data size required for stable retraining. CALIPER exploits state dependence in streams generated by dynamical systems: we run a single-pass weighted local regression over the post-drift window and track a one-step proxy error as a function of a locality parameter $\theta$. When an effective sample size gate is satisfied, a monotonically non-increasing trend in this error with increasing a locality parameter indicates that the data size is sufficiently informative for retraining.
We also provide a theoretical analysis of our CALIPER, and we show that the algorithm has a low per-update time and memory. Across datasets from four heterogeneous domains, three learner families, and two detectors, CALIPER consistently matches or exceeds the best fixed data size for retraining while incurring negligible overhead and often outperforming incremental updates. CALIPER closes the gap between drift detection and data-sufficient adaptation in streaming learning.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5261. Neural+Symbolic Approaches for Interpretable Actor-Critic Reinforcement Learning
๐ฌ 25
โญ 4.00
๐ interpretability and explainable AI
๐ท๏ธ Interpretable RL Symbolic
The integration of neural networks into actor-critic frameworks has been pivotal in advancing the field of reinforcement learning, enabling agents to perform complex tasks with greater efficiency and adaptability. However, neural network-based actor-critic models remain opaque ``black boxes,'' concealing their decision-making processes and hindering their use in critical applications where transparent and explainable reasoning is essential. This work introduces an innovative adaptation of the actor-critic framework that unites neural networks with rule ensembles to tackle key challenges in reinforcement learning. We harness the computational power, scalability, and adaptability of neural networks to model the critic, while integrating a rule ensemble system for the actor, ensuring transparency and interpretability for decision-making. Our study establishes a theoretical foundation for integrating rule ensembles into the Advantage Actor-Critic (A2C) framework. Experimental results from seven classic and complex environments demonstrate that our proposed method matches or exceeds the performance of representative RL models, including symbolic methods, while offering self-interpretability and transparency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5262. ConvRec-R1: Training LLM-based Conversational Recommender Systems with Reinforcement Learning
๐ฌ 27
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Conversational recommender system Reinforcement learning with Verifiable Reward
Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats, and their ranking quality degrades sharply toward the end of the generated list. To this end, we propose ConvRec-R1, a two-stage framework for end-to-end training of LLM-based conversational recommender systems. In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline, which produces high-quality, catalog-grounded demonstrations from powerful blackbox LLMs to warm-start the RL training. In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization (GRPO) tailored to tasks with rank-style outputs. Rank-GRPO treats each rank in the recommendation list as the unit instead of token (too fine-grained) or sequence (too coarse), redefining rewards to remove non-causal credit assignment and introducing a rank-level importance ratio based on the geometric mean of rank-wise token probabilities to stabilize policy updates. Experiments on the public Reddit-v2 dataset show that ConvRec-R1 converges faster and achieves higher Recall and NDCG than GRPO-style baselines. Code and datasets are released at https://anonymous.4open.science/r/ConvRec-R1-5615.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5263. CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning
๐ฌ 12
โญ 4.00
๐ datasets and benchmarks
๐ท๏ธ Code Semantics Benchmark LLM
Understanding and reasoning about code semantics is essential for enhancing code LLMs' abilities to solve real-world software engineering (SE) tasks. Although several code reasoning benchmarks exist, most rely on synthetic datasets or educational coding problems and focus on coarse-grained reasoning tasks such as input/output prediction, limiting their effectiveness in evaluating LLMs in practical SE contexts. To bridge this gap, we propose CodeSense, the first benchmark that makes available a spectrum of fine-grained code reasoning tasks concerned with the software engineering of real-world code. We collected Python, C and Java software projects from real-world repositories. We executed tests from these repositories, collected their execution traces, and constructed a ground truth dataset for fine-grained semantic reasoning tasks. We then performed comprehensive evaluations on state-of-the-art LLMs. Our results show a clear performance gap for the models to handle fine-grained reasoning tasks. Although prompting techniques such as chain-of-thought and in-context learning helped, the lack of code semantics in LLMs fundamentally limit models' capabilities of code reasoning. Besides dataset, benchmark and evaluation, our work produced an execution tracing framework and tool set that make it easy to collect ground truth for fine-grained SE reasoning tasks, offering a strong basis for future benchmark construction and model post training. Our code and data are located at \url{https://codesense-bench.github.io/}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5264. Scalable Random Wavelet Features: Efficient Non-Stationary Kernel Approximation with Convergence Guarantees
๐ฌ 9
โญ 4.00
๐ probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
๐ท๏ธ Random features Non-stationary kernels Wavelet features Gaussian Process Kernel approximation
Modeling non-stationary processes, where statistical properties vary across the input domain, is a critical challenge in machine learning; yet most scalable methods rely on a simplifying assumption of stationarity. This forces a difficult trade-off: use expressive but computationally demanding models like Deep Gaussian Processes, or scalable but limited methods like Random Fourier Features (RFF). We close this gap by introducing Random Wavelet Features (RWF), a framework that constructs scalable, non-stationary kernel approximations by sampling from wavelet families. By harnessing the inherent localization and multi-resolution structure of wavelets, RWF generates an explicit feature map that captures complex, input-dependent patterns. Our framework provides a principled way to generalize RFF to the non-stationary setting and comes with a comprehensive theoretical analysis, including positive definiteness, unbiasedness, and uniform convergence guarantees. We demonstrate empirically on a range of challenging synthetic and real-world datasets that RWF outperforms stationary random features and offers a compelling accuracy-efficiency trade-off against more complex models, unlocking scalable and expressive kernel methods for a broad class of real-world non-stationary problems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5265. Neural Hamilton--Jacobi Characteristic Flows for Optimal Transport
๐ฌ 13
โญ 4.00
๐ optimization
๐ท๏ธ Optimal Transport Hamilton--Jacobi Equations Method of Characteristics Class-Conditional Optimal Transport
We present a novel framework for solving optimal transport (OT) problems based on the Hamilton--Jacobi (HJ) equation, whose viscosity solution uniquely characterizes the OT map. By leveraging the method of characteristics, we derive closed-form, bidirectional transport maps, thereby eliminating the need for numerical integration. The proposed method adopts a pure minimization framework: a single neural network is trained with a loss function derived from the method of characteristics of the HJ equation. This design guarantees convergence to the optimal map while eliminating adversarial training stages, thereby substantially reducing computational complexity. Furthermore, the framework naturally extends to a wide class of cost functions and supports class-conditional transport. Extensive experiments on diverse datasets demonstrate the accuracy, scalability, and efficiency of the proposed method, establishing it as a principled and versatile tool for OT applications with provable optimality.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5266. FSD-CAP: Fractional Subgraph Diffusion with Class-Aware Propagation for Graph Feature Imputation
๐ฌ 25
โญ 4.00
๐ learning on graphs and other geometries & topologies
๐ท๏ธ incomplete graph learning graph feature imputation feature propagation
Imputing missing node features in graphs is challenging, particularly under high missing rates. Existing methods based on latent representations or global diffusion often fail to produce reliable estimates, and may propagate errors across the graph. We propose FSD-CAP, a two-stage framework designed to improve imputation quality under extreme sparsity. In the first stage, a graph-distance-guided subgraph expansion localizes the diffusion process. A fractional diffusion operator adjusts propagation sharpness based on local structure. In the second stage, imputed features are refined using class-aware propagation, which incorporates pseudo-labels and neighborhood entropy to promote consistency. We evaluated FSD-CAP on multiple datasets. With 99.5% of features missing across five benchmark datasets, FSD-CAP achieves average accuracies of 80.06% (structural) and 81.01% (uniform) in node classification, close to the 81.31% achieved by a standard GCN with full features. For link prediction under the same setting, it reaches AUC scores of 91.65% (structural) and 92.41% (uniform), compared to 95.06% for the fully observed case. Furthermore, FSD-CAP demonstrates superior performance on both large-scale and heterophily datasets when compared to other models. Code conducting all experiments can be found at https://anonymous.4open.science/r/FSD-CAP-50E8.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5267. LoRA meets Riemannion: Muon Optimizer for Parametrization-independent Low-Rank Adapters
๐ฌ 17
โญ 4.00
๐ generative models
๐ท๏ธ Low-rank Adaption Fine-tuning Smooth manifolds Riemannian optimization Fixed matrix rank manifold LLM Diffusion Models
This work presents a novel, fully Riemannian framework for Low-Rank Adaptation (LoRA) that geometrically treats low-rank adapters by optimizing them directly on the fixed-rank manifold. This formulation eliminates the parametrization ambiguity present in standard Euclidean optimizers. Our framework integrates three key components to achieve this: (1) we derive **Riemannion**, a new Riemannian optimizer on the fixed-rank matrix manifold that generalizes the recently proposed Muon optimizer; (2) we develop a Riemannian gradient-informed LoRA initialization, and (3) we provide an efficient implementation without prominent overhead that uses automatic differentiation to compute arising geometric operations while adhering to best practices in numerical linear algebra. Comprehensive experimental results on both LLM and diffusion model architectures demonstrate that our approach yields consistent and noticeable improvements in convergence speed and final task performance over both standard LoRA and its state-of-the-art modifications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5268. Robustness of Probabilistic Models to Low-Quality Data: A Multi-Perspective Analysis
๐ฌ 19
โญ 4.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ data quality; probabilistic model; multi-perspective analysis;
A systematic, comparative investigation into the effects of low-quality data reveals a stark spectrum of robustness across modern probabilistic models.
We find that autoregressive language models,
from token prediction to sequence-to-sequence tasks,
are remarkably resilient (for GPT-2, test NLL increases modestly from 2.87 to 3.59 despite 50\% token corruption).
By contrast, under the same levels of data corruption,
class-conditional diffusion models degrade catastrophically (image-label consistency plummets by 56.81\% relative to baseline),
while classifiers show a moderate impact that diminishes with dataset scale.
To explain these discrepancies,
we analyze the results through a multi-perspective lens,
integrating information theory,
PAC learning, and gradient dynamics.
\textcolor{blue}{These analyses suggest that robustness is heavily influenced by two key principles}:
the \textbf{richness of conditioning information}, which constrains the learning problem,
and the \textbf{absolute information content} of the training data, which allows the signal from correct information to dominate statistical noise.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5269. Adaptive Nonlinear Compression for Large Foundation Models
๐ฌ 16
โญ 4.00
๐ optimization
๐ท๏ธ Model Compression Low-Rank Factorization Large Language Models Vision Model
Despite achieving superior performance, large foundation models (LFMs) have substantial memory requirements, leading to a growing demand for model compression methods.
While low-rank approximation presents a promising hardware-friendly solution, existing linear methods suffer significant information losses due to rank truncation. Nonlinear kernels can enhance expressiveness by operating in higher-dimensional spaces, yet most kernels introduce prohibitive overhead and struggle to support adaptive rank allocation across heterogeneous matrices.
In this paper, we propose a compression method called Nonlinear Low-Rank Approximation with Adaptive Budget Allocation (NLA).
Instead of relying on linear products, we employ piecewise-linear kernels with a forward-pass optimization operator to approximate weight matrices, enhancing the recovery of high-rank weight matrices from low-rank matrices.
Moreover, considering the heterogeneous representation abilities and dynamic sensitivities of different weight matrices, we adaptively allocate the compression ratio of each weight matrix during the re-training process by cubic sparsity scheduling.
Through evaluations on large language models and vision models across various datasets, NLA demonstrates superior performance while achieving a higher compression ratio compared to existing methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5270. Sparling: End-to-End Spatial Concept Learning via Extremely Sparse Activations
๐ฌ 14
โญ 4.00
๐ interpretability and explainable AI
๐ท๏ธ machine learning sparsity interpretability optimization identifiability
Real-world processes often contain intermediate state that can be modeled as an extremely sparse activation tensor. In this work, we analyze the identifiability of such sparse and local latent intermediate variables, which we call motifs. We prove our Motif Identifiability Theorem, stating that under certain assumptions it is possible to precisely identify these motifs exclusively by reducing end-to-end error. Additionally, we provide the Sparling algorithm, which uses a new kind of informational bottleneck that enforces levels of activation sparsity unachievable using other techniques. We find that extreme sparsity is necessary to achieve good intermediate state modeling empirically. On our synthetic DigitCircle domain as well as the LaTeX-OCR and Audio-MNIST-Sequence domains, we are able to precisely localize the intermediate states up to feature permutation with $>90\%$ accuracy, even though we only train end-to-end.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5271. Zeros can be Informative: Masked Binary U-Net for Image Segmentation on Tensor Cores
๐ฌ 8
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ U-Net segmentation binary neural network GPU Tensor Core
Real-time image segmentation is a key enabler for AR/VR, robotics, drones, and autonomous systems, where tight accuracy, latency, and energy budgets must be met on resourceโconstrained edge devices. While UโNet offers a favorable balance of accuracy and efficiency compared to large transformerโbased models, achieving realโtime performance on highโresolution input remains challenging due to compute, memory, and power limits. Extreme quantization, particularly binary networks, is appealing for its hardwareโfriendly operations. However, two obstacles limit practicality: (1) severe accuracy degradation, and (2) a lack of endโtoโend implementations that deliver efficiency on generalโpurpose GPUs.
We make two empirical observations that guide our design. (1) An explicit zero state is essential: training with zero masking to binary UโNet weights yields noticeable sparsity. (2) Quantization sensitivity is uniform across layers. Motivated by these findings, we introduce Masked Binary UโNet (MBUโNet), obtained through a costโaware masking strategy that prioritizes masking where it yields the highest accuracyโperโcost, reconciling accuracy with nearโbinary efficiency.
To realize these gains in practice, we develop a GPU execution framework that maps MBUโNet to Tensor Cores via a subtractive bitโencoding scheme, efficiently implementing masked binary weights with binary activations. This design leverages native binary Tensor Core BMMA instructions, enabling high throughput and energy savings on widely available GPUs. Across 3 segmentation benchmarks, MBUโNet attains near fullโprecision accuracy (3\% average drop) while delivering 2.04$\times$ speedup and 3.54$\times$ energy reductions over a 16-bit floating point UโNet. The code will be released to the public alongside this publication.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5272. IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction
๐ฌ 31
โญ 4.00
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Long-Horizon Agents Deep Research
Recent advances in deep-research agents have shown promise for autonomous knowledge construction through dynamic reasoning over external sources. However, existing approaches rely on a mono-contextual paradigm that accumulates all information in a single, expanding context window, leading to context suffocation and noise contamination that limit their effectiveness on long-horizon tasks. We introduce IterResearch, a novel iterative deep-research paradigm that reformulates long-horizon research as a Markov Decision Process with strategic workspace reconstruction. By maintaining an evolving report as memory and periodically synthesizing insights, our approach preserves consistent reasoning capacity across arbitrary exploration depths. We further develop Efficiency-Aware Policy Optimization (EAPO), a reinforcement learning framework that incentivizes efficient exploration through geometric reward discounting and enables stable distributed training via adaptive downsampling. Extensive experiments demonstrate that IterResearch achieves substantial improvements over existing open-source agents with average +14.5pp across six benchmarks and narrows the gap with frontier proprietary systems. Remarkably, our paradigm exhibits unprecedented interaction scaling, extending to 2048 interactions with dramatic performance gains (from 3.5\% to 42.5\%), and serves as an effective prompting strategy, improving frontier models by up to 19.2pp over ReAct on long-horizon tasks. These findings position IterResearch as a versatile solution for long-horizon reasoning, effective both as a trained agent and as a prompting paradigm for frontier models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5273. LoC-Decomp: LLM Autoformalization via Logical Concept Decomposition and Iterative Feedback Correction
๐ฌ 21
โญ 4.00
๐ neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
๐ท๏ธ Autoformalization Automated theorem proving Large language model
Autoformalizationโthe process of converting natural language mathematical statements into machine-verifiable formal codeโplays a critical role in ensuring the reliability of mathematical reasoning generated by large language models (LLMs). Recent studies show that LLMs exhibit strong potential in automating this process, producing formal code for systems such as Lean 4, Coq, and Isabelle. Despite prominent advances, existing LLM-based autoformalization methods remain limited: they lack the ability to provide reliable semantic consistency checks to ensure that the formal code accurately preserves the meaning of the original statement. Furthermore, such methods are unable to support iterative improvement through corrective feedback. To address these limitations, we propose Loc-Decomp, a novel framework that integrates an automatic semantic consistency checker and the Lean 4 compiler to iteratively refine LLM-generated formalizations, ensuring both semantic consistency and syntactic correctness. Our approach introduces three key innovations: __(1)__ A structured and COT-like formalization template that decomposes complex formalization tasks into modular, foundational components, and systematically assembles themโlike building blocksโinto a complete formal expression. __(2)__ A semantic self-checking mechanism based on a divide-conquer-merge strategy to detect subtle inconsistencies between the formalization and the original statement. __(3)__ An iterative feedback-driven refinement loop that leverages both semantic and syntactic error signals to guide the LLM in progressively improving the formal output. By integrating these innovations, Loc-Decomp significantly enhances the accuracy of LLM-driven formalization, reduces reliance on human intervention, and moves closer to truly reliable automated reasoning. Extensive experiments on high-school-level and undergraduate-level datasets demonstrate that our approach achieves a significantly higher formalization success rate compared to baseline methods and state-of-the-art (SOTA) models. On the PutnamBench dataset, for instance, our method attains a success rate of 93.09\%, representing an improvement of 18 percentage points over the previous SOTA SFT-based model.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5274. One step further with Monte-Carlo sampler to guide diffusion better
๐ฌ 20
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Conditional Generation; Diffusion Model; Training-free Guidance
Stochastic differential equation (SDE)-based generative models have achieved
substantial progress in conditional generation via training-free differentiable
loss-guided approaches. However, existing methodologies utilizing posterior sam-
pling typically confront a substantial estimation error, which results in inaccurate
gradients for guidance and leading to inconsistent generation results. To mitigate
this issue, we propose that performing an additional backward denoising step and
Monte-Carlo sampling (ABMS) can achieve better guided diffusion, which is a
plug-and-play adjustment strategy. To verify the effectiveness of our method, we
provide theoretical analysis and propose the adoption of a dual-evaluation frame-
work, which further serves to highlight the critical problem of cross-condition
interference prevalent in existing approaches. We conduct experiments across var-
ious task settings and data types, mainly including conditional online handwritten
trajectory generation, image inverse problems (inpainting, super resolution and
gaussian deblurring), and molecular inverse design. Experimental results demon-
strate that our approach consistently improves the quality of generation samples
across all the different scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5275. Low rank adaptation of chemical foundation models generate effective odorant representations
๐ฌ 27
โญ 4.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ chemical foundation models protein foundation models low rank adaptation olfaction multi-modal model computational neuroscience
Featurizing odorants to enable robust prediction of their properties is difficult due to the complex activation patterns that odorants evoke in the olfactory system. Structurally similar odorants can elicit distinct activation patterns in both the sensory periphery (i.e., at the receptor level) and downstream brain circuits (i.e., at a perceptual level). Despite efforts to design or discover features for odorants to better predict how they activate the olfactory system, we lack a universally accepted way to featurize odorants. In this work, we demonstrate that feature-based approaches that rely on pre-trained foundation models $\textit{do not}$ significantly outperform classical hand-designed features, but that targeted foundation model fine-turning can increase model performance beyond these limits. To show this, we introduce a new model that creates olfaction-specific representations: $\textbf{L}$oRA-based $\textbf{O}$dorant-$\textbf{R}$eceptor $\textbf{A}$ffinity prediction with $\textbf{CROSS}$-attention ($\textbf{LORAX}$). We compare existing chemical foundation model representations to hand-designed physicochemical descriptors using feature-based methods and identify large information overlap between these representations, highlighting the necessity of fine-tuning to generate novel and superior odorant representations. We show that LORAX produces a feature space more closely aligned with olfactory neural representation, enabling it to outperform existing models on predictive tasks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5276. Optimal Robust Subsidy Policies for Irrational Agent in Principal-Agent MDPs
๐ฌ 12
โญ 4.00
๐ reinforcement learning
๐ท๏ธ Principal-Agent Problem Markov Decision Process Reinforcement Learning
We investigate a principal-agent problem modeled within a Markov Decision Process, where the principal and the agent have their own rewards. The principal can provide subsidies to influence the agentโs action choices, and the agentโs resulting action policy determines the rewards accrued to the principal. Our focus is on designing a robust subsidy scheme that maximizes the principalโs cumulative expected return, even when the agent displays bounded rationality and may deviate from the optimal action policy after receiving subsidies.
As a baseline, we first analyze the case of a perfectly rational agent and show that the principalโs optimal subsidy coincides with the policy that maximizes social welfare, the sum of the utilities of both the principal and the agent. We then introduce a bounded-rationality model: the globally $\epsilon$-incentive-compatible agent, who accepts any policy whose expected cumulative utility lies within $\epsilon$ of the personal optimum. In this setting, we prove that the optimal robust subsidy scheme problem simplifies to a one-dimensional concave optimization. This reduction not only yields a clean analytical solution but also highlights a key structural insight: optimal subsidies are concentrated along the social-welfare-maximizing trajectories. We further characterize the loss in social welfareโthe degradation under the robust subsidy scheme compared to the maximum achievableโand provide an upper bound on this loss. Finally, we investigate a finer-grained, state-wise $\epsilon$-incentive-compatible model. In this setting, we show that under two natural definitions of state-wise incentive-compatibility, the problem becomes intractable: one definition results in non-Markovian agent action policy, while the other renders the search for an optimal subsidy scheme NP-hard.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5277. ROGA: Scaling Generalist Agents for Office Productivity Tasks via Tool Generation
๐ฌ 36
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Generalist agent Office productivity Tool generation
Automatic tool generation (ATG) has emerged as a key approach to enable the automatic adaptation across diverse tasks within a single generalist agent.
Despite their potential, we argue that current ATG agents, often built on reactive paradigms, fail to effectively adapt to realistic environments requiring long-term reasoning and stateful interaction, particularly in office ecosystems. We empirically show that current ATG agents underperform by up to 27.43\%.
This performance degradation stems from three fundamental limitations of prevailing agent paradigms: (1) a failure to build a coherent world model from long, partially observable contexts; (2) a memory-less execution model where stateless actions fail to track state evolution during iterative tasks; and (3) a static capability generation model focusing on one-shot tool generation for immediate needs, thereby forcing redundant regeneration for similar steps.
To address these fundamental limitations, we propose ROGA, which instantiates a new agent paradigm for long-horizon, stateful environments. ROGA moves beyond simple reactive loops by introducing four foundational algorithmic innovations: (1) \textbf{Active World Modeling}, an iterative process where the agent actively probes the environment to construct its own world model; (2) a \textbf{Persistent Symbolic Memory} that explicitly tracks the state evolution for temporal reasoning; and (3) a \textbf{Dynamic Capability Evolution} model for long-term adaptation and meta-learning on the agent's own capabilities.
Comprehensive experiments on widely used benchmarks show that ROGA consistently outperforms existing ATG agents by up to 13.64\%.
These results underscore ROGA's potential to advance the ATG paradigm, delivering a practical pathway toward building sustainable generalist agents in realistic environments.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5278. Subspace Kernel Learning on Tensor Sequences
๐ฌ 3
โญ 4.00
๐ learning on time series and dynamical systems
๐ท๏ธ Kernel tensor subspace learning action recognition
Learning from structured multi-modal data, represented as higher-order tensors, requires capturing complex interactions across modes while remaining computationally efficient. We introduce Uncertainty-driven Kernel Tensor Learning (UKTL), a novel kernel framework for $M$-mode tensors that compares mode-wise subspaces derived from tensor unfoldings, enabling expressive and robust similarity measures. To handle large-scale tensor data, we propose a scalable Nystrรถm kernel linearization with dynamically learned pivot tensors obtained via soft $k$-means clustering. A key innovation of UKTL is its uncertainty-aware subspace weighting, which adaptively down-weights unreliable mode components based on estimated confidence, improving robustness and interpretability in comparisons between input and pivot tensors. Our framework is fully end-to-end trainable and naturally incorporates multi-mode interactions through structured kernel compositions. Extensive evaluations on skeleton-based action recognition benchmarks (NTU-60, NTU-120, Kinetics-Skeleton) show that UKTL achieves state-of-the-art performance, superior generalization, and meaningful mode-wise insights. This work establishes a principled, scalable, and interpretable kernel learning paradigm for structured multi-modal sequences.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5279. Deep Hierarchical Learning with Nested Subspace Networks
๐ฌ 17
โญ 4.00
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ dynamic neural networks efficient inference adaptive computing deep learning low-rank adaptation
Large neural networks are typically trained for a fixed computational budget, creating a rigid trade-off between performance and efficiency that is ill-suited for deployment in resource-constrained or dynamic environments. Existing approaches to this problem present a difficult choice: training a discrete collection of specialist models is computationally prohibitive, while dynamic methods like slimmable networks often lack the flexibility to be applied to large, pre-trained foundation models. In this work, we propose *Nested Subspace Networks (NSNs)*, a novel architectural paradigm that enables a single model to be dynamically and granularly adjusted across a continuous spectrum of compute budgets at inference time. The core of our approach is to re-parameterize linear layers to satisfy a nested subspace property, such that the function computed at a given rank is a strict subspace of the function at any higher rank. We show that this entire hierarchy of models can be optimized jointly via an uncertainty-aware objective that learns to balance the contributions of different ranks based on their intrinsic difficulty. We demonstrate empirically that NSNs can be surgically applied to pre-trained LLMs and unlock a smooth and predictable compute-performance frontier. For example, a single NSN-adapted model can achieve a 50\% reduction in inference FLOPs with only a 5 percentage point loss in accuracy. Our findings establish NSNs as a powerful framework for creating the next generation of adaptive foundation models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5280. Strongly Convex Sets in Riemannian Manifolds
๐ฌ 18
โญ 4.00
๐ optimization
๐ท๏ธ Strong convexity Riemannian Manifold Frank-Wolfe optimization nonconvex
Strong convexity plays a key role in designing and analyzing convex optimization algorithms and is well-understood in Hilbert spaces. However, the notion of strongly convex sets beyond Hilbert spaces remains unclear. In this paper, we propose various definitions of strong convexity for uniquely geodesic sets in a Riemannian manifold, examine their relationships, introduce tools to identify geodesically strongly convex sets, and analyze the convergence of optimization algorithms over these sets. In particular, we show that the Riemannian Frank-Wolfe algorithm converges linearly when the Riemannian scaling inequalities hold.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5281. RankFlow: Property-aware Transport for Protein Optimization
๐ฌ 7
โญ 4.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ protein language models fitness prediction
A key step in protein optimization is to accurately model the fitness landscape, which maps sequence and structure to functional assay readouts. Previous methods typically predict fitness landscape by directly using likelihoods or embeddings derived from pretrained protein language models (PLMs), which are property-agnostic. In addition, many predictors assume individual mutations have independent effects, thus failing to capture rich interactions among multiple mutations. In this work, we introduce RankFlow, a conditional flow framework that refines PLM representations to be a property-aligned distribution via a tailored energy function. RankFlow captures multi-mutation interactions through learnable embeddings on mutation sets. To align optimization with evaluation protocols, we propose the Rank-Consistent Conditional Flow Loss, a differentiable ranking objective that enforces the correct order of mutants rather than absolute values, which improves out-of-distribution generalization. Finally, we introduce a Property-guided Steering Gate (PSG) that concentrates learning on positions carrying signal for the target property while suppressing unrelated evolutionary biases. Across ProteinGym, PEER, and FLIP benchmarks, RankFlow attains state-of-the-art ranking accuracy and stronger generalization to higher-order mutants.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5282. DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively
๐ฌ 29
โญ 4.00
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Automated Scientific Discovery Large Language Models (LLMs) AI Scientist
While previous AI Scientist systems can generate novel findings, they often lack the focus to produce scientifically valuable contributions that address pressing human-defined challenges. We introduce DeepScientist, a system designed to overcome this by conducting goal-oriented, fully autonomous scientific discovery over month-long timelines. It formalizes discovery as a Bayesian Optimization problem, using a cumulative Findings Memory to intelligently balance the exploitation of promising avenues with the exploration of novel hypotheses. Consuming over 20,000 GPU hours, the system generated about 5,000 unique ideas and experimentally validated approximately 1100, ultimately surpassing human-designed 2025 state-of-the-art (SOTA) methods on three frontier AI tasks by 183.7\%, 1.9\%, and 7.9\%. Crucially, this was achieved by autonomously redesigning core methodologies, not merely recombining existing techniques. In a striking demonstration, the system achieved progress on AI text detection in just two weeks that is comparable to three years of cumulative human research. This work provides the first large-scale evidence of an AI achieving discoveries that progressively surpass human SOTA on scientific tasks, producing valuable findings that genuinely push the frontier forward. To facilitate further research into this process, we will open-source all experimental logs and system code.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5283. TusoAI: Agentic Optimization for Scientific Methods
๐ฌ 37
โญ 3.60
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ AI for Science Agentic AI Code Optimization AutoML
Scientific discovery is often slowed by the manual development of computational
tools needed to analyze complex experimental data. Building such tools is costly
and time-consuming because scientists must iteratively review literature, test mod-
eling and scientific assumptions against empirical data, and implement these in-
sights into efficient software. Large language models (LLMs) have demonstrated
strong capabilities in synthesizing literature, reasoning with empirical data, and
generating domain-specific code, offering new opportunities to accelerate com-
putational method development. Existing LLM-based systems either focus on
performing scientific analyses using existing computational methods or on de-
veloping computational methods or models for general machine learning without
effectively integrating the often unstructured knowledge specific to scientific do-
mains. Here, we introduce TusoAI, an agentic AI system that takes a scientific task
description with an evaluation function and autonomously develops and optimizes
computational methods for the application. TusoAI integrates domain knowledge
into a knowledge tree representation and performs iterative, domain-specific op-
timization and model diagnosis, improving performance over a pool of candidate
solutions. We conducted comprehensive benchmark evaluations demonstrating
that TusoAI outperforms state-of-the-art expert methods, MLE agents, and scien-
tific AI agents across diverse tasks. Applying TusoAI to two key open problems
in genetics improved existing computational methods and uncovered new biology
missed by previous methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5284. AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations
๐ฌ 45
โญ 3.60
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ Automated Scientific Illustration Agentic AI Text-to-Image
High-quality scientific illustrations are crucial for effectively communicating complex scientific and technical concepts, yet their manual creation remains a well-recognized bottleneck in both academia and industry. We present FigureBench, the first large-scale benchmark for generating scientific illustrations from long-form scientific texts. It contains 3,300 high-quality scientific textโfigure pairs, covering diverse text-to-illustration tasks from scientific papers, surveys, blogs, and textbooks. Moreover, we propose AutoFigure, an agentic framework that automatically generates high-quality scientific illustrations based on long-form scientific text. Specifically, before rendering the final result, AutoFigure engages in extensive thinking, recombination, and validation to produce a layout that is both structurally sound and aesthetically refined, outputting a scientific illustration that achieves both structural completeness and aesthetic appeal. Leveraging the high-quality data from FigureBench, we conduct extensive experiments to test the performance of AutoFigure against various baseline methods. The results demonstrate that Autofigure consistently surpasses all baseline methods, producing publication-ready scientific illustrations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5285. Extending the Context of Pretrained LLMs by Dropping Their Positional Embedding
๐ฌ 27
โญ 3.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LMs Long Context Positional Embeddings Architecture
So far, expensive finetuning beyond the pretraining sequence length has been a prerequisite to effectively extend the context of language models (LM). In this work, we break this key bottleneck by ***Dro**pping the **P**ositional **E**mbeddings of LMs after training (DroPE)*. Our simple method is motivated by three key theoretical and empirical observations. First, positional embeddings serve a crucial role during pretraining, providing an important inductive bias that significantly facilitates convergence. Second, over-reliance on this explicit positional information is also precisely what prevents test-time generalization to sequences of unseen length. Third, positional embeddings are not an inherent requirement of effective language modeling and can be safely *removed after pretraining* following a short recalibration phase. Empirically, DroPE yields seamless *zero-shot* context extension *without any long-context finetuning*, quickly adapting pretrained LMs without compromising their capabilities in the original training context. Our findings hold across different models and dataset sizes, far outperforming previous specialized architectures and established rotary position embedding scaling methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5286. Evaluating SAE interpretability without generating explanations
๐ฌ 13
โญ 3.50
๐ interpretability and explainable AI
๐ท๏ธ interpretability explanation sae transcoder
Sparse autoencoders (SAEs) and transcoders have become important tools for machine learning interpretability. However, measuring the quality of the features they uncover remains challenging, and there is no consensus in the community about which benchmarks to use. Most evaluation procedures start by producing a single-sentence explanation for each feature in the sparse coder. These explanations are then evaluated based on how well they enable an LLM to predict the activation of a feature in new contexts. This method makes it difficult to disentangle the explanation generation and evaluation process from the actual interpretability of the features in the sparse coder. In this work, we adapt existing methods to assess the interpretability of sparse coders, with the advantage that they do not require generating natural language explanations as an intermediate step. This enables a more direct and potentially standardized assessment of interpretability. Furthermore, we compare the scores produced by our interpretability metrics with human evaluations across similar tasks and varying setups, offering suggestions for the community on improving the evaluation of these techniques.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5287. Dual Perspectives on Non-Contrastive Self-Supervised Learning
๐ฌ 16
โญ 3.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Deep learning representation learning self-supervised learning
The {\em stop gradient} and {\em exponential moving average} iterative procedures are commonly used in non-contrastive approaches to self-supervised learning to avoid representation collapse, with excellent performance in downstream applications in practice. This presentation investigates these procedures from the dual viewpoints of optimization and dynamical systems. We show that, in general, although they {\em do not} optimize the original objective, or {\em any} other smooth function, they {\em do} avoid collapse. Following [Tian et al. 2021], but without any of the extra assumptions used in their proofs, we then show using a dynamical system perspective that, in the linear case, minimizing the original objective function without the use of a stop gradient or exponential moving average {\em always} leads to collapse. Conversely, we characterize explicitly the equilibria of the dynamical systems associated with these two procedures in this linear setting as algebraic varieties in their parameter space, and show that they are, in general, {\em asymptotically stable}. Our theoretical findings are illustrated by empirical experiments with real and synthetic data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5288. BoreaRL: A Multi-Objective Reinforcement Learning Environment for Climate-Adaptive Boreal Forest Management
๐ฌ 20
โญ 3.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Multi-Objective Reinforcement Learning (MORL) RL environments Climate-adaptive forest management Boreal forests Permafrost thaw Carbon sequestration Process-based simulator Preference-conditioned policies
Boreal forests store 30-40\% of terrestrial carbon, much in climate-vulnerable permafrost soils, making their management critical for climate mitigation. However, optimizing forest management for both carbon sequestration and permafrost preservation presents complex trade-offs that current tools cannot adequately address. We introduce BoreaRL, the first multi-objective reinforcement learning environment for climate-adaptive boreal forest management, featuring a physically-grounded simulator of coupled energy, carbon, and water fluxes. BoreaRL supports two training paradigms: site-specific mode for controlled studies and generalist mode for learning robust policies under environmental stochasticity. Through evaluation of multi-objective RL algorithms, we reveal a fundamental asymmetry in learning difficulty: carbon objectives are significantly easier to optimize than thaw (permafrost preservation) objectives, with thaw-focused policies showing minimal learning progress across both paradigms. In generalist settings, standard gradient-descent based preference-conditioned approaches fail, while a naive site selection approach achieves superior performance by strategically selecting training episodes. Analysis of learned strategies reveals distinct management philosophies, where carbon-focused policies favor aggressive high-density coniferous stands, while effective multi-objective policies balance species composition and density to protect permafrost while maintaining carbon gains. Our results demonstrate that robust climate-adaptive forest management remains challenging for current MORL methods, establishing BoreaRL as a valuable benchmark for developing more effective approaches. We open-source BoreaRL to accelerate research in multi-objective RL for climate applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5289. Scaling Behavior of Discrete Diffusion Language Models
๐ฌ 13
โญ 3.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ diffusion discrete diffusion diffusion language models scaling scaling laws optimal batch size critical batch size
Modern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor.
Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs).
However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs.
We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate.
Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs.
While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, making them a promising candidate in data-constrained training environments.
We scale our uniform diffusion model up to 10B parameters trained for $10^{22}$ FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.
In the process of deriving the scaling laws, we reformulate the discrete diffusion ELBO in terms of signal-to-noise ratio, closing the gap to continuous diffusion theory and simplifying both theory and implementation.
Training code and models are open-sourced: upon acceptance
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5290. CodeGenGuard: A Robust Watermark for Code Generation Models
๐ฌ 28
โญ 3.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Watermarking Code Generation Copyright Protection
Code language models (LMs) represent valuable intellectual property (IP) as their training involves immense investments, including large-scale code corpora, proprietary annotations, extensive computational resources, and specialized designs. Hence the threat of model IP infringements such as unauthorized redistribution or model theft has become increasingly concerning. While neural network watermarking has been widely studied as a measure to support model ownership verification, watermarking code LMs is particularly challenging due to the seemingly conflicting requirements of code generation: adhering to strict syntactic rules and semantic consistency while allowing flexible changes to embed watermarks, keeping high fidelity of the generated content while being robust to extraction attacks, etc. To resolve the issues, we propose CodeGenGuard, a watermarking framework for code LMs. CodeGenGuard leverages semantic-preserving transformations (SPTs) to encode the watermark and incorporates a dead-code-based data augmentation pipeline to diversify SPT patterns. To improve robustness, we incorporate an efficient dual-LoRA shadow training scheme and an optimizable trigger prompt that learns to extract watermark from both the watermarked and the shadow models. As most SPTs take place in specific contexts, we implant auxiliary prompts during verification to encourage the generation of the context, further enhancing the detection rate. Evaluation results on representative code generation models demonstrate that CodeGenGuard achieves superior watermarking performance to the state-of-the-art.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5291. Information-based Value Iteration Networks for Decision Making Under Uncertainty
๐ฌ 10
โญ 3.50
๐ reinforcement learning
๐ท๏ธ Reinforcement Learning value iteration networks planning under uncertainty
Deep neural networks that incorporate classic reinforcement learning methods, such as value iteration, into their structure significantly outperform randomly structured networks in learning and generalization. These networks, however, are mostly limited to environments with no or very low uncertainty and do not extend well to partially observable environments. In this paper, we propose a new planning module architecture, the VI$^2$N (Value Iteration with Value of Information Network), that learns to act in novel environments with high perceptual ambiguity. This architecture over-emphasizes reducing uncertainty before exploiting the reward. VI$^2$N can also utilize factorization in environments with mixed observability to decrease the computational complexity of calculating the policy and to facilitate learning. Tested on a range of grid-based navigation tasks, each containing various types of environments with different degrees of observability, our network outperforms other deep architectures. Moreover, VI$^2$N generates interpretable cognitive maps highlighting both rewarding and informative locations. These maps highlight the key states the agent must visit to achieve its goal.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5292. Learning is Forgetting; LLM Training As Lossy Compression
๐ฌ 4
โญ 3.50
๐ learning theory
๐ท๏ธ Compression Information Theory Learning Generalisation LLMs Interpretability
Despite the increasing prevalence of large language models (LLMs), we still have a limited understanding of how their representational spaces are structured. This limits our ability to interpret how and what they learn or relate them to learning in humans. We argue LLMs are best seen as an instance of lossy compression, where over training they learn by retaining only information in their training data relevant to their objective(s). We show pre-training results in models that are optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound on compression. Across an array of open weights models, each compresses differently, likely due to differences in the data and training recipes used. However even across different families of LLMs the optimality of a model's compression, and the information present in it, can predict downstream performance on MMLU-Pro, letting us directly link representational structure to actionable insights about model performance. In the general case the work presented here offers a unified Information-Theoretic framing for how these models learn that is deployable at scale.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5293. HARP: Hallucination Detection via Reasoning Subspace Projection
๐ฌ 10
โญ 3.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Hallucination detection Subspace Projection SVD
Hallucinations in Large Language Models (LLMs) pose a major barrier to their reliable use in critical decision-making. Although existing hallucination detection methods have improved accuracy, they still struggle with disentangling semantic and reasoning information and maintaining robustness. To address these challenges, we propose HARP (Hallucination detection via reasoning subspace projection), a novel hallucination detection framework. HARP establishes that the hidden state space of LLMs can be decomposed into a direct sum of a semantic subspace and a reasoning subspace, where the former encodes linguistic expression and the latter captures internal reasoning processes. Moreover, we demonstrate that the Unembedding layer can disentangle these subspaces, and by applying Singular Value Decomposition (SVD) to its parameters, the basis vectors spanning the semantic and reasoning subspaces are obtained.
Finally, HARP projects hidden states onto the basis vectors of the reasoning subspace, and the resulting projections are then used as input features for hallucination detection in LLMs. By using these projections, HARP reduces the dimension of the feature to approximately 5% of the original, filters out most noise, and achieves enhanced robustness. Experiments across multiple datasets show that HARP achieves state-of-the-art hallucination detection performance; in particular, it achieves an AUROC of 92.8% on TriviaQA, outperforming the previous best method by 7.5%.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5294. Difference Predictive Coding for Training Spiking Neural Networks
๐ฌ 24
โญ 3.50
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Spiking neural networks predictive coding biologically plausible learning neuromorphic computing difference predictive coding local learning rules energy efficiency communication efficiency spike-based learning surrogate gradient alternatives
Predictive coding networks (PCNs) offer a local-learning alternative to backpropagation in which layers communicate residual errors, aligning well with biological computation and neuromorphic hardware. In this work we introduce Difference Predictive Coding (DiffPC), a spike-native PC formulation for spiking neural networks. DiffPC replaces dense floating-point messages with sparse ternary spikes, provides spike-compatible target and error updates, and employs adaptive threshold schedules for event-driven operation. We validate DiffPC on fully connected and convolutional architectures, demonstrating competitive performance on MNIST (99.3\%) and Fashion-MNIST (89.6\%), and outperforming a backpropagation baseline on CIFAR-10. Crucially, this performance is achieved with high communication sparsity, reducing data movement by over two orders of magnitude compared to standard predictive coding. DiffPC thus establishes a faithful, hardware-aligned framework for communication-efficient training on neuromorphic platforms.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5295. Beyond Softmax and Entropy: $f$-Regularized Policy Gradients with Coupled Parametrizations
๐ฌ 16
โญ 3.50
๐ reinforcement learning
๐ท๏ธ policy gradient methods reinforcement learning theory f-divergence Tsallis entropy Shannon entropy
We introduce $\texttt{f-PG}$, a new class of stochastic policy gradient methods regularized by a family of $f$-divergences, including entropy and Tsallis divergences. For each divergence, we employed a $\textit{coupled}$ parameterization, defined by $f$-softargmax, which allows us to establish the first explicit, non-asymptotic, last-iterate convergence rates for stochastic policy gradient.
To derive our analysis, we prove that the $f$-regularized value function is smooth and satisfies a Polyak-ลojasiewicz inequality as a function of $f$-softargmax parameters. To establish the latter, we introduce a general policy improvement operator that restricts optimization to a well-defined policy space that excludes ill-behaved policies. In the case of softmax, this allows to escape the "gravitational pull" and yields the first $\textit{explicit}$ convergence guarantees for this parameterization, closing a gap in the literature.
Finally, we leverage these rates to derive sample complexity bounds for the unregularized problem and show that $\texttt{f-PG}$ with Tsallis divergences provides a provably better sample complexity/regularization bias trade-off compared to softmax-based policy gradient with entropy regularization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5296. FedMuon: Federated Learning with Bias-corrected LMO-based Optimization
๐ฌ 11
โญ 3.50
๐ optimization
๐ท๏ธ fderated learnig muon linear minimizatio oracle
Recently, a new optimization method based on the linear minimization oracle (LMO), called Muon, has been attracting increasing attention since it can train neural networks faster than the existing adaptive optimization methods, such as Adam.
In this paper, we study how Muon can be utilized in federated learning.
We first show that straightforwardly using Muon as the local optimizer of FedAvg does not work since the LMO is a biased operator.
We then propose FedMuon, which can mitigate this issue and can converge to the stationary point.
We also analyze how solving the LMO approximately affects the convergence rate and find that, surprisingly, FedMuon can converge for any number of Newton-Schulz iterations, while it can converge faster as we solve the LMO more accurately.
Through experiments, we demonstrated that FedMuon can outperform the state-of-the-art federated learning methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5297. IR-Agent: Expert-Inspired LLM Agents for Structure Elucidation from Infrared Spectra
๐ฌ 24
โญ 3.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ LLM Agent Infrared Spectroscopy Structure Elucidation
Spectral analysis provides crucial clues for the elucidation of unknown materials. Among various techniques, infrared spectroscopy (IR) plays an important role in laboratory settings due to its high accessibility and low cost. However, existing approaches often fail to reflect expert analytical processes and lack flexibility in incorporating diverse types of chemical knowledge, which is essential in real-world analytical scenarios. In this paper, we propose IR-Agent, a novel multi-agent framework for molecular structure elucidation from IR spectra. The framework is designed to emulate expert-driven IR analysis procedures and is inherently extensible. Each agent specializes in a specific aspect of IR interpretation, and their complementary roles enable integrated reasoning, thereby improving the overall accuracy of structure elucidation. Through extensive experiments, we demonstrate that IR-Agent not only improves baseline performance on experimental IR spectra but also shows strong adaptability to various forms of chemical information.
The source code for IR-Agent is available at https://anonymous.4open.science/r/IR-Agent-ICLR26-CD59.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5298. Designing Time Series Experiments in A/B Testing with Transformer Reinforcement Learning
๐ฌ 23
โญ 3.50
๐ causal reasoning
๐ท๏ธ Experiment Designs A/B Testing Reinforcement Learning
A/B testing has become a gold standard for modern technological companies to
conduct policy evaluation. Yet, its application to time series experiments, where
treatments are sequentially assigned over time, remains challenging. Existing designs suffer from two limitations: (i) they do not fully leverage the entire history for treatment allocation; (ii) they rely on strong assumptions to approximate the objective function (e.g., the mean squared error of the estimated treatment effect) for optimizing the design. We first establish an impossibility theorem showing
that failure to condition on the full history leads to suboptimal designs, due to the dynamic dependencies in time series experiments. To address both limitations simultaneously, we next propose a transformer reinforcement learning (RL) approach which leverages transformers to condition treatment allocation on the entire history and employs RL to directly optimize the MSE without relying on
restrictive assumptions. Empirical evaluations on synthetic data, a publicly available dispatch simulator, and a real-world ridesharing dataset demonstrate that our proposal consistently outperforms existing designs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5299. NC-Bench and NCfold: A Benchmark and Closed-Loop Framework for RNA Non-Canonical Base-Pair Prediction
๐ฌ 14
โญ 3.50
๐ datasets and benchmarks
๐ท๏ธ RNA secondry structure prediction RNA non-canonical base pair RNA foundation model
RNA secondary structure forms the basis for folding and function, with non-canonical (NC) interactions indispensable for catalysis, regulation, and molecular recognition. Despite their importance, predicting NC base pairs remains challenging due to the absence of a standardized benchmark for systematic evaluation. To address this, we introduce NC-Bench, the first benchmark dedicated to NC base-pair prediction. NC-Bench provides 925 curated RNA sequences with 6,708 high-quality NC annotations, fine-grained edge and orientation classification tasks, and IsoScore-based embedding evaluation, offering a rigorous foundation for systematic assessment. Building on this, we propose NCfold, a dual-branch framework that couples sequence features with structural priors derived from RNA foundation models (RFMs) via Representative Embedding Fusion (REF) and REF-weighted self-attention. This closed-loop design iteratively refines sequence and structure representations, alleviating data sparsity and enhancing predictive accuracy. Experiments on NC-Bench show that NCfold outperforms existing methods, with zero-shot and ablation studies confirming its effectiveness and underscoring the need for NC-specific benchmarks. Together, NC-Bench and NCfold establish a systematic foundation for NC base-pair prediction, advancing our understanding of RNA structure and enabling next-generation RNA-centric applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5300. Controllable diffusion-based generation for multi-channel biological data
๐ฌ 16
โญ 3.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ diffusion model conditional imputation channel attention random-masking guidance imaging mass cytometry
Biological profiling technologies, such as imaging mass cytometry (IMC) and spatial transcriptomics (ST), generate multi-channel data with strong spatial alignment and complex inter-channel relationships. Modeling such data requires generative frameworks that can jointly model spatial structure and channel relationships, while also generalizing across arbitrary combinations of observed and missing channels for practical applications. Existing generative models typically assume low-dimensional inputs (e.g., RGB images) and rely on simple conditioning mechanisms that break spatial correspondence and overlook inter-channel dependencies. This work proposes a unified multi-channel diffusion (MCD) framework for controllable generation of structured biological data with intricate inter-channel relationships. Our model introduces two key innovations: (1) a hierarchical feature injection mechanism that enables multi-resolution conditioning on spatially aligned observed channels, and (2) two complementary channel attention modules to capture inter-channel relationships and recalibrate latent features. To support flexible conditioning and generalization to arbitrary sets of observed channels, we train the model using a random channel masking strategy, enabling it to reconstruct missing channels from any combination of observed channels as the spatial condition. We demonstrate state-of-the-art performance across both spatial and non-spatial biological data generation tasks, including imputation in spatial proteomics and clinical imaging, as well as gene-to-protein prediction in single-cell datasets, and show strong generalizability to unseen conditional configurations.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5301. Inheriting Generalizable Knowledge from LLMs to Diverse Vertical Tasks
๐ฌ 10
โญ 3.50
๐ learning theory
๐ท๏ธ NLP; DL
Large language models (LLMs) have demonstrated remarkable generalization across diverse tasks, suggesting the existence of task-agnostic, generalizable knowledge encoded within them. However, how to systematically extract and evaluate this knowledge remains unexplored. In this work, we innovatively propose MASA (Matrix-level Alignment and Scalable Adaptation), a unified framework for extracting and transferring generalizable knowledge from LLMs. MASA first introduces a lightweight set of gene matrices trained with a dual alignment strategy, combining output alignment and spectral alignment, to capture the generalizable knowledge encoded in the feed-forward networks (FFNs) of LLM. It then employs scalable adaptation to flexibly reshape these gene matrices to match the parameter dimensions of lightweight dense models of various sizes, enabling direct initialization of their FFN layers. To evaluate the inherited knowledge, we measure the downstream performance of lightweight models initialized with MASA across language understanding and dialogue generation tasks spanning diverse vertical domains. Experiments on both dense and Mixture-of-Experts (MoE) source LLMs show that MASA consistently outperforms baselines such as random initialization, pruning, and distillation, yielding lightweight models that achieve stronger performance, require less pre-training data, and converge faster. These results establish MASA as an effective and general framework for extracting and leveraging the generalizable knowledge within LLMs. The code is available at https://anonymous.4open.science/r/MASA-main-6083/.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5302. Gen-DFL: Decision-Focused Generative Learning for Robust Decision Making
๐ฌ 10
โญ 3.50
๐ optimization
๐ท๏ธ decision focused learning decision making stochastic optimization generative models operational research
Decision-focused learning (DFL) integrates predictive models with downstream optimization, directly training machine learning models to minimize decision errors. While DFL has been shown to provide substantial advantages when compared to a counterpart that treats the predictive and prescriptive models separately, it has also been shown to struggle in high-dimensional and risk-sensitive settings, limiting its applicability in real-world settings. To address this limitation, this paper introduces Decision-Focused Generative Learning (Gen-DFL), a novel framework that leverages generative models to adaptively model uncertainty and improve decision quality. Instead of relying on fixed uncertainty sets, Gen-DFL learns a structured representation of the optimization parameters and samples from the tail regions of the learned distribution to enhance robustness against worst-case scenarios. This approach mitigates over-conservatism while capturing complex dependencies in the parameter space. The paper shows, theoretically, that Gen-DFL achieves improved worst-case performance bounds compared to traditional DFL. Empirically, it evaluates Gen-DFL on various scheduling and logistics problems, demonstrating its strong performance against existing DFL methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5303. The Lattice Geometry of Neural Network Quantization: A Short Equivalence Proof of GPTQ and Babai's algorithm
๐ฌ 8
โญ 3.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ quantization lattices GPTQ Babai
We explain how data-driven quantization of a linear unit in a neural network corresponds to solving the closest vector problem for a certain lattice generated by input data. We prove that the GPTQ algorithm is equivalent to Babai's well-known nearest-plane algorithm. We furthermore provide geometric intuition for both algorithms. Lastly, we note the consequences of these results, in particular hinting at the possibility of using lattice basis reduction for improved quantization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5304. The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner
๐ฌ 11
โญ 3.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ Length Generalization Large Language Models Turing Machine Chain-of-Thought Computable Reasoning Synthetic Dataset
Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLMs).
Although existing studies have predominantly focused on data-driven approaches for particular arithmetic operations or symbolic manipulation tasks, these approaches tend to be task-specific with limited performance on individual tasks.
To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are *computable*, *i.e.*, problems that algorithms can solve, thus can be solved by the Turing machine, which operates over inputs of unbounded length.
From this perspective, this paper proposes **T**uring m**A**chine **I**mitation **L**earning (**TAIL**) to improve the length generalization ability of LLMs.
TAIL uses computer programs to directly synthesize chain-of-thought (CoT) data that imitate the execution process of a Turing machine, which *linearly* expands the reasoning steps into *atomic* states to alleviate shortcut pattern learning and explicit *memory* fetch mechanism to reduce the difficulties of dynamic and long-range data access.
To validate the universality and reliability of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks.
Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B in individual tasks using only synthetic data, surpassing previous methods and DeepSeek-R1.
The experimental results reveal that the key concepts in the Turing machine, instead of the human-like thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5305. Group Representational Position Embedding
๐ฌ 15
โญ 3.50
๐ learning theory
๐ท๏ธ position encoding group theory
We present GRAPE (Group RepresentAtional Position Encoding), a unified framework for positional encoding based on group actions. GRAPE brings together two families of mechanisms:
(i) \emph{multiplicative} rotations (Multiplicative GRAPE) in $\mathrm{SO}(d)$ and
(ii) \emph{additive} logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$.
In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ (or $t\in\mathbb{R}$) acts as $\mathbb{G}(n)=\exp(n\,\omega\,\mathbf{L})$ with a rankโ2 skew generator $\mathbb{L} \in \mathbb{R}^{d \times d}$, yielding a relative, compositional, normโpreserving map with a closedโform matrix exponential. RoPE is recovered exactly when the $d/2$ planes are the canonical coordinate pairs with logโuniform spectrum. Learned commuting subspaces and compact nonโcommuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at $O(d)$ and $O(rd)$ cost per head, respectively.
In Additive GRAPE, additive logits arise as rankโ1 (or lowโrank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Altogether, GRAPE supplies a principled design space for positional geometry in longโcontext models, subsuming RoPE and ALiBi as special cases.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5306. PACEbench: A Framework for Evaluating Practical AI Cyber-Exploitation Capabilities
๐ฌ 18
โญ 3.50
๐ datasets and benchmarks
๐ท๏ธ LLM Agent Cybersecurity Benchmark AI Safety
The increasing autonomy of Large Language Models (LLMs) necessitates a rigorous evaluation of their potential to aid in cyber offense. Existing benchmarks often lack real-world complexity and are thus unable to accurately assess LLMs' cybersecurity capabilities. To address this gap, we introduce PACEbench, a practical AI cyber-exploitation benchmark built on the principles of realistic vulnerability difficulty, environmental complexity, and cyber defenses. Specifically, PACEbench comprises four scenarios spanning single, blended, chained, and defense vulnerability exploitations. To handle these complex challenges, we propose PACEagent, a novel agent that emulates human penetration testers by supporting multi-phase reconnaissance, analysis, and exploitation.
Extensive experiments with seven frontier LLMs demonstrate that current models struggle with complex cyber scenarios, and none can bypass defenses. These findings suggest that current models do not yet pose a generalized cyber offense threat. Nonetheless, our work provides a robust benchmark to guide the trustworthy development of future models.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5307. Benchmarking Stochastic Approximation Algorithms for Fairness-Constrained Training of Deep Neural Networks
๐ฌ 11
โญ 3.50
๐ datasets and benchmarks
๐ท๏ธ Fair Machine Learning stochastic approximation Augmented Lagrangian Sequential Quadratic Programming benchmarking
The ability to train Deep Neural Networks (DNNs) with constraints is instrumental in improving the fairness of modern machine-learning models. Many algorithms have been analysed in recent years, and yet there is no standard, widely accepted method for the constrained training of DNNs. In this paper, we provide a challenging benchmark of real-world large-scale fairness-constrained learning tasks, built on top of the US Census (Folktables, Ding et al, 2021). We point out the theoretical challenges of such tasks and review the main approaches in stochastic approximation algorithms. Finally, we demonstrate the use of the benchmark by implementing and comparing three recently proposed, but as-of-yet unimplemented, algorithms both in terms of optimization performance, and fairness improvement. We will release the code of the benchmark as a Python package after peer-review.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5308. FlowCast: Advancing Precipitation Nowcasting with Conditional Flow Matching
๐ฌ 14
โญ 3.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ conditional flow matching precipitation nowcasting generative models spatiotemporal forecasting machine learning in environmental science
Radar-based precipitation nowcasting, the task of forecasting short-term precipitation fields from previous radar images, is a critical problem for flood risk management and decision-making. While deep learning has substantially advanced this field, two challenges remain fundamental: the uncertainty of atmospheric dynamics and the efficient modeling of high-dimensional data. Diffusion models have shown strong promise by producing sharp, reliable forecasts, but their iterative sampling process is computationally prohibitive for time-critical applications. We introduce FlowCast, the first end-to-end probabilistic model leveraging Conditional Flow Matching (CFM) as a direct noise-to-data generative framework for precipitation nowcasting. Unlike hybrid approaches, FlowCast learns a direct noise-to-data mapping in a compressed latent space, enabling rapid, high-fidelity sample generation. Our experiments demonstrate that FlowCast establishes a new state-of-the-art in probabilistic performance while also exceeding deterministic baselines in predictive accuracy. A direct comparison further reveals the CFM objective is both more accurate and significantly more efficient than a diffusion objective on the same architecture, maintaining high performance with significantly fewer sampling steps. This work positions CFM as a powerful and practical alternative for high-dimensional spatiotemporal forecasting.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5309. The Deleuzian Representation Hypothesis
๐ฌ 10
โญ 3.50
๐ interpretability and explainable AI
๐ท๏ธ Mechanistic Interpretability Concept Extraction Explainability
We propose an alternative to sparse autoencoders (SAEs) as a simple and effective unsupervised method for extracting interpretable concepts from neural networks. The core idea is to cluster differences in activations, which we formally justify within a discriminant analysis framework. To enhance the diversity of extracted concepts, we refine the approach by weighting the clustering using the skewness of activations. The method aligns with Deleuze's modern view of concepts as differences. We evaluate the approach across five models and three modalities (vision, language, and audio), measuring concept quality, diversity, and consistency. Our results show that the proposed method achieves concept quality surpassing prior unsupervised SAE variants while approaching supervised baselines, and that the extracted concepts enable steering of a modelโs inner representations, demonstrating their causal influence on downstream behavior.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5310. HistoPrism: Unlocking Functional Pathway Analysis from Pan-Cancer Histology via Gene Expression Prediction
๐ฌ 33
โญ 3.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ spatial transcriptomics pan-cancer modeling pathway-level coherence computational efficiency
Predicting spatial gene expression from H\&E histology offers a scalable and clinically accessible alternative to sequencing, but realizing clinical impact requires models that generalize across cancer types and capture biologically coherent signals. Prior work is often limited to per-cancer settings and variance-based evaluation, leaving functional relevance underexplored. We introduce HistoPrism, an efficient transformer-based architecture for pan-cancer prediction of gene expression from histology. To evaluate biological meaning, we introduce a pathway-level benchmark, shifting assessment from isolated gene-level variance to coherent functional pathways. HistoPrism not only surpasses prior state-of-the-art models on highly variable genes and, but more importantly, achieves substantial gains on pathway-level prediction, demonstrating its ability to recover biologically coherent transcriptomic patterns. With strong pan-cancer generalization and improved efficiency, HistoPrism establishes a new standard for clinically relevant transcriptomic modeling from routinely available histology.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5311. Purrception: Variational Flow Matching for Vector-Quantized Image Generation
๐ฌ 18
โญ 3.50
๐ generative models
๐ท๏ธ generative models flow matching vector quantized image generation computer vision variational flow matching
We introduce Purrception, a variational flow matching approach for vector-quantized image generation that provides explicit categorical supervision while maintaining continuous transport dynamics. Our method adapts Variational Flow Matching to vector-quantized latents by learning categorical posteriors over codebook indices while computing velocity fields in the continuous embedding space. This combines the geometric awareness of continuous methods with the discrete supervision of categorical approaches, enabling uncertainty quantification over plausible codes and temperature-controlled generation. We evaluate Purrception on ImageNet-1k $256 \times 256$ generation. Training converges faster than both continuous flow matching and discrete flow matching baselines while achieving competitive FID scores with state-of-the-art models. This demonstrates that Variational Flow Matching can effectively bridge continuous transport and discrete supervision for improved training efficiency in image generation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5312. CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
๐ฌ 7
โญ 3.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ CUDA Optimization Reinforcement Learning LLMs
The exponential growth in demand for GPU computing resources has created an urgent need for automated CUDA optimization strategies.
While recent advances in LLMs show promise for code generation, current state-of-the-art models achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning (RL) framework for CUDA optimization that employs a novel contrastive RL algorithm.
CUDA-L1 achieves significant performance improvements on the CUDA optimization task: trained on NVIDIA A100, it delivers an average speedup of {\bf ร3.12} with a median speedup of {\bf ร1.42} against default baselines over across all 250 CUDA kernels of KernelBench, with peak speedups reaching {\bf ร120}. In addition to the default baseline provided by KernelBench, CUDA-L1 demonstrates {\bf ร2.77} over Torch Compile, {\bf ร2.88} over Torch Compile with reduce overhead, and {\bf ร2.81} over CUDA Graph implementations. Furthermore, the model also demonstrates portability across GPU architectures, achieving average speedups of {\bf ร3.85} (median {\bf ร1.32}) on H100, {\bf ร3.13} (median {\bf ร1.31}) on L40, {\bf ร2.51} (median {\bf ร1.18}) on RTX 3090, and {\bf ร2.38} (median {\bf ร1.34}) on H20 despite being optimized specifically for A100.
Beyond these benchmark results, CUDA-L1 demonstrates several properties: CUDA-L1 1) discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) uncovers fundamental principles of CUDA optimization, such as the multiplicative nature of optimizations; 3) identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that actually harm performance.
The capabilities demonstrate that, RL can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge.
In this process, it identifies CUDA optimization patterns, discovers new techniques, synthesizes them to achieve speedups, and more importantly,
extends the acquired reasoning abilities to new kernels.
This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5313. OpenEstimate: Evaluating LLMs on Probabilistic Estimation with Real-World Data
๐ฌ 20
โญ 3.50
๐ datasets and benchmarks
๐ท๏ธ probabilistic estimation reasoning uncertainty calibration
Real-world settings where language models (LMs) are deployed---in domains spanning healthcare, finance, and other forms of knowledge workโrequire models to grapple with incomplete information and reason under uncertainty. Yet most LM evaluations focus on problems with well-defined answers and success criteria. This gap exists in part because natural problems involving uncertainty are difficult to construct: given that LMs have access to most of the same knowledge as humans, it is non-trivial to design questions for which LMs will struggle to produce correct answers, but which humans can answer reliably.As a result, LM performance on reasoning under uncertainty remains poorly characterized. To address this gap, we introduce \textsc{OpenEstimate}, an extensible, multi-domain benchmark for evaluating LMs on numerical estimation tasks that require models to synthesize significant amounts of background information and express predictions as probabilistic priors. We assess these priors for accuracy and calibration. Across six frontier models, we find that LM-elicited priors are worth the equivalent of about five samples from the underlying data distribution, and that posteriors computed using LM priors tend to be more accurate than those computed using a naive prior. At the same time, the relationship between model accuracy and confidence is weak across the board, indicating the value of developing new methods to improve calibration. The \textsc{OpenEstimate} benchmark thus offers a challenging evaluation for frontier LMs and a platform for developing models that are better at probabilistic estimation and reasoning under uncertainty.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5314. REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering
๐ฌ 24
โญ 3.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ language modeling; representation engineering
Inference-time steering aims to alter an LLMโs responses without changing its parameters. A key challenge lies in selecting internal modules that most strongly govern the target behavior; existing approaches often rely on simplistic cues or ad hoc heuristics, leading to suboptimal or unintended effects. In this work, we introduce \modelname{}, a novel framework for identifying behavior-relevant modules (heads or layers) in Transformers. For each module, we train a vector-quantized autoencoder (VQ-AE) on its hidden activations, partitioning the latent space into behavior-relevant and behavior-irrelevant subspaces via a shared, learnable codebook. We quantify each moduleโs behavioral relevance by evaluating how effectively the VQ-AE encodings distinguish between behavior-aligned and behavior-violating responses using a binary classification metric. This relevance score informs both module selection and steering strength. We evaluate \modelname{} across eight LLMs from two model families (\textsc{Llama} and \textsc{Qwen}) and nine datasets spanning truthfulness enhancement, open-domain question answering under knowledge conflicts, and general alignment tasks. \modelname{} enables more effective inference-time interventions, yielding significant improvements on these steering tasks. Notably, it achieves an average relative improvement of 20\% (up to 81.5\%) over the seminal ITI method~\citep{DBLP:conf/nips/0002PVPW23} on truthfulness steering. Moreover, the modules selected by our method exhibit strong zero-shot generalization in cross-domain truthfulness-steering scenarios.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5315. mR3: Multilingual Rubric-Agnostic Reward Reasoning Models
๐ฌ 31
โญ 3.50
๐ applications to computer vision, audio, language, and other modalities
๐ท๏ธ reward model reasoning rubric
Evaluation using Large Language Model (LLM) judges has been widely adopted in English and shown to be effective for automatic evaluation. However, their performance does not generalize well to non-English settings, and it remains unclear what constitutes effective multilingual training for such judges. In this paper, we introduce mR3, a massively multilingual, rubric-agnostic reward reasoning model trained on 72 languages, achieving the broadest language coverage in reward modeling to date. We present a comprehensive study of data and curriculum selection for training to identify effective strategies and data sources for building high-quality reward models, including the integration of target-language reasoning datasets. Our approach attains state-of-the-art performance on multilingual reward model benchmarks, surpassing much larger models (i.e., GPT-OSS-120B) while being up to nine times smaller, and its effectiveness is further confirmed through extensive ablation studies. We will release our models and datasets publicly upon acceptance.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5316. Untraceable DeepFakes via Traceable Fingerprint Elimination
๐ฌ 11
โญ 3.50
๐ other topics in machine learning (i.e., none of the above)
๐ท๏ธ DeepFakes Attribution;Adversarial Attack;Generative Model Fingerprint
Recent advancements in DeepFakes attribution technologies have significantly enhanced forensic capabilities, enabling the extraction of traces left by generative models (GMs) in images, making DeepFakes traceable back to their source GMs.
Meanwhile, several attacks have attempted to evade attribution models (AMs) for exploring their limitations, calling for more robust AMs.
However, existing attacks fail to eliminate GMs' traces, thus can be mitigated by defensive measures.
In this paper, we identify that untraceable DeepFakes can be achieved through a multiplicative attack, which can fundamentally eliminate GMs' traces, thereby evading AMs even enhanced with defensive measures.
We design a universal and black-box attack method that trains an adversarial model solely using real data, applicable for various GMs and agnostic to AMs.
Experimental results demonstrate the outstanding attack capability and universal applicability of our method, achieving an average attack success rate (ASR) of 97.08\% against 6 advanced AMs across 12 GMs.
Even in the presence of defensive mechanisms, our method maintains an ASR exceeding 72.39\%.
Our work underscores the potential challenges posed by multiplicative attacks and highlights the need for more robust AMs.
The code is available at \textbf{\textit{https://anonymous.4open.science/r/TEST-F4B1}}.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5317. Best-of-Infinity: Asymptotic Performance of Test-Time Compute
๐ฌ 11
โญ 3.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ LLM test-time compute majority voting LLM ensemble
We study best-of-$N$ for large language models (LLMs) where the selection is based on majority voting. In particular, we analyze the limit $N \to \infty$, which we denote as best-of-$\infty$. While this approach achieves impressive performance in the limit, it requires an infinite test-time budget. To address this, we propose an adaptive generation scheme that selects $N$ based on answer agreement, thereby efficiently allocating inference-time computation. Beyond adaptivity, we extend the framework to weighted ensembles of multiple LLMs, showing that such mixtures can outperform any individual model. The optimal ensemble weighting is formulated and efficiently computed as a mixed-integer linear program. Extensive experiments demonstrate the effectiveness of our approach.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5318. Benchmarking ECG Foundational Models: A Reality Check Across Clinical Tasks
๐ฌ 21
โญ 3.50
๐ foundation or frontier models, including LLMs
๐ท๏ธ ECG foundation model benchmark representation learning
The 12-lead electrocardiogram (ECG) is a long-standing diagnostic tool. Yet machine learning for ECG interpretation remains fragmented, often limited to narrow tasks or datasets. FMs promise broader adaptability, but fundamental questions remain: Which architectures generalize best? How do models scale with limited labels? What explains performance differences across model families? We benchmarked eight ECG FMs on 26 clinically relevant tasks using 12 public datasets comprising 1,650 regression and classification targets. Models were evaluated under fine-tuning and frozen settings, with scaling analyses across dataset sizes. Results show heterogeneous performance across domains: in adult ECG interpretation, three FMs consistently outperformed strong supervised baselines. In contrast, ECG-CPC, a compact structured state-space model, dominated 5 of 7 task categories, demonstrating that architecture matters more than scale. FMs improved label efficiency 2.5-9ร over supervised baselines, though scaling behaviors varied across architectures. Representation analysis reveals that models with similar performance learn markedly different internal structures, suggesting multiple viable paths to effective ECG representation. Overall, while FMs show promise for adult ECG analysis, substantial gaps remain in cardiac structure, outcome prediction, and patient characterization. ECG-CPCโs strong performance despite being orders of magnitude smaller challenges the assumption that FM quality requires massive scale, highlighting architectural inductive biases as an untapped opportunity.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5319. Decoding Inner Speech with an End-to-End Brain-to-Text Neural Interface
๐ฌ 16
โญ 3.50
๐ applications to neuroscience & cognitive science
๐ท๏ธ brain-computer interfaces neuroscience neural decoding self-supervised learning multimodal transformer masked modeling
Speech brainโcomputer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end BraIn-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text โ24 and โ25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio-LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5320. Distilling the Thought, Watermarking the Answer: A Principle Semantic Guided Watermark for Reasoning Large Language Models
๐ฌ 11
โญ 3.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Reasoning Large Language Model Watermark
Reasoning Large Language Models (RLLMs) excelling in complex tasks present unique challenges for digital watermarking, as existing methods often disrupt logical coherence or incur high computational costs. Token-based watermarking techniques can corrupt the reasoning flow by applying pseudo-random biases, while semantic-aware approaches improve quality but introduce significant latency or require auxiliary models. This paper introduces ReasonMark, a novel watermarking framework specifically designed for reasoning-intensive LLMs. Our approach decouples generation into an undisturbed Thinking Phase and a watermarked Answering Phase. We propose a Criticality Score to identify semantically pivotal tokens from the reasoning trace, which are distilled into a Principal Semantic Vector (PSV). The PSV then guides a semantically-adaptive mechanism that modulates watermark strength based on token-PSV alignment, ensuring robustness without compromising logical integrity. Extensive experiments show ReasonMark surpasses state-of-the-art methods by reducing text Perplexity by 0.35, increasing translation BLEU score by 0.164, and raising mathematical accuracy by 0.67 points. These advancements are achieved alongside a 0.34% higher watermark detection AUC and stronger robustness to attacks, all with a negligible increase in latency. This work enables the traceable and trustworthy deployment of reasoning LLMs in real-world applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5321. Never Saddle: Reparameterized Steepest Descent as Mirror Flow
๐ฌ 21
โญ 3.50
๐ learning theory
๐ท๏ธ Implicit bias mirror flow sign gradient descent Adam AdamW steepest descent reparameterization diagonal linear networks
How does the choice of optimization algorithm shape a modelโs ability to learn features? To address this question for steepest descent methodsโincluding sign descent, which is closely related to Adamโwe introduce steepest mirror flows as a unifying theoretical framework. This framework reveals how optimization geometry governs learning dynamics, implicit bias, and sparsity and it provides two explanations for why Adam and AdamW often outperform SGD in fine-tuning. Focusing on diagonal linear networks and deep diagonal linear reparameterizations (a simplified proxy for attention), we show that steeper descent facilitates both saddle-point escape and feature learning. In contrast, gradient descent requires unrealistically large learning rates to escape saddles, an uncommon regime in fine-tuning. Empirically, we confirm that saddle-point escape is a central challenge in fine-tuning. Furthermore, we demonstrate that decoupled weight decay, as in AdamW, stabilizes feature learning by enforcing novel balance equations. Together, these results highlight two mechanisms how steepest descent can aid modern optimization.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5322. CFO: Learning Continuous-Time PDE Dynamics via Flow-Matched Neural Operators
๐ฌ 17
โญ 3.50
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Neural PDE Solver Operator Learning Flow Matching Machine Learning Long Rollouts
Neural operator surrogates for time-dependent partial differential equations (PDEs) conventionally employ autoregressive prediction schemes, which accumulate error over long rollouts and require uniform temporal discretization.
We introduce the Continuous Flow Operator (CFO), a framework that learns continuous-time PDE dynamics without the computational burden of standard continuous approaches, e.g., neural ODE. The key insight is repurposing flow matching to directly learn the right-hand side of PDEs without backpropagating through ODE solvers. CFO fits temporal splines to trajectory data, using finite-difference estimates of time derivatives at knots to construct probability paths whose velocities closely approximate the true PDE dynamics. A neural operator is then trained via flow matching to predict these analytic velocity fields. This approach is inherently time-resolution invariant: training accepts trajectories sampled on arbitrary, non-uniform time grids while inference queries solutions at any temporal resolution through ODE integration. Across four benchmarks (Lorenz, 1D Burgers, 2D diffusion-reaction, 2D shallow water), CFO demonstrates superior long-horizon stability and remarkable data efficiency. CFO trained on only 25% of irregularly subsampled time points outperforms autoregressive baselines trained on complete data, with relative error reductions up to 87%. Despite requiring numerical integration at inference, CFO achieves competitive efficiency, outperforming autoregressive baselines using only 50% of their function evaluations, while uniquely enabling reverse-time inference and arbitrary temporal querying.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5323. Certified Evaluation of Model-Level Explanations for Graph Neural Networks
๐ฌ 21
โญ 3.50
๐ interpretability and explainable AI
๐ท๏ธ Model-Level Explanations for GNNs Theory of XAI Evaluation of GNN Explainablity
Model-level explanations for Graph Neural Networks (GNNs) aim to identify class-discriminative motifs that capture how a classifier recognizes a target class. Because the true motifs relied on by the classifier are unobservable, most approaches evaluate explanations by their target class score. However, class score alone is not sufficient as high-scoring explanations may be pathological or may fail to reflect the full range of motifs recognized by the classifier. To bridge this gap, this work introduces sufficiency risk as a formal criterion for whether explanations adequately represent the classifierโs reasoning, and derives distribution-free certificates that upper-bound this risk. Building on this foundation, three metrics are introduced: Coverage, Greedy Gain Area (GGA), and Overlap which operationalize the certificates to assess sufficiency, efficiency, and redundancy in explanations. To ensure practical utility, finite-sample concentration bounds are developed for these metrics, providing confidence intervals that enable statistically reliable comparison between explainers. Experiments on synthetic data and with three state-of-the-art explainers on four real-world datasets demonstrate that these metrics reveal differences in explanation quality hidden by class scores alone. Designed to complement class score, they constitute the first theoretically certified framework for evaluating model-level explanations of GNNs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5324. TAVAE: A VAE with Adaptable Priors Explains Contextual Modulation in the Visual Cortex
๐ฌ 21
โญ 3.33
๐ applications to neuroscience & cognitive science
๐ท๏ธ VAE generative models biological vision neuroscience
The brain interprets visual information through learned regularities, a computation formalized as performing probabilistic inference under a prior. The visual cortex establishes priors for this inference,
some of which are delivered through widely established top-down connections that inform low-level cortices about statistics represented at higher levels in the cortical hierarchy.
While evidence supports that adaptation leads to priors reflecting the structure of natural images, it remains unclear if similar priors can be flexibly acquired when learning a specific task.
To investigate this, we built a generative model of V1 that we optimized for performing a simple discrimination task and analyzed it along with large scale recordings from mice performing an analogous task.
In line with recent successful approaches, we assumed that neuronal activity in V1 can be identified with latent posteriors in the generative model, providing an opportunity to investigate the contributions of task-related priors to neuronal responses. To obtain a flexible test bed for this analysis, we extended the VAE formalism so that a task can be flexibly and data-efficiently acquired by reusing previously learned representations.
Task-specific priors learned by this Task-Amortized VAE were used to investigate biases in mice and model when presenting stimuli that violated the trained task statistics.
Mismatch between learned task statistics and incoming sensory evidence showed signatures of uncertainty in stimulus category in the posterior of TAVAE, reflecting properties of bimodal response profile in V1 recordings.
The task-optimized generative model could account for various characteristics of V1 population activity, including within-day updates to the population responses. Our results confirm that flexible task-specific contextual priors can be learned on-demand by the visual system and can be deployed as early as the entry level of the visual cortex.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5325. WARC-Bench: Web Archive based Benchmark for GUI Subtask Executions
๐ฌ 10
โญ 3.33
๐ datasets and benchmarks
๐ท๏ธ web navigation benchmark reinforcement learning
Training web agents to navigate complex, real-world websites requires them to master subtasksโshort-horizon interactions on multiple UI components (e.g., choosing the correct date in a date picker, or scrolling in a container to extract information). We introduce WARC-Bench (Web Archive Benchmark), a novel web navigation benchmark featuring 438 tasks designed to evaluate multimodal AI agents on subtasks. WARC-Bench enables sandboxed interactions with dynamic and realistic webpages using Web ARChive files. We show that WARC-Bench is challenging for leading computer-use models, with the highest observed success rate being 64.8%. To improve open source models on subtask, we explore two common training techniques: supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). Experiments show that SFT models obtain a 48.8% success rate on the benchmark. Training with RLVR over SFT checkpoints, even in data-scarce settings, improves the score to 52.8% on WARC-Bench, outperforming many frontier models. Our analysis concludes that mastering these subtasks is essential for robust web planning and navigation, and is a capability not extensively evaluated by existing benchmarks.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5326. Off-Policy Safe Reinforcement Learning with Cost-Constrained Optimistic Exploration
๐ฌ 13
โญ 3.33
๐ reinforcement learning
๐ท๏ธ constrained reinforcement learning safe reinforcement learning safe exploration epistemic uncertainty quantification
When formulating safety as limits of cumulative cost, safe reinforcement learning (RL) learns policies that maximize rewards subject to these constraints during both data collection and deployment. While off-policy methods offer high sample efficiency, their application to safe RL faces substantial challenges from constraint violations caused by the cost-agnostic exploration and the underestimation bias in the cost value function. To address these challenges, we propose Constrained Optimistic eXploration Q-learning (COX-Q), an off-policy primal-dual safe RL method that integrates cost-bounded exploration and conservative distributional RL. First, we introduce a novel cost-constrained optimistic exploration strategy that resolves gradient conflicts between reward and cost in the action space, and adaptively adjusts the trust region to control constraint violation in exploration. Second, we adopt truncated quantile critics to mitigate the underestimation bias in costs. The quantile critics also quantify distributional, risk-sensitive epistemic uncertainty for guiding exploration. Experiments across velocity-constrained robot locomotion, safe navigation, and complex autonomous driving tasks demonstrate that COX-Q achieves high sample efficiency, competitive safety performance during evaluation, and controlled data collection cost in exploration. The results highlight the proposed method as a promising solution for safety-critical RL.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5327. Internal Evaluation of Density-Based Clusterings with Noise
๐ฌ 20
โญ 3.33
๐ datasets and benchmarks
๐ท๏ธ Evaluation Clustering Unsupervised Learning
Evaluating the quality of a clustering result without access to ground truth labels is fundamental for research in data mining.
However, most cluster validation indices (CVIs) do not consider the noise assignments by density-based clustering methods like DBSCAN or HDBSCAN, even though the ability to correctly determine noise is paramount to successful clustering.
In this paper, we propose DISCO, a **D**ensity-based **I**nternal **S**core for **C**lusterings with n**O**ise, the first CVI to explicitly assess the *quality* of noise assignments rather than merely counting them.
DISCO is based on the Silhouette Coefficient, but adopts density-connectivity to evaluate clusters of arbitrary shapes, and proposes explicit noise evaluation: it rewards correctly assigned noise labels and penalizes noise labels where a cluster label would have been more appropriate.
The pointwise definition of DISCO allows for the seamless integration of noise evaluation into the final clustering evaluation, while also enabling explainable evaluations of the clustered data.
In contrast to most state-of-the-art, DISCO is well-defined and also covers edge cases that regularly appear as output from clustering algorithms, such as singleton clusters or a single cluster plus noise.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5328. Fine-Tuning Diffusion Models via Intermediate Distribution Shaping
๐ฌ 16
โญ 3.33
๐ generative models
๐ท๏ธ diffusion fine-tuning reinforcement learning
Diffusion models are widely used for generative tasks across domains. While pre-trained diffusion models effectively capture the training data distribution, it is often desirable to shape these distributions using reward functions to align with downstream applications. Policy gradient methods, such as Proximal Policy Optimization (PPO), are widely used in the context of autoregressive generation. However, the marginal likelihoods required for such methods are intractable for diffusion models, leading to alternative proposals and relaxations. In this context, we unify variants of Rejection sAmpling based Fine-Tuning (RAFT) as GRAFT, and show that this induces same solution as PPO objective with reshaped rewards. We then introduce P-GRAFT to shape distributions at intermediate noise levels and demonstrate empirically that this can lead to more effective fine-tuning. We mathematically explain this via a bias-variance tradeoff. Motivated by this, we propose inverse noise correction to improve flow models without leveraging explicit rewards. We empirically evaluate our methods on text-to-image(T2I) generation, layout generation, molecule generation and unconditional image generation. Notably, our framework, applied to Stable Diffusion 2, improves over policy gradient methods on popular T2I benchmarks in terms of VQAScore and shows an 8.81% relative improvement over the base model. For unconditional image generation, inverse noise correction improves FID of generated images at lower FLOPs/image.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5329. Drugging the Undruggable: Benchmarking and Modeling Fragment-Based Screening
๐ฌ 19
โญ 3.33
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Drug Discovery Representation Learning Virtual Screening fragment-based drug design
A significant portion of disease-relevant proteins remain undruggable due to shallow, flexible, or otherwise ill-defined binding pockets that hinder conventional molecule screening. Fragment-based drug discovery (FBDD) offers a promising alternative, as small, low-complexity fragments can flexibly engage shallow, transient, or cryptic binding pockets that are often inaccessible to conventional drug-like molecules. However, fragment screening remains difficult due to weak binding signals, limited experimental throughput, and a lack of computational tools tailored for this setting. In this work, we introduce FragBench, the first benchmark for fragment-level virtual screening on undruggable targets. We construct a high-quality dataset through multi-agent LLMโhuman collaboration and interaction-based fragment labeling. To address the core modeling challenge, we propose a novel tri-modal contrastive learning framework FragCLIP that jointly encodes fragments, full molecules, and protein pockets. Our method significantly outperforms baselines like docking software and other ML based methods. Moreover, we demonstrate that retrieved fragments can be effectively expanded or linked into larger compounds with improved predicted binding affinity, supporting their utility as viable starting points for drug design.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5330. RigidSSL: Rigidity-based Geometric Pretraining for Protein Generation
๐ฌ 10
โญ 3.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Protein design Self-supervised Learning 3D Geometry Rigidity Flow matching SE(3)-equivariance
Protein design stands as one of biologyโs most important frontiers, with the potential to transform medicine, advance human health, and drive sustainability. Protein generation, a central task in protein design, has been greatly accelerated by AI-driven modelsโsuch as FoldFlow, MultiFlow, and AlphaFlow that build on residue-wise rigidityโbased modeling pioneered by AlphaFold2. Residue-wise rigid-body representations reduce structural dimensionality while enforcing chemical constraints, enabling more efficient and physically consistent protein structure generation than all-atom modeling. Despite these advances, existing models often underutilize the vast structural information available in large-scale protein datasets. This highlights the importance of pretraining, which can provide richer representations and improve generalization across diverse protein design tasks. More importantly, the challenge lies in how to fully exploit abundant, low-cost unlabeled protein datasets using unsupervised pretraining. We introduce RigidSSL, a rigidity-based pretraining framework for proteins. RigidSSL canonicalizes structures into an inertial frame, employs a two-phase workflow combining large-scale perturbations and molecular dynamics views, and applies a rigid-body flow matching objective with Invariant Point Attention to capture global geometry. This enables learning stable, geometry-aware representations that improve downstream protein generation. To evaluate the effectiveness of RigidSSL, we conduct quantitative experiments on the protein generation task. Empirically, RigidSSL outperforms previous state-of-the-art geometric pretraining algorithms, leading to improvements in unconditional generation across all metrics, including designability, novelty, and diversity, for length up to 800 residues.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5331. Not-a-Bandit: Provably No-Regret Drafter Selection in Speculative Decoding for LLMs
๐ฌ 24
โญ 3.00
๐ generative models
๐ท๏ธ Online algorithm Speculative decoding efficient LLM
Speculative decoding is widely used in accelerating large language model (LLM) inference. In this work, we focus on the online draft model selection problem in speculative decoding. We design an algorithm that provably competes with the best draft model in hindsight for each query in terms of either the token acceptance probability or expected acceptance length. In particular, we show that we can accurately evaluate all draft models, instead of only the chosen model without incurring additional queries to the target model,
which allows us to improve exponentially over the existing bandit-based approach as the number of draft models increases.
Our approach is generically applicable with any speculative decoding methods (single draft, multi-drafts and draft-trees). Moreover, we design system-efficient versions of online learners and demonstrate that the overhead in computation and latency can be substantially reduced. We conduct extensive experiments on open-source LLMs and diverse datasets,
demonstrating that our methods substantially outperform the state-of-the-art EAGLE3 and the BanditSpec baseline in a variety of domains where specialized domain-expert drafters are available,
especially when long reasoning chains are required.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5332. LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning
๐ฌ 16
โญ 3.00
๐ foundation or frontier models, including LLMs
๐ท๏ธ Large Language Models Reasoning Diffusion Models Latent Reasoning
Large Language Models (LLMs) demonstrate their reasoning ability through chain-of-thought (CoT) generation. However, LLM's autoregressive decoding may limit the ability to revisit and refine earlier tokens in a holistic manner, which can also lead to inefficient exploration for diverse solutions. In this paper, we propose \textit{LaDiR} (\textbf{La}tent \textbf{Di}ffusion \textbf{R}easoner), a novel reasoning framework that unifies the expressiveness of continuous latent representation with the iterative refinement capabilities of latent diffusion models while operating effectively without large-scale pretraining. We first construct a structured latent reasoning space using a Variational Autoencoder (VAE) that encodes text reasoning steps into blocks of thought tokens, preserving semantic information and interpretability while offering compact but expressive representations. Subsequently, we utilize a latent diffusion model that learns to denoise a block of latent \textit{thought tokens} with a blockwise bidirectional attention mask, enabling longer horizon and iterative refinement with adaptive test-time compute. This design allows efficient parallel generation of diverse reasoning trajectories, allowing the model to plan and revise the reasoning process holistically. We conduct evaluations on a suite of mathematical reasoning and planning benchmarks. Empirical results show that LaDiR consistently improves accuracy, diversity, and interpretability over existing autoregressive, diffusion-based, and latent reasoning methods.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5333. High-Probability Bounds for the Last Iterate of Clipped SGD
๐ฌ 22
โญ 3.00
๐ optimization
๐ท๏ธ Stochastic Optimization
We study the problem of minimizing a convex objective when only noisy gradient estimates are available. Under the mild assumption that the stochastic gradients have finite $\alpha$-th moments for some $\alpha \in (1,2]$, we show that the last iterate of clipped stochastic gradient descent (Clipped-SGD) achieves high-probability convergence of order $1/K^{(2\alpha-2)/(3\alpha)}$ on smooth objectives. Finally, we provide empirical results that support and complement our theoretical analysis.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5334. Residual Feature Integration is Sufficient to Prevent Negative Transfer
๐ฌ 9
โญ 3.00
๐ learning theory
๐ท๏ธ deep transfer learning theory negative transfer provable safe transfer mechanism
Transfer learning has become a central paradigm in modern machine learning, yet it suffers from the long-standing problem of negative transfer, where leveraging source representations can harm rather than help performance on the target task. Although empirical remedies have been proposed, there remains little theoretical understanding of how to reliably avoid negative transfer. In this article, we investigate a simple yet remarkably effective strategy: augmenting frozen, pretrained source-side features with a trainable target-side encoder that adapts target features to capture residual signals overlooked by models pretrained on the source data. We show this residual feature integration strategy is sufficient to provably prevent negative transfer, by establishing theoretical guarantees that it has no worse convergence rate than training from scratch under the informative class of target distributions up to logarithmic factors, and that the convergence rate can transition seamlessly from nonparametric to near-parametric when source representations are informative. To our knowledge, this is the first theoretical work that ensures protection against negative transfer. We carry out extensive numerical experiments across image, text and tabular benchmarks, and empirically verify that the method consistently safeguards performance under distribution shift, label noise, semantic perturbation, and class imbalance. We additionally demonstrate that this residual integration mechanism uniquely supports adapt-time multimodality extension, enabling a pretrained single-cell foundation model to incorporate spatial signals for lymph-node anatomical classification despite the source model being trained without them. Our study thus advances the theory of safe transfer learning, and provides a principled approach that is simple, robust, architecture-agnostic, and broadly applicable.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5335. Overtone: Cyclic Patch Modulation for Cleaner, Faster Physics Emulators
๐ฌ 9
โญ 3.00
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ autoregressive models partial differential equations tokenization artifacts compute-adaptive inference
Transformer-based PDE surrogates achieve remarkable performance but face two key challenges: fixed patch sizes cause systematic error accumulation at harmonic frequencies, and computational costs remain inflexible regardless of problem complexity or available resources. We introduce Overtone, a unified solution through dynamic patch size control at inference. Overtone's key insight is that cyclically modulating patch sizes during autoregressive rollouts distributes errors across the frequency spectrum, preventing the systematic harmonic artifacts that plague fixed-patch models. We implement this through two architecture-agnostic modulesโCSM (Cyclic Stride Modulator, using dynamic stride modulation) and CKM (Cyclic Kernel Modulator, using dynamic kernel resizing)โthat together provide both harmonic mitigation and compute-adaptive deployment. The harmonic mitigation alone yields up to 40% error reduction in long rollouts, while the flexible tokenization allows users to trade accuracy for speed dynamically based on computational constraints. Applied to challenging 2D and 3D PDE benchmarks, a single Overtone model matches or exceeds multiple fixed-patch baselines across all compute budgets.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5336. Physics-Constrained Fine-Tuning of Flow-Matching Models for Generation and Inverse Problems
๐ฌ 13
โญ 3.00
๐ generative models
๐ท๏ธ Generative Modeling PhysicsโInformed Machine Learning Inverse Problems Parameter Identification
We present a framework for fine-tuning flow-matching generative models to enforce physical constraints and solve inverse problems in scientific systems. Starting from a model trained on low-fidelity or observational data, we apply a differentiable post-training procedure that minimizes weak-form residuals of governing partial differential equations (PDEs), promoting physical consistency and adherence to boundary conditions without distorting the underlying learned distribution. To infer unknown physical inputs, such as source terms, material parameters, or boundary data, we augment the generative process with a learnable latent parameter predictor and propose a joint optimization strategy. The resulting model produces physically valid field solutions alongside plausible estimates of hidden parameters, effectively addressing ill-posed inverse problems in a data-driven yet physics-aware manner. We validate our method on canonical PDE problems, demonstrating improved satisfaction of physical constraints and accurate recovery of latent coefficients. Further, we confirm cross-domain utility through fine-tuning of natural-image models. Our approach bridges generative modelling and scientific inference, opening new avenues for simulation-augmented discovery and data-efficient modelling of physical systems.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5337. Tab-MIA: A Benchmark Dataset for Membership Inference Attacks on Tabular Data in LLMs
๐ฌ 12
โญ 3.00
๐ datasets and benchmarks
๐ท๏ธ Membership Inference Attacks Tabular Data Large Language Models Privacy Leakage Table Encoding QLoRA Data Memorization Structured Data Benchmarking Model Vulnerability
Large language models (LLMs) are increasingly trained on tabular data, which, unlike unstructured text, often contains personally identifiable information (PII) in a highly structured and explicit format. As a result, privacy risks arise, since sensitive records can be inadvertently retained by the model and exposed through data extraction or membership inference attacks (MIAs). While existing MIA methods primarily target textual content, their efficacy and threat implications may differ when applied to structured data, due to its limited content, diverse data types, unique value distributions, and column-level semantics. In this paper, we present Tab-MIA, a benchmark dataset for evaluating MIAs on tabular data in LLMs and demonstrate how it can be used. Tab-MIA comprises five data collections, each represented in six different encoding formats. Using our Tab-MIA benchmark, we conduct the first evaluation of state-of-the-art MIA methods on LLMs fine-tuned with tabular data across multiple encoding formats. In the evaluation, we analyze the memorization behavior of pretrained LLMs on structured data derived from Wikipedia tables. Our findings show that LLMs memorize tabular data in ways that vary across encoding formats, making them susceptible to extraction via MIAs. Even when fine-tuned for as few as three epochs, models exhibit high vulnerability, with AUROC scores approaching 90% in most cases. Tab-MIA enables systematic evaluation of these risks and provides a foundation for developing privacy-preserving methods for tabular data in LLMs.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5338. Pay Less Attention to Function Words for Free Robustness of Vision-Language Models
๐ฌ 33
โญ 3.00
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ Vision-Language Model Concentrated Attention Adversarial Robustness
To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53\% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90\% ASR drop with a 0.3\% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly available.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5339. IAGA: Identity-Aware Gaussian Approximation for Efficient 3D Molecular Generation
๐ฌ 15
โญ 2.67
๐ applications to physical sciences (physics, chemistry, biology, etc.)
๐ท๏ธ Gaussian approximation data identity efficient generation 3D molecular generation
Gaussian Probability Path based Generative Models (GPPGMs) generate data by reversing a stochastic process that progressively corrupts samples with Gaussian noise. Despite state-of-the-art results in 3D molecular generation, their deployment is hindered by the high cost of long generative trajectories, often requiring hundreds to thousands of steps during training and sampling. In this work, we propose a principled method, named GAGA, to improve generation efficiency without sacrificing training granularity or inference fidelity of GPPGMs. Our key insight is that different data modalities obtain sufficient Gaussianity at markedly different steps during the forward process. Based on this observation, we analytically identify a characteristic step at which molecular data attains sufficient Gaussianity, after which the trajectory can be replaced by a closed-form Gaussian approximation. Unlike existing accelerators that coarsen or reformulate trajectories, our approach preserves full-resolution learning dynamics while avoiding redundant transport through truncated distributional states. Experiments on 3D molecular generation benchmarks demonstrate that our GAGA achieves substantial improvement on both generation quality and computational efficiency.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5340. PLAGUE: Plug-and-play Framework for Lifelong Adaptive Generation of Multi-turn Exploits
๐ฌ 14
โญ 2.50
๐ alignment, fairness, safety, privacy, and societal considerations
๐ท๏ธ LLM Red-Teaming Agentic AI
Large Language Models (LLMs) are improving at an exceptional rate. With the advent of agentic workflows, multi-turn dialogue has become the de facto mode of interaction with LLMs for completing long and complex tasks. While LLM capabilities continue to improve, they remain increasingly susceptible to jailbreaking, especially in multi-turn scenarios where harmful intent can be subtly injected across the conversation to produce nefarious outcomes. While single-turn attacks have been extensively explored, adaptability, efficiency and effectiveness continue to remain key challenges for their multi-turn counterparts. To address these gaps, we present PLAGUE, a novel plug-and-play framework for designing multi-turn attacks inspired by lifelong-learning agents. PLAGUE dissects the lifetime of a multi-turn attack into three carefully designed phases (Primer, Planner and Finisher) that enable a systematic and information-rich exploration of the multi-turn attack family. Evaluations show that red-teaming agents designed using PLAGUE achieve state-of-the-art jailbreaking results, improving attack success rates (ASR) by more than 30% across leading models in a lesser or comparable query budget. Particularly, PLAGUE enables an ASR (based on StrongReject) of 81.4% on OpenAI's o3 and 67.3% on Claude's Opus 4.1, two models that are considered highly resistant to jailbreaks in safety literature. Our work offers tools and insights to understand the importance of plan initialization, context optimization, and lifelong learning in crafting multi-turn attacks for a comprehensive model vulnerability evaluation.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5341. DAVE: A VLM Vision Encoder for Document Understanding and Web Agents
๐ฌ 17
๐ unsupervised, self-supervised, semi-supervised, and supervised representation learning
๐ท๏ธ Visual Representation Learning Vision Language Models Document Understanding Web Agents
While Visionโlanguage models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial
information essential for document understanding and web agents. To bridge this
gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored
for these tasks. Our training pipeline is designed to leverage abundant unlabeled
data to bypass the need for costly large-scale annotations for document and web
images. We begin with a self-supervised pretraining stage on unlabeled images,
followed by a supervised autoregressive pretraining stage, where the model learns
tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoderโs alignment with
both general visual knowledge and diverse document and web agentic tasks: (i)
We introduce a novel model-merging scheme, combining encoders trained with
different text decoders to ensure broad compatibility with different web agentic
architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis
5342. EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer
๐ฌ 21
๐ generative models
๐ท๏ธ Video Generation Human Motion Generation;
Video generation models have advanced significantly, yet they still struggle to synthesize complex human movements due to the high degrees of freedom in human articulation. This limitation stems from the intrinsic constraints of pixel-only training objectives, which inherently bias models toward appearance fidelity at the expense of learning underlying kinematic principles. To address this, we introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion, thereby improving the quality of complex human action video generation. EchoMotion extends the DiT (Diffusion Transformer) framework with a dual-branch architecture that jointly processes tokens concatenated from different modalities.
Furthermore, we propose MVS-RoPE (Motion-Video Syncronized RoPE), which offers unified 3D positional encoding for both video and motion tokens. By providing a synchronized coordinate system for the dual-modal latent sequence, MVS-RoPE establishes an inductive bias that fosters temporal alignment between the two modalities. We also propose a Motion-Video Two-Stage Training Strategy. This strategy enables the model to perform both the joint generation of complex human action videos and their corresponding motion sequences, as well as versatile cross-modal conditional generation tasks.
To facilitate the training of a model with these capabilities, we construct *HuMoVe*, a large-scale dataset of approximately 80,000 high-quality, human-centric video-motion pairs.
Our findings reveal that explicitly representing human motion is complementary to appearance, significantly boosting the coherence and plausibility of human-centric video generation. The entire project will be open-sourced.
๐ OpenReview
๐ PDF
๐ค LLM-Analysis